Archive for May, 2016

Preventing Large-scale Losses of Medical Data

Sunday, May 15th, 2016

Has your personal information been stolen by hackers? Mine has – at least twice. My records were part of the 80 million record Anthem breach in 2015. My information was also included in the 22 million record breach at the Office of Personnel Management. While I appreciate having my online identity monitored and protected by two separate services as a result of these events, wouldn’t it be better if there were a way to prevent them?

Good news – there is a way! In a paper published last month in the Journal of Biomedical Informatics, I describe the new personal grid architecture that prevents large-scale data losses like these. Here’s how it works: Instead of storing all the records in a single database, each person’s record is stored in a separate file, separately encrypted, with its own complex password. In this way, even if a hacker were somehow able to take all the files, it would be necessary to break through strong encryption (a huge effort!) to get access to just a single record. That same difficult process would be needed to access each and every additional record. The incentive for the hacker is thus removed as the work needed to gain access far outweighs the potential value of each record.

So – problem solved … almost. As with so many things, there is no “free lunch.” While organizing data this way does eliminate the potential for loss of the entire dataset all at once, it creates a new problem: slow searching. Much of the value of medical information comes from searching across many patients – for example, to find patients who need a flu shot or are eligible to participate in a particular clinical trial. Regular (relational) databases have indexes that store “pre-searched” information so that, much like the index of a book, finding a particular item just requires a quick lookup in the index (or multiple indexes). But a personal grid cannot have such indexes because they can be used to reconstruct most (if not all) of the original data for all the patients – thereby creating the same vulnerability to loss of all the data that was eliminated by separately storing each record.

Without indexes, searching involves the slow step-by-step process of retrieving each record, decrypting it, then determining if whatever we are searching for is present. This type of search is called sequential, and much of the work in the field of computer science is devoted to finding ways to avoid this (such as by using relational databases) because it is very slow, even with fast computers.

While the problem of slow searching in the personal grid can’t be totally eliminated, it can be greatly reduced so that this new architecture is feasible for everyday use. Instead of searching sequentially with one computer, we can divide this task among a large number of machines working in parallel. For example, 1,000 machines will complete the search 1,000 times faster, which is fast enough to allow searches of multimillion record personal grid databases in about an hour — fast enough for any purpose when searching across medical records of a population.

Where can we get 1,000 servers that can be used for this? It turns out this is relatively easy and inexpensive in today’s cloud computing environments. Cloud computing services are specifically designed to be able to allocate large numbers of servers on short notice to a computationally difficult problem. So the availability of cloud computing makes the personal grid architecture, and the high level of security it provides, practical and feasible today.

Who’s using the personal grid? No one at the moment, since the idea is new. I’m hoping that we’ll soon see some implementations, which seems likely given the huge costs involved in mitigating large-scale medical information breaches. Senior healthcare executives responsible for protecting our information should be very interested in this new ultra-secure approach to information architecture that removes the hugely expensive potential risk of total data loss.

Finally, I want to mention that the personal grid addresses one of the key objections people have to centralized repositories of medical records – namely, that a hacker might break in and take all the data at once. Readers of this blog know that I’m a long-time advocate for such repositories in communities (known as health record banks), with the records controlled by patients. With the personal grid architecture, worries about losing all the records to a hacker are eliminated. This means that health record banks — community-based, patient-controlled repositories of health records – not only can overcome the obstacles of privacy, stakeholder cooperation, and financial sustainability, but also provide the security needed so that all of us can be confident that our information is truly protected from unauthorized use.

The five minute narrated slideshow posted below has more details.