The Myth of Anonymized Data

Today’s CNET story, “AOL, Netflix and the end of open access to research data”, describes how two large so-called “anonymized” databases have been re-identified, compromising the privacy of everyone in them. This provides yet another example of why “anonymized” data is a myth — and reinforces the need to avoid the release of large datasets of medical records, even if they are supposedly “de-identified.”

The first incident described involves the release of 500,000 people’s movie ratings by Netflix in 2006. To protect the privacy of their subscribers, Netflix carefully removed all personal information. They offered $1 million to anyone who could develop an algorithm that would improve their movie recommendation system — a worthy goal. However, this week researchers announced that they successfully re-identified the data using publicly available information.

A similar scenario occurred when AOL publicly released “de-identified” search data for 500,000 of its users. Some were re-identified within days.

The lesson in this is simple: THERE IS NO SUCH THING AS ANONYMIZED DATA. To some extent, it can always be re-identified. For those who are interested in more details, computer scientist Dr. Latanya Sweeney’s Data Privacy Lab at Carnegie-Mellon has been studying this issue for years and developing the theory needed to understand it.

So what are the implications for medical data? As previously described in this space (Protecting Privacy While Searching Health Record Banks), each person’s complete health records need to be stored in a central location with all access under the control of that individual (or whomever they designate). To provide the tremendous research benefits available from searching this data, queries should be submitted to health record banks, but NO DATA SHOULD EVER BE RELEASED. Instead, the result of a query would be a count of the number of matches and a carefully controlled demographic summary. In this way, re-identification is prevented since no actual data is available. This allows all of us to have the fruits of medical research WITHOUT having to give up our privacy.

Let’s hope Netflix and AOL have learned their lesson and that other organizations — especially health care institutions — are paying close attention.

7 Responses to “The Myth of Anonymized Data”

  1. Ed Dodds says:

    This, of course, presupposes that compromised health record banks employees will be without thumb drives or monitored to keep them from doing ad hoc searches of their own from within the facility. And of course no underpaid temp worker will ever have access; neither will someone with some political agenda, whether in the US or off shore. šŸ˜‰ The reality is in cases of identity theft the police rarely do anything substantive due to identity theft’s low priority in their budgetary scheme of things.

  2. Mr. Dodds,
    Thanks very much for your comment. Of course, employees and contractors of health record banks must be closely screened and monitored, just as is done today in “classified” facilities. Within such a facility, all USB ports of all machines will be deactivated (as is commonly done today) to prevent use of USB drives. Furthermore, USB devices, laptops, etc., would not be allowed in or out of such a facility. It would also be surrounded by a Faraday cage to prevent the escape of electromagnetic signals from the computer systems that could be intercepted outside. Finally, any output written to media within the facility would require special authorization from at least two staff members; the media itself would then be meticulously tracked. With proper precautions, security can be maintained.
    Note that this is in sharp contrast to how health care data is typically handled today. Now, your information is freely being used (with permission) for all sorts of purposes without your knowledge or consent.

  3. Richard Ward says:

    Patient rights and their expressed wishes are key issues in the lawsuit we are attempting to have reviewed by the US Supreme Court. (see ) When I, and thousands of other prostate cancer survivors, were told our donated tissues would not be used for the prostate cancer research we intended, but instead could be anonymized and used for other research, we began our efforts in the courts. I applaud your concerns about this highly questionable practice and your efforts to inform others about the need for security of healthcare data.

  4. hi, i’m one of the authors of the netflix paper and my views on this are very similar

  5. Tom Knorr says:

    Bill, great job, an agreeable solution. I think most directly identifying data (SSN, account numbers, …) should be one-way encrypted, something I have not seen used very often in medical data storage. That makes it more difficult to mine a stolen database e.g. for identity theft. Finally I think data can be separated in publicly available data (contact information, names, places, something one could find on the web) and medical case data. The main concern is the connection of the case data to the patient. If access to these separate data realms requires 2 different authorizations, (e.g. a query over the medical data returns an ID that then needs to be resolved in the public data) then unauthorized reconstruction of the full data set should be sufficiently difficult.

  6. Here is another paper on that topic:

    El Emam K, Jabbouri S, Sams S, Drouet Y, Power M
    Evaluating Common De-Identification Heuristics for Personal Health Information
    J Med Internet Res 2006;8(4):e28

  7. Excellent post. I was checking continuously this weblog and I am impressed! Very useful information specially the last section šŸ™‚ I deal with such information much. I was looking for this particular info for a very long time. Thanks and good luck.

Leave a Reply