Posts Tagged ‘federated’

The Myth of Distributed Healthcare Queries (or Big Data Gone Bad)

Tuesday, May 21st, 2013

Every day we’re hearing more and more about the exciting new discoveries that are possible with “big data,” including in healthcare. Indeed, a number of activities have been organized to “federate” or connect multiple healthcare databases to facilitate addressing critical research and policy questions: Query Health (Office of the National Coordinator at HHS), Harvard’s i2b2 project, and the FDA Sentinel System (New Eng J Med 364:498-9, 2011). These systems all send healthcare queries to multiple databases and then aggregate the resultant counts into an overall result.

Unfortunately, the results of such queries are PROVABLY INCORRECT. This type of distributed database architecture violates a basic computer science principle: Distributed databases only produce correct query results if the data in each node is independent (Weber G: Federated queries of clinical data repositories: the sum of the parts does not equal the whole. J Am Med Informatics Assn 2013). Of course this sounds like (and is) technical jargon, so let me decode and explain it.

The easiest way to understand this concept is with a simple example. Let’s say you want to know how many patients have both diabetes and high blood pressure. You send a query to multiple databases saying “Tell me how many patients you have with both diabetes and high blood pressure.” The problem is that you don’t know whether or to what extent data about the same patient appears in multiple databases. Each database only reports the count of the number of patients that satisfy the query, but there’s no identification information included. So if John Brown, who has both diabetes and high blood pressure, has been seen in two different institutions with databases, then John Brown will be counted twice. If Mary Jones, who also has diabetes and high blood pressure, has been seen by two different institutions, but one only recorded the fact that she has diabetes and the other that she had high blood pressure, she won’t be counted at all (even though she should be).

So, as you can see, this method of querying multiple databases and adding up the counts results in both over- and under-counting – i.e., INCORRECT results. And the errors can be quite substantial, with large and unpredictable mistakes. This is because most patients receive their medical care in multiple places, leaving data at each. The medical records in one location may be complete (possibly leading to over-counting) or incomplete (possibly leading to under-counting), with no way to know in advance whether these two types of errors will balance or not for any specific query.

So how does the concept of “independence” fit in? In this case, one database is independent of another if all the data for each patient is in one and only one database. In other words, no patient has data in multiple places, so the data in each database is “independent” of all the others. In such a case, where you know that all of each patient’s data is in one and only one database, a query to multiple databases will produce correct results. This is because the decision about whether a given patient meets the conditions of the query is made based on complete data, and each patient’s data is only considered once.

Does this mean that all these systems mentioned above are useless and should be discarded? Not necessarily. The results of queries to these distributed (but not independent) systems may still occasionally provide some helpful insights into medical phenomena. In a few cases, the possibility of over- and under-counting may not be critically important. For example, if we are looking for events that should not be occurring at all (like administering penicillin to patients who are allergic to it), any such events that are found are significant. But we must always realize that the quantitative query results are not accurate (or necessarily even close). Otherwise, we may draw potentially dangerous conclusions, such as finding a disease outbreak when none exists, or misallocating resources based on artificially high or low estimates of the number of people affected by a given disease – in other words, “big data” gone bad.

Finally, how can we avoid these problems? The best approach to avoiding incorrect query results from distributed medical information systems is to compile a comprehensive copy of all the records for each patient from all sources in a single place (but not necessarily the same place for everyone). The institutions that do this are known as health record banks, patient-controlled repositories of electronic health records. This totally avoids the potential for erroneous counts in response to distributed queries. With health record banks, we could actually have an effective and efficient system for aggregating patient information for research, policy, and public health.

In addition, health record banks provide many other benefits, most importantly to patient care. The availability of comprehensive electronic patient information when and where needed can both improve care and reduce costs. With health record banks, such information is available and ready to be retrieved by providers with a single query. Leaving patient information where it’s created and putting it together when needed is both inefficient and prone to error (Lapsia V, Lamb K, and Yasnoff WA. Where should electronic records for patients be stored? Int J Med Informatics 81(12):821-7, 2012).

So why are we continuing to invest in a “federated” architecture for health information infrastructure that doesn’t work? It’s time for the health IT community to shift their efforts to building health record banks – for both patient care and research.