The Big Data Revolution: From Drug Development to Better Health Outcomes?

Published on: 
Evidence-Based Oncology, May 2014, Volume 20, Issue SP7

Normal analysis of discreet health data, like blood pressure and cholesterol, identified 5000 patients at risk of developing congestive heart failure.

Automated analysis of doctors’ notes and other unstructured information from Carilion Clinic’s 8 hospitals turned up 3500 more.1

Early treatment should prevent many of those extra cases from ever developing and save the communities of western Virginia hundreds of lives and millions of dollars.

Such efforts to avert the chronic diseases that kill the majority of Americans and consume the majority of their healthcare dollars rank among the most promising medical applications for “Big Data” analysis.

But there are countless others.

An unbelievably large amount of medically useful information is available for study—a century of published studies, decades of insurance claims—and that stock expands every time a doctor completes an electronic medical record or a runner dons a heart-rate monitor.

Software that can find, interpret, and analyze it all may eventually revolutionize healthcare.

A McKinsey & Company analysis, for example, predicts such programs will soon save at least $300 billion a year in American healthcare spending—and possibly much more.2

“Why is Big Data emerging in healthcare now? There are really 3 reasons,” said McKinsey director Nicolaus Henke. “The first is availability. We have so much more captured, machine-readable data available to us than we did just a few years ago. The second reason is that it’s much cheaper and easier to link these data. The third reason is a big imperative to understand population health better…it’s important both for outcomes and costs.”

Indeed, even with all the limitations in both data and software, would-be innovators are already finding significant ways to use the existing data to improve patient health.

The most famous applications to date lie in taking some of the data that modern life automatically generates about individual activity and reusing it to benefit that individual.

Smartphone applications that tap GPS and clock functionality to track runs have millions of users. Diet applications that use phone cameras and Internet connectivity to help users track what they eat have even more.

Medical practices are using similar tools to help patients.

Billing records have always recorded when patients come in for Papanicolaou tests (Pap smears), but now software sold to gynecologists can automatically look through those records, infer which patients are overdue for another Pap smear, and send reminders to those patients.3

Such programs can also look through medical records to see which women began receiving the sequence of shots needed for Human Pappilomavirus vaccination and call them to remind them about the next shot.3

Applications like that, which use relevant data from each individual to help that same individual, may provide substantial health benefits, but experts see even more promise in tools that use both individual data and collective data, such as the tools that IBM used to predict congestive heart failure (CHF) at Carilion.

“Traditional models use a handful of medical measurements to predict CHF,” said Ed Macko, IBM’s chief technology officer for healthcare & life sciences.

“Our systems—after scanning not only structured data but also free-written material from doctors’ notes, journals, and other sources—found dozens of relevant factors, including stuff that has rarely been considered before, like whether the patient has a job or someone at home that can provide care during illness.”


Such deep analysis allowed IBM’s technology to identify 70% more atrisk patients than traditional tools, all while maintaining an estimated 85% accuracy rate that matches prior standards.

And each passing month increases both the number of patients identified and the accuracy of the prediction.

IBM has plenty of competitors, big and small, that want to use Big Data to improve healthcare. The McKinsey report estimates that 200 new companies have already entered the space. Older companies, universities, government agencies, and other nonprofits are also getting into the act.

Much of their work resembles the project at Carilion. It seeks to predict which people will become chronically ill—be it from CHF, diabetes, chronic obstructive pulmonary disease, drug addiction,

or a handful of other problems—and prevent the downward spiral.

“It makes sense to focus here because a relatively small number of very ill people account for a huge percentage of both the suffering and the cost,” said Erica Mobley, senior manager at a hospital-monitoring nonprofit called The Leapfrog Group.

“Hospital systems have also focused on using Big Data to expand and improve upon data-driven decision making,” said Mobley, who noted that the real analytical pioneers among hospitals tend to be self-insuring university systems that can get a full picture of patients by using complete medical records, drug records, and insurance records.

The University of Pittsburgh Medical Center, for example, announced in 2012 that it was working with outside companies to create an enterprise data warehouse that would draw on more than 200 data sources to provide doctors with individualized care recommendations for particular patients.4

“Ever more data, sometimes right down to the genetic level, give hospitals the ability to help staff determine the correct decision in ever more specific situations. These data-driven decisions replace instinct or gut feeling, which studies have generally shown to be little better than raw guesswork.”

Big Data is also helping groups like Leapfrog improve their hospital rankings.

When Leapfrog was founded in 2000, hospitals reported so little data on safety that sophisticated analysis was unnecessary. Now, thanks to efforts by Leapfrog and other groups to increase transparency, patients can compare hospitals on issues as specific as the likelihood that the doctors will leave something inside them after surgery or that the staff will give them the wrong type of blood.

To help patients understand how to value all those extra data, Leapfrog (which still wants way more data) now uses sophisticated analytics to weigh the different factors and compile a single letter grade for each facility.

Government agencies have also begun using Big Data to improve healthcare.

The FDA has launched a number of projects that mine and analyze data, including a program called Mini-Sentinel that automatically combs medical databases for signs of drug safety issues that were not detected before approval.

The numbers involved are vast. An FDA report from January revealed that as of July 2012, the Mini-Sentinel system had already collected records of some 3.8 billion medical visits and 3.5 billion dispensations of medication for 160 million Americans.5

For all those records, however, questions remain about the reliability of the analysis performed by the current system. A research letter published in January’s edition of the Journal of the American Medical Association, for example, noted that traditional studies comparing the bleeding risk of warfarin (Coumadin) and dabigatran (Pradaxa) have all found substantially more risk with dabigatran, while Mini-Sentinel found more with warfarin.6

Looking forward, the FDA reportedly plans to expand its automatic monitoring system to read sources like Facebook and Twitter for signs of drug safety issues.7

Similar techniques have already demonstrated some usefulness. Google, for example, has demonstrated that it can often spot a regional outbreak of flu earlier than health authorities simply by noting the prevalence of flu-related Web searches.

Researchers at Stanford and Columbia, moreover, were able to find a drug interaction the FDA had missed—the tendency of paroxetine (Paxil) and pravastatin (Pravachol) to raise blood sugar when used together—by analyzing tens of millions of search queries.8

Amid its efforts to use Big Data to monitor the safety of marketed drugs, the FDA also hopes its collection of drug trial information can help it develop software to better predict the behavior of experimental drugs in the human body.

Agency officials are using their vast archives of data to help build physiologically based pharmacokinetic models to predict drug absorption. Such models may spot potential problems with new drugs and improve the FDA’s ability to evaluate them.9

Of course, the FDA’s mountain of trial data could prove useful to many health-related analyses, so the agency plans to throw much of it open to outside researchers. FDA officials have launched a resource called the Janus Clinical Trials Repository, designed not only to release terabytes of information but also to make it user friendly.10

The FDA is also tapping outside organizations for help.

It funded a Center for Excellence in Regulatory Science and Innovation (CERSI) at the University of Maryland to help it use Big Data (and many other tools) to modernize and improve the review and evaluation of drugs and medical devices.11

The CERSI’s work on Big Data and healthcare, which includes a recentconference on the subject, nicely complements other efforts by Maryland to collect and harness records, efforts like its Research HARBOR (Helping Advance Research By Organizing Resources) project.12

“Assembling databases in useful ways has been very hard work. Claims databases lack the detail that researchers want. Medical records are only just going electronic—and even the electronic records we have are often incomplete and sometimes inaccurate,” said Eleanor M. Perfetto, PhD, MS, professor of pharmaceutical health services research at the University of Maryland’s School of Pharmacy.

“Still, while there is much work to be done, not only with traditional data sources but also completely new ones such as social media, we are making progress.”

Indeed, researchers can access data from insurers such as UnitedHealthcare, government entities such as the United Kingdom’s National Health Service, or the companies that make electronic medical record software. There are also data sellers like Humedica that try to link data from several sources to give researchers more holistic views of patient health.

Such data have many uses, but the most valuable, commercially speaking, may be the development of new treatments.

Many device and drug makers think Big Data can significantly improve the success rates of their laboratories and help them bring drugs to market faster, and more of them. Their projects vary widely. Some are monitoring social media, analyzing what people say about their products, and considering that feedback in new designs. Others are using archived medical records to determine the characteristics of target populations and thus improve enrollment criteria for drug trials.

Most of these projects have yet to advance beyond pilot programs and other early-stage initiatives.

The same could be said about virtually all efforts to better healthcare with Big Data. The successes of these efforts, while sometimes impressive, have generally been limited in scope, and many obstacles will hinder attempts to expand them to the system as a whole.

Patient data, as Perfetto said, is sometimes wrong, often sketchy, and almost always stored in dozens of different databases that must be accessed separately, if they can be accessed at all. Territorialism, privacy concerns, and other issues will hinder adequate data assembly. What’s more, computer software suffers real limitations in its ability to interpret and analyze the available material.

Big Data failures still outnumber successes, and some very easy soundinganalyses still lie outside the realm of possibility.

That said, each week brings news of another promising application for data-parsing software, applications such as ones that help drug development by “reading up” on the nearly endless supply of peer-reviewed articles that have been published over decades of time.

No one person—no team of people—could ever read all the relevant studies before choosing a drug target or a promising design, but programmers are “teaching” their computers to understand subject areas such as biology and chemistry and to “read” far more research than humans ever could.

One research hospital, in collaboration with IBM, used software IBM to analyze decades’ worth of literature about p53, a protein involved in both normal cell growth and many types of cancer. Using information in those papers about kinases that are known to act on p53, the software created a general understanding of p53-kinase interaction. It then made a list of other proteins mentioned in the literature that were probably kinases that would interact with p53.

Most of the computer’s predictions proved accurate.

“This software isn’t going to cure cancer yet, but it did make significant new discoveries about a very heavily studied protein, and there is a significant possibility that some of these proteins could be medically useful,” said Ying Chen, a research staff member from IBM’s Watson Group.

“This technology is ready to make real contributions.”


1. IBM predictive analytics to detect patients at risk for heart failure [press release]. Armonk, NY: IBM Newsroom; February 19, 2014. Accessed April 30, 2014.

2. Kayyali B, Knott D, Van Kuiken S. The big data revolution in US healthcare: accelerating value and innovation. McKinsey & Company website. systems_and_services/the_big-data_revolution_in_us_health_care. Published April 2013. Accessed April 30, 2014.

3. Tracey JK. Big Data and the research HARBOR. Presented at: Leveraging Big Data II: What does it mean for improving product development and health care? February 11, 2014;

University of Maryland School of Pharmacy, Baltimore, MD. Accessed April 30, 2014.

4. UPMC fosters ‘personalized medicine’ with $100 million investment in sophisticated data warehouse and analytics [press release]. Pittsburgh, PA: University of Pittsburgh Medical Center; October 1, 2012. Accessed April 30, 2014.

5. Mini-Sentinel Data Score. Mini-Sentinel Distributed Database Year 3 Summary Report. Mini-Sentinel website. Distributed-Database-Summary-Report.pdf. Published July 2012. Accessed April 30, 2014.

6. Sipahi I, Celik S, Tozun N. A comparison of results of the US Food and Drug Administration’s Mini-Sentinel program with randomized clinical trials: the case of gastrointestinal tract bleeding with dabigatran. JAMA Intern Med. 2014;174(1):150-151.

7. Karlin S. Adverse events in social media: FDA expects Signal Detection Revolution. The Pink Sheet. Published January 27, 2014. Accessed April 30, 2014.

8. Markhoff J. Unreported side effects of drugs found using data, study finds. The New York Times. March 7, 2013. Accessed April 30, 2014.

9. Florian J. FDA use of Big Data in modeling and simulations. Presented at: Leveraging Big Data II: What Does It Mean for Improving Product Development and Health Care? February 11, 2014; University of Maryland School of Pharmacy, Baltimore, MD. Accessed April 30, 2014.

10. Rosario L. Janus Clinical Trials repository: an update and insights into future directions. Presented at: Leveraging Big Data II: What Does It Mean for Improving Product Development and Health Care? February 11, 2014; University of Maryland School of Pharmacy, Baltimore, MD. Accessed April 30, 2014.

11. Center of Excellence in Regulatory Science and Innovation website. University of Maryland. Accessed April 30, 2014.

12. Clinical and Translational Sciences Institute website. University of Maryland. Accessed April 30, 2014.