OpenFDA – the Good, the Bad, and the Ugly
On June 3rd, the FDA launched OpenFDA, in an attempt to take large internal datasets and make them more accessible and usable by the developer and business community.
OpenFDA is delivered in a search-based API that should enable software developers to more easily build applications based on adverse event data from the FDA Adverse Event Reporting System (FAERS) dataset for the period 1/1/2004 to 6/30/3013. The FDA has announced plans to add device and food adverse events data to the framework, along with structured product labeling and recall data (update: drug and device recall data was added on July 16).
The launch was heralded with the sort of buzz and hoopla usually reserved for a major product launch from a Silicon Valley startup. We have held off on any analysis and opinion until now to give our team the needed time to look through the system thoroughly.
Now, I readily admit, I am biased. As a data geek with 15+ years working on Big Data problems, I really want to love OpenFDA. It is, after all, a major step forward both in terms of technology, but more importantly, philosophy from an agency that hasn’t exactly been a shining example of either in recent years.
In an ideal world, OpenFDA could usher in a world of new and improved tools and products that would improve patient safety and adherence, increase physician awareness of drug safety dangers, assist healthcare decision makers who are driving prescribing behavior with better decision support, and lower the overall cost of care by reducing avoidable side effects.
But we don’t live in an ideal world.
So, here are my thoughts on the Good, Bad, and Ugly of OpenFDA:
OpenFDA is a seal of approval over the use of FAERS data in multiple settings. This is the first time that FDA has confirmed what we have believed all along - that these data are valuable and should be used in multiple venues to improve patient safety. It has long seemed ridiculous that the FDA spends millions of dollars to collect these data, uses it for their own internal safety signaling and review processes, but then deters others in the healthcare community from deploying these same data in new and innovative ways.
OpenFDA gives the long awaited ‘all-clear’ to harness the power of these data to improve care throughout the healthcare system. We’re excited to see how this evolves in the product space – especially at the patient level - in the months and years ahead.
As with any new launch, once the excitement dies down, the true capabilities and limitations of the system are revealed. After careful review we’ve discovered several major concerns, two of the big ones are detailed below:
1. Timeliness / Completeness. While OpenFDA has made 9.5 years’ worth of adverse event case reports available, the most current data provided is already more than a year old. One of the major draw-backs of FAERS is the delay in public data releases. FDA has the sole capability to make these data available quicker, and we expected OpenFDA to include a concerted effort to speed up this process. The continued delay in data releases from FAERS puts patients at undue risk and undermines the utility of the OpenFDA program.
There are also concerns about OpenFDA’s commitment to keeping their API feed up-to-date. Days after the OpenFDA launch, FDA released new FAERS data through September 30, 2013. Now, six weeks later, those new data are still not incorporated into the OpenFDA feed.
We expect another major data load in the coming weeks. How long will it take for OpenFDA to update the API that will be populating all of these new apps and innovation? Will it update the API at all? When does the data become obsolete? When it is 2 years out of date? There’s no information about any of this on the OpenFDA website.
Further, while adverse events data through the FAERS system is readily available back to 1997, OpenFDA only made data available from 1/1/2004 – 6/30/2013. It’s unclear why OpenFDA capped the historical data at 9.5 years instead of the full 16-17 years available. Adding the additional years would have added over 1 million more adverse event cases to the available feed. For those trying to do meaningful statistical analysis on the underlying data, the extra years and cases are vitally important.
2. Data Cleansing- the OpenFDA calls their clean-up process ‘harmonization’. AdverseEvents calls ours ‘optimization’. Whatever the term, the goal is the same – take these very dirty raw data and get them properly lined up so that you can see the true and complete side effect profile of a particular drug and compare/contrast those profiles on an apples-to-apples basis. It took us 18-months to do our initial clean-up work and develop RxFilter. But even with the best automated process, these data are constantly changing and even the best algorithms aren’t enough. We have a dedicated staff of highly trained analysts in our California office who work constantly to keep these data properly cleaned and mapped to the right drug names, manufacturers, indications, classes, mechanisms of action, and other aggregation points that we continue to add on a regular basis.
OpenFDA has made some strides in their drug ‘harmonization’ by mapping to a series of common drug identifiers (notably, the National Drug Code and Structured Product Labeling), but by their own admission, their harmonization process isn’t complete. They claim that 86% of all cases in OpenFDA are properly mapped and aggregated, but our initial review shows that number to be over-stated. Chief among the myriad issues with OpenFDA’s harmonization process, is that it while it recognizes and maps verbatim drug names and some variations through its Medicinal Product field, it missed many others. It simply can’t take into account all of the adverse event cases that contain drug names with major misspellings, keystroke errors, and incorrect field data. This leaves hundreds of thousands of cases unmapped and virtually unsearchable via OpenFDA. For example, a search for the drug Lipitor in OpenFDA’s harmonized data produces 75,325 adverse event cases. When we search for Lipitor adverse event cases in our AdverseEvents Explorer platform for those same dates and using the same types of criteria embedded in the OpenFDA platform, we find 103,110 cases. Why the difference? We have found over 1,200 variations on the drug name Lipitor in the raw FAERS data. OpenFDA simply doesn’t capture them all.
Put simply, a large chunk of the adverse event cases provided by OpenFDA are out of alignment. How bad is the problem? It’s really difficult to tell exactly without thorough study, but I’m quite comfortable stating that the actual correct mapping rate is a lot lower than the 86% that OpenFDA claims. At a 14% error rate, it makes analysis very difficult. At an even higher error rate, it makes analysis impossible. And, coincidentally as I was writing this post this week, some of the more active users of OpenFDA finally discovered some of these issues and started raising questions on the OpenFDA Github site. As one post plainly states, “…this skews the data quite a bit.”
Now, in fairness our automated process is not perfect either. This is why we add a manual step. After four years of work, we can machine-map 95-98% of new quarterly cases added, but the variation of data errors in FAERS is endless. Even an extra space in any field can throw the whole system into disarray. Keeping the dataset clean and optimized and accounting for the 2-5% that our RxFilter process misses, takes manual curation and validation. I don’t see that ever changing. OpenFDA is starting with a (much) lower level of automated clean-up, doing no manual correction to bring the dataset into proper alignment, and since it is provided via API, does not empower its users to easily recognize these problems or implement their own solutions.
Despite the issues raised above, the OpenFDA API feed is already being deployed and widely used. According to the FDA, since its launch, the adverse events API has been accessed by 18,000 devices with nearly 2.4 million API calls. The developers putting out those products – and the end users querying those products - don’t understand the scope of data shortcomings, errors, and timing issues that exist (the drug name issue described above is just scratch-ing the surface). Unlike working with raw data, developers and users don’t have the means to easily identify the problems, let alone make the needed corrections; they simply see que-ry results – errors and all. This will lead to incorrect and incomplete analysis and, I fear, potentially add to the already abundant misconceptions about FAERS data. In all, OpenFDA in its current state may end up doing more harm than good.
OpenFDA is a beta launch and that implies that there are improvements and upgrades yet to come. While we congratulate the OpenFDA team on their work to date, we hope that they will invest the necessary time and resources into addressing the issues stated in this post and other major flaws in the platform.
Unfortunately, the public proclamations made by those associated with OpenFDA indicate that their focus is now on adding more ‘open’ datasets rather than fixing and updating the adverse events data. In fact, just yesterday they rolled out an API for drug and device recall data (note: we have not reviewed that API and will not comment on it at this time). Even more unfortunately, the driving in-house technologist behind the OpenFDA effort has al-ready left the program. Hopefully others will step in and continue to move the process forward.
Open Data is Great, But Not Enough
We would like nothing more than to see the OpenFDA effort succeed, as it will only help our company to move forward faster with ever increasing quality. So, if the FDA really wants to be as nimble as a Silicon Valley start-up, I’m calling on it to start acting like one. Especially with the disastrous launch of the Affordable Care Act website so fresh in everyone’s mind, FDA should utilize the lessons learned from every new tech product launch from the past fifteen years and fix the mistakes in the existing system before trying to add more data.
The concept of Open Data is great. But, simply making data available isn't good enough. FDA needs to be transparent with the deficiencies of the current OpenFDA system (if they’re even aware of them), set a schedule for upgrades and development to fix the adverse events data, and commit the human and financial resources needed to make OpenFDA the truly revolutionary step forward in the advancement of healthcare that it can be.
There’s a widely held view in Silicon Valley that government and start-ups will – and should - always be at odds. I don’t share that view and it seems that OpenFDA could be a great platform to bring those two worlds closer together. But if some serious improvements aren’t made to the platform, I fear that OpenFDA will be branded as just another example of government wasting its time and our tax dollars.