Clinical Data Cleaning and Validation Steps

Wed, 04/15/2009 - 12:14pm

What every data manager should know about clinical trial data

Clinical data is one of the most valuable assets to a pharmaceutical company. Data is central to the whole clinical development process. It serves as basis for analysis, submission, and approval, labeling and marketing of a compound. Without good clinical data – well organized, easily accessible and properly cleaned - the value of a drug may not be fully realized. The Data Warehousing Institute (DWI) estimates the cost of bad or ‘dirty' data exceeds $600 billion annually. The actual cost of bad data may never be known, but it's safe to say the cost is significant enough to move up on your to-do list.

The cost per patient of Phase 3 clinical studies of new pharmaceuticals exceeds $26,000, on average, according to benchmarking report, published by business intelligence firm1. Phase 2 trials are comparatively cheaper, with the average per-patient cost falling just over $19,300 per patient. Phase I trials are even less expensive at nearly $15,700 per patient. Amount of volunteer patients per phase is usually: 20 to 80 for Phase I, 20 to 300 for Phase II and 300 to 3000 for Phase III, which totals to roughly $46,784,000 per average trial. Now imagine how much it will cost to the company if all the collected data will be "bad"?

In practice, no matter how well designed and implemented, all studies have to deal with errors from various sources and their effects on study results. Even though trial designs are often unique and require different handling depending on the particular trial, data manipulation methodologies chare a common set of tasks to be performed on certain stages (Figure I). Every stage involves certain data cleaning and validation procedures to ensure data consistency and accuracy (Table I).

Errors inevitably occur during data entry. Most common errors include: typographical errors, copying errors, coding errors and range error 2. Normal editing process involves a series of edit checks and queries. [3] Those are special codes running on the background of data entry which prevents from entering wrong information. For example, it will not allow input of future date as "visit date", or enter any other gender then male or female etc.

Any errors, omissions, or items requiring clarification or changes to CRF should be noted on the Query Form and forwarded to clinical monitor for processing.

Purpose of trial monitoring is to verify that:

a.) rights and well-being of human subjects are protected

b.) The reported trial data are accurate, complete and verifiable from source documents.

c.) The conduct of the trial is in compliance with the currently approved protocol/amendment(s), with GCP, and with applicable regulatory requirements."4

Sponsor should ensure that trials are properly monitored. Part of monitor responsibilities, depending on sponsor request, is to verify data consistency with the source data or documents:

1. Dose or therapy modifications

2. Adverse events

3. Concomitant medications and/or intercurrent illnesses

4. Visits, labs, examinations and tests required by protocol, but failed to perform.

This type of reports should be submitted in written trial-visit or trial-related communication with appropriate information included (e.g. date, site, name of the monitor etc.) 4

Even though electronic data management techniques allow to prevent a lot of "dirty data" during data collection, there are much more actual processes of data cleansing. Statistical societies recommend that description of data cleaning be a standard part of reporting statistical methods. In real practice it is often causes a lot of confusions of what is refering to data cleaning , deviation from clinical protocol or data abnormalities (Figure II).

The Society for Clinical Data management, in their guidelines for good clinical data management practices, states: "Regulations and guidelines do not address minimum acceptable data quality levels for clinical trial data. In fact, there is limited published research investigating the distribution or chracteristics of clinical trial data errors. Even less published information exists on methods of quantifying data quality".5 It is also essential that quality control be applied to each stage of data handling to ensure that all data are reliable and have been processed correctly. 4

Data Validation stage is refering to:

Missing data identification.

It is usually taken care of by running standard data cleaning reports, which identify missing values or missing records. Again, it is essential to understand difference between "handling missing data" for data cleansing purposes and for efficacy/safety analysis. Handling missing data for data cleaning purposes must answer following questions:

1. Was missing data collected on CRFs?

2. Was missing data lost during data load or database manipulations?

3. Was missing data loaded fully and correctly?

If all those questions are answered "Yes" then data cleaning job for missing data is completed. Later, during analysis Statisticians will be deciding how it should be used in the further analysis: if patient should be excluded from the analysis or LOCF (Last Observation Carried Forward) techniques can be used. Procedures for dealing with missing data, e.g., use of estimated or derived data, should be described. Detailed explanation should be provided as to how such estimations or deriviations were done and what underlying assumptions were made.4

Data reconciliation, which take place at the end of clinical trial and refers to a process that compares two sets of records to make sure they are in agreement. This includes matching the source and reflecting an accurate, valid value.

Queries that arise during the reconciliation of the data should be handled in the same manner in which clinical queries are handled Standard operating procedures (SOP) and quality analysis should be a part of every study in which the company invests money to collect end-points, be they traditional or pharmacoeconomic end-points.4

Detecting outlier

Most databases include a certain amount of exceptional values. Methods of outlier detection are either manual inspection of graphically represented data or automated statistical models. Mendenhall [6] determines "outliers" as "that lie very far from the middle of the distribution in either direction". Pyle7 in his "Data Preparation for Data Mining" is refering to "outliers" as: "single, or very low frequency, occurrence of the value of a variable that is far away from the bulk of the values of the variable". In fact, determination of outliers depends on purpose of analysis. If analysis relates to efficacy or safety evaluation then frequency of occurrence should be an important criterion for detecting outliers in categorical data. Otherwise, for data cleaning purposes, flawed values only should be determined. Flawed values are the ones resulting from the poor quality of a data, i.e., a data entry or a data conversion error. The isolation of such outliers is important both for improving of the data quality, following GCP guidelines and reducing the impact of outlying values in the process of data analysis and data mining.

Once data validation is completed it is time for Data Treatment stage, which is mostly referring to determining of deviations from study protocol. Industry Guidelines for "Structure and Content of Clinical Study Reports"9 states that all important deviations related to study inclusion or exclusion criteria, conduct of trial, patient managements or patient assessment should be described as Protocol Deviations. Unfortunately, this term is not defined any clearer by either the HHS human subject regulations (45 CFR 46) or the FDA (21 CFR 50). Protocol Deviation might also be determined as inconsistency between the Protocol reviewed and approved by the IRB, and the actual activities being done.

But it is still a challenge for most of data managers and clinical team leads to distinguish "dirty data" from actual Protocol deviations. In fact, it is essential to have full, clean and accurate data to determine deviations from the protocol.

Deviation is a term that includes protocol exceptions, changes made to avoid immediate harm to subjects, and protocol violations. From Regulation prospective, there are two main groups of Protocol Deviations:

1. Major (serious) PD – deviations that may affect the subject's rights, safety, or well-being, completeness and accuracy for study data. Examples:

* Research subject received wrong treatment or dose.

* Subject did not meet inclusion criteria but was not withdrawn from the study (e.g. age requirements, certain health conditions, test results out of specified range etc.)

* Research subject received an excluded, concomitant medication.

* Breaches of confidentiality.

* Missed oral medications, not relating to treatment of toxicities, or a missed day of treatment with continuous therapy.

* Failure to obtain informed consent prior to initiation.

* Failure to report Adverse Event.

2. Minor (non-serious) Protocol Deviations need not be submitted for prospective review. Those are change, divergence, or departure from the study design or procedures which does not have a major impact on the subject's rights, safety or well-being, or the completeness and reliability of the study data:

* Participants do not show up for scheduled research visit.

* Study procedures conducted out-of-sequence.

* Failure to perform required lab test, measurements or evaluations.

* Study visit conducted outside of required timeframe.

Deviations also can be of emergent and non-emergent nature. If Deviation appears in an emergency situation, such as when a departure from the protocol is required to protect the life of physical well-being of a participant, the sponsor and reviewer must be notified as soon as possible to comply with GCP guidelines and CFR 21 (312.56(d)) during 5 working days: "The sponsor shall discontinue the investigation as soon as possible, and in no event later than 5 working days after making the determination that the investigation should be discontinued. Upon request, FDA will confer with a sponsor on the need to discontinue an investigation".

Non-emergent deviations that represents a major change in the approved Protocol should be submitted as a "change in research" (for Medical devices only CFR 21 (812.150)). Before IRB approve request for the proposed changes those deviations are considered as non-compliance and also should be reported promptly.

From the data management prospective there are six main groups of Protocol Deviations:

1. Protocol Inclusion/Exclusion criteria

2. Discontinuation of treatment

3. Compliance

4. Study drug related

5. Medication related

6. Pain related

Complexity and frequency of data manipulation depends on Data Management system. Clinical Trial data can come from a variety of sources: investigator sites, laboratories, directly from subjects and partners. Data directly may be collected in form of paper CRFs as well as electronically. Oracle Clinical system (Electronic Data Management) allows Remote Data Capture (RDC) which automatically performs Data Cleaning as part of its Validation Process. Web-based CRFs allow advanced Edit Checks during Data Collection stage to prevent major data cleaning issues in the future. Once web-CRF data getting uploaded to Oracle Clinical EDM, it runs automatic validation (basic data cleaning) routine which helps to inform Data Managers about data issues in timely fashion.

Working in clinical industry, it is essential to remember, no matter how frequent or fast data manipulation should occur – it has certain sequence and rules, regulated by FDA (IND), various guidelines and GCP. It is Quality (ISO) standard requirement for clinical system to assign responsibility for different aspects of quality monitoring, and ensure periodic audits of the clinical database. There should also be provision for corrective actions to be implemented and the system should be revised or redesigned if deemed necessary.8


1. "Clinical Operations: Accelerating Trials, Allocating Resources and Measuring Performance", Pharmaceutical Processing (Nov, 2006),

2. Shein-Chung Chow, Jen-pei Liu, "Design and Analysis of Clinical Trials: Concepts and Methodologies" (Wiley-Interscience, 2nd ed., 2004) , pp. 639.

3. Curtis L. Meinert, Susan Tonascia, "Clinical Trials: Design, Conduct and Analysis" (Oxford University Press , 1986) pp. 168.

4. Andrew J. Fletcher, Lionel D. Edwards, Anthony W. Fox, Peter D. Stonier, "Principles and Practice of Pharmaceutical Medicine" (Wiley and Sons, Ltd., 2nd ed., 2007), pp.88-91 .

5. Society for Clinical Data Management, "Good clinical data management practices", (Society for Clinical Data Management, version 3.0, 2003)

6. Mendenhall W., Reinmuth J.E. & Beaver R.J., "Statistics for Management and Economics", (Duxbury Press, Belmont, CA, 7th ed., 1993)

7. Pyle D, "Data Preparation for Data Mining" (San Francisco, CA: Morgan Kaufmann, 1st ed., 1999), pp. 240-258

8. Rondel R.K., Varley S.A., Webb C.F., "Clinical Data Management", (John Wiley & Sons, Ltd, 2nd ed., 2000), pp. 137-138

9. U.S. Department of Health and Human Services, FDA, CDER, CBER, "Guideline for industry: E6 Good Clinical Practice: Consolidated Guidance". (Geneva: International Conference on Harmonization, 1996 and) , accessed on January, 2009 via pp. 12, 15, 35, 42
About the author:Vera Pomerantseva has 9 years of experience in analysis, programming and designing data for various projects including time spent as an IT consultant (Database Design and SAS programming) for Novartis. She has both undergraduate and a graduate degree in Economics from the State University of Management, Moscow, Russia.

Share this Story

You may login with either your assigned username or your e-mail address.
The password field is case sensitive.