Data Cleaning Tools for Specific EHR Datasets
EHR database systems may have different designs and implementations. They may use different data dictionaries and their schema, as a relational database, can be different from one to another. For example, one database may use the World Health Organization ICD-10 code list with around 14,000 different codes and another may use the ICD-10-CM code list with more than 70,000 entries. Date may be entered in European style (dd/mm/ yyyy) or USA format (mm/dd/yyyy). Different countries have different formats for their postal addresses; therefore, geolocation data can be another challenging task for data cleaning teams. Because of these discrepancies among EHR database systems, data cleaning tools are usually developed to perform on specific EHR database systems and may not be applied directly to other databases. However, a study for the methods and experience of research groups and their tools will always be beneficial and informative to researchers developing their own data cleaning tools.
One example of this type is "rEHR" package, developed by UK researchers using “r" statistical software. The UK has developed a near-universal deployment of EHRs in general practice performed by general practitioners for over 20 years (Springate et al. 2017). The following databases were used to create this universal EHR database:
- • The Clinical Practice Research Datalink (CPRD, previously known as the General Practice Research Database, GPRD)
- • The Health Improvement Network (THIN)
- • QResearch
- • The Doctors' Independent Network (DIN-LINK)
- • Research One
This database is completely de-identified and made available for research (Springate et al. 2017).
rEHR package is a wrapper of SQL queries to interact with the underlying database easily and rapidly with some additional features. It can extract data from the database and convert the longitudinal data to a cohort dataset suitable for survival analysis. rEHR can create data that have matched control and case groups and can cut cohort data by time-varying covariates. An additional feature is the code's ability to unify the measurement unit for HbAlc.
There are also some statistical methods for data cleaning in literature. For example, algorithms have been developed to detect and clean inaccurate adult height using median height during different stages of a patient's life (Muthalagu et al. 2014). This approach is simple and easy to implement but drops surprising and dramatic information points from the patient's history. For example, we neither expect a negative change in the patient's height, nor a significant height gain after adolescence. However, we should keep in mind that, although with extremely small probability, there are exceptions to the above assumptions. For instance, if the patient undergoes a bilateral lower limb amputation surgery, the change in height will be negative, on the other hand, if the patient undergoes a spinal reconstructive surgery, he/she may gain height even after becoming an adult (Spencer et al. 2014). These situations are rare but, depending on the type of study and target population, ignoring them during the data cleaning process may remove some critical data points and add bias to the study.
Data Quality Assessment
One of the first challenges an investigator faces is if the available data is suitable and good enough for conducting the study of interest. The answer to this question is critical and the result usually becomes apparent during the initial steps of the data cleaning process. For example, we may be interested in studying a rare infectious disease using a Multi-central EHR database system. Initially, we may find that the database has enough records and subjects to conduct the study. However, in later data inspections, we may realize that majority of the records have been recorded improperly, and it includes an extremely high rate of missing data. These findings may force the research group to drop some blocks of records from the study which eventually may result in a significant decrease in the number of subjects available to continue the study. Kahn et al. proposed data quality metrics to assess the appropriateness of the available data (Kahn et al. 2012). They divided these metrics into five different categories:
1. Attribute domain constraints: This metric measures the basic summary statistics of each of the desired attributes and assesses the distributions. If the clinical knowledge proposes a certain distribution with a definitive possible range of values for that attribute, we can estimate the level of error in the database for that specific attribute. For example, pulse oximetry values are usually above 95% and we know the possible range of values is from 0% to 100%.
So if we find a large number of negative or more than 100% values for these attributes it can be indicative of poor data quality (high number of errors) (Elder et al. 2015).
2. Relational integrity rules: This metric deals with the structure of the database from which we have extracted our data. For example, if we discover that the primary key for each patient (which is called Patient_sk [super_key] in Cerner) is not unique for some patients, we can conclude that the database has a structural problem, and a higher ratio of this discrepancy reveals a lower quality of the data.
- 3. Historical data rules: This metric evaluates the quality of the time- varying attributes, such as the use of the same format for different time points.
- 4. State-dependent rules: This metric assesses if the changes in the lifecycle of an object follow expected transitions. For example, if a patient is flagged as expired at one-time point, we shouldn't have any records at later time points regarding his/her state of the living. An example is the measurement of blood pressure, recorded after the reported time of death. However, we should keep in mind that some results may become available and recorded in the database even after such occurrences such as toxicology reports.
- 5. Attribute dependency rules: This metric measures the quality of dependent attributes. For example, if we observe pregnancy diagnosis for a patient with multiple male gender records in different encounters, we may question the quality of the data (Kahn et al. 2012).
This type of data quality assessment can guide us in addressing the following questions:
- • Is this specific research feasible using the available dataset?
- • How difficult and tedious will the data cleaning process be?
- • What areas should the data cleaning team prioritize?
Some studies used this framework to evaluate the data quality and clean their data mainly by dropping some of the records or attributes (Dziadkowiec et al. 2016), which although easy and straightforward, produced biases in the final results.