Use of EMR/EHR Database for Research and Scientific Discoveries: Procedure and Life Cycle
Electronic medical records (EMRs) are a digital version of the paper charts of patients in the hospital, clinic or clinician's office. An EMR system usually contains the medical and treatment history of patients to help with the clinician's decision on diagnosis and treatment for patient care. EMRs allow clinicians to better track patient's data over time, identify and remind patients for preventive checkups and disease screenings, monitor patients and improve healthcare quality. Electronic health records (EHRs) are much broader than EMRs and contain all relevant health data of patients in addition to EMRs, which may include the data from laboratories, specialists, nursing homes and other healthcare providers. EHR systems are also designed to share the patient's data with all authorized clinicians, caregivers, stakeholders, and even the patients themselves. Thus, a fully functional EHR system enables all authorized healthcare providers to access the latest information of patients anywhere and at any time so that more coordinated and patient-centered care can be provided timely to the patients. At the same time, the EHRs also serve as documentation for administration and billing purposes. Recently, EHR data became one of the major sources for real-world evidence to evaluate treatments, improve diagnosis and healthcare quality, reduce side effects and adverse events of drugs, predict disease risks and treatment outcomes, optimize and personalize patient care (MIT 2016).
Since EHR data are very complex and noisy, analysis and interpretation require sophisticated statistical methods and data science techniques as well as multidisciplinary collaborations between data scientists and domain experts. In addition, a novel data-driven research paradigm and state-of- the-art approaches from a systematic perspective are necessary in order to harness a big EHR database and translate it into clinical knowledge for best practice. Based on our experience and from a systematic perspective, we summarize the procedure and the life cycle to use the EHR database for research and scientific discoveries in the following steps:
- 1. Initiate a project: proposing a research topic with some potential high-impact biomedical/clinical questions or hypotheses
- 2. Data queries and data extraction
- 3. Data cleaning
- 4. Data pre-processing or processing
- 5. Data preparation
- 6. Data analysis, modeling and prediction
- 7. Result validation
- 8. Result interpretation
- 9. Publication and dissemination
This procedure is quite similar to the data mining procedure for knowledge discoveries in databases (KDD) (McLachlan 2017, Fayyad, Piatetsky- Shapiro, and Smyth 1996, Fernandez-Arteaga et al. 2016, Holzinger, Dehmer, and Jurisica 2014, Mitra, Pal, and Mitra 2002). We will provide the details and explanation for each of these steps in the following sections.
Initiate a Project
To initiate a project, one should start by proposing a research direction or topic, usually with a focus on a particular disease, treatment, medication, or other conditions of interest. Ideally, domain-specific clinicians, epidemiologists, or biomedical scientists in the multidisciplinary team may initiate a project with some potential biomedical or clinical hypotheses or scientific questions, although it may not need to be specific. Since the EHR database usually contains data from a large number of patients and covers many different diseases, treatments and conditions, it is easy to raise many clinical, biomedical, or epidemiological questions. However, it may not be easy to identify a good question.
What is a good question? Based on our experience, a good research question based on the EHR database should satisfy the following criteria:
- • Clinically or scientifically important and high-impact: If we could answer the question or prove/disprove the hypothesis, the results and conclusions are clinically important with a high impact so that we can publish the results in a high impact journal.
- • Appropriate to use the available EHR data to address: The EHR data are appropriate or even the best data to address the proposed question or hypothesis. Sometimes the available EHR data may not be good or the best for the question or hypothesis. It is ideal if one can justify that using EHR data is the only way to address the proposed question or hypothesis and there is no other alternative.
- • Appropriate and reliable endpoint or outcome data are available or can be derived from the EHR database for the proposed question or hypothesis. For any clinical or scientific question and hypothesis, appropriate endpoints or outcomes must be defined and identified, and sometimes good biomarkers can be used. It is necessary to confirm that these endpoint or outcome data are available and reliable in the EHR database. For example, to use mortality as the outcome or endpoint to evaluate a disease treatment, the researcher needs to carefully evaluate whether the EHR system captures the mortality reliably for most of the death cases due to the treatment. However, for chronic disease treatments, this may not be true since the follow-up time is usually not long enough to capture death due to the chronic diseases by the EHR system.
- • The sample size is big enough: The sample size (the number of subjects, events and/or measurements) is usually quite large in the EHR database. However, for a particular question or hypothesis, we must screen the subjects based on the inclusion/exclusion criteria. For questions or hypotheses related to rare diseases or rare events, the sample size may still be an issue. Thus, it is also crucial to carefully define the study cohort based on the proposed question or hypothesis and develop the appropriate inclusion/exclusion criteria in order to ensure the sample size to be large enough.
The types of observational studies include 1) Case study or case report, a descriptive report on one or a series of special or unique clinical cases, which is a good source to generate hypotheses, instead of exploring any association or cause-effect relationships; 2) Cross-sectional study, the data of exposure or intervention/prevention treatments and outcomes are collected simultaneously from a specified population at a given time point via a survey or other approaches. These kinds of studies are typically used to define incidence, prevalence and associated risk factors for diseases or outcomes, but it is difficult to identify causal-effect relationships using cross-sectional studies; 3) Case-control study, based on the historical data, the subjects of controls are selected to match those of the cases for important factors so that these two groups of outcomes can be compared fairly in some sense. This kind of study can be used to compare the effectiveness of exposures, treatments and prevention/intervention strategies, but caution has to be taken due to sample selection bias and potential confounding factors of observational studies; 4) Cohort study, including both prospective and retrospective cohort studies with longitudinal data collected over a period of time. This kind of study can provide some evidence on the causal-effect relationships between the exposure and outcome; and 5) Real-world data study, the data such as EHR data are collected for business operation, reporting and practice purposes. These kinds of data were not collected for research purposes, but it can be used to formulate the four types of studies mentioned above based on real-world data. Furthermore, the real-world data such as EHR and insurance claim databases can be used to develop accurate predictive models for future disease burdens and outcomes, and pharmacovigilance to detect rare drug adverse events due to its long-term follow up time period and large sample size.
Notice that some questions such as treatment evaluation or group comparisons may not be easy to address using the retrospective EHR data. This is because the patients are not randomized to the comparison groups and the results can be confounded by many factors including patient severity that may not be available from the EHR database. Causal inference approaches may help address this confounding problems to some degree. Thus, in formulating a good study question or hypothesis, one should try to avoid comparison between groups unless the fair comparison can be justified or supported by the data.
To identify a good clinical research question based on the EHR database requires close collaborations between domain clinical/epidemiological scientists and data scientists. Initial data exploration is necessary and it may take several iterative interactions between domain-specific clinicians/ epidemiologists and data scientists before a good question or hypothesis can be identified. The question or hypothesis may be modified or changed during the data exploration and even during the data analysis stage. A vague or general question regarding a disease, treatment or condition can be an initial target to start data exploration by statisticians and data scientists. The data exploration results can be presented to the clinical or epidemiological experts so that more specific questions or hypotheses can be gradually identified. A data extraction summary statistics report (see Chapter 3 for details) may be drafted to serve as a communication channel between the data scientists and domain-specific clinicians/epidemiologists. It is possible that the initial clinical or epidemiological question can be modified, improved, or even completely changed based on the data exploration results. Sometimes, the clinical question or hypothesis may not be finalized until the final data analysis is done. Data visualization and descriptive/summary statistics are usually used for initial data exploration in order to identify a good clinical question or hypothesis.
Our focus here is on clinical or scientific research projects. One may also initiate a statistical or computational methodology research project based on the EHR database. The procedure or life cycle for a methodology research project is different and will not be discussed in this book.