EHR/EMR Database Design and Structure: Database Queries
Most EHR/EMR database systems are based on relational database design. Oracle and Microsoft SQL server are among the most commonly used Database Management Systems (DBMS) employed by most EHR/EMR vendors. When healthcare data is acquired or purchased from vendors, the data format is usually in text files (or csv files). Similarly, publicly available and free healthcare data often exists in text file formats. For example, when we purchased the 2018 version of the Cerner Health Facts database, we received the EHR data in csv files. Furthermore, some vendors, including Cerner, provide design details of their database structure through data dictionaries and database schema diagrams. The data dictionary document that we received from Cerner explains the data sets and the data elements within the Health Facts, and the database schema defines the database tables, fields and relationships. Using the details of the database design, one can recreate a database system to host and manage healthcare data for research purposes.
3.3.1. Database Construction
In general, a database system development process can be divided into three stages: database design, implementation, and database maintenance. Database design starts with a conceptual data model and produces a specification of a logical schema. The output of this stage is a detailed relational specification, the logical schema, of all the tables and constraints needed to satisfy the description of the data in the conceptual data model. Implementation includes the construction of a database according to the specification of a logical schema. This will include the specification of an appropriate storage schema, security enforcement, external schema and so on. Finally, database maintenance, including operational maintenance, is to ensure that performance of the database system is maintained at an acceptable level, which requires effective communications among the team members to determine and implement changes as needed.
The design and construction of an EHR database system for research purposes should follow the description of the schema document received from the vendor. A traditional relational DB design is most common, but a distributed DB design is also an option for efficient data management.
3.3.2. Traditional Relational Database System
Database systems based on the relational data model are most commonly used in healthcare data management. In a relational model, data is organized in tables, linked with relationships, and controlled by constraints. Famous examples of relational database management systems include Oracle, Microsoft SQL server and PostgreSQL.
The steps for creating your own database system to manage healthcare data include,
- • Setup database environment (database server configuration)
- • Create schema and tables
- • Populate tables with data
- • Operate database server (run server process)
- 3.3.3. Distributed Database System
As database sizes keep rapidly increasing, in order to improve the performance of data manipulation, online analytical processing (OLAP) and online transaction processing (OLTP), the Apache Hadoop was developed as a framework for running applications on large clusters from commodity hardware. Hadoop provides a distributed file system (HDFS) that stores data on compute nodes, and implements a computational paradigm named Map/Reduce, where the application is divided into several small fragments of work, each may be executed or re-executed on any node in the cluster. The Apache HBase, established on Hadoop HDFS, provides the capability to read/write big data randomly in real-time. As an open-source, distributed non-relational database management system, HBase modeled after Google's Bigtable: A Distributed Storage System for Structured Data, developed by Chang et al. (Fay et al., 2006).
Unlike traditional SQL databases (e.g., Oracle, Microsoft SQL, MYSQL, PostgreSQL), the NoSQL databases (e.g., HBase, MongoDB, Cassandra DB) provide more flexible solution on unstructured data (e.g., photos, videos, documents and articles). They are generally document-oriented. In this way, the unstructured data can be stored in a single document that can be easily found, but it is not necessarily categorized into fields in relational databases.
For the Cerner EHR database, we employed the HBase framework mainly for performance concern, since many tables are very large with billions of records. In an HBase Cluster, the query and analysis performance are significantly enhanced. The comparison shows ~7x speed improvement on a two-node cluster.
Define Inclusion/Exclusion Criteria for Data Extraction
EHR/EMR data present a wealth of valuable information for researchers. Tapping into this potential necessitates collaboration between domain experts (i.e., clinicians, healthcare experts and physicians) and data analysts as a starting point in EHR research. This collaboration should stimulate and produce some interesting topics to study, which is not trivial. Most of the time we, as statisticians and data scientists, do not have a clear scientific question in mind. On the other hand, domain experts may have enough scientific knowledge to identify important questions or hypotheses in a biomedical or clinical field, but they may not know whether the data are good enough or analytic methods are available to address the questions or hypotheses. It is essential for domain experts and statisticians/data scientists to closely work together to identify good and important questions and hypotheses. This process sometimes is not easy and may require multiple iterations until a good question or hypothesis can be identified. More discussions on this point can also be found in Chapter 1 (Section 1.2).
Once the questions or hypotheses are determined and the objective of the project is clear, initial data extraction and data exploration can begin. In order to extract the appropriate data for the targeted questions, we need to clearly define inclusion/exclusion criteria for data extraction. The initial inclusion/exclusion criteria for data extraction can be simple but must be refined later after further data exploration and data cleaning. If the identified question or hypothesis is regarding a disease, all the diagnosis codes (e.g., ICD-9 codes) related to this disease need to be identified so that all the relevant data from all the patients with these diagnosis codes can be extracted from the EHR/EMR database. Sometimes this may not be trivial since there are many disease diagnosis codes possibly related to the disease and it may not be easy to identify all the diagnosis codes or ICD-9 codes by keyword search. Some domain knowledge related to the disease is needed to identify all synonyms of the disease. In this case, some natural language processing (NLP) approaches can also be used for this purpose. Sometimes a patient is likely a confirmed case of disease by reviewing his/her records of medications, treatment procedures or lab test results, but the disease diagnosis code or ICD-9 code of the disease did not appear in the patient's record due to missing data or other errors. In this case, more patients related to this particular disease can be possibly identified through medication, procedure and lab test tables. However, it needs domain knowledge to identify the medications, procedures and lab test results, which are almost uniquely related to this disease. Finally, it is necessary to use rigorous procedures and approaches to confirm the cases of disease diagnosis via formal disease phenotyping, which will be discussed in the next subsection.
Some other inclusion/exclusion criteria may also involve patient demographics such as age, gender or race, or the data from other EHR/EMR tables such as medication, procedure, lab test, clinical event, vital sign or microbiology tables. Some of these data or variables are quite accurate and complete, some others may have high uncertainty, high missing or error rate. It is quite challenging to accurately identify the targeted cohort of patients that meet all inclusion/exclusion criteria.