Data Management Framework
Now we have a reliable and secure server with a detailed convention for naming and organizing our files and folders, but still, a very important issue remains - the reproducibility. It is crucial to be able to reproduce the data files, reports, and results from the EHR research projects. It is not uncommon that other parts of your organization or other outside organizations want to see how you produced the specific results. Having messy data makes this task challenging or even impossible. The steps we described in the previous section lay the cornerstone in creating a data management framework. First, we should note that EHR data analysis is a multistep project and each step is usually dependent on the data from previous steps (see Figure 2.1 and Chapter 1). This means that each data file or report is dependent on some other data or report. We can use Directed Acyclic Graphs to visualize these dependencies and create a semiautomatic framework to reproduce the data and analysis results. To describe this framework, we begin with an example project (Figure 2.9).
To keep things simple, we describe an imaginary sub-project (Mortality Prediction) from the SAH Project, that belongs to the Cerner EHR working group. The actual network is bigger and more complicated, but the framework functions will be the same for all of them. As you can see, we can arrange the data hierarchy in layers. Each shape in this figure is considered

FIGURE 2.9
An example of data structure in DMF.
a node. The research Center (CBD_HS)/ Groups (Cerner), Projects (SAH), and Sub-projects (Mortality Prediction) layers are not data nodes, but they are helping us to keep track of the changes and organize the downstream nodes (the nodes lower than a specific node). The Extracted Data (e.g., extracted lab table), Cleaned Data (cleaned lab table), Prepared Data (prepared lab table), Reports (reports 1 and 2), and Papers (paper 1) are considered as real data nodes in DMF. Each directional edge (arrow) shows the dependency relationship between the current node and upstream nodes (nodes at the same level or higher level). Using the proposed file and folder convention, a computer code can be written to scan all of the files and folders inside the main folder (CBD_HS) and create this network. There are multiple visualization modules to visualize your network as a 2-dimensional (with concentric circles as layers) or a 3-dimensional (with concentric spheres as layers) graphs. As you can see this graph can be considered as a Directed Acyclic Graph. At the next step, we will describe the properties of nodes and edges in DMF (Figure 2.10).
In Figure 2.10, we show the part of the network downstream to the Mortality Prediction sub-project from the SAH project. In this network, each node and each edge have their specific properties. The edge property is the code that connects upstream nodes to a downstream node. For example, the edge connecting Cleaned Medication (upstream node) to the Prepared Medication (downstream node) has Preparation code 1(PC1) as its property. Some of the properties of the edges are inherent to the network, e.g., the nodes that are connected by the edge. We are not showing these essential properties in the figure. Another addition to the edge property can be the execution time, so in the future, the admin can calculate a rough estimate about how long it takes to update all of the nodes in the DMF.
Similarly, each node has the following property elements associated with it:
Name: Name of the file excluding the date part (last eight characters according to folder and file naming convention). It can be recorded by DMF during the scanning of the CBD_HS folder.
Type: The layer that the node belongs to. Again, this element can be generated by DMF during the scanning of the CBD_HS folder.
Dependencies: The nodes upstream that are needed to generate the current node. The user (users) who created the file should inform the project manager or admin about the dependencies. These dependencies should be included in the data submission form (appendix).
Code (if applicable): The code that needs to be executed on the dependencies to create the current node. The code or command to run the code should be submitted when the user creates a file and wants to place it in the DMF.
Users: The users involved in generating the current node. This property will be recorded according to the data submission form. The user (users) email address can be retrieved using the User ID.

FIGURE 2.10
The edge and node properties in DMF.
Date: last eight characters of the file name according to folder and file naming convention. It can be recorded by DMF during the scanning of the CBD_HS folder.
In Figure 2.10, we show the property of one node in each layer. Every node in the DMF has a similar property structure and information.
You can set up the DMF to update the network in three different ways:
- 1. Update the network as soon as a file added to the CBD_HS folder or its sub-folders.
- 2. Update Network on a pre-defined date and time (e.g., ever)' midnight).

FIGURE 2.11
Scenario 1: adding a newly extracted Procedure data to DMF.
3. Update the network when the DMF administrator runs the update command.
Now we will describe how the DMF behaves in the following scenarios: Scenario 1—adding a new file: in this scenario, the DMF will scan the CBD_HS folder and its sub-folders, and detect the new file. By using the convention, it can decide on which layer it should put the new file (new node) and can create the properties of the new node according to the file submission form as described above (Figure 2.11).
Scenario 2—updating a file (same data structure): in this scenario, the DMF will scan the CBD_HS folder and its sub-folders, and detect the updated date (the last eight characters of the file name) and marks it as "Updated" (same structure) node. Then the DMF will detect all downstream nodes that are affected by the new change and mark them as "Update needed". If the downstream nodes are generated by code, it can regenerate the downstream node, otherwise, the node will remain in "Update needed" status (Figure 2.12). By using properties of affected nodes, the DMF will notify (by email or other messaging services) all of the affected users. In Figure 2.12, the edges with a white circle on them have associated code.
Scenario 3—updating a file (new data structure): in this scenario, the DMF will scan the CBD_HS folder and its sub-folders, and detect the new file name. If it is submitted as a new file, then it will continue as scenario 1; but if it is an existing file with a new data structure according to the file submission form (i.e., the downstream codes cannot be applied to it), then it should be marked as "Updated" (new structure) node. Then the DMF will detect all downstream nodes that are affected by the new change and mark them as

FIGURE 2.12
Scenario 2: updating a file (same data structure).

FIGURE 2.13
Scenario 3: updating a file (new data structure).
"Update needed" (Figure 2.13). By using affected nodes properties, the DMF will notify (by email or other messaging sendees) all of the affected users.
This Data Management Framework (DMF) will be semi-automatic and updates the data files and reports if it is applicable. The DMF admin can add a function to create a unique ID for each file that stores the name and date of the file using a hashing function. Creating this type of ID can eliminate the need to use a long path in the coding and helps in keeping a well-organized data structure.