EHR Data Analytics and Predictions: Machine Learning Methods

Machine Learning Overview

The meaning of machine learning (ML) might be different for people from different research and application fields. To simplify, here ML refers to a broad range of statistical and computational methods, theories, and tools for data exploration and analytics. In comparison with traditional statistical methods, ML approaches are primarily data-driven and may not rely on a pre-specified mechanistic model. With the goal of extracting hidden knowledge from data, usually the larger the number of samples available for learning tasks, the better the performance of ML algorithms. Thanks to its power and flexibility, ML becomes a popular choice for researchers and scientists in solving a variety of complex problems involving large datasets with a huge number of variables. In general, ML methods may be classified into several categories: supervised learning, unsupervised learning, semi- supervised learning, and reinforcement learning.

As described in Shobha and Rangaswamy (2018), supervised learning mainly adopts two types of techniques: 1) classification techniques, which are used to predict discrete responses (binary or multi-classes); e.g., the risk category of a patient to develop a certain disease or condition, and 2) regression techniques, which are mostly used to predict continuous responses; e.g., changes in a patient's vital signs (like blood pressure or body temperature). Examples of ML algorithms commonly used for classification and/or regression include к-nearest neighbor (KNN), Naive Bayes, linear models, bagged and boosted decision trees, random forests, support vector machine (SVM), and neural networks.

Nowadays, supervised learning has been broadly employed in the healthcare fields such as the diagnosis of diseases, disease risk prediction, or mortality/outcome prediction. A few examples include researchers using the rich information in EHR database to predict hospitalization due to heart diseases (Dai et al. 2015), forecasting the risk of type 2 diabetes (Mani et al. 2012), or predict 6-month mortality among cancer patients (Parikh et al. 2019). In the previous two chapters, we have discussed statistical methods, neural networks, and deep learning approaches for EHR data analysis and predictions. In this chapter, we focus on some select supervised learning techniques. We briefly describe these supervised learning techniques and then illustrate their usefulness via an EHR application project on mortality prediction for subarachnoid hemorrhage (SAH) patients. We will also compare the performance of these ML methods based on the SAH dataset.

Machine Learning Methods

In order to understand random forest, we need to start by defining the concept of a decision tree. Briefly, a decision tree is a decision-making tool that utilizes a branching, tree-like graph. Let S be the training set of n samples (xj, y:),..., (xn, yn), where Xj e X is the input instances, yt e Y are the class labels associated with each x:. A decision tree is a classifier, h: X -> Y, that predicts the label у of x by making a series of decisions along a tree structure. A typical decision-tree starts with the root node, the algorithm then assigns this node a label according to the majority vote among all attributes over the training set. As we move from the root node to the leaf node, a series of tests must be performed. For each test, we examine the effect of splitting a single branching node. The entropy (or gain) is used to help the algorithm find the best numerical or categorical feature to split from. The algorithm then looks at the improvement associated with the gain from each split; among all possible splits, we choose the one that maximizes the gain or we just stop if further splitting does not result in an increase in the gain. This process can also end if a pre-specified maximum depth of the tree is reached.

An example of a simple decision tree is shown in Figure 10.1. In order to check if a given patient has a certain disease or not, the decision tree first examines the lab result of that patient. If this result is in the normal range, then the tree immediately predicts that the patient does not have the disease and stops there. Otherwise, the tree moves to examine the medication that the patient is taking. If that patient takes a medication related to the disease, the tree predicts that the patient has the disease.

Many algorithms for learning decision trees have been developed, such as ID3 and C4.5 and CART algorithm. For additional reading, we refer the


A simplified example of a decision tree modified from Shalev-Shwartz and Ben-David, 2014.

readers to the work by Hastie, Tibshirani, and Friedman (2009). Here we will look into ID3, which is a commonly used decision tree algorithm (Figure 10.2). Specifically, ID3 is built upon the concept of recursive calls as follows:


Pseudocode of ID3 algorithm modified from Shalev-Shwartz and Ben-David (2014).

ID3 takes a training set S and a feature subset A as input. The algorithm calculates the Gain(S, i), which measures the probability of incorrectly classifying a randomly-chosen element in the dataset if it was randomly labeled according to the class distribution in the dataset. Then, the algorithm evaluates the gain of a split of the tree according to the ith feature and chooses the best split by maximizing the gain. In classification, we calculate the gain using the Gini impurity or entropy.

The decision tree is good for a dataset with a small number of features. However, for a dataset with a large feature space, algorithms like ID3 can return a large size and highly complex tree. Such a tree may predict a training data set well but perform poorly on a testing set, and thus incur the overfitting problem. To address this problem, the random forest (RF) algorithm was introduced by Breiman (2001). RF works by constructing an ensemble of trees. The prediction of the random forest is obtained by a majority vote over the predictions of the individual trees (Shalev-Shwartz and Ben-David 2014) (Figure 10.3).

Specifically, given a training data set S = (xj, yf), ..., (x„„ ym), we will implement a random forest algorithm as follows: First, we sample with a replacement on the training set S to get smaller subsets; then, we train a classification tree algorithm (such as ID3) on all those extracted subsets; Eventually, after we get all the predictions from individual trees, we take the majority vote to get the final prediction. Therefore, the training algorithm for random forests applies the general technique of bagging, or by bootstrapping sampling from the training set uniformly and with the replacement, to decision trees. (James, Witten, and Tibshirani 2013) This bagging method decreases the variance of the model and does not increase the bias. Hence, it helps in preventing the problem of overfitting from decision trees as mentioned earlier.


A simplified structure of random forest algorithm modified from Subramaniam and Kaur (2019).

Extremely Randomized Tree

Extremely randomized tree (ERT) has been introduced by Geurts, Ernst and Wehenkel (2006). ERT is a new tree-based algorithm that introduces randomness totally or partially on splits. In terms of computational cost, ERT can be more efficient than the random forest. When a large number of attributes are involved, the computational efficiency of ERT is even attractive and thus might be a suitable tool for EHR data analysis (for instance, one patient could have hundreds or even thousands of laboratory tests, clinical procedures, and/or medications; therefore, the data dimension is often extremely high). There are two main differences that distinguish random forest from ERT: i) for the ERT algorithm, each split is chosen at random, while each split is optimal in the random forest method. Consecutively, the trees built in ERT are more diverse, ii) each tree in ERT is built by using the entire learning sample while in the random forest, each decision tree is built by a bootstrapped learning sample.

The variance of a single tree can be high and using ensemble methods could reduce such variance. Among the ensemble methods, the amount of reduction in variance introduced by bagging is less than that of random forest and ERT because averaging highly correlated trees (as in the bagging approach) will have less variance reduction than averaging uncorrelated trees. The ERT has a tendency to strongly reduce variance at the cost of increasing bias. In practice, one needs to find the best balance between variance and bias, which is called bias and variance trade-off. Overfitting is a common problem for decision tree models, especially when a tree has many levels. To address this issue, the recommended approach is to use the bagging method described earlier in the section of random forest.

Gradient Boosting

What is boosting? Simply put, it is a committee-based procedure like bagging. In the bagging approach, we calculate the average of all the predictions from bootstrap sampling. Therefore, bagging is parallelizable. However, boosting is a sequential method and cannot be parallelized in the same way as bagging (i.e., the residuals from the previous iteration are inputs to the current iteration). It takes the outputs of many weak classification models (usually decision trees) and combines them to form a more powerful "committee".

In gradient boosting, we build an ensemble of decision trees in a stage- wise fashion and use a functional gradient descent (FGD) procedure to minimize the loss. In FGD, we parametrize the trees and reduce the residual loss at each step by moving in the direction suggested by the gradient descent algorithm. (Hastie, Tibshirani and Friedman 2009).

Gradient-boosting(S, L, M)

Input: Training set S = {(*;,Уг)}?=1> loss function Цу, F(x)), number of iterations M


  • 1. Initialize model with Fo(x) = argmin £”=1 L(y;, y)
  • 2. From m = 1 to M:
  • 1. Compute the residuals:

ri = 1.....»


  • 2. Fit a base learner (such as decision tree) hm(x) to the residuals for training.
  • 3. Compute multiplier ym by solving the optimization function:

Ym = argmin £?=iL(yi, + yV(*i))

4. Update the model:

Fm(x) — Fm_ j(x) + Ym^miP^i)

Output: FM(x)


Pseudocode of gradient boosting modified from Hastie, Tibshirani, and Friedman (2009).

In the pseudo-code in Figure 10.4, it can be seen that by inserting different loss criteria L(y, f(x))Uy, fix)), we can do a customized implementation for gradient boosting.


Boosting is a very popular and effective machine learning method and is usually employed as a benchmark approach. Recall that boosting is an ensemble method, but instead of training all models parallelly, it builds models in a sequential manner such that the new models are built to correct the errors made by its predecessor.

Before getting into the details, it is necessary to point out that gradient tree boosting is the same as Gradient Boosting Machine (GBM) or gradient boosted regression tree (GBRT), and XGBoost is the nickname for eXtreme Gradient Boosting. XGBoost was proposed by Tianqi Chen in a research project as part of the Distributed (Deep) Machine Learning Community (DMLC) group (Chen et al. 2015). Boosting can be very slow in training time because it needs to sequentially train and build models; hence, such methods are not easily scalable. Thus, XGBoost focuses on both execution speed and model performance. XGBoost can do parallel tree construction and out-of-core computation for big datasets that do not even fit in available memory. Technically, XGBoost is a regularized method so it can control overfitting and may have better performance. In addition, XGBoost introduces a new sparsity aware algorithm, learning from the data in the optimal way to deal with missing values and to handle different kinds of sparsity patterns in a unified manner (e.g., the sparsity due to missing values, frequent zero entries, or artifacts caused by one-hot encoding). With many attractive features introduced into regular gradient boosting, XGBoost provides a parallel tree boosting solution to many data science problems in an efficient way.

Support Vector Machine (SVM)

The SVM algorithm was introduced for the first time in the "Statistical learning theory frameworks" paper by Vapnik (1995). Cortes and Vapnik (1995) later used SVM for binary classification problems. Later, the SVM framework has been significantly extended to various problems, including regression, ranking and clustering.

In many machine learning tasks, we often face the problem of sample complexity as the dimensionality of the feature space goes higher. The SVM algorithm tackles the sample complexity problem by finding the half-space separators.

We can look at the simple example in Figure 10.5, from which we can tell that the blue and red hyperplanes together separate the example data points of two different clusters (in green and yellow colors, respectively). We call those hyperplane lines the half-space separators.

In order to pick the optimal half-space, we need to maximize the margin of two clusters. From the example above, we would prefer the blue


A simplified example of SVM modified from Shalev-Shwartz and Ben-David (2014).

hyperplane since the blue line has a larger margin which means better separation. The margin is defined to be the minimal distance between a point in the training set and the hyperplane.

Hard-SVM is the learning algorithm that will run through the training set to find and return a half-space with the largest margin (a hyperplane that best separates the training set into predefined classes). The pseudo-code for hard SVM is described in Figure 10.6.

Intuitively, w is the margin and b is the bias, hard-SVM searches for the w of a minimal norm among all the vectors that separate the data, for which | (w, Xj) + b | > 1 for all i. From there, the output of hard-SVM is indeed the separating hyperplane with the largest margin. In other words, finding the largest margin half-space becomes the problem of finding w whose norm is minimal.

Hard-SVM formulation requires a strict condition that the training set is linearly separable, which is usually not the case for real-world data. When the training set is homogenous but not linearly separable, soft-SVM is more suitable. We can view soft-SVM as a less restrictive version of Hard-SVM that can work even if the training set is not linearly separable.

As described in Shalev-Shwartz, and Ben-David (2014), the optimization function in Equation (10.1) in Figure 10.6 enforces the hard constraint yf{w, Xj) + b) z 1 for all i. To address this, soft-SVM allows this constraint to be relaxed at some data points in the training set. This can be modeled by introducing non-negative slack variables, £ with i - 1, which

measures to what extent the constraint у ((w, x,) + b) £ 1 is being violated. Then, the algorithm replaces each constraint y; ((w, x,) + b) z 1 by i/fizo, X,-) + b) 2 1 — as in function (10.2) in Figure io.7. Soft-SVM jointly minimizes the norm of w (e margin) and the average of (the violations of the constraints). A parameter X controls the tradeoff between these two terms. The pseudo-code for soft-SVM is as in Figure 10.7.

We see that a Soft-SVM algorithm uses regularized loss minimization for Equation (10.2) in Figure 10.7 so that it can penalize not only training errors. Therefore, soft-SVM is often used to deal with a homogenous half-space. For further details, please refer to Shalev-Shwartz and Ben-David (2014).


Pseudocode of hard SVM modified from Shalev-Shwartz and Ben-David (2014).


Pseudocode of soft SVM modified from Shalev-Shwartz and Ben-David (2014).

For non-linear and inseparable data, we need to use the Kernel trick to help SVM's to form nonlinear boundaries. How does the Kernel trick work? First, the original data are passed through non-linear maps to form new data in a high-dimensional space. Then, we perform a function to take a dot product of the data after doing non-linear mapping on it. This function is the kernel function (Jakkula 2011). Different SVM algorithms use different types of kernel functions such as linear, nonlinear, sigmoid, polynomial, and radial basis function (RBF). The most popular one is RBF since it localizes and limits the response along the entire x-axis. (Hill and Lewicki 2006).

< Prev   CONTENTS   Source   Next >