Continuous Endpoints: HbAc, BMI

As diabetic patients often suffer from obesity, BMI has been a very important index for measuring diabetes progression. To evaluate the validity of the proposed network structure, we examine its ability to predict the BMI given the predictors.

Although it is a typical regression problem, predicting BMI is intrinsically not easy because of the amount of missingness in this data set and the noisy nature of EHR data. Therefore, several methods are included for comparison, including ordinary least-squares (OLS) regression, randomForest, a simple deep neural network (DNN) approach, and the proposed CNN with TF-IDF weighted text. The OLS and the randomForest are the typical regression methods used in statistics and machine learning communities. Limited by the scope of this study, we do not cover the details of these methods. The simple DNN method shares the same set of predictors with quarterly repeated measures. The only difference is that the DNN does not impose any convolutional layers on the data; rather, it uses a flattened vector of length 819 (21 x 39) as the input layer. There are 10 fully connected layers with 400 neurons for each layer. The dropout rate is the same as the CNN approach. Table 8.1 shows the performance of these methods in terms of R-square based on fivefold cross-validation.

Table 8.1 proved the difficulty of predicting BMI for diabetic patients. All methods have R-squares around 0.5. The classic DNN does not beat the randomForest, which is extremely robust in noisy settings. However, the proposed CNN approach actually enjoys a 2% improvement, which implies the trajectories of predictors' values contribute to the prediction aside from the values themselves. In addition, adding text information can get a further edge of 1%. This verified the hypothesis that the text contains extra information


R-Squares of Various Methods





Random Forest


Deep Neural Network


Convolutional Neural Network


CNN with TF-IDF weighted Text


about the patients, but this marginal improvement can hardly justify the increased training time and the complexity of the model.

< Prev   CONTENTS   Source   Next >