Natural Language Processing

In the healthcare industry, the clinical information comes in the form of written text which will be in huge forms, such as laboratory reports, physical examination reports, operation notes of patients, discharge related summaries, etc. These are usually in unstructured forms and are not comprehensible for the computer-based programs as they need special models for processing the text (Luo et al. 2016). The Natural Language Processing model provides a solution to these issues by identifying a series of keywords that are relevant to the disease in the patient notes based on the existing databases, thereby enriching the structured data for supporting clinical decision making.

Naïve Bayes

The Naive Bayes classifier is a probabilistic method used for categorizing the text, and solving the problem of document predictions for finding the category to which it belongs to. The Naive Bayes classifier considers that one particular feature of the class will be unrelated to other features. Even though the features of a class are independent, all its properties will independently contribute its probability for a certain category. It is one of the most efficient probabilistic classification algorithms that are successfully applied for many of the medical related problems.

Deep Learning

Deep learning belongs to the machine learning family and it is based on the artificial neural network techniques, as it is a neural network with an increased number of layers. When compared to traditional machine learning algorithms, the more complex non-linear patterns can be learned using the deep learning algorithms in the data. Modules are pipelined and are train- able. It is a scalable approach and the automatic feature extraction of data can be performed.

In healthcare applications, these types of algorithms handle both the tasks such as machine learning and language processing. The predominantly used deep learning algorithms are convolution neural networks, deep belief networks, the multilayer perception model, and recurrent neural networks. It remains one of the most effective classification algorithms and is successfully used in addressing many healthcare-related problems, such as healthcare report classification and journal classification.

Convolutional Neural Network

Convolutional neural networks (CNNs) are developed to handle high dimensional data or data with an increased number of traits. As proposed by LeCun et al., (1995), the pixel values that are rectified with the normalization of images will be the inputs. Convolutional networks were inspired by medical processes, and thus the connectivity pattern that exists between the neurons with separate cortical neurons (which respond to the stimuli in the region) is restricted. However, the whole visual field is covered as the receptive field of various neurons will overlap. The CNN then transfers the weighted pixel values of the image in the convolution layers and sampling is done in the subsampling layers. The final output will be a recursive function of the input values.

Phenotyping Algorithms

Phenotyping algorithms are implemented using the samples of the diseases on the EHR data that are usually collected from healthcare units for diagnosing the diseases. The data may be in an unstructured form which contains large amount of texts from the physicians' reports, various diagnostics of diseases, and different vital signs. A phenotyping algorithm is a different form of special model that is carried through various numbers of medical data points with specific codes for radiology results, billing, and natural language processing where different forms of texts are extracted from the physicians. Machine learning algorithms with supported vector machines can be applied for identifying arthritis in a combination of patient's prescription records for improving the accuracy of predictive models of disease. For example, the prevalent condition of diabetic patients can be suggested by examining the usage of hypoglycemic agents that are collected from the prescription records.

Use Cases

  • 1. Automated Trigger: an automated trigger for sepsis clinical decision support using ML has been created. It involves the extraction of text and vital signs for predicting the life-threatening infection that may affect the patients. Natural Language Processing (NLP) is used for extracting the data from the clinical text. It has been found that the area under the curve value is 0.667 without using the NLP and 0.86 when the NLP is used. It was also seen that the accuracy of the model is increased when the language processing algorithm is used.
  • 2. Patient Risk Prediction: this is an important process as it is used for making decisions. It assists the physicians in making valuable predictions. The predicted test result values will be used to ensure that the particular treatment which was undergone is useful or not. It has been found that out of the total predictive rules used, 97% of them seem to be more sensible when the NLP is used. There are other cases where the physician's prediction ability is seen to be poor. For this instance, oncologists found that only 20% accuracy is achieved while predicting the survival rates of ill patients.
  • 3. Cohort Building: this can be done by leveraging the oncology department's electronic health record data. A demonstration of non-small cell lung cancer is done using the structured as well as unstructured data. It is found that 8,324 patients are affected with non-small cell lung cancer by using both the type of data. Out of the 8,324 patients, more than 2,000 patients were found with the cohort which was formed by structured data. In addition to this, 1,090 patients would be further included in the cohort if only the structured data is used. It was found that more than 1,000 patients did not match with the parameters of the study. Hence, only the patients affected with non-small cell lung cancer that are more than 2,000 in number were found to be the true cohort that can be used for analysis. This analysis highlights the importance of analyzing both structured and unstructured data.


This chapter has outlined that there is a consequential need for the improvement of structured, semi-structured, and unstructured healthcare data for storing, analyzing, and interpreting. Though powerful tools already exist for analysis - one that might help the analysts to analyze the data well - there is a lack of standardization which continues to impede the overall process. Machine learning, language processing, and Artificial Intelligence have the potential to streamline the way that the unstructured data can be utilized, but we fail to capture the point that the machines are making the critical decisions instead of traditional decision-making physicians. Regardless, all patients should aexpect and look forward for improved medical or health outcomes as the technological advancements continue to improve the way health data are used. Thus, this chapters elaborates on the different forms of healthcare data with examples of relevant algorithms and use cases, thereby supporting users to understand the basic concepts of healthcare data analysis.


Asif, Muhammad, H. F. M. С. M. Martiniano, A. M. Vicente, and F. M. Couto "Identifying disease genes using machine learning and gene functional similarities, assessed through Gene Ontology". PLoS One 13(12) (2018): 12.

Ba, Mohan, and H. Sarojadevi "Disease diagnosis system by exploring machine learning algorithms". International Journal of Innovations in Engineering and Technology 10(2) (2018): 14-21.

Chen, Min, et al. "Disease prediction by machine learning over big data from healthcare communities". IEEE Access 5 (2017): 8869-8879.

Istrate, D., E. Castelli, M. Vacher, L. Besacier, and J.-F., Serignat. Information Extraction From Sound for Medical Telemonitoring. Information Technology in Biomedicine, IEEE Transactions on, 2006,10(2), pp. 264-274. ffhal-00419915f

Jutel, Annemarie "Classification, disease, and diagnosis". Perspectives in Biology and Medicine 54(2) (2011): 189-205.

Kaur, Harleen, and Siri Krishan Wasan "Empirical study on applications of data mining techniques in healthcare". Journal of Computer Science 2(2) (2006): 194-200.

Y, LeCun and Y. Bengio, Convolutional networks for images, speech, and time series, The handbook of brain theory and neural networks 3361 (10) 1995.

Luo, Jake, M. Wu, D. Gopukumar, and Y. Zhao "Big data application in biomedical research and health care: A literature review". Biomedical Informatics Insights 8 (2016): BII-S31559.

Petkovic, M., and Jonker, W. An Overview of Data Models and Query Languages for Content-based Video Retrieval. (2000).

Razia, Shaik, et al. "A review on disease diagnosis using machine learning techniques". International Journal of Pure and Applied Mathematics, 117(16), (2017): 79-85.

Rosenbloom, S. Trent, et al. "Data from clinical notes: A perspective on the tension between structure and flexible documentation". Journal of the American Medical Informatics Association JAMIA 18(2) (2011): 181-186.

Sharmila, S., C. Leoni, Dharuman, and P. Venkatesan "Disease classification using machine learning algorithms-A comparative study". International Journal of Pure and Applied Mathematics 114(6) (2017): 1-10.

Vinitha, S., et al. "Disease prediction using machine learning over big data". Computer Science & Engineering: Anais an International Journal (CSEIJ) 8 (2018): 1.

Wu, Honghan, et al. "SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research". Journal of the American Medical Informatics Association JAMIA 25(5) (2018): 530-537.


< Prev   CONTENTS   Source   Next >