Extraction of Medical Entities Using a Matrix-Based Pattern-Matching Method

Ruchi Patel and Sanjay Tanwani


A massive number of research papers on disease treatment, prevention, and diagnostics have been published. The medical text data provide the origin of information for biomedical study and research. However, these research papers are scattered across a huge medical informatics literature which have been published by specialist doctors. It is difficult for doctors to read all of these publications and discover new knowledge. The need is to accumulate all the information in a single place so that the specialist doctor may obtain guiding information for the most effective treatment and prevention. Health care professionals keep patient details, such as their past medical history, signs and symptoms of diseases, tests and treatments, and medication, in clinical records like discharge summaries and patients’ prescriptions. These clinical records are in the form of unstructured or semi-structured texts. Extracting medical knowledge from an unstructured clinical dataset is a real challenge.

The Center of Biomedical Computing, named i2b2 (Informatics for Integrating Biology and the Bedside), has organized different challenges in natural-language processing (NLP) for extracting useful knowledge from clinical texts. One of the challenges was organized in 2010, in which one task focused on extracting clinical concepts such as problems, tests, and treatments from clinical records [1]. Medical concept detection is also called a clinical entity recognition. A named entity recognition is an essential part of clinical NLP, because it is a crucial step for extracting knowledge from semi-structured or unstructured texts. Named entity recognition has two phases: entity boundary detection and entity type classification. Previously, many systems had been developed for recognizing clinical entities using a machine learning approach [1], a rule based approach [2], a dictionary look-up method [3], and a hybrid method [4]. Entity boundary detection is a type of sequence labeling problem, resolved by a BIO model, where В refers to “beginning,” I to “inside,” and О to “outside.” Other models are IOBW and IOBEW, where E refers to “end” and W refers to “single word” [5]. Some chunkers like openNLP Chunker, Peregrine [6], Tree Tagger, and MetaMap [7] are available but do not perform precisely. MetaMap is an information extraction tool for medical texts which also uses the chunking method, but its precision and recall are both worse.

A single word entity is easy to find and classify, but boundary detection of the sequence of entities is still a vital issue in clinical text processing. For example, “oxycodone-acetaminophen,” “saphaneous vein graft —► posterior descending artery,” and “a permanent dual chamber rate responsive pacemaker." In the proposed work the problem of correct boundary identification of clinical concepts is explored. The proposed system uses Part of Speech (POS) as a feature for training the model. The system is based on a matrix model and performs multi-pattern matching. These matched patterns are converted to their corresponding words and mapped with a unified modeling language system (UMLS) for entity classification. The rest of this chapter is organized as follows. Section 8.2 illustrates the background of clinical named-entity recognition. Section 8.3 describes the proposed method and dataset. Section 8.4 presents system evaluation. Section 8.5 provides the experimental results and a discussion, and Section 8.6 concludes and indicates some new directions for further research.


Many different NLP challenges had organized and focused on medical concept extraction tasks such as the i2b2 2010 challenge for clinical notes [8] and the ShARe/CLEF eHealth Shared Task [9], the shared task of BioNLP/NLPBA 2004 with the GENIA dataset for identifying different biomedical entities [10], and the BioCreAtlvE challenge [11] for recognizing biological concepts like gene mention identification [12].

In previous works of clinical entity identification, different methods have been used such as the dictionary look-up method, and rules-based and machine learning. The dictionary look-up method used in [3], in which the authors identified clinical entities using dictionaries compiled from the corpus, performed experimentation on I2b2 2010 dataset and obtained the average F score of 48% for the Beth dataset and

50% for the Partners dataset. The rule-based method is also used in [ 13, 14], in which some rules are created based on corpus words and word occurrences, and words are then found in the corpus and mapped to a corresponding category and provided a 42% F score. Machine-learning-based approaches like SVM (support vector machine) and CRF (conditional random field) have been used for entity boundary identification and entity classification [1. 15], and which is based on the beginning, inner, and outside (BOI) model for sequence labeling. An unsupervised approach has also been used to extract named entities from biomedical texts [16]. in which authors have developed a noun phrase chunker followed by a filter based on inverse document frequency. The classification of multiword entities is carried out by using the concept of distributional semantics.

A number of systems have been developed for medical entity recognition, such as MetaMap, which is used for concept extraction [17]. It uses UMLS for medical-term identification [18], but its results are worse than other approaches. MetaMap 2013v2 [13] gave a 40% F score and MetaMap 2010 [6] gave a 21.8% F score; previous versions [7] gave an average 15.5% F score. Other systems have used entity boundary identification, like Peregrine (F score 46.8%), OpenNLP chunker (F score 70.0%), and StanfordNer (F score 76.8%) [6]. A lot of previous research work has been identified for entity boundary detection of medical data. In [5, 19], the authors used OpenNLP chunker for named entity recognition (NER) with IO, IOB, and IOBW models, and, for getting correct boundaries, post-processing is performed with boundary adjustment rules. In [7], the authors compared different methods like MetaMapPlus, a rule-based method, and obtained an F score of 52.28%; the CRF method, with an F score of 45.33%; and an SVM method, with an F score of 76.17%.

For text pre-processing, several NLP tools were used, like Lingpipe, Tree Tagger, OpenNLP, c-TAKES. Stanford parser [20], splitta, SPECIALIST, and Stanford CoreNLP. Evaluation of boundary detection of sentences using these tools is carried out in [21], where the authors discussed and identified different errors, such as the detection of sentence splitters like semicolons and colons, though these errors w'ere separate from their context. As per their evaluation, except for c-TAKES. other tools performed worse on clinical notes than on general domain text.

Das et al. [22] proposed a neuro-fuzzy model with post-feature reduction to analyze complex biomedical data. In this paper, to identify uncertainty issues from input patterns, a class-belongingness fuzzification method is used. As the result of fuzzification, input patterns increase; to handle this issue, post-feature reduction is used which eliminates all the irrelevant data from the input set. Das et al. [23] proposed a framework for the classification of medical diseases. For dealing with the uncertainty of data a linguistic neuro-fuzzy with feature extraction (LNF-FE) model is presented in this paper. Where linguistic fuzzification is used for finding the membership value and, from that, values, the relevant data are retrieved using feature extraction. Finally, the reduced features are classified using an artificial neural network (ANN). Das et al. [24] used a particle swarm optimization (PSO) model for building a multilayer perceptron which is used for classification. The model is capable of solving linear and non-linear problems. The back propagation algorithm is used for training the network. The performance of the proposed model is compared with multilayer perceptron (MLP) and genetic algorithm (GA-MLP) also. Das et al. [25] used a hybrid neuro fuzzy and feature reduction model for data analysis. Using fuzzification, the input pattern classes are identified and the irrelevant data is removed by using feature reduction methods. An ANN model is used for classifying the filtered data. The performance of the model is tested against ten different datasets.



The Informatics for Integrating Biology and the Bedside Center (i2b2) organized a challenge in 2010, which was focused on NLP for clinical data. The dataset included patients’ discharge summaries and clinical notes provided by three institutions: the Beth Medical Center. Partners Healthcare, and the Pittsburgh Medical Center. The organizers annotated manually 826 clinical notes, which provided gold-standard data for the challenge [8].

For developing the new system for entity boundary detection, the task of clinical concept extraction must be focused. In this task, 426 annotated progress notes of Partners and Beth were used where 170 progress notes were used as a training set, and the remaining 256 notes were used as a test dataset to assess the performance of the systems with reference data.

Proposed Method

The proposed system uses the concept of multi-pattern matching based on a matrix model [26] for clinical data. Fundamentally, a matrix is a rectangular array of numbers, symbols, or expressions arranged in rows and columns. The system performs the parallel pattern matching between two matrices in which those of the same size are added or subtracted, element by element. Every element of a matrix has a user- assigned value corresponding to the POS tag of an entity.

This system is created in two matrices, one for gold-standard training data and another for test data. The framework of the proposed system (see Figure 8.1) is divided into a few components for medical concept extraction. The working of the components is described below.

Text Pre-Processing

I2b2 clinical notes contain various parts, such as discharge date, admission date, allergies, present illness history, past medical history, social history, family history, physical examination, pertinent results, discharge medication, and discharge instructions. These clinical notes are in an unstructured and semi-structured form of text [27]. Every part encloses a little information associated with each patient and concerns various special characters, colons, semicolons, punctuation, hyphens, and so on. For medical concept identification, each sentence of each section needs pre-processing. In this system, a natural language toolkit is used for text processing. For performing tokenization some methods, like line tokenize or word tokenize, are used, followed by a post-tag method for parts of speech tagging. Pre-processing is performed on gold-standard annotations and test annotations. In gold-standard

Framework of proposed matrix-based system

FIGURE 8.1 Framework of proposed matrix-based system.

annotations, each file has its named entities with its concepts. These entities are tagged with their POS. In a test dataset, every sentence of the file is tagged with POS. These tagged data are applied as input for creating matrices.

Trained Matrix Formation

The system is trained using gold-standard annotations for which a trained matrix was created. After the text processing of the gold-standard annotations of every file, POS


Different POS Tags with Assigned Value for Matrix Calculation


Assigned Value


Assigned Value





































































tagged patterns are generated collectively for every single entity and sequence of entities. For matrix calculation, every POS tag is assigned with a different value, as shown in Table 8.1. The frequency of each pattern among the different pattern sets of gold annotations is then ascertained. For obtaining frequent pattern sets, a user- defined minimum threshold value greater than or equal to 2 is required. These different frequent pattern sets are then applied as elements for the trained matrix.

The trained matrix has к number of pattern sets, where every row represents one pattern set. The longest and the shortest pattern’s length are maxi and mini, respectively. In a(M x N), a is a trained matrix, M is к (number of pattern sets), and N is maxi. When the location of the elements is lacking, “0” is entered in the columns of the matrix.

Test Matrix Formation

The test matrix (/)) is created using test data for every file independently. POS tagged test data with an assigned value is applied to the matrix. Here, the number of rows and columns are the same as the trained matrix a. Every complete sentence in its converted form is passed to the matrix, which changes dynamically at the entrance of every new sentence after comparison with the trained matrix.

Pattern Matching

The system performs multi-pattern matching of the trained matrix with the test matrix by using matrix subtraction. Matrix a minus matrix f) is equal to matrix в.

If a row of в is equal to 0 (all are 0’s), this shows the pattern is completely matched. When matrices a and [) are subtracted once, matrix f) will transform circularly until the end of the text. Every file’s test matrix is compared with the trained matrix. After obtaining exactly matched patterns, these are converted into their corresponding POS tags as are their entities or sequence of entities.

Pruning Non-Medical Concepts

The system generates several medical or non-medical single entities or sequences of entities because of the parallel pattern matching between two matrices and because the precision of non-medical concepts drops off. For improving the precision of the system, non-medical concepts are pruned according to post-processing rules. For this, a few' rule patterns are showrn below in which a medical concept’s semantic type is one feature which is presented in the UMLS Metathesaurus [2]. UMLS shows some categories of semantic types of the database which is used for three medical concepts: problem, test, and treatment.

where Y = 1 to m, w'hich concerns different semantic types that are subsequent to their concept class, w hich is used for categorization. For example:

  • Ifyl = ‘Finding', у 2 = ‘Sign or Symptom ’, y3 = ‘Disease or Syndrome ',
  • 4 = ‘Pathologic Function ’

then Y = Problem

  • Ifyl = ‘Laboratory or Test Result’, 2 = 'Cell', y3 = ‘Laboratory Procedure ’,
  • 4 = ‘Tissue’,

y5 = ‘ClinicalAttribute'then Y = Test

  • Ifyl = ‘Diagnostic Procedure', y2 = ‘Therapeutic or Preventive Procedure’, y3 = ‘Organic Chemical',
  • 4 = ‘Pharmacologic Substance ’, y5 = ‘Antibiotic’ then Y = Treatment

If the word token is equivalent to any category of semantic type, then it can be precisely mapped to a suitable medical semantic class.

< Prev   CONTENTS   Source   Next >