Methods and Technologies Used in Text Mining
Text mining systems consist of a wide range of techniques and algorithms (computer programs) that are generally found in libraries of programming languages (e.g., Python, R) or in software applications for text mining (e.g., SAS Text Miner, IBM Watson NLU).
The market for text mining applications is highly fragmented. A 2018 study from Gartner provided 39 of the most visible text analytics vendors (Davis et al., 2018). Examples of vendors listed in the Gartner report include SAS (Text Miner), IBM (Watson and SPSS Text Analytics for Surveys), Google (Cloud Natural Language API), Amazon (Amazon Comprehend), Microsoft Azure (Text Analytics), Expert system (Cogito), Lexalytics (Lexalytics Intelligence Platform), OpenText (OpenText Content Analytics, OpenText Sentiment), SAP (SAP HANA Text Mining XS classic), and Verint Systems (Text Analytics).
Reusing the classification proposed by researchers Martin Rajman and Martin Vesely of the Laboratory of Artificial Intelligence of the Swiss Federal Institute of Technology (Rajman and Vesely, 2004), we can classify text mining techniques and algorithms into one of three categories: document preprocessing, mining (or, data mining for textual data), and visualization.
- • Document preprocessing includes the following subcategories: data selection and filtering, data cleaning, document representation, morphological normalization, and parsing, and semantic analysis;
- • Mining (data mining for textual data) includes clustering, classification, entity, and relation extraction; and
- • Visualization includes visualization techniques for multidimensional data and text summarization.
In the following subsections, we summarize the text mining techniques and algorithms used in each category and subcategory.
Data Selection and Filtering
Data selection and filtering techniques and algorithms are used to reduce the texts to their most relevant items for analysis or action. More specifically, data selection assists users with the identification and retrieval of related documents based on the explicit descriptive metadata with which they are associated, such as keywords or descriptors. Data filtering then evaluates the documents’ relevance based on their actual content using relevance measures.
Data cleaning tools are used to remove noise such as spelling errors, inconsistencies, and unnecessary items from the textual data, and to identify metalinguistic information. These tools assist users in:
- • Correcting mistakes,
- • Normalizing text (e.g., letter case normalization, abbreviation normalization),
- • Removing parts that are not part of the processed language,
- • Removing stop words (i.e., unnecessary words), punctuations, and special characters, and
- • Assigning metalinguistic tags to words. Metalinguistic tags include:
- o Named entity recognition (NER), a type of algorithm that identifies relevant nouns of named entities such as people, places, organizations, or dates, and
- o Part-of-speech (POS) tagging, a type of algorithm that assigns to each word an identifier, such as noun, verb, adjective, and others.
Document representation algorithms transform unstructured text into a representation that the system can interpret for analysis or visualization. A simple representation approach is the bag-of-words (BoW) approach, where a text such as a document or a sentence is represented as a string of words, disregarding their order in the text and grammar. A popular BoW representation often used for text documents is the vector space model (VSM). In a VSM, each text document is represented as a vector, and the vector space is represented by the words in the text document with their respective importance (weight), measured in terms of frequency of occurrence (i.e., the number of times the word appears in the text document).
More complex representation models include more structured semantic models.
Keikha et al. (2008) distinguish four different types of document representation: N-grams, single terms, phrases, and a logic-based document representation called rich document representation (RDR).
- 1 The N-gram representation is a string-based representation with no linguistic processing. It is the simplest representation, where documents are represented as strings of n words.
- 2 The single-term approach is based on words with minimum linguistic processing. In this approach, documents are represented as vectors of their distinct words and their importance as described earlier (Vector Space Model approach). Most often, the stem (i.e., root) of the words is used instead of the words themselves to increase document matching results.
- 3 The phrase approach is based on linguistically formed phrases and single words. It is a more sophisticated approach that involves extracting statistical or linguistic phrases and representing documents with their stemmed single words and phrases.
- 4 The rich document representation (RDR) provides a more semantic representation of a document. It is based on linguistic processing (such as part-of-speech tagging and matching rules) and represents documents as a set of logical terms and statements that describe the relationships in the text. For example, a proposition such as “for” in the sentence fragment “....operating systems for personal computers...” suggests a relationship between “operating systems” and “personal computers” (Keikha et al., 2008).
There are also document representation models that are based on concepts (rather than on words alone). In a concept-based representation, a document is represented as a vector of concepts. The importance of the concepts is measured in terms of frequency of occurrence. A hierarchical or lattice structure represents the number of times the concept appears in the document as a hierarchical “is-a” relation. For example, if a document contains the three related concepts, “cat,” “dog,” and “animal,” “animal” would be a super-class of both “cat” and “dog” (Da Costa Pereira & Tettamanzi, 2006).
The more sophisticated approaches above allow greater understanding of the meaning of the texts by the software and enable systems to produce more accurate and useful results in terms of information retrieval, analysis, and visualization.
Morphological Normalization and Parsing
Morphological normalization refers to natural language tasks such as stemming, lemmatization (i.e., reduction of words to their stems), and part-of-speech tagging, as described earlier (see Data Cleaning subsection above).
Parsing refers to the process of assigning syntactic structure to the normalized text, including text segmentation, sentence chunking into smaller syntactic units such as phrases, and the identification of syntactic relations between the identified units (Rajman & Vesely, 2004).
Semantic analysis tools resolve semantic ambiguities. These tools use techniques such as word sense disambiguation (WSD), anaphora resolution, and co-reference resolution. The semantic analysis also assesses topical proximity among and within the documents. These tools further increase the system's ability to interpret the meaning of texts.
A representative (non-exhaustive) list of these techniques and algorithms is summarized below:
- • Word sense disambiguation (WSD) enables systems to interpret words that have multiple meanings (i.e., word senses), depending on the context. WSD tools use context to assign the appropriate meaning to a word, determined by the other words around that word in the text.
- • Anaphora resolution tools assist in resolving what pronouns (e.g., they, he, she, it) or noun phrases refer to in a text. (Note: a noun phrase is a noun with a modifier that modifies that noun. For example: in the following noun phrases “his cat,” “Paul’s cat,” and “the grey cat,” “cat” is the noun, and “his,” “Paul’s,” and “the grey” are modifiers).
- • Co-reference tools assist in finding expressions that refer to the same entity. Co-reference occurs when two mentions refer to the same entity, such as in “She taught herself,” “She,” and “herself” refer to the same person (Random House Kernerman Webster., n.d.).
- • Latent semantic analysis (LSA) refers to techniques and algorithms that assist in uncovering synonyms, homonyms, and term dependencies, such as pairs or groups of words.
- • Another type of semantic models and algorithms is the latent Dirichlet allocation (LDA), which refers to techniques and systems that identify topics from the words of a collection of documents, representing a document as a mixture of topics (and a topic as a mixture of words). Latent Dirichlet allocation (LDA) is a popular topic modeling technique in natural language processing.
- • Measures used to assess topical proximity include vector space similarity measures such as the generalized Euclidean distance,
Text Mining 55 cosine similarity, and Chi-square distance, and other measures such as measures based on phrase occurrence, measures based on the length of the document under evaluation, and the average document length in the whole collection, and others (Rajman & Vesely, 2004).
Semantic analysis is a particularly challenging area in natural language processing, and it is evolving rapidly. This includes applying machine learning approaches to these methodologies, which further improves the software’s ability to interpret natural language.