Information Extraction

IE is the process of extracting useful information from unstructured or semi-structured text. The extracted information is either readily usable or requires some additional processing and analyses. Although there is no strict rule on what information an IE system should extract, there are three types of information that are commonly extracted by most IE systems. They are named entities, relations, and events.

■ Named Entity Recognition

As the simplest but most commonly performed IE subtask, named entity recognition (NER) aims to identify the named entities and their types: person, organization, location, etc. Named entities carry important information about the text itself and it can help build a common ground for other analytics tasks such as relation and event exaction, document summarization, question answering, semantic search, etc.

NER is usually formulated as a multi-class classification or sequence labeling task in the supervised learning setting. Word-level features (e.g., case, suffixes, prefixes, shape, and part-of-speech tag), contextual features (e.g., following token(s), preceding token(s), and their past-of-speech tags), and external knowledge (e.g., word clustering, phrasal clustering, and Wikipedia gazetteer) have been widely used in various supervised NER systems. It requires a considerable amount of engineering skill and domain expertise on feature selection.

In recent years, deep learning (DL) based NER models (Li et al. 2018) have become dominant and have achieved state-of-the-art results. As aforementioned, compared with feature-based approaches, the key advantage of deep learning is the capability of automatically discovering latent, potentially complex, representations. As a well-studied topic in NLP, the NER component on languages such as English, Chinese, French, etc. have been readily built in many IE systems. Nowadays, more focus is on building NER systems on low-resource languages (Cotterell and Duh, 2017).

■ Relation Extraction

With named entities identified in text, a further step is to determine how they are related. Consider the following example (Eisenstein, 2019, p387) George Bush traveled to France on Thursday for a summit.

This sentence introduces a relation between the entities referenced by George Bush and France. In the automatic content extraction (ACE) ontology,[1] the type of this relation is PHYSICAL, and the subtype is LOCATED. This relation would be written as follows:


Early work on relation extraction focused on handcrafted patterns (Hearst, 1992). In a supervised learning setting, relation extraction is formulated as a classification task. In recent years, the DL models that use Recurrent Neural Networks have been developed that can simultaneously detect entities and then extract their relations (Miwa and Sasaki, 2014).

Not as in classical relation extraction where the set of relations is predefined, a relation in open information extraction (OpenIE) can be any tuple (a subject, a relation, an object) of text. Extracting such tuples can be viewed as a lightweight version of semantic role labeling (Christensen et al. 2010).

■ Event Extraction

Relation extraction links pairs of entities, but many real-world situations involve more than two entities. In event detection, a schema is provided for each event type (e.g., ELECTION, CRIME, or BANKRUPTCY), indicating all the possible properties of the event. The system is then required to fill in as many of these properties as possible. Event extraction generally involves finding 1) named entities; 2) the event trigger which is usually a verb or noun that clearly expresses the occurrence of the event, for example, the trigger word “conviction” in an event of “CONVICT”; and 3) event argument(s) which is the participant or general attributes (such as place and time) of the event. As a downstream task of Named Entity Recognition, the performance of event extraction heavily depends on NER performance. More recent work has tried to formulate NER and event extraction as a joint task (Yang and Mitchell, 2016).

Other NLP Applications in Data Analytics

Text summarization. It is the automatic process of shortening one or multiple documents into a summary that preserves the key information of the original text that is intended to be kept. It’s a task that falls under the scope of “natural language generation (NLG).” There are two types of summarization: extractive and abstractive. The former extracts and reuses important nuggets (words, phrases, or sentences) from the original documents) to create a summary. Tire latter paraphrases the intent of the original text in a new way. Most of the current summarization systems are extractive in that they are focused on identifying important pieces of information to produce a coherent summary text. Text summarization covers a wide range of tasks such as headline generation, meeting minutes generation, search results presentation by search engines, customer feedback summarization, etc. With the advancement of NLP and machine learning, the methods and applications on text summarization are also evolving.

Chatbots in customer service. Chatbots for customer service have been utilized to remove repetitions from workflows and provide 24/7 instant support for a number of years. The newer ones have demonstrated a better understanding of language and are more interactive due to the application of NLP techniques. Compared with a traditional rule-based chatbot, an NLP-based chatbot can continue to learn from every interaction and automatically evolves to improve the quality of the support they offer in the future. A social chatbot such as Microsoft Xiaoice (Zhou L. et ah, 2018) involves far more sophisticated and advanced NLP techniques, among many other techniques, and represents the highest-level achievement of a chatbot in this generation.

NLP Text Preprocessing

Because of the noisy nature of unstructured text, text preprocessing is usually the first step in the pipeline of an NLP system. There are different ways to preprocess a text. And the steps of preprocessing a text may vary from task to task. In the following, we list some of the common preprocessing steps in NLP.

Tokenization. It is the process of breaking a text into linguistically meaningful units (LMU) called tokens, which are mostly words but could be phrases, symbols, and so on. The output of a tokenizer can then be fed as input for further processing such as NER, document classification, text summarization, etc. Challenges in tokenization depend on the type of language and the domain of the text. For example, English is a space-delimited language, but Chinese is not. Biomedical text contains many special symbols and punctuations which makes it different to tokenize it from tokenizing news text. For an agglutinative language such as Turkish, the tokenizer would require additional lexical and morphological knowledge.

Stopword removal. Stopwords are the words that occur frequently but do not contribute to the content of text. Due to their high frequency in text, their presence may introduce noise and confusion to the downstream steps. This is especially true for an Information Retrieval system. Examples of stopwords in English are “this,” “is,” “the,” “an,” etc. Depending on the specific task, a stopword list can be preestablished or generated on the fly.

Normalization. It is the process of transforming tokens into a single canonical form, so they are consistent when taken as input for further analyzing. Common normalization techniques include:

- Lowercasing. A text may contain multiple variations of a word like “USA,” “usa,” and “Usa.” Lowercasing is to convert all such occurrences into lowercase. Though lowercasing is very useful in searching because it makes the system not sensitive to the case of the input keywords, it would not be a good practice for a task such as NER where uppercase/capitaliza- tion is an important feature for a named entity.

  • - Stemming and lemmatization. Both aim to reduce the inflected form of a word to its base form (e.g., from produce, produces, producing, product, products, production to produce). The only difference is the former does not necessarily return a dictionary word, but the latter does. Stemming uses a crude heuristic process that chops off the inflected part at the end of a word, but lemmatization uses a dictionary to map a word from its inflected form to its original form. Stemming has been shown to be helpful in improving search accuracy in some high-inflected languages such as Finnish but not as much for the English language.
  • - Spell correction. Spelling mistakes, as commonly seen in social media text, can present an obstacle for processing it. Spell correction has become a necessary step in text preprocessing. Minimal edit distance[2] is a widely used technique which measures the number of edit steps (“Deletion,” “Insertion,” “Substitution”) that it takes to transform one string to the other.

Basic NLP Text Enrichment Techniques

Latent information such as parts-of-speech of words or structural dependencies among words can be added to plain text through some basic text enrichment techniques.

Part-of-Speech (POS) Tagging. It is the process of assigning parts-of-speech (such as noun, verb, adjective, etc.) to each token in text. As one of the most well-studied NLP tasks, a state-of-the-art English POS tagger can achieve over 97% accuracy on all tokens. The main challenge lies in tagging the words that are never seen. It can still achieve over 90% accuracy on such words.

Syntactic Parsing. It is the process of analyzing and constructing the syntactic structure of a given sentence. Without it, it would be very difficult to determine the order and syntactic dependencies among words in a sentence and comprehend the sentence. Therefore, it has been deemed to be one of the most important NLP tasks for a long time. A lot of theoretical and practical work has been done around this topic. SyntaxNet, released by Google in 2016,7 has been announced to be the most accurate English parser so far.


As described in this chapter, due to the availability of big data, computational resources, and the advancement of machine learning techniques, there are many remarkable uses of NLP in data analytics today. As NLP continues to make data more “user-friendly” and “insightful,” it will be more and more widely adopted in all types of data analytics platforms. In spite of its wide application, NLP is still in its infancy compared with peoples expectations for AI. Languages are complex, subtle, and ambiguous. Processing natural language itself is an extremely challenging task. No matter it is a low-level processing step such as tokenization or a high-level task such as machine translation, the existing NLP systems are still far from perfect. Though there is still a long way to go, NLP is rapidly advancing today along with deep learning techniques and revolutionizing the way people interact with computers and data. Looking to the future, it is clear that NLP-empowered applications will become even more capable, and the improvement will continue to take place. As NLP is embracing a new renaissance that is unprecedented, NLP will play an indispensable part in the next generation of Al-empowered applications and NLP applications will become ubiquitous in our lives.


Christensen, J. et al. (2010). Semantic Role Labeling for Open Information Extraction. Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, pp. 52-61.

Cotterell, R., Dull, K. (2017). Low-Resource Named Entity Recognition with Cross-Lingual, Character-Level Neural Conditional Random Fields. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 91-96.

Eisenstein, J. (2019). Introduction to Natural Language Processing. Cambridge, MA: The MIT Press.

Hearst, M.A. (1992). Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of the 14th Conference on Computational Linguistics, vol. 2, pp. 539-545.

Li, J. et al. (2018). A Survey on Deep Learning for Named Entity Recognition. arXiv preprint arXiv:1812.09449, 2018.

Liu, B. (2015). Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge, UK: Cambridge University Press.

Manning, C.D., Schiitze, H. (1999). Foundations of Statistical Natural LanguageProcessing. Cambridge, MA: The MIT Press.

Miwa, M., Sasaki, Y. (2014). Modeling ]oint Entity and Relation Extraction with Table Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1858—1869. Doha, Qatar: Association for Computational Linguistics.

Yang, B., Mitchell, T. (2016). Joint Extraction of Events and Entities within a Document Context. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 289-299.

Zhou, Li. et al. (2018). The Design and Implementation of Xiaoice, an Empathetic Social Chatbot. arXiv preprint arXiv:1812.08989, 2018.

  • [1]
  • [2]
< Prev   CONTENTS   Source   Next >