ENGINEERING APPROACH TO NLP

In the engineering approach, a given text is preprocessed, and base forms of words are created. However, one cannot do syntax analysis, etc., in the engineering approach immediately after morphological analysis because the computer needs everything in digital form for analysis. Therefore, the words must be converted into vectors for doing further analysis with the help of machine learning methods.

14.5.1 PREPROCESSING TEXT

The preprocessing step in the engineering approach includes the same steps as in the theoretical approach. The preprocessing tasks are done mainly using regular expressions in computers.

14.5.2 TEXT-TO VECTOR CONVERSION

Text-to-vector conversion is the process of converting word to vector form for analysis. This process is also known as feature extraction. It is to find relevant features that are required to create a machine learning model.

Word embeddings are nothing but a numerical representation of the text. Embedded word makes the processing faster. There are various techniques for creating word embeddings. Traditional techniques used for numerical representation are One Hot encoding and iV-gram representation. Most of the new methods based on language modeling use these traditional techniques as the first step. Today word embeddings are created using machine learning techniques. Fast text, Glove, word2vec, etc., are examples of word embeddings created using NNs.

14.5.2.1 ONE HOT ENCODING

Tokenized words are sent to One Hot encoder for mapping.

Example: (“Hai,” “how,” “are,” “you”)

OUTPUT:

Hai - [1, 0, 0, 0] how- [0, 1, 0, 0] are - [0, 0, 1, 0] you - [0, 0, 0, 1]

One hot encoded representation of the word “are” is [0, 0, 1,0].

14.5.2.2 N-GRAM

N-gram is one of the most efficient techniques used for predicting the next word given the previous word. iV-gram works based on the chain rule of probability.

Consider a Phone review for sentiment analysis: “Camera quality is not bad” the actual meaning of this sentence is “The phone has a good camera quality.” Sentiment analysis is the process of finding sentiment (positive, negative, or neutral) about a product/something. The sentiment is found by tokenizing the words and finding whether that word is present in the list of positive, negative, or neutral category. Here, in this case, both “not” and “bad” fall in the negative category, and that review is classified as a negative one even though it is a positive review. Therefore, for avoiding such situations, the iV-gram concept is introduced. N can be any value above 0.

Example: (Camera quality is not bad)

OUTPUT:

  • 1- gram : (“Camera,” “quality,” “is,” “not,” “bad”)
  • 2- gram : (“Camera quality,” “quality is,” “is not,” “not bad”)
  • 3- gram : (“Camera quality is,” “quality is not,” “is not bad”)

This method eliminates the problem of classifying it as a negative sentence using the bigram technique.

14.5.2.3 WORD EMBEDDINGS

New methods for word embedding helps in finding similarities as well. Words with similar meanings will be having the same representation in word embeddings. There are different types of word embeddings.

Frequency based

  • • Count Vectorizer
  • • TF-IDF Vectorizer
  • • Co-occurrence Vectorizer

Prediction based

  • • Continuous Bag of words (CBOW)
  • • Skip-Gram model

14.5.2.3.1 Count Vectorizer

Count vector technique works on the top of one-hot encoding. This method is similar to the one-hot encoding technique, but this can be used for large corpus (collection of documents).

Consider a corpus of D(=2) documents. Unique tokens are extracted out from the corpus excluding stop words.

Docl: (He is greedy. Raghu is a bad and greedy person)

Doc2: (Ramu is greedy)

TABLE 14.3 Count Vector Representation

Greedy

bad

Person

Raghu

Ramu

Docl

2

1

1

1

0

Doc2

1

0

0

0

1

Count vector representation of the word “gr eedy”: [2, 1] (columnwise).

The count vector method counts the occurrence of the word “gr eedy” in Docl and Doc2. It uses either the frequency of the word occurrence or the presence of that word for getting count value. In this case, if using frequency, then “greedy” is represented as [2,1], and if the presence of a word is used, “greedy” is represented as [1,1]

Count vector representation of Docl : [2 1 1 1 0] (row-wise).

This technique is not suitable for a large document with millions of words. A matrix of significant size, including the list of words and documents, is to be created for large documents. In addition, if such a matrix is created, it will be sparse.

14.5.2.3.2 Term Frequency-Inverse Document Frequency (TF-IDF)

It is one of the most popular techniques used for finding whether that word is relevant in that specific document that is analyzing. Words with high occurrence count and words that occur less frequently are treated as essential words

Term frequency tells you how frequently a given term appears. For example, If “Learning’' appears 20 times in Docl and the total number of terms in the document is 200, then TF(Learning) = 20/200 = 0.1.

Document frequency finds the importance of that term. Assume that the number of documents containing the term “learning” is 100, and the total number of documents in the corpus is 4,000,000, then DF(learning)= 100/4,000,000 =0.000025. Taking log() for normalizing this value, log(0.000025) = -4.60205999133. If the total number of documents containing the term T is very less than the total number of documents in the corpus, it gives negative value, and it is challenging to compare. Therefore, the value is inverted, that is,

IDF can compare the quantities even if there is a considerable difference between the quantities. For example, if idf(learnmg) = 4,000,000/100 = 40,000, normalize this value by taking log will give IDF as 4.60205999133

TF IDF(learning, Docl) =0.1 * 4.60205999133 = 4.60205999133.

Taking another example, consider the term “this” in Docl.

TF(this) = 60/200 = 0.3

IDF(this) =log (4,000,000/4,000,000) = 0 (Every document will contain at least one “this”)

TF-IDF(this) = 0.3 *0 = 0 (Indicates that the term “this” is not relevant).

14.5.2.3.3 Co-occurrence Matrix

It is based on the assumption that similar words tend to occur together. The word co-occurrence means the number of times a word pair (Wl, W2) has appeared together in a context window. The context window is similar to the A-gram language model. The size of the context window can be varied depending on the application.

Corpus = “I love to leam. I love robotics. I love to read.”

Context window size = 2 (takes two words on the left and right of the given word) that is, Consider the word “to.” The word “to” occurred with “I” in context three times and “to” occurred with “love” in the context two times. The phrases “love to” and “I love” tend to occur together.

I love to Learn I love robotics I love to read

TABLE 14.4 Co-occurence Matrix Representation

I

Love

To

Learn

Robotics

Read

I

0

4

3

1

2

0

love

4

0

2

1

2

1

to

3

2

0

1

0

1

leam

1

2

1

0

0

0

robotics

2

2

0

0

0

0

read

0

1

1

0

0

0

This model is rarely used as this needs a large matrix if the number of terms is high. In addition, the co-occurrence matrix is not a word vector representation, and it needs to be converted into word vector using principal component analysis. This model helps in building relationships, i.e., (Boy, Girl), (Husband, Wife) pairs.

14.5.2.3.4 Continuous Bag of Words

CBOW is a machine learning model created for predicting the probability of a word in a given context. A context is formed using the yV-gram concept. It may be a single word or a collection of words. One-hot encoded vector is given as input to an NN, and the probability of the given word is predicted. This method consumes less memory and is more accurate. However, training an NN needs more time as it formulates rules from examples. However, this is an efficient and widely used model that can predict the probability of word being in a given context. The word “bat” can be a bird or a cricket bat. CBOW places the word 'bat' in between the cluster of birds and a cluster of sports. CBOW takes a context as input and predicts the probability of the current word based on the context. The bag of words matrix is calculated as follows.

TABLE 14.5 Bag of Words Representation

Not

Bad

Good

Story

Casting

Not bad

1

1

0

0

0

Good story

0

0

1

1

0

Good casting

0

0

1

0

1

14.5.2.3.5 Skip-Gram Model

It is a machine learning model with a context window on both sides. This model predicts the word surrounding the target word. Consider an example, a skip-gram model with context window size = 1.

Example: Corpus = Hello! How are you?

TABLE 14.6 Skip-Gram Model Representation

INPUT (Given Word)

OUTPUT 1 (Next Word)

OUTPUT 2 (Previous Word)

Hello

How

-

How

Are

Hello

Are

You

How

You

-

Are

The skip-gram model with negative subsampling has a better accuracy over all other methods. In addition, this can identify the semantics of a word quickly. That is, the word “bank” can be a financial institution or riverbank.

  • • Google’s word2vec uses CBOW and skip-gram model.
  • • Facebook’s Fast Text is an extension of the word2vec model, but it feeds yV-grams of words instead of feeding yV-grams of sentences. It helps in identifying the context of the given word even if that word is not present in the dictionary. Consider a legal term; it can be present in the standard dictionary, but some of the legal terms may not be present in the standard dictionary that can be found out with the help of this method.

Example: Trigram of the word ‘crane’»> {cra,ran,ane}.

14.5.3 NLP METHODOLOGIES

Features are used to implement different applications with the help of machine learning or rule-based methods. Every application need not use the other processing steps such as syntax, semantics, and pragmatic analysis. It varies depending on the application. There are different approaches to processing natural language.

Symbolic approach: This approach to NLP handles the problem of using human developed rules. Regular expressions and context-free grammars are examples of symbolic approaches. Morphological analysis and syntactic parsing are mainly done using symbolic methods.

Statistical approach: hi this approach, NLP is done with the help of mathematical analysis. A large corpus is analyzed, and trends are found out. After that, linguistic rules are formulated using the analyzed trend. Machine learning models for classification, semantic processing, discourse processing, etc. are based on statistical approaches. Finite automata (FA), Hidden Markov model, etc., are examples of statistical methods.

Connectionist approach: The coimectionist approach to NLP combines symbolic and statistical approaches. This approach uses formulated rules for processing text and combines them with inference obtained from the statistical approach for efficiently making specific applications. WSD, NLG, etc., can be done through the coimectionist approach.

Another classification for NLP is based on the methods used for processing syntax, semantics, pragmatics, and discourse. It includes the following:

  • • Rule-based methods.
  • • Probabilistic and machine learning techniques.
  • • Deep learning methods.
  • 14.5.3.1 RULE-BASED METHODS

Rule-based approaches are veiy efficient for rule-based tasks. Natural language has a specific rule for syntax and semantics, and hence, those tasks are rule based. Therefore, the analysis of syntax and semantics can be quickly done using rule-based approaches. The main disadvantage of a rule-based system is that the efficiency of the system depends on the programmer who creates rules. If he is efficient, the system will also be efficient. A skilled programmer can build the system within a limited period. Rule-based methods used in computers are regular expressions and FA. Regular expressions are very fast and easy to use. Most of the preprocessing and morphological tasks are done using regular expressions. One of the eveiyday NLP tasks called string matching can be done quickly using regular expressions.

For example: The set of strings over {0,1} that end in three consecutive 1 ’s.

(0 | 1)* 111.

The above expression matches a string that ends with 111. The matched strings for the above regular expression include 0000001111,0001110010111, 10101010111, etc.

  • • For every regular expression R, there is a corresponding FA that accepts the set of strings generated by R.
  • • For every FA, there is a corresponding regular expression that generates the set of strings accepted by that FA.
  • 14.5.3.2 PROBABILISTIC AND MACHINE LEARNING METHODS

Probabilistic models for NLP are used in many NLP tasks. However, the most exciting application of the probabilistic model in NLP is predicting the next word, given a word. Likelihood maximization, conditional random fields, etc., are the most commonly used probabilistic methods in the area of NLP.

Machine learning models leant from examples. They fonnulate rules based on given examples, and similar data can be handled by a machine learning model very quickly. Many machine learning algorithms can be used for different purposes. There are two types of learning.

14.5.3.2.1 Supervised Learning

In this learning method, the programmer feeds the system with a set of example data about which he already knows the answer. It is used for the structured dataset. Super-vised NLP machine learning algorithms are:

  • • Support vector machines.
  • • Bayesian networks.
  • • Maximum entropy.
  • • NNs.
  • 14.5.3.2.2 Unsupervised Learning

This method of learning is used for unstructured data whose labels are unknown. Unsupervised machine learning algorithms include clustering and dimensionality reduction.

Machine learning methods cannot handle new situations efficiently. For handling such problems, deep learning methods are introduced. Sentiment analysis, dialog systems, etc., are modeled using deep learning models as it comes up with new data eveiy time. NLP tasks such as named entity recognition, text or document classification, etc., are done using machine learning models.

14.5.3.3 DEEP LEARNING METHODS

An NN with more than one hidden layer is called as a deep learning network. In machine learning, classification is done by feeding examples along with essential features, whereas, in deep learning, there is no need to supply features for classification. The deep learning network learns the features for classification automatically from given examples. However, the main disadvantage of a deep learning network is that it needs a massive dataset of examples since it learns features from examples. Consider a dataset of vehicle images; it is required to predict whether the input image is a car or not, and for that, the network should be fed with the features of a car (i.e., it has four wheels, four doors, etc.). However, in the case of deep learning, no need to specify the features of a car.

Architecture of a feedforward deep neural network

FIGURE 14.9 Architecture of a feedforward deep neural network.

14.5.3.3.1 Convolutional Neural Network (CNN)

CNN is a deep neural network (DNN) model mainly used for image classification.

Convolutional neural network

FIGURE 14.10 Convolutional neural network.

Most of the computer vision applications, such as photo tagging and age prediction that exist today, are based on the CNN. It uses convolutional filters for analyzing different features simultaneously. The main advantages of the CNN include the following:

  • Ability to learn features from unformatted data: It can find relevant data from raw data. The CNN can easily find the linguistic feature in a sentence. It is used to process diverse morphology, syntax analysis, and things that do not follow a specific pattern.
  • Less training time as the weights are shared: The same weight for inputs helps in reducing the training time. Since the weights are the same for eveiy neuron, there is no need to train eveiy neuron separately.

The CNN with pooling is better than general CNN. It helps in reducing the feature maps. Pooling can reduce the number of locations to be examined by the following layers. It also helps in reducing the training time.

14.5.3.3.2 Recursive Neural Network (RcNN)

RcNNs are created by applying the same set of weights recursively over a structured input to generate a structured prediction. It can be used to leant the sequential and hierarchical structure. Language is seen as a recursive structure where words become the leaf nodes, and subphrases become child nodes that compose the parent node, sentence. It is easy to model parse trees using RcNN since it is used for building hierarchical structures. The CNN shares weights between nodes, whereas RcNN shares weights between layers.

Recursive neural network

FIGURE 14.11 Recursive neural network.

14.5.3.3.3 Recurrent Neural Network (RNN)

RNN is a feedforward NN with a feedback loop, and it is used mainly for sequential and time-series data. The RNN can be used for classifying, clustering, and making predictions. While writing an email in Google, one will get predictions for the next word based on the previous words. It is an application of RNN. The absence of memory makes it unable to process long sequences efficiently. As there is a vanishing gradient problem in RNN, it is challenging to train such a network.

14.5.3.3.4 Long Short-Term Memory (LSTM)

LSTM is a modified version of RNN that eliminates the shortcomings of RNN. RNN cannot retain memory and fails in processing sequences. RNN with a memory cell inside the neurons is called an LSTM. The presence of memoiy in LSTM helps in capturing syntax and semantic dependencies very quickly. It has memoiy capacity and can store data over time. LSTM has input, output, and forget gates. Learning LSTM cell sends its hidden state and memoiy cell state to the adjacent LSTM cells. There are different types of LSTM networks.

• One to One model

It is used for predicting common phrases.

Example: (“Pretty”)»>(“Good”)

• One to many model

It can be used for question generation.

Example: (“Fine”)»>(“How,” “are,” “you,” “?”)

• Many to one model

Many to one model can be used for predicting the next word.

Example: (“How,” “are”)»>(“you”)

• Many to many model

Many to many models can be used for generating a sequence of words,

given a sequence of words.

Example: (“How,” “are,” “you,” “?”)>»(“!,” “am,” “fine”)

Recurrent neural network

FIGURE 14.12 Recurrent neural network.

Types of LSTM

FIGURE 14.13 Types of LSTM.

14.5.3.3.5 Gated Recurrent Unit (GRU)

GRUs is introduced to overcome the vanishing gradient problem of RNN and is similar to LSTM. However, the GRU has only two gates, namely, reset and update gates. The GRU can perform well, like LSTM, without a memory unit. GRU has more control over hidden states and can eliminate the vanishing gradient problem without using memory. The learning procedure for GRU is the same as LSTM, but the GRU uses hidden states as an internal vector.

14.5.3.3.6 Encoder-Decoder network

LSTM cannot handle variable-length input and output. LSTM cannot read the entire input sequence and are not able to retain the entire sequence in its memory. For example, if “How are you?” is to be converted into any other language, then it is required to feed the entire sequence to the decoding architecture. However, LSTM cannot retain the entire sequence and will not be able to feed it directly. To solve this problem, an encoder-decoder architecture is introduced. The encoder-decoder network is a sequence to sequence architecture in which both input and output are sequences. The encoder uses an LSTM network for receiving input and convert the entire sequence into a fixed-length vector. Then, the fixed-length vector is given as input to the decoder LSTM. This method can increase output efficiency. Figure 14.14 shows the encoder-decoder network, and “How are you?” is converted to a fixed-length vector “W.” That “W” is given as input to the decoder. The decoder generates a reply to the input given.

Encoder-decoder network

FIGURE 14.14 Encoder-decoder network.

 
Source
< Prev   CONTENTS   Source   Next >