THEORETICAL APPROACH TO NLP

In the theoretical approach, a given sentence is subjected to initial processing, and words are separated from sentences. The words are in different forms (i.e., past tense and present tense). Therefore, it is required to convert those words into its base form through morphological analysis. For finding out the actual meaning of the sentence, it must be in proper syntax. Syntax analysis enables syntactic checking and tagging the words with POS. POS tagging helps in finding the meaning of a sentence quickly. The semantic study finds out the meaning of a sentence by building relationship charts/graphs. The semantic analysis helps in finding the sentence meaning by combining word meanings. Finally, pragmatic and discourse analysis finds the actual meaning of a sentence by considering the situation.

14.4.1 PREPROCESSING

Real-world data are unstructured and complicated, especially with the emergence of social media. The scattered and unorganized data should be combined and organized adequately for efficient processing. Irrelevant information affects the efficiency of the final output. In addition, the use of abbreviations, smileys, etc., can create confusion. Therefore, the data must be converted into their normalized form before processing. Most of the preprocessing tasks are done using regular expressions. Essential functions in initial processing include the following tasks.

14.4.1.1 SPELLING CORRECTION

Words with incorrect spellings are identified and coll ected. Most of the data that are processing today include customer reviews, social media comments, etc. It is more likely to use short word forms in those sites. Therefore, the spellings must be collected for avoiding the preprocessing delay. Processing text data without spelling collection may take more time for processing, as it is a tedious task in finding a proper word from the dictionary.

Example: “Its ur responsibility to handle dat case” »> “It is your responsibility to handle that case.”

14.4.1.2 CASE CONVERSION

Case conversion is a normalization procedure, in which all data are converted to lower case to make processing easier. Normalization procedure makes processing more manageable, but it may create problems in some situations. For example, “March” represents a month, and in that scenario, the letter “M” is capitalized, but in the case of “march” (a military walk), the letter “m” is small.

Example: Ho-w are you? >>>/;ow are you?

14.4.1.3 PUNCTUATION REMOVAL

It is the process of removing punctuations in the text.

Example: Awesome! »> Awesome.

14.4.1.4 TEXT STANDARDIZATION

It is the process of converting abbreviations to its original for resolving ambiguity while processing.

Example: NLP >» Natural Language Processing.

14.4.1.5 TOKENIZATION

Breaking of data into smaller units to ease processing. The main tokenization methods include the following.

Sentence tokenization: A piece of text is converted into individual sentences. It is done by using a period(.).

Example: “Learning is the activity of gaining knowledge. Learning enhances the awareness of the subjects of the study.” »> “Learning is the activity of gaining knowledge.” “Learning enhances the awareness of the subjects of the study.”

Word tokenization: Breaking up of textual data into individual words. Sentences are converted to words by identifying blank space between words.

Example: “I am fine.” >>>“!,” “am,” “fine.”

14.4.1.6 STOP WORDS REMOVAL

A stop word is a word that is commonly used in eveiy article. The words “are,” “was,” “and,” etc., are examples of stop words. The presence of stop words affects feature engineering very severely. Count of these words is high in the document, and there is a chance to take this as a relevant word while classifying documents.The classification process efficiency gets affected by these words. Therefore, stop words are eliminated during preprocessing.

Example: (“The,” “book,” “is,” “on,” “the,” “table”) »>

(“The,” “book,” “table”).

14.4.2 MORPHOLOGICAL ANALYSIS

After tokenization, the word should be converted into then base forms. Word can have suffixes and prefixes that make the sentences express then idea efficiently. However, those words will not be present in the dictionaiy for further analysis. Therefore, the words must be converted into then base forms by trimming suffixes or prefixes, if any. Morphological analysis is done using two methods.

14.4.2.1 STEMMING

It deals with different forms of the same word. For example, write can appear in many forms, such as write, wrote, writing, written, writer, and so on. Stemming converts words into their base forms by trimming inflectional word forms such as ing, ed, en, etc.

Example: writing »> write, calves »>calv.

14.4.2.2 LEMMATIZATION

Lemmatization does the same thing as stemming, but the difference is that lemmatizer checks the presence of the stemmed word in the dictionaiy.

Noun Lemmatizer

Verb Lemmatizer

Writing

Writing

Write

Calves

Calves

Calve

14.4.3 SYNTAX ANALYSIS

It is essential to know the syntax and structure of language for efficient processing. Syntax analysis parses the text and annotates the text with POS tags. It helps in understanding the hierarchy of the sentence, and it also makes semantic analysis easier. Standard parsing techniques for understanding text syntax are mentioned below.

14.4.3.1 POS TAGGING

POS tagger labels the words in a sentence with a POS tag that is most suitable for that particular word. POS tagging helps in analyzing the semantic meaning of the sentence without much effort. Different POS include adjective, adverb, conjunction, determiner, noun, number, preposition, pronoun, verb, adjective phrase, adverb phrase, noun phrase, verb phrase, etc.

Example: Ram lores eating pizza »> RamWoun loves

Verb eatingWerbpizzaNoim

14.4.3.2 SHALLOW PARSING OR CHUNKING

Shallow parsing identifies the nonrecursive phrases in a sentence.

Shallow parsed tree

FIGURE 14.3 Shallow parsed tree.

14.4.3.3 CONSTITUENCY PARSING

Constituency parser finds the recursive phrases in a sentence.

14.4.3.4 DEPENDENCY PARSING

Dependency parser focuses on a word in the sentence and tries creating a relationship with other words. A relationship tag labels the edges.

Constituency parsed tree of a sentence

FIGURE 14.4 Constituency parsed tree of a sentence.

Constituency parsed tree of the expression ((2*7)+(8*5))

FIGURE 14.5 Constituency parsed tree of the expression ((2*7)+(8*5)).

Dependency parsed tree of the expression ((2*7)+(8*5))

FIGURE 14.6 Dependency parsed tree of the expression ((2*7)+(8*5)).

Dependency parsed tree for a sentence

FIGURE 14.7 Dependency parsed tree for a sentence.

14.4.4 SEMANTIC ANALYSIS

Semantic analysis demands contextual awareness and a well-structured knowledge base for extracting an accurate meaning from a sentence. Human- computer interaction systems such as conversational agents generally deal with understanding the human language and generating an accurate response in return. Semantic analysis of human language is the critical technology that assists any cognitive computing system to interpret the meaning of natural language. An efficient cognitive system must have the ability to analyze unstructured data and make connections between related information by refining the learning process continuously. The power of linguistics and advances in semantic analysis facilitates cognitive computing to establish relevant relationships among the words in human interaction by considering the surrounding context and sense of eveiy word.

The grammatical information transferred from the syntactic parser is fed into the semantic analysis phase after assigning a proper part of speech tag to each lexeme. Semantics is the process of understanding the linguistic meaning of each lexeme from the parser. For a sentence or statement, the meaning depends on the meaning of constituent parts as well as the composition of lexical items. Various other factors such as the context of the sentence, senses of words, the inclusion of phrases, utterance, and pronouns mentioned play a significant role in cracking the meaning of a sentence. A rich and versatile knowledge base also supports the power of language understanding and semantic processing to enact better reasoning for comprehending and decision making in cognitive systems.

14.4.4.1 THEORETICAL APPROACHES IN SEMANTIC ANALYSIS

The topic discusses the necessary methods in semantic processing:

  • • How is the input to a semantic analyzer converted to an intermediate meaning representation?
  • • How is this intermediate representation used to assign proper meaning to each constituent?

Meaning extraction from the text data can be done based on two different approaches. First is lexical semantics that deals with extracting word meaning of each lexical item, while the later compositional semantics considers the meaning of phrases, the composition of lexical parts, etc. For information extraction (IE), the initial challenge in the semantic analysis is how to represent the meaning of a human utterance in the computer. Hence, fir st, the output from syntax parser must be converted into proper meaning representation using techniques such as first-order predicate logic, associative networks, frames or scripts, etc. The final transformation is done by approaches such as syntax-directed semantic analysis, semantic grammars, IE, etc.

14.4.4.2 LEXICAL SEMANTICS

The study of meanings of word units in an entire sentence and their interrelation is lexical semantics. Semantic analysis of individual lexemes impose analysis of their structure and labeling their relations to other lexemes, accounting similarities and differences among different lexemes in similar settings and the nature of relations among lexemes in a single setting [1]. Table 14.1 shows a list of lexical relations and their senses significant in computational as well as cognitive semantics. To distinguish between the various senses of a word is one of the major issues identified in lexical semantics. Some of the issues related to lexical semantics are word sense disambiguation, semantic role labeling, and semantic selectional restrictions.

14.4.4.2.1 Word-Sense Disambiguation

The process of distinguishing word senses and choosing the most appropriate sense for a word is called WSD. One common feature is to identify word senses from the different contexts, in which a given word is used. WSD resolves many NLP tasks that address cognitive semantic issues, such as question answering, intelligent document retrieval, and text classification. The way that WSD is exploited in these and other applications varies widely based on the particular needs of the application [1].

One way of WSD is to use dictionaries and ontological relationships to analyze the different sense of words. It is a knowledge-based approach and highly depends on the quality of the knowledge base in use [2]. Another approach used is based on machine learning, which involves supervised and unsupervised learning. Unsupervised learning uses clustering techniques for identifying different contexts and co-occurrence of words. In supervised learning, a selected set of sense tagged words along with corpora (like WordNet) is used to train the algorithm [2].

TABLE 14.1 Common Lexical Relationships in the English Language

Lexical Relations

Description

Example

Homonymy

Words with same spelling and soimd but having a different meaning

Light

The book is too light. Light the lamp

Polysemy

Multiple related meaning within a single lexeme

Blood bank and question bank

Synonymy

Different lexemes with the same meaning

Big and large Fat e and price

Antonymy

Words that express the opposite meaning

Fast and slow Cold and hot Rise and fall

Hyponymy

The pairing of words where one lexeme denotes a subclass of another

Car is a subclass of Vehicle Banana is a subclass offruit

Metonymy

Words that denote the part of a relation

An aim is a part of the body

Holonynty

A word that indicates Whole of a part (reverse part-of)

The building is the whole part of the window

14.4.4.2.2 Semantic Role Labeling

An event represented in a dialog system can be presented in several different syntactical ways. Understanding the idea should not depend on the syntactic arrangement. It is another issue in meaning representation/semantic parsing. The solution is to map the meaning by identifying the verb and its arguments, not by its syntactic order.

Semantic role labeling is to identify a verb or a predicate and its arguments in a sentence [1]. Each argument identified is labeled based on its semantic relation with the predicate. Semantic role labeling within the sentence makes the meaning representation independent of syntactic arrangement. Both supervised and unsupervised methods are used in semantic role labeling. The semantic roles can come variously from resources such as PropBank, FrameNet, or VerbNet. It can also be extended to similar semantic roles that are introduced by other POS, such as nominal or adjectival elements [2]. Semantic role labeling is also called as thematic role labeling, case role assignment, or even shallow semantic parsing [1].

14.4.4.2.3 Semantic Selectional Restriction

Even though disambiguating word senses and attaching role labels to every word argument makes the semantic analysis more accurate, specific environments show semantic violations in a different way.

Example: She likes to eat mountain.

This sentence is syntactically correct, but semantically it is not acceptable. The argument of the verb “eat” will be an edible thing. This cognitive issue in the semantic analysis is solved by semantic selectional restriction. This technique allows predicate words to use semantic constraints on its argument words. This process is known as a semantic selectional restriction. A violation of semantic restriction may produce an anomalous sentence [1].

14.4.4.2.4 Compositional Semantics

The meaning of a phrase or composite expression can be obtained by combining the meaning of each word unit in the phrase and by the syntactic arrangement of these constituent words. The combination of such words often leads to meaning representation issues. Meaning representation is the first task of semantic analysis as every natural language statement should be converted into an equivalent meaning representation for further mapping. In compositional semantics, meaning representation is usually done by a logical approach or a truth-conditional approach, which is based on the principle of compositionality [3].

14.4.4.3 CHALLENGES IN SEMANTICS

Some significant issues in semantics are discussed as follows.

14.4.4.3.1 Meaning Representation

To perform the desired task based on a linguistic input, a cognitive system needs a rich semantic analyzer. It is necessary to bridge the gap between the linguistic input and the knowledge base. Hence, the primary step in any semantic analysis is to construct an equivalent meaning representation of natural language input. An ideal meaning representation should be verifiable, unambiguous, and expressive [ 1 ]. A significant issue in meaning representation is understanding sentences that convey the same meaning. Such sentences are structured with different lexical patterns, as in the example given as follows.

Example: Today is sunny.

Today is a bit hot.

The meaning representation for the given examples should be the same. The conventional ways of meaning representation are listed in the following.

14.4.4.3.1.1 Predica te-Argument Structu re

The words and phrases in a sentence keep some relationships or dependencies with the underlying concepts and meaning of the input text. Grammatical objects such as verbs in a sentence assert an object argument structure with other constituent words such as nouns, noun phrases, etc., in the sentence. Hence, the predicate-argument structure is a suitable format for semantic representation of human language.

14.4.4.3.1.2 Model-Theoretic Semantics

In this method, a linguistic input is represented as objects, their properties, and object relationships in a model. Here, a model is implemented to represent the state of affairs in the world. If the model accurately captures the facts from the input, then it efficiently expresses the meaning representation. The domain of a model is simply the set of objects that are part of the application, or state of affairs, being represented. Each distinct concept, categoiy, or individual in an application denotes a unique element in the domain.

This approach sometimes initiates a truth conditional, which represents the world as a mathematical abstraction made up of sets and related linguistic expressions to the model. Tiuth-conditional semantics takes the external aspect of meaning as fundamental. Here, a sentence is true or false depending on the state of affairs that obtain in the world, and the meaning of a proposition is its truth conditions [1].

14.4.4.3.1.3 First-Order Predicate Calculus

First-order logic or first-order predicate calculus (FOPC) is a sound method for representing meaning which satisfies verifiability and expressiveness. The basic building blocks FOPC using are constants, functions, and variables to refer objects and their properties in the real world as modeled in a knowledge base. It uses logical connectives to make composite and complex representations from simple predicates. Quantifiers such as existential quantifiers and universal quantifiers can be applied to represent variables as a collection. One crucial issue of semantics is that meaning representation should support the inference. With the support of inference, FOPC deduces valid conclusions from existing knowledge.

14.4.4.3.1.4 Semantic Networks and Frames

A semantic net is a representation of factual knowledge as entities and their relationships. It uses nodes to represent entities and links to represent relationships. Frames are structures containing all relevant information about the type of entities and their instances. It works like a slot and slot-filler to store fields and values of a record structure and sketches a narrow concept in detail.

Semantic net with nodes and links

FIGURE 14.8 Semantic net with nodes and links

14.4.4.4 COMPUTATIONAL SEMANTICS

Computational semantics perfoims the computation and analysis of linguistic meaning for the natural language input. It automatically assigns meanings to the intermediate representation with common-sense reasoning.

14.4.4.4.1 Syntax Directed Semantic Analysis

A syntax analyzed tree is given to the semantic analyzer. It uses grammar rules and lexical semantics to assign the correct semantic value to each lexical item. Semantic grammars encode semantic information into a syntactic grammar. They use context-free rewrite rules with nonterminal semantic constituents. Generally, a semantic error occurs when the meaning of knowledge is not communicated correctly.

14.4.4.4.2 Information Retrieval

The information retrieval system relies on the storage of documents and retrieval of data in response to a textual query. Keyword-based retrieval should apply WSD to ensure proper indexing and error-free document retrieval. In addition, in a multiword query, the relationship between keywords is semantic rather than syntactic. A concept-based document retrieval should adopt a cognitive semantic approach to tackle the issues such as term co-occurrence and relationship between keywords.

TABLE 14.2 Commonly Used Lexical Resources in Semantics for WSD and SRL

Name

Description

Origin

WordNet [17]

Vast computational lexicon or dictionary provides different word senses

Miller. G. A., 1995

FrameNet [18]

Lexical database using a semantic frame, describes a type of event, relation or entity, and its participants

Baker, C. F., Fillmore, C. J., & Lowe, J. B., 1998

VerbNet [19]

Online verb lexicon

Schuler, К. K„ 2005

PropBank [20]

Proposition Bank, information about basic semantic proposition

: Kingsbury, P. R.. & Palmer. M„ 2002

ConceptNet [21]

Multilingual semantic net and common sense knowledge base

Liu, H„ & Singh, P, 2004

BabelNet [22]

Multilingual encyclopedic dictionary. Integrated with word Net

Navigli. R., & Pouzetto, S. P.2010

HowNet [23]

Online common sense knowledge base

Dong, Z., & Dong, Q., 2003

14.4.4.4.3 Information extraction

IE is the activity of identifying and extracting data from a document and categorizing it semantically. Various tasks, including IE, are as follows:

  • Name extraction: The process of identifying the names in a text and classifying them as people, organizations, locations, etc.
  • Entity extraction: Identifying all phrases which refer to objects of specific semantic classes, and linking phrases which refer to the same object.
  • Relation extraction: Identifying pahs of entities in a specific semantic relation.
  • Event extraction: Identifying instances of events of a particular type and the arguments of each event.

The goal of IE is only to capture selected types of relations, types of events, and other semantic distinctions that are specified in advance. IE systems are domain based, and most of them incorporate the semantic structure of that domain. Major domains of IE are news articles, medical records, and biomedical literature, which have large quantities of text, repeated entities, and events of the same type and where there is a need to distill the information to a structured database form [2, 3].

14.4.4.4.4 Named Entity Recognition

Named entity recognition is the process of extracting named entities that are present in a text into pretagged categories such as “individuals,” “companies,” “places,” “organization,” “cities,” “dates,” “product terminologies,” etc. It enriches the semantic knowledge of the content and helps promptly understand the subject of any given text. It is useful in applications such as news content analysis, business sentiment analysis, etc. Named entity recognition can provide article scanning based on relevant tags to reveal the significant people, organizations, and places discussed in them. It helps in the automatic classification of articles with fast content discoveiy. In business sentiment analysis, extracting the identity of people, places, dates, companies, products, jobs, and titles gives an insight into the people's opinion on product and company.

14.4.5 PRAGMATIC AND DISCOURSE ANALYSIS

The study of natural language by the speaker's utterance and its interpretation based on situational context is called pragmatics. It is closely connected to semantics, but the focus is on interpersonal communication. Sentence meaning in semantics may vastly underdetermine speakers meaning. The goal of pragmatics is to bridge the gap between a formal sentence meaning and the speaker's intention based on the context.

14.4.5.1 DISCOURSE

A group of structured related sentences is a discourse [1]. It refers to a sequence of sentences, where each sentence is interpreted in the context of the preceding sentences. Natural language utterances are never disconnected or unrelated sentences; instead, they are structured and continually and coherently related one after another. Such coherent gr oups of sentences form a discourse.

14.4.5.2 ANAPHORA RESOLUTION

The linguistic action of referring back to a previously mentioned item in the text is termed as anaphora. The word or phrase “referring back’’ in the text is known as anaphor, and the thing which it refers to is its antecedent. When the anaphor refers to an antecedent, and when both have the same referent hr the real world, they are termed coreferential. The interpretation of anaphors, known as anaphora resolution, has a vital role in the understanding of discourse.

Example, Santa loves icecream. She works in an icecream shop.

Here, She is an anaphor, and Santa is the antecedent. Here, she and Santa are coreferential.

14.4.5.3 COREFERENCE

A natural language sentence in a discourse often containing a reference to a previous word entity is called a referring expression, and the object that is referred to is called the referent. References used in a sentence are often denoted as mentions, and mentions of the same entity are lurked as being coreferences, or members of a coreference set. These can include pronouns, nominal entities, named entities, and events [2]. Co-reference resolution intends to find referring expressions in a text that refer to the same entity. The set of expressions that со-refer is known as a co-reference chain. Reference resolution is the task of determining what objects are referred to by which linguistic expressions [1]. Application areas of co-reference resolution are IE, summarization, and conversational agents.

Example, Dr. APJ Abdul Kalam was the President of India. Formerly, he was the chairman of ISRO.

Here, he and Dr. APJ Abdul Kalam are co-references.

14.4.5.4 TEXT COHERENCE

Text coherence has an influential role in determining the acceptability of a specific discourse. Coherence refers to the meaningful connection between text units. A large structured discourse unit is built by organizing coherent structures in a meaningful way. A coherent paragraph is unified, logical, consistent, and meaningful. However, it makes sense only when reading as a whole.

Example: One Man’s Meat (by E.B. White) [https://literaiydevices.net/ coherence/]

“Scientific agriculture, however, sound in principle, often seems strangely unrelated to, and unaware of, the vital, grueling job of making a living by farming. Fanners sense this quality in it as they study their bulletins, just as a poor man senses in a rich man an incomprehension of his own problems. The fanner of today knows, for example, that manure loses some of its value when exposed to the weather.. .But he knows also that to make hay, he needs settled weather—better weather than you usually get in June.”

The above passage is an excellent example of coherent text. The author here described theoretical agriculture, and that topic is intelligently combined with fanner's problems and climate.

14.4.5.5 RHETORICAL STRUCTURE THEORY (RST)

RST explains the coherence of texts. RST is a model of text organization [2]. It can be used for systematic analysis of text using rhetorical relations. A text analysis using RST creates a tree based on the rhetorical relations to represent the content. All relations in RDT are based on the concept of a nucleus and satellite. The nucleus holds a central independent idea, but the satellite is less central, and usually, its interpretation is connected with the nucleus. There is a set of 25 rhetorical relations. These relations include the evidence relation, which applies to two text spans where the satellite provides evidence for the claim contained in the nucleus [2].

 
Source
< Prev   CONTENTS   Source   Next >