Text Mining
S. KARTHIKEYAN,1 JEEVANANDAM JOTHEESWARAN,1 B. BALAMURUGAN,1 and JYOTIR MOY CHATTERJEE2
School of Computing Science and Engineering, Galgotias University, Greater Noida, Uttar Pradesh, India, E-mails: link2karthikcse@gmail com (S. Karthikeyan), This email address is being protected from spam bots, you need Javascript enabled to view it (J. jotheeswaran), This email address is being protected from spam bots, you need Javascript enabled to view it (В. Balamurugan)
zSchool of Computing Science and Engineering, LBEF (APUTI), Kathmandu, Nepal, E-mail: This email address is being protected from spam bots, you need Javascript enabled to view it
ABSTRACT
The major objectives of text mining (text data mining/text analytics) are to extract the pattern or information from the largely available unstructured or semi-structured text data. Data mining deals only with structured data whereas text mining deals with semi-structured or unstructured data, Around 80% of data stored throughout the globe is in unstructured or semi-structured form, it is the biggest need for text mining to manipulate the data in a meaningful way, there are many techniques like sentimental analysis, natural language processing (NLP), information extraction, information retrieval, clustering, concept linkage, associate rule mining (ARM), summarization, topic tracking are used to extract the data based upon the nature of data and will be discussed further on each technique in this chapter, but the major problem in the text mining is the ambiguity of the natural language, as the one word can be interpreted in multiple ways, ambiguity is the primary challenge for the researchers to address and the possible solutions are explained. Algorithms such as genetic algorithm, differential evolution can be combined to get the desired result, the output of algorithm can be scaled so that it can ensure the quality of the text retrieval. There are two methods called as precision and recall is used to measure text retrieval quality in text mining. There are several applications that are associated with text mining such as healthcare, telecommunication, research papers categorization, market analysis, Customer Relationship Management (CRM), banks, Information Technology and another environment where the huge unstructured volume of data is generated.
INTRODUCTION
Text mining (TM) is characterized as the non-minor extraction of covered up and possibly helpful information from textual data. TM is another field that endeavors to extract significant infonnation from natural language text. It may be characterized as the process of dissecting text to extract infonnation that is helpful for a particular reason. Comparing to the data in databases, the text is unstructured, unclear, and hard to process. In present-day culture, the text is the most mutual path for the formal trade of information [43].
TM typically manages texts whose work is the correspondence of genuine information or opinions, and the improvements for endeavoring to extract information from such text consequently are intriguing regardless of whether achievement is just fractional [8]. TM is like data mining (DM), then again, actually DM tools are intended to deal with structured data from databases, however, TM can likewise work with unstructured or semi-structured data sets, for example, messages, text records, and HTML documents and so on. Subsequently, TM is an obviously better arrangement [37].
TM ordinarily is the process of organizing the information text, inferring patterns inside the structured data, and the last assessment and elucidation of the output. The term TM is usually used to indicate any framework that breaks down huge amounts of natural language text and recognizes lexical or semantic utilization patterns tiying to extract likely valuable information.
There are applications such as:
- • Healthcare & biomedical;
- • Social media;
- • Banking;
- • Customer management relationship;
- • Education;
- • Web-based software;
- • Business intelligence;
- • Sentiment analysis;
- • Research paper classification;
- • Security and biometric.
DATA MINING (DM) AND TEXT MINING (TM)
TM is the study and application of textual information extraction (IE) by the doctrine of computational linguistics. As an exploratory data analysis, TM is a method that uses the software to support decision-makers and researcher practitioners who uses large text collections in descending latest and pertinent information. The stakeholder is still involved, interacting with the system in the semi-automated process [55].
In the KDD (knowledge discovery from data) process, DM is a stride. KDD takes care of useful knowledge attainment that is novel, imperative, and legitimate [19]. DM needs little relations among the investigator and DM tool. It is an automatic process because DM tools inevitably search the data for anomaly and conceivable associations, thus identifying the unidentifiable problems by the end-users [37], while meager data analysis relies on the end-users to define the problem, choose the data, and commence the proper data analyses to help the model and resolve tribulations [20].
ADVANTAGES AND DISADVANTAGES OF DATA MINING (DM)
- 72.1.1 ADVANTAGES OF DATA MINING (DM)
- 7.2.1.1.1 Marketing/Retail
DM enables sales and marketing organizations to design models based upon previous data the target audience to the new advertising efforts, tools such as mail, online promoting effort and etc. Through a result, advertisers will have a fitting way to deal with pitching beneficial items to focused customers.
7.2.1.1.2 Manufacturing Environment
By applying DM in outfitted building data, makers can recognize defective gear or equipment and decide ideal control parameters. For instance, semiconductor makers have a test that even the states of manufacturing conditions at various wafer creation plants are comparative, the nature of wafer are a great deal the equivalent and some for obscure reasons even has abandons.
7.2.1.1.3 Banking and Financial Management
Financial reports on loan information and credit risks can be fetched using DM. Likewise, DM enables banks to identify untrustworthy credit card transactions to certify credit card’s proprietor.
7.2.1.1.4 Social Media Analysis
Facebook and Twitter are considered as the most swanned Social networking websites. These networking destinations have made it simple to speak with loved ones without endeavoring. Individuals identified with various qualities come nearer to one another by sharing their thoughts, premiums, and knowledge nowadays, it turns out to be simple for anybody to meet the general population of their interests for learning and sharing valuable information [56].
Additionally, social media is a fusion of few learning frames, for instance, e-learning, and m-learning. On various social networking destinations, the most widely recognized strategy for association with one another is through text. People can exchange their ideas by blogs, posts, and all other medium where text is involved. The utilization of the TM techniques is to produce the text correct so that it is useful for anybody to compose in the most proper way TM implies the extraction of the data which isn’t natural to anybody.
- 7.2.1.2 DISADVANTAGES OF DATA MINING (DM]
- 7.2.1.2.1 Information Usage/Inaccurate Information
There are intruders in the globe who will access the data of the authorized individuals or organizations where they will use the unauthorized information without the knowledge of the genuine user. When the users using genuine information, decisions can be taken effectively, whereas the results will be ineffective if incorrect information being used for taking decisions.
7.2.1.2.2 Privacy in Database
The users trust on the internet is reducing constantly as being the data can intruded by the attackers at any point of tune during the message transmission among blogs, forums, online business, social networks and etc. Due to this reasons, authenticated users still worried about the data security whether it is stored securely or not. Organizations collect data on customers from various means for knowing similar patterns.
7.2.1.2.3 Security Issues
Organizations possess information on their representatives and customers including Citizen Identification number, birthday, and other financial data. The data that are stored about the people are still in question regarding the data security and other means. There are events occurring regularly in the globe where the intruders attacking the central bank server or any server which has the sensitive data. This leads to huge financial loss for organization, making people to lose the trust on the organization [57].
DATA MINING (DM) VERSUS TEXT MINING (TM)
Earlier years, IT individuals concentrated on DM, where they will extract the knowledge from the huge volume of text which is unstructured, as most of the organization does not possess the text in a structured way, it challenges them to get the desired result. Much organization takes decision-based upon the available information from the large database.
The major difference between TM and DM is that the TM gives knowledge from both structured as well as unstructured text. It can deal with all three different types such as structured, unstructured, and semi- structured data.
7.2.2.1 TEXT MINING (TM) APPROACHES
Similarly, as DM isn’t only a novel methodology or a solitary system for finding knowledge from data, TM additionally comprises of a wide assortment of techniques and innovations [47], All TM methodologies have a typical feature such as dealing about processing text. For example:
- • Keyword Based Advances: The information depends on a choice of keywords in text that are separated as a progression of character strings.
- • Statistics Innovations: Alludes to frameworks dependent on AI. Statistics advancements influence a preparation set of archives utilized as a model to oversee and classify text.
- • Linguistic Based Innovations: It makes use of language processing frameworks. The output of text analysis permits a shallow comprehension of the logic, grammar, text structure.
PRELIMINARY TEXT MINING (TM) METHODS
TEXT PREPROCESSING
Text preprocessing is a basic component of any NLP framework, since the characters, words, and sentences recognized at this stage are the key units gone to all further processing stages, from analysis and labeling segments, like morphological analyzers and grammatical form taggers, through applications, like infonnation recovery and machine interpretation frameworks [6]. It is an accumulation of exercises in which text archives are preprocessed. Since the text data frequently contains some extraordinary configurations like number arrangements, date groups and the most well- known words that farfetched to help TM, like, relational words, articles, and professional things can be wiped out [6].
- • Purpose of Text Preprocessing in NLP System:
- 1. To diminish the indexing record size of the text archives:
- • Stop words produce 20-30% of the complete word includes in a specific text archives
- • Stemming may decrease ordering size maybe 50%.
- 1. To diminish the indexing record size of the text archives:
- 2. To improve the productivity and viability of the IR framework
- • Stop words didn’t produce that much impact on TM and it affects the retrieval system.
- • Stemming utilized for coordinating the comparable words in a text report.
TOKEN1ZATION
Tokenization is the process of separating a character succession into pieces of words or phrases called tokens, and maybe in the meantime discard certain characters. Here the text is divided into words, expressions, images, or other significant components called tokens. The point of the tokenization is the investigation of the words in a sentence [1].
Tokenization is helpful both in linguistics that is a type of text division, and in software engineering, where it frames some portion of lexical analysis. Textual data is just a square of characters toward the start. All processes in information recovery require the words of the data set. Henceforth, the necessity for a parser is a tokenization of records [1] (Figure 7.1).

FIGURE 7.1 Tokenization.
This may sound inconsequential as the text is as of now put away in machine-meaningful arrangements. Even few issues are still left, similar to the expulsion of accentuation marks. Different characters like sections, hyphens, and so on require processing as well. Besides, tokenizer can provide food for consistency in the archives. The primary utilization of tokenization is recognizing the important keywords. The irregularity can be diverse number and time positions. Another issue is shortenings and abbreviations which must be changed into a standard structure.
STOP WORD REMOVING
In a document, large number of words repeats around regularly yet is basically negligible as they are just used to consolidate words in a sentence. It is generally known that stop words don’t add to the context or substance of textual records. As the words are repeated multiple number of tunes, it challenges the TM to further processing of document.
Statement with Stop Words |
Statement without Stop Words |
Akilan was studying Computer Science |
Akilan, studying, Computer science |
Researchers are working hard on new innovations |
Researchers, working, hard, new innovations |
Mining is the best method for taking decisions |
Mining, method, taking, decisions |
Technologies are making people dumb |
Technologies, making, people, dumb |
Stop words use the most regular words like ‘is,’ ‘was,’ ‘and,’ ‘are,’ ‘this’ and so on. They are not helpful in arrangement of words and should be evacuated. In the above example you can see the statement “Akilan was studying Computer Science” where ‘was’ is the stop which is not needed in the text process, on next statement “Researchers are working hard on new innovations,” where ‘are’ and ‘on’ belongs to stop words, which can be eliminated in the TM process [1].
STEMMING
It comes under natural language processing (NLP) and natural language understanding (NLU) where the word or the term is reduced to its stem such as run, running, and runner are the words which form the stem such as iun as a whole.
There are two errors in stemming:
- 1. Over stemming; and
- 2. under stemming
- • Over Stemming: It occurs if the two words with various stems are stemmed to a similar root. This is otherwise called a false positive.
- • Under Stemming: It occurs if the two words are stemmed to a same root are most certainly not. This is otherwise called a false negative.
- 7.3.4.1 TYPES OF STEMMING ALGORITHMS
- 1. Table Lookup Approach: One strategy to do stemming is to store a table of all record terms and their stems. Terms from the inquiries and records could then be stemmed through query table, utilizing b-trees or hash tables. Such queries are extremely quick; however, there are issues with this methodology [1]. First, there is no such data for English, regardless of whether there were they may not be spoken to on the grounds that they are space explicit and require some other stemming strategies. Second issue is capacity overhead.
- 2. Successor Assortment: These stemmers depend on the auxiliary linguistics which decides the word and morpheme limits dependent on dispersion of phonemes. Successor assortment of a string is the quantity of characters that tail it in words in some group of text. For instance, consider a group of text comprising of the following words [1].
Karthik, Kavin, Kenny, Karina, Krunal, Jeeva, Bala, Joy
How about we decide the successor assortment for the word read. The first letter in Karthik is К. К is followed in the text body by 3 characters A, E, R subsequently the successor assortment of R is 3. The following successor assortment for KAR is 2 since T, I pursues KAR in the text body, etc. Table 7.1 demonstrates the total successor assortment for the word read.
TABLE 7.1 Stemming Approach
Prefix |
Successor Variety |
Letters |
К |
3 |
A, E.R |
KA |
2 |
R, V |
KAR |
2 |
T,I |
KART |
i |
H |
When the successor assortment for a given word is resolved then this information is utilized to portion the word.
i. Cut Off Strategy: Some cutoff esteem is chosen and a limit is recognized at whatever point the cut off esteem is come to.
ii. Peak and Plateau Strategy: In this teclmique, a section break is made after a character whose successor assortment suipasses that of the characters promptly going before and tailing it.
iii. Complete Word Strategy: Break is made after a fragment if a section is a finished word in the corpus.
3. N-Gram stemmers: This strategy has been planned by Adamson and Boreham. It is called a shared diagram strategy. Digram is a couple of continuous letters. This teclmique is called n-gram strategy since trigram or n-grams could be utilized. In this technique, affiliation measures are determined between the sets of terms dependent on shared one of a kind diagram [1].
For instance: consider two words Stemming and Stemmer; Rumiing->ru un nn ni in ng Runner-> ru un mi er
In this precedent the word running has 6 interesting diagrams, runner has 4 kind diagrams, these two words share 3 novel diagrams ru un nn. When the quantity of exceptional diagrams is discovered then a similitude measure dependent on the novel digrams is determined utilizing dice coefficient. Shakers coefficient is characterized as:
where: C = Regular one of a kind diagrams; A = Quantity of special diagrams in the first word; В = Quantity of one of a kind diagrams in the second word.
Comparability measures are resolved for all sets of terms in the database, framing a closeness matrix. When such a comparability matrix is accessible, the terms are clustered utilizing a solitary connection clustering teclmique.
4. Affix Removal Stemmers: Fasten evacuation stemmers expel the additions or prefixes from the terms leaving the stem. One of the cases of the append expulsion stemmer is one which expels the plurals type of the terms. Some arrangement of tenets for such a stemmer is as per the following: i. If any word that finishes in “ies” yet not “eies” or “aies “
At that point “ies” - > “y”
ii. If any word that finishes in “es” however not “aes,” or “ees” or “oes”
At that point “es” - > “e”
iii. If a word finishes in “s” however not “us” or “ss “
At that point “s” - > “Invalid”