What is text mining?
In Chapter 1, we introduced the term “text mining” and defined it as follows:
Text mining is a process of extracting information from various text sources (such as Word documents, PDF files, social media posts, emails, websites, articles, XML files, and others) to discover patterns, trends, and themes. Text found in these documents is typically unstructured, i.e., they are not in a predefined format that can be analyzed through data analytics software such as IDEA or ACL. Text mining is performed in two steps: (1) imposing structure on the text data sources, and then (2) using data mining techniques to extract relevant information.
(Sharda et al., 2014)
There are multiple definitions of text mining in the literature. A critical preliminary point to make is that that text mining goes beyond information retrieval.
The purpose of information retrieval is to differentiate between relevant and non-relevant texts, with the primary goal to enable information access or make it faster and more accurate. The purpose of text mining is to analyze texts and uncover new insights from them, with a primary goal to gain new actionable knowledge or to improve decision-making.
In other words, text mining is a form of data mining that analyzes patterns from unstructured text sources as opposed to structured sources, such as databases.
Early research on text mining refers to it as the process of knowledge discovery from extensive collections of unstructured textual data
(Feldman & Dagan, 1995). In her essay on text mining (Hearst, 2003), Professor Marti Hearst of the University of California, Berkeley defines text mining as the computer-led discovery of new, previously unknown information by automatically extracting information from different written resources. She adds that “a key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation”. In line with that stream of research, Professor Catherine Blake of the University of Illinois at Urbana-Champaign characterizes text mining as the process of identifying novel, interesting, and understandable patterns from a collection of texts (Blake, 2011, p. 126). For professor Stephane Tuffery of the University of Rennes in France, text mining is the set of techniques and methods used for the automatic processing of natural language text data available in reasonably large quantities in the form of computer files, to extract and structure their contents and themes, for rapid (non-literary) analysis, the discovery of hidden data, or automatic decision making (Tuffery, 2011, p. 627).
Text mining is also referred to as text analytics. These terms are often used interchangeably in the literature and practice. However, there is a subtle difference between the two words: Text analytics may be viewed as more focused on the visualization aspects (graphs, reports), using the results from analyses processed by text mining. However, for this book, we use these terms interchangeably.
Text Mining has gained tremendous momentum recently mainly because of the exponential growth of content on the internet with web-enabled applications and social networks, most of which contain unstructured text. Text Mining has also grown because of the rapid digitalization of business processes as organizations moved away from paper to electronic documents and other digital records. Finally, the significant technological advances in cloud computing, Al, the internet of things (loT), and big data analytics have had a notable impact on the amount of data available for text mining.
The role of natural language processing in text mining
Text mining, as a process of knowledge discovery from unstructured texts, requires natural language interpretation. For this reason, natural language processing (NLP), which is the process by which a computer program interprets natural language like speech or text, is one of the most important concepts and methodologies for enabling
Text Mining 49 text mining to extract actionable knowledge effectively and efficiently from texts.
As defined in Chapter 1, NLP is a subfield of Al that focuses on the interaction of computers and people using human languages. This subfield can be broken down further into two types: natural language understanding and natural language generation.
As we will discuss in this chapter, NLP techniques involve both statistical and machine learning approaches.
Overview of Text Mining Research
Text mining is a multidisciplinary branch of knowledge that involves concepts and techniques from such disparate fields as the library and information sciences, linguistics, computer sciences, and data sciences (i.e., statistics and data mining). These text mining techniques are applied and used in numerous domains, such as in medical research, bibliometric studies, marketing, government, political research, and technology.
Considerable research has been published in the field of text mining in many of these domains. A study of the text mining literature published under the subject category “Information Science Library Science” in the Web of Science Database during 1999-2013 counted more than 36,000 text mining-related research papers, with research contributions in various subjects such as biology, technology, chemistry, physics, medical sciences, and the social sciences (Nagarkar & Kumbhar, 2015).
In the financial sector, abundant literature exists documenting the application of data mining in areas such as stock market predictions, financial risk analysis, and fraud detection. In contrast, text mining in financial applications has emerged only recently (Pejic-Bach et al., 2019). This new development of text mining in the financial sector has allowed researchers and organizations to identify new valuable patterns and insights from unstructured text sources such as corporate documents, social media posts, emails, and call logs. Most of this emerging research is primarily focused on external data sources, such as news and online media posts, for stock market predictions and fraud detections. The number of research studies using internal data sources is still relatively low (Pejic-Bach et al., 2019).
A 2016 survey of the applications of text mining in the financial domain (Kumar & Ravi, 2016) categorized the emerging literature in the following areas:
- • Foreign exchange (FOREX) rate prediction: e.g., mining of news for exchange rate forecasting;
- • Stock market prediction: e.g., mining of financial news and textual information available on websites for financial market predictions or stock price prediction;
- • Customer relationship management: e.g., mining of customer reviews, complaints, news or social media contents for customer opinion or sentiment analysis; and
- • Cybersecurity, fraud, and other risk detection: e.g., anomaly detection in email systems, spam, phishing, and malware detection, churn prediction, bankruptcy prediction, financial statement fraud.
However, more investigations remain to be done to advance text mining research in these areas as well as in new financial application areas.
A 2019 study (Lewis & Young, 2019) reviewed papers employing textual analysis methods published in leading accounting and finance journals over the period 2010-2018. Lewis and Young found the accounting and finance profession has been slow to adopt mainstream natural language processing (NLP) methods. Instead, accounting and finance professionals preferred simple approaches such as keyword searches, word counts, and dictionaries that measure specific attributes such as tone and readability scores. The study also highlights the small but growing body of work applying specific NLP techniques, including machine learning classifiers and statistical methods for identifying topic structure, at the document or corpus level. Finally, the authors discuss the opportunities and challenges to improving NLP usage in financial reporting research moving forward, pointing access to text resources and interdisciplinary and intersectoral collaboration among the main impediments to future progress.
In summary, the use of text mining in the finance and accounting literature is still a new trend. Future growth in adoption will be driven by:
- • Improvements in data access,
- • Increased collaboration among researchers from various disciplines and between academia and practice, and
- • The growing availability of resources in big data analytics, machine learning, and artificial intelligence.