Machine Learning-Based Text Mining in Social Media for COVID-19
Tajinder Singh and Madhu Kumari
In the current era, social media is a unique primary asset for textual corpus and realtime social text streams. Numerous varieties of HTML documentations include vast and deep information, attracting a variety of users. Various types of social media users post both essential and redundant information on social media to seek the attention of other users. In the COVID-19 pandemic, social media is playing a key role and has become prominent in multiple ways which deal not only with text but also with multimedia. Information posted by a variety of users can be extracted over time using analysis tools which help to understand the real-time situation with changes over time. Similar to data mining, text mining looks for suitable information from the collected social streams or corpus and identifies interesting patterns. Every social media is jam-packed with vast data concerned with COVID-19, and such platforms are a great source for information-seekers who collect data and identify valuable patterns using analysis. Such types of social media text are usually extracted in an unstructured form which needs to be tamed for further analysis and classification purposes. Online social media is a source of information-sharing which provides extensive ways to contribute to and participate in online communities to become a part of an information diffusion system. Several kinds of social media have expanded and attracted great attention of social media users upto peak level. This availability of social media users and data in real-time situations helps to extract and analyze the information from a variety of perspectives. Numerous methods and approaches are given by various authors to extract social media text. These methods are used to distribute and connect with other online users to exchange information in multiple forms (Lifna and Vijayalakshmi, 2015).
Such kinds of platforms are contributing a lot to producing and spreading social information in real time and for seeking the attention of other users, by which the volume of social media users are exponentially emerging. With the rapid evolution of COVID-19. the World Wide Web (WWW) is fully occupied with news and articles related to COVID-19. The WWW has become a leading source for addressing the concerns of social media users and for developing eminent data for real-time social stream data, including a large amount of text corpuses. In one study (Goldhahn et ah, 2012), numerous schemes which help to assemble textual information concerned with COVID-19 are explained. The assembled text needs to be processed before further analysis. Therefore, taming the text is a valuable step through which the necessary and highest quality information can be pulled out. The extracted data patterns can be used for multiple purposes such as polarity disambiguation and sentiment analysis (Bhadane et ah, 2015; Hamdan et ah, 2015; Saleiro et ah, 2017), social event detection, classification, and analysis on social media (Vavliakis et ah, 2013; Zhou and Chen, 2014; Zhao and Mitra, 2007).
Nowadays, as COVID-19 is gaining more attention, it is also forcing people to change their lifestyles and social life. This trend of change is also a key aspect where the recommendations advisory is passed by various health organizations through online and offline modes. By including all series of patterns related to COVID-19, it is observed that with the change of time, its trend keeps on growing, and various recommendations are passed in terms of precautions. So it’s not hard to say that trend prediction (Aiello et ah, 2013), topic tracking, and recommendation (Lin et ah, 2011: Martinez et ah, 2008) of the COVID-19 pandemic are related to each other either directly or indirectly. Multiple methods are used by social media users to represent information on social media, and for textual information, the (Bag of Words) BOW model has become very common due to its simplicity. In this COVID-19 situation, the main goal is to understand the type of information shared on social media, in which name entity recognition is playing a significant role (Quan and Ren, 2014). It helps to understand the nature of the collected text corpus or text stream to mine useful patterns of information (Croft et ah, 2010) concerned with COVID-19.
To bring more clarity to this vast area of text mining, its key roles in the COVID- 19 pandemic are explained below:
- • Information exploration and retrieval: Text related to COVID-19 is to be explored and retrieved by a user who wants to understand the hidden features of the extracted data. The extraction of data can be based on keywords, where a user will pass a query to extract information, or indexed-based (Croft et al„ 2010; Alves et ah, 2019).
- • Clustering of extracted text'. To collect the text related to COVID-19 from the extracted data or in the real-time scenario, clustering plays a key role (Ghai et ah, 2016). Multiple algorithms are available to cluster the data, whereas online fast clustering is effectively used for fast processing of data in real time (Aggarwal, 2014).
- • Classification of collected text: In text mining, classification is an important task which classifies the collected text into various segments. In the COVID-19 text corpus, if a user wants to extract only related information then classification comes into the real picture (Santana et ah, 2020).
- • Mining of web: COVID-19 related information is available in huge amounts on the web in distinctive structures. Web crawlers are used to extract the data, usually in a structured way from which patterns of information can be analyzed by applying machine learning algorithms (Desikan et ah, 2006).
- • Extraction of features and concept: Features from the collected text are extracted and the collected text usually is unstructured. The attributes and features of collected data can be visualized by extracting the features using machine learning algorithms. Similarly, if linked information is existing in the collected text then the concept of linkage is also analyzed and matched with the available text (Popowich, 2005; Gelfand et ah, 1998).
Activities Involved in COVID-19 Text Mining Process
Activity I: Create the corpus: This is the first activity involved in the text mining process. The main motive of this activity is to develop a corpus from the collected text in an appropriate way. In this step, all the features are combined together, which helps to analyze the hidden patterns accurately (Figure 6.1). In this step, COVID-19 related features are obtained and are analyzed further for a detailed study.
Activity 2: Text pre-processing: This activity removes unwanted data from the collected text. Structured representation with a detailed attribute value comes out after text pre-processing. The main aim of this activity is to remove slang, emoticons, abbreviations, misspelled words (Lifna and Vijayalakshmi, 2015; Kumar and Govindarajulu, 2013), and other related information which is not required for the study of the COVID-19 text.
Activity 3: Knowledge extraction: New, well-structured, knowledgeable information is extracted from the processed text in the situation of exact dilemma being addressed using the structured COVID-19 text. Various activities are part of this activity, such as prediction, which involves classification and real-time series
FIGURE 6.1 Representation of COVID-19 text mining task.
prediction, association and clustering of texts, trend prediction, and analysis to extract informative patterns.
For this purpose, this chapter is divided into various sections. Section 6.2 highlights the value of text mining and motivation, and Section 6.3 elaborates the general outline of text pre-processing. Similarly, Section 6.4 explains the various text pre-processing mechanisms helpful in COVID-19, and the extraction mechanism of social texts is described in Section 6.5. The scope of various machine learning approaches in text mining in social media for COVID-19 and recent approaches of text mining are explained in Sections 6.6 and 6.7, respectively. Impacts of the COVID-19 pandemic in various sectors are explained in Section 6.8, and Section 6.9 puts forward a conclusion.