General Outline of Text Pre-Processing in COVID-19

Text pre-processing contains a number of steps in a sequence designed to convert noisy text into an appropriate form for input into an algorithm. In Figure 6.3, the distinctive operations of text pre-processing for COVID-19 analysis are illustrated. Tokenization is the most important step because it represents COVID-19 information sparsely and unusually. In tokenization, the input text is divided into small units and the subsequent tokens are tagged with their respective part of speech (POS) and COVID-19 event diffusion rate. In the next phase, tokens are transformed into a reliable case using lemmatization, and finally, filtering of stop words is typically implemented.

Main Challenges Related to Social Media Text of COVID-19

COVID-19 in social media is an ambiguous notion which refers to generating and sharing of COVID-19 data or contributing to social networking (Peetz, 2015).

Basic layout of text mining

FIGURE 6.3 Basic layout of text mining.

Multiple blogs and social media platforms which are related to COVID-19 are available, and these all are contributing in extracting data and seeking knowledgeable patterns from it with great potential. Various methods such as question answering, sentiments, mailing lists, and group discussion platforms are designed to present information delicately to facilitate the people and provide precautions related to this pandemic (Ren, 2016; Han, 2014). In other words, we can say that in the current era, social media is including various tools and methods which allow users to share, access, and diffuse information. As social media is an integral part of users’ experience, then we can say that social networking has become a feature of social media. Various authors such as Aichner and Jacob (2015) have divided social media into the different ways social media is represented, including microblogs, e-commerce portals, multimedia sharing, social networks, review platforms, social gaming, and virtual worlds. These domains of social media are very interesting to capture knowledge in terms of features associated with them. Texts of social media contain special features which are represented in Figure 6.3.

  • Tininess: In social media, the size of users’ posts is usually small when compared to traditional media. If we discuss on Twitter, then the user can post only 280 characters, as their posts, called “tweets,” are limited to 280 characters (Ren et al., 2013). Due to the shortness of these tweets, it is very challenging and difficult to extract the required features related to COVID- 19 from such a small length of text.
  • Multilanguage: Users of social media can express their views using different languages and can choose any platform to discuss and they share their views in groups, blogs, communities, or at any social media platform.


Social Text Quality Challenges



Slop list

Occurrence of frequent words in text

Text pre-processing

Remove the undesired information from collected text

Intelligibility of words

To provide a clear meaning in text


Predicting data annotation and its characteristics


Scope of ambiguity, data dependency


Various methods to tokenize words or phrases

Usual learning

Similarity measures and use of characterization

Languages used by the participants can be different from those of others and understanding the real meaning of multilingual texts on social media is becoming challenging. In the pandemic of COVID-19, if we are not able to keep up with the challenge of translation then it will become a problem for normal users to understand the actual system (Ren, 2016). Therefore, a machine learning mechanism is required to hande such challenges and provide clear-cut information to the people.

  • Opinions: On social media, every post holds an opinion. Opinion can be defined as a quintuple (oj, fjk, soijkl, hi, tl), where, “oj” is a target object, “fjk” is a feature of the object “oj,” and “soijkl” is the sentiment value of the opinion of the opinion holder “hi” on feature “fjk” of object “oj” at time “t.” An opinion can be positive, negative, or neutral, which depends on the words expressed by the opinion holder (Pang and Lee, 2008).
  • Appropriateness: The dynamic nature of social media provides precise information but its exact representation of textual documents varies with the change of time and sometimes the whole structure of the sentence also changes, including its meaning. Therefore, to analyze the exact picture of COVID-19 detection, classification, and diffusion prediction in the current scenario is quite challenging due to its dynamic behavior (Table 6.1).
< Prev   CONTENTS   Source   Next >