Data Collection
The combined, as well as automatic, data models were presented to eliminate large- scale data from different sources. Social network data such as Twitter, Facebook, Blogs, and WhatsApp can be removed on a massive scale by applying an application programming interface (API) provided with diverse social websites as well as Flume, a benchmark big data device employed to eliminate the data. API has the limitation on data controlling value that depends on the time and volume of data eliminated. A modified linear recursive method has been utilized to reject data from large-scale sources. After developing the query, the relevant data is removed and saved in a central memory called the HDFS. Hence, data attained from diverse social websites provide a larger number of parameters as a distinct group. As there is a restriction in India for geographical research, data were avoided to apply the Geo Location accessing data set.
Data Preprocessing and Integration
Data that has been eliminated from social media would be a large amount and in an unstructured JavaScript Object Notation (JSON) form, which has noisy and irregular information. Data mining modules such as stop word extraction, stemming, tokeni- zation, and normalization are applied in preprocessing and reject the significant text. Also, MapReduce methods are used for data preprocessing. The data modification is shown in Fig. 7.2.
Data Tokenization
Data tokenization is a way of classifying whole information into a set of words with accurate blanks, commas, and spaces. The input data is segmented as words, phrases, useful components, as well as symbols. It is provided as an input for additional tasks. Data tokenization can be operated in tw'o stages: Mode and Characters. Mode tokenizes data on the basis of selected points while Characters segments the data as a specified character.
Generating and Removing Stop Words
The stop word creation is deployed under the application of a contextual semantics approach. It is applicable in removing relevant semantics from co-occurrences of tokenized data. Here, the circle is developed with the help of a corresponding task and simple trigonometry is used in determining contextual emotion.
Detecting Stop Words with SentiCircles
Stop words in SA are defined as the weak sentiment and the semantics process is carried out. It is comprised of SentiMedian, which is placed in a smaller region

FIGURE 7.2 Data integration model.
closer to the generation of SentiCircle. The points from a site have weak sentiment d h = 0 and the low correlation is 0. It is useful in finding SentiCircle and computes the whole contextual semantics. The sentiment of SentiMedian guarantees the stop word area. The SentiCircle is used in estimating whole semantics and sentiment with SentiMedian. An edge is applied to terminate stop words regions by the combination of SentiCircle, which has been isolated in the future.
Stemming and Lemmatization
These modules are normalized ones that are applied in NLP regions for processing the normalized form. The stemming process is a fundamental estimation; however, it does not give the substitution of the word. The rules generated from the stemming process are utilized for determining stem candidates.
Corpus Generation
Corpus is a massive and structured group of text utilized for implementing statistical analysis as well as theory testing. A corpus is utilized for training the classification for identifying positive as well as negative sentiments. The two types of corpus, emoticons as well as user defined corpus, are utilized for classification and identifying emotions of common people. Initially, emotion corpus is applied in classifying data on the basis of sentiment word and icons are used in sentences. The emoticon corpus is composed of linguistical scores to happy emoticons. The next corpus is the semi-automated corpora which is generated by the experts. In addition, the uncommon words are removed and Term Frequency Inverse Document Frequency (TFIDF) technique is used.
Tagging
While applying the tag model, the tokenized data will tag under the application of Part-of-speech (POS), Food tag, and external events to analyze the risk issues of diabetes. In the earlier level, text data undergoes tagging by using a Diabetic corpus-based n-gram that varies from maximum ranges. A POS tagging is defined as allocating part- of-tagging speech into words from an input text. Usually, POS implies the tagging of words, enclosed with a slash. Due to the massive data in POS, recent language techniques are applied to the English Penn Treebank tag group problem to tag the social networking data, which performs grouping noise, and unrelated data will be preprocessed.