Data Analysis Stage

In this section, the different processes involved in the data analysis stage have been discussed in brief.

MPHDT-T-Based Opinion Mining

SA is used for predicting people’s attitudes, emotions, opinions, and polarities of a text; which is classified into three stages: Document, Sentence, Aspect level classifiers. For computing the user’s opinion from the social network, it is processed by applying MPHDT-T. It is utilized to calculate the polarity of a statement as well as the TFIDF being applied in analyzing the occurrence of words with respect to polarity.

Decision Tree (DT)

In this section, the operation of a DT-based classification process along with its subprocesses have been discussed in detail. Single Decision Tree (SDT)

The DT classification model employs a tree-like graph structure. Feature vectors are segmented as a single area, equivalent to classes, in sequential techniques. Providing a feature vector, this allocates features for every classifier, in order to make optimal decisions together with nodes that are produced from a DT classifier. By providing a feature vector X, X e R", the DT is developed by consecutive phases.

A group of binary questions has emerged, in the form of: X c A, A subset X for categorical queries, or X >Cj where C;is the applicable threshold value. For every feature, each feasible value of the threshold C, determines the particular division of the subset X. Splitting Criterion

All binary division of a node creates two descendant nodes if the condition for tree dividing t is dependent on the node impurity function /(t). The different node impurity measures are determined, as represented in Eq. (7.1)

where ф is a random function and P(co,lf) indicates the possibility that vector X, goes to the class co(: i = 1,2, - --, M. The common option to is the entropy function from Shannon’s Information Theory, as represented in Eq. (7.2)

where log2 is the logarithm by base 2 and M is the entire number of classes. A reduce in node impurity is determined as represented in Eq. (7.3)

with aR,aL the proportions of the instances in node t, allocated to the right node tR and the left node tL, correspondingly. A deduction in node impurity process is defined in Eq. (7.3). Stop-Splitting Rule

The easy stop-splitting principle has been adopted if the highest value of ДI(t). entirely all feasible divisions, is lesser than a threshold T; after that, dividing is stopped. Another alternative is to stop dividing either if the cardinality of the subset X, is little sufficient or if X, is pure, in the sense that each point goes to a single class. The essential factor in planning the DT is its size: it can be large and sufficient; however not too large or else it inclines to learn the specified details of the training set and shows worst generalization action. Studies defined that the threshold value can be used to reduce impurity node by the use of stop-splitting principle, and it does not result in optimal tree size. Several times, it stops the tree developing either too early or too late. The most generally utilized manner is to develop the tree up to a large size initially; after that, prune its nodes based on the pruning condition. Tree sizes are highly essential to the current study as it can be performing a two-class problem. A tree too large or too tiny is inaccurately signifying the feature vectors. Class Assignment Rule

After terminating a node, it is assumed to be a leaf and the class label соt is provided utilizing the important principle

Besides, a leaf t is declared to a class where a larger number of vectors X, go to.

Term Frequency-Inverse Document Frequency (TFIDF)

TFIDF is a popular statistical method utilized for classifying text data and indexing the required document. TFIDF is based on frequency and employing a word in a sentence. It is classified under the application of TFIDF to analyze the linked terms for specific diseases from a corpus produced. The TFIDF is determined with the help of Eq. (7.5)

where, t, denotes the /thword, TFIDF of word t, is sentenced; rf(rh d) = TF of word /; in the sentence; and idf(r,) is named as IDF. A norm Frequency rf(rh d) of word r in the sentence is calculated with word r in sentence d, the Frequency is estimated by Eq. (7.6)

IDF is utilized to compute the infrequency of a word in the entire sentence collection. (When the word take paces in every the sentence of the collection, its IDF is 0.)

where N is the whole number of sentences in the corpus; and D and ■ Nt e d is the number of terms in that term t is present. The TFIDF is performed utilizing Python in Hadoop Streaming utility as a MapReduce task.

< Prev   CONTENTS   Source   Next >