HIGH-VARIETY DATA ANALYSIS
Blackberry faced a serious outage when its email servers were down for more than a day. I tried powering my Blackberry off and on because I was not sure whether it was my device or the CSP. It never occurred to me that the outage could be at the Blackberry server itself. When
I called the CSP, they were not aware of the problem. So I turned to one obvious source: Twitter. Sure enough, I found information about the Blackberry outage on Twitter. One of my clients told me that his vice president of customer service is always glued to Twitter looking for customer service problems. Often, someone discovers the problem on Twitter before the internal monitoring organization does. We found that a large number of junior staffers employed by marketing, customer service, and public relations search through social media for relevant information.
Traditional analytics has been focused primarily on structured data. Big data, however, is primarily unstructured, so we now have two combinations available. We can perform quantitative analysis on structured data as before. We can extract structure out of unstructured data and perform quantitative analysis on the extract quantifications. Last, but not least, there is a fair amount of nonquantitative analysis now available for unstructured data. I would like to explore a couple of techniques rapidly becoming popular with the vast amount of unstructured data and look at how these techniques are becoming mainstream with their powerful capabilities for organizing, categorizing, and analyzing big data.
Google and Yahoo rapidly became household names because of their ability to search the Web for specific topics. A typical search engine offers the ability to search documents using a set of search terms and may find a large number of candidate documents. It prioritizes the results based on preset criteria that can be influenced by how we choose the documents. If I have a large quantity of unstructured data, I can count words to find the most commonly used words. Wordle™ (www.wordle.net) supplies word clouds for the unstructured data provided to it. For example, figure 6.2 shows a word cloud for the text used in this book. The font size represents the number of times a word was used in the text.
This data can be laid out against other known dimensions. For example, IBM was working on unstructured data analytics in the Indian
Figure 6.2 A Wordle diagram of the text used in this book
market. A fairly large number of customer comments were available publicly. The IBM analysts used text analytics to study the key words being used as plotted against time. Figure 6.3 shows the results of this word count plotted against time.10
Once we start to categorize and count unstructured text, we can begin to extract information that can be used for qualitative analytics. Qualitative analytics can work with the available data and perform operations based on the characteristics of the data.
If we can classify the data into a set of hierarchies, we can determine whether a particular data belongs to a set or not. This would be considered a nominal analysis. If we have an established hierarchy, we can deduce the set membership for higher levels of the hierarchy. In ordinal analysis, we can compare two data items. We can deduce whether a data is better, higher, or smaller than another based on comparative algebra available to ordinal analytics. Sentiment analysis is one such comparison. For example, let us consider a statement we analyzed from a customer complaint.
“Before 12 days, I was recharged my Data Card with XXX Plan. But I am still not able to connect via internet. I have made twise complain. But all was in vain. The contact number on Contact Us page is wrong, no one is picking up. I have made call to customer care but every guy telling me.. .”
Figure 6.3 Time plot of customer blog keywords in Indian market
As humans, it is obvious to us that the sentiment of this sentence is negative. However, big data requires sentiment analysis on terabytes of data, which means we need to assign a positive or negative sentiment using a computer program. The use of words or phrases such as “I am not able" “complain" “in vain" and “no one is picking up” are examples of negative sentiments. A sentiment lexicon can be used as a library to compare words against known “positive" or “negative" sentiment. A count of the number of negative sentiments is qualitative analytics that can be performed on sentiment data, as we can differentiate between positive and negative sentiment and conclude that positive sentiment is better than negative sentiment. We can also create qualifiers such as “strong" sentiment and “weak" sentiment and compare the two sets of comments.
In typical interval-scaled data, we can assign relative values to data, but may not have a point of origin. As a result, we can compute differences and deduce that the difference between two data items is higher than another set of data items. For example, a strong positive sentiment may be better than a weak positive sentiment. However, these two data items are more similar than the pair of a strong positive and a strong negative sentiment.