Sentiment analysis and opinion mining

Section 2.3 focused on methods for extracting key words, phrases, and topics from text data for the purpose of identifying ideas for new products or improvements to existing ones. Another form of text analysis goes a step further and uncovers people’s sentiments or opinions about a concept, which is a product in our case. A sentiment is a negative, neutral, or positive view. In many instances, just a polarity - negative or positive - is used. Examples are:

  • 2. “I can’t live without if!”
  • • This is a positive sentiment.
  • 3. “I have mixed feelings about this product.”
  • • This is a neutral sentiment.
  • 4. “7 really regret buying this!!”
  • • This is a negative sentiment.
  • 5. “Don’t waste your money on this.”
  • • This is a negative sentiment.

Some sentiment analysis methodologies also consider emotions rather than sentiments, but the concept is the same. See Sarkar [2016, p. 342] for some discussion. Also see Zhang and Liu |2017| for a more detailed discussion of sentiment analysis.

Market researchers have, of course, always captured consumers’ sentiments about a product through customer satisfaction surveys. They typically asked three questions:

  • 1. How satisfied are you with the performance of this product?
  • 2. How likely are you to recommend this product to someone else?
  • 3. How likely are you to switch to another product within the next six months?

These questions are obviously looking for sentiments or opinions. A numeric scale for them is usually a Likert five-point scale which the market researchers transform to top-two box, middle box, and bottom-two box.13 These are positive, neutral, and negative sentiments, respectively.

One form of analysis of these three questions is based on a Venn Diagram such as the one in Figure 7.25. The scores for the three satisfaction questions are converted to top-two box ratings (T2B), which are 0/1 values, and then simple proportions are calculated as the average of these 0/1 values for each of the eight possible intersections of the questions corresponding to the three Venn circles. A tabular report based on these proportions, similar to the one shown in Table 7.9, is also possible.

The difference in this analysis from the focus of this section is that text data are used in our current discussion, not a numeric scale.

Sentiment methodology overview

Sentiment analysis is based on a lexicon, a special dictionary that classifies words and phrases as negative/neutral/positive sentiments. There is a wide array of lexicons. See Sarkar [2016| for a discussion of some of them.

Sentiment analysis is done by tokenizing the words of a document (after preprocessing the text to delete stop-words, correct spelling, removing punctuation, changing case, and so forth) and then passing the tokens through a lexicon. The document as a whole is classified for its sentiment following some procedure. See Sarkar |2016| for Python code for accomplishing this. Values representing the sentiment of each document are returned and used in statistical analysis. For example,

This Venn diagram shows one possible summarization of the three customer satisfaction questions in a typical customer satisfaction study

FIGURE 7.25 This Venn diagram shows one possible summarization of the three customer satisfaction questions in a typical customer satisfaction study. In terms of sentiment analysis, the intersection of the three circles shows a positive sentiment; the customers in this intersection are loyal to the product. Those outside all three circles, have a strong negative sentiment. Customers at other intersections have varying degrees of neutral sentiment.

These are the sequence of steps for feature opinion analysis

FIGURE 7.26 These are the sequence of steps for feature opinion analysis.

simple frequency counts could be calculated. If the proportion unfavorable is larger than the favorable, then the product has a problem.

A problem with sentiment analysis as usually applied is that it does not explicitly say what is wrong with the product. It does not say which aspect - feature or attribute - is an issue, only that something is wrong. In order to pinpoint a specific problem, one that can be addressed by the R&D staff, you still have to read through the texts. Several researchers have proposed a methodology for extracting features or attributes referenced in product reviews. The opinions of these features are then classified. Quantitative measures are created that can be used in statistical and machine learning applications to further understand opinions of product features, but the opinions are pinpointed to specific features. See de Albornoz et al. [2011], Morales et al. [2008], and Plaza et al. [2010] for discussions.

TABLE 7.9 This is a tabular form of the Venn diagram in Figure 7.25. The counts (corresponding percentages) should add to the total sample size (100%). For this example, n = 1000. There are 683 in the T2B for Satisfaction; 877 for Recommend; and 548 for Stay. All others were in the bottom boxes with 66 in the bottom box for all three questions. These 66 are the strongly negative sentiment. This example is based on fictitious data.




Strongly Positive



Somewhat Positive



Satisfied & Recommend



Satisfied & Stay



Recommend & Stay



Somewhat Negative



Satisfied Only



Recommend Only



Stay Only



Strongly Negative



There are several steps in the process which are illustrated in Figure 7.26. The first step, of course, is to compile a corpus of product reviews from internal logs, social media, or product review websites as I discussed in Chapter 2. Each review is a document in the corpus. The sentences in each document are preprocessed and then sentence-tokenized. The words in each sentence are then tokenized. The reason for this second tokenization is to allow you to refer back to the original sentence from where the word came. Each tokenized word is assigned a “sense or meaning” using WordNet. See Miller [1995] and Fellbaum [1998] for background on WordNet.

WordNet is a lexical database that classifies words into a hierarchy of more abstract words and concepts. It maps words to their synonyms thus allowing a grouping of words. In this regard, some have equated WordNet to a combination dictionary and thesaurus. The important use for this application is the hierarchy of concepts that is formed. This hierarchy can be visualized as an inverted tree with the root, representing the most general abstract concept, at the top of a graph with branches flowing down from the root. Each branch can be split into new branches, the split points, called nodes, being a next lower (and less abstract) concept or word. This split continues until the original word is reached. At this point, the branches terminate at a final node. There will be as many final branches and terminal nodes as there are words originally submitted to the WordNet program. In our case, the tokenized words from a single sentence are used so there are as many terminal nodes as there are words in that sentence.

A hierarchy of general concepts contained in Wordnet is shown in Figure 7.27. The root, an “entity”, is at the top and an “organism” is at the bottom.

Using WordNet, the tokens or words can be classified as nouns, verbs, adjectives, and adverbs, de Albornoz et al. [2011] note that only nouns, but not proper nouns,

This illustrates the lexical hierarchy on Wordnet. Notice that the highest element in the hierarchy is an Entity which could be a physical entity or an abstract entity'

FIGURE 7.27 This illustrates the lexical hierarchy on Wordnet. Notice that the highest element in the hierarchy is an Entity which could be a physical entity or an abstract entity'.

are useful for product features. These nouns are processed through the WordNet database and WordNet concepts are returned. The concepts are extended with their hypernyms. Hypernyms are words that have a more general, broader meaning than the word under consideration. For example, canine is a hypernym for dog since a canine could be wolf, jackal, fox, or coyote as well as a domesticated dog.14 There are also hyponyms which are the opposite of hypernyms; they have a more specific meaning. Dog is a hyponym for canine.

If your focus is on hypernyms, then the flow of a tree is upward to more abstract concepts. If your focus is on hyponyms, then the flow is downward to more specific concepts. In Figure 7.27 and Figure 7.28, I chose to move up the trees to more abstract concepts.

For “blinds”, the hypernym is “protective covering.” For a robotic vacuum cleaner, the hypernym for “robot” is “automaton”, a more general concept involving any mechanism, including a person, that is self-operating.15 At an even more general level, a blind and an automaton are each a physical “entity'”, something that exists, whether it be living or not such as an abstract idea or an abstract entity.16 A graph for the words “dog”, “blind”, “robot”, and “vacuum” is shown in Figure 7.28.

Following de Albornoz et al. [2011], a graph can be created for each sentence in a product review and then all the graphs are merged. The edges of the graph are weighted, the weights being a function of how far a concept is from the root. Those concepts that are further have more specificity and so are weighted more than those that are close to the root. These weights are used to calculate the salience or importance of each concept. The salience of a concept is “the sum of the weights of the

This illustrates the lexical hierarchy from Wordnet for four words mentioned in this book

FIGURE 7.28 This illustrates the lexical hierarchy from Wordnet for four words mentioned in this book.

edges that have as source or target the given vertex.”17 The weights are multiplied by the frequency of occurrence of the concept in the corpus. The concepts are sorted by their saliences and grouped by a clustering algorithm such as hierarchical or к-means clustering. Each cluster is a product feature. A word cloud can be used to emphasize the importance of each feature. These features can be used to specify and design a new product. In one sense, they come directly from the customers via their written comments, but in another sense they are derived from the comments since few customers can clearly and succinctly indicate what they want. Customers can only complain about their problems; they cannot articulate their needs. This approach holds promise for uncovering those needs.

de Albornoz et al. [2011] outline several heuristics for associating each sentence with a product feature. This results in a mapping of each sentence to at least one product feature. Then all the sentences associated with a feature can be analyzed for the sentiment or opinion expressed for that feature.


Most general statistical software can handle data visualization. R and Python have excellent data visualization capabilities but producing a graph, even a simple one, requires some program coding. Stata has a good graphics library but program coding is a bit awkward. The same holds for SAS. JMP is particularly good at data visualization because of its dynamic linking of a data table and a graph.


I presented a detailed case study of a drill-down of issues associated with a new product post-launch. The problems were identified through a dashboard but then investigated for root causes. The purpose of this detailed case study was to emphasize that Deep Data Analysis (DDA), the key message of this book, is not solely for new product development, but is also applicable for post-launch examination of that product. Just because a new product is launched does not mean that you should stop studying the data generated from it. Once launched, the product will succeed or fail, and as you know from Chapter 1, most fail. The root-cause analysis will help you identify why it failed so that the next new product would have a better chance for success.


Demonstration of linearization using log transformation

Let f(x) be some function we want to approximate at a point x = a. The Taylor Series Expansion (TSE) off (x) at x = a is

where f‘(a) is the i'1' derivative of the function evaluated at я;/0 (л) is the original function evaluated at a; and /'! is the factorial function with 0! = 1. If you set f(x) = In a% then the TSE of the natural log is

where R is a remainder term that is usually ignored. Let x = y, and a = y,_,. Then In yt = lny,_] + The last term is the percent change in y. Denote this by

g. Then

Notice that the first term on the right-hand side in (7.A.1) is like the term on the left, just one step back. So you can write it as

and substitute (7.A.2) in (7.A.1) to get

Repeat this backward substitution until you get

where y0 is the first or starting point for the у data series. Clearly, (7.A.4) is a linear equation with intercept In y0 and slope g. So the natural log transformation linearized the original data.

Demonstration of variance stabilization using log transformation

Time series are often characterized by the property that their variance increases over time. Such a series is said to be nonstationary. See Wei [2006] for a discussion of nonstationarity. This property is not desirable from a statistical point of view because changing variances are difficult to work with; constant variances are more desirable. A time series that has a constant variance is called stationary. We require a function that when applied to a nonstationary time series coiwerts it to a stationary one. That is, we require a function that converts the variance of a time series to a constant. This section closely follows the demonstration by Wei [2006] for finding the transformation function.

Assume a time series Y„t= 1,2.....T. Then this series is nonstationary if we

can write V[Y,] = cXf(a,) where c is a constant and/(-) is a function of the data. For example, we could assume that V[Y,] = cXcrf. We want a function T(-) that transforms У, so that V'!'(У,)] = c. Following Wei [2006], we could use the Taylor Series Expansion (TSE) to determine the transformation function.

We want a transformation function Т(У,) evaluated at У, = a, such that

The variance of Т(У() is

where 7'1 (a,) is the first derivative of T evaluated at a,. We need

Integrating (7.A.8) and using V[Y,] = cXaj, then 7'(a,. So the natural log is the transformation that stabilizes the variance; i.e., makes it a constant. Wei [2006] shows that this approach can be extended to produce the Box-Cox Power Transformations which have the natural log as one option. This class of transformations is given by

Transformations for different values of A are shown in Table 7.10.

TABLE 7.10








Source: Wei [2006]

TABLE 7.11 Elasticity values and interpretations.

Elasticity Value


t] = 00

Perfectly Elastic

1 < If < 00



Unit Elastic

0 < if < 1


n = 0

Perfectly Inelastic

Constant elasticity models

A useful concept is the elasticity. An elasticity is a unitless measure of the percent change in the dependent variable for a percent change in an independent variable:

Since an elasticity is a ratio of percent changes, it is natural to interpret values of ?/ with respect to 1: the percent change in Y equals the percent change in X. Terminology is shown in Table 7.11.

A price elasticity of demand is

For an inherently linear model, Y, = e&> xX^1, linearization using natural logs yields In Yj = Д, + /?, lnX,. This is a log-log model. Note that


The elasticity is the parameter and is a constant. The curve for Y is called an isoelastic curve. See Paczkowski [2018] for a discussion.

Total revenue elasticity

Total revenue is defined as TR = Px Q(P). Taking the first derivative with respect to price yields

To convert this to an elasticity, multiply the left-hand side by P and divide by TR which amounts to multiplying the left-hand side by '/q based on the definition of TR. The right-hand side also must be multiplied by '/q yielding rjJR = 1 + v/qX dQ/tip. The second term is the definition of a price elasticity so >]lpR = 1 + цр .

Effects tests F-ratios

The effects tests for the OLS models are based on the sum of squares for the respective effects. The sum of squares is from the Model component of the ANOVA table with that effect included and excluded. Basically, an ANOVA is created with the effect and then without. The sum of squares for the effect is the difference in the Model sum of squares from the two ANOVA tables. For example, the Region effect sum of squares is the difference between an ANOVA table’s Model sum of squares with the Region variable and an ANOVA table’s Model sum of squares without the Region variable. The corresponding F-ratio is the effect’s mean square divided by the residual mean square from the ANOVA table.


  • 1 B.Baesens. “Data Warehouses, Data Marts, Operational Data Stores, and Data Lakes: WhataAZs in a Name?.” Available at Data- Warehouses- Data-Marts-Operational-Data-Stores-and-Data-Lakes- Whats-in-a- Name-127417.aspx. Last accessed June 9, 2019.
  • 2 Also see,_transform,_load for a high-level overview of this process. Last accessed on June 6, 2019.
  • 3 There is another sense to dynamic - the real-time updating of the dashboard's contents. The continuous display of stock prices is an example. This is not my focus when I talk about a dynamic dashboard.
  • 4 See The quote is disputed as this link notes.
  • 5 Based on Last accessed June 7, 2019.
  • 6 The color issue refers to difficulties color-challenged people face.
  • 7 Sec Accessed January 29, 2019.
  • 8 I only consider two here, although actually there are more.
  • 9 It could also OLS with appropriate transformation and use of polynomials, but I will focus on the logistic.
  • 10 These steps hold regardless of the software used for model estimation. How they are implemented does vary, but only slightly.
  • 11 From the sklearn User Guide. See Last accessed June 10, 2019.
  • 12 Recall that the odds ratio was discussed in Chapter 5.
  • 13 The “box” is old survey terminology that is still in use. It refers to the boxes people checked on a paper questionnaire. Now, they check boxes on an online questionnaire. For a five-point Likert Scale, “top-two box” refers to the top two boxes checked.
  • 14 Based on Last accessed December 26, 2018.
  • 15 See Last accessed December 26, 2018.
  • 16 See for WordNet which defines an entity as “that which is perceived or known or inferred to have its own distinct existence (living or nonliving).”
  • 17 de Albornoz et al. [2011]
< Prev   CONTENTS   Source   Next >