Text processing

Even though a corpus is a collection of documents so there is really some structure to the raw text, this is insufficient for statistical analysis for uncovering patterns or Rich Information in the text. More structure has to be imposed. Basically, the raw text has to be encoded as numeric quantities appropriate for statistical analysis since statistical analysis is based on numbers, not words. In this subsection, I describe the methods for encoding text and the analysis methods applicable to this encoded data.

The “old fashioned” way to encode text data, before modern text processing software, was labor intensive. Consider the market research space where text in the form of verbatim answers to some questions is common (e.g., the ubiquitous “Other: Please Specify’’). Clerks laboriously examined each verbatim response following a set of guidelines specified by the lead researcher. The guidelines specified key words and phrases expected to appear in responses; assigned a code number to them; specified words and phrases believed to be meaningless or that could be ignored (e.g., articles such as “the”, “a”, “an” and conjunctive words such as “and” to mention a few). These words and phrases were called, and still are called, stop- words. The clerk extracted key words or phrases while ignoring stop-words.5. The final result of this text processing is a list of words and phrases the clerk would then sort and tally into simple frequency counts in a manner similar to a Stat 101 initial exercise for a histogram construction. The top few words and/or phrases were then reported. See Leskovec et al. [2014| for some discussion.

This labor-intensive process (actually, the clerk) has been replaced by software but the process is fundamentally the same. A basic list of stop-words is maintained that could, of course, be expanded to include industry, discipline, and geographically local or regional words, phrases, and even symbols that should be ignored. Some words on a stop-word list, however, may be important and should not be ignored. Words such as “The” and “An” (note the capitalization) could be part of a title and may be important. The implication is that not all words are equally unimportant - some should be kept because of their association with other words and thus give those other words more meaning and context. The word “not” is a prime example. There are lists of stop-words but no universal list.6

Some words are verbs with different conjugates such as “be”, “am”, “is”, “was”, “were”, “are”, “been”, and “being” all of which can be reduced to “be.” Other words are written in different forms merely by changing the ending to a base or stem word to create variations on the word. For example, consider a word I just used: “changing.” The stem or root word is “change.” Variations are “changing”, “changed”, and “changes.” In many applications, only the stem — “chang” in my example - is needed. The process of stemming, removing typical endings, is applied to words to extract and return just the stem. For example, stemming might return “chang-” where the indicates that something was dropped, but you do not know what.

Stemming is sometimes described as the crude process of deleting typical endings (i.e., “ed”, “es”, “ies”, “tion”) with the goal of returning the stem. Another process, lemmatization, is not crude but sophisticated relying on dictionaries and parts-of-speech analysis to reduce words to their core which is called a lemma. Stemming, however, is the most commonly used. An algorithm called Porter’s Algorithm is the most common method for stemming. Stemming may be overdone: it would reduce the words “operator”, “operating”, “operates”, “operation”, “operative”, “operatives”, and “operator” to “oper-”. What would happen to “opera?” See Manning et al. [2008, p. 34| for a discussion.

Punctuation and capitalization must also be dealt with. Punctuation marks are of no use for understanding the words themselves, so they need to be removed. Capitalized words, especially the first word of a sentence, are of little value unless they are proper names. These must also be dealt with by changing all words to lower case.

Modern text processing software scans the text in a document and tokenizes each word to create tokens. Tokenization is the process of separating each word and symbol (e.g., the ampersand “&”) into a separate entity or token. Some tokens are words per se and thus meaningful while others are not. It all depends on the text itself. The words “token”, “word”, and “term” are used interchangeably in the literature and I will use them interchangeably in what follows.

Tokenization is done with a text manipulation software language called regular expressions. Regular expressions are powerful, widely used, but yet arcane and difficult to understand and interpret, not to mention difficult to write. It consists of defining a pattern using metacharacters (e.g., the asterisk, question mark, and period or dot are three metacharacters) and regular characters (e.g., letters of the alphabet and the digits 0-9) which are meant to match corresponding symbols in a character string. Each metacharacter is a special pattern matching instruction. The question mark metacharacter, for example, instructs the software to match zero or one occurrence of a preceding character in the pattern. The character string could contain letters, symbols, and digits. White spaces are also possible and it is these white spaces that break a string into tokens. For example, the classic string “The quick brown fox” has three separating white spaces and would be decomposed into four tokens. Regular expression capabilities are found in many programming languages such as Python, Pearl, and R to mention a few. See Watt [2005] and Frield [2002] for excellent introductions to regular expressions.

There are two forms of tokenization: sentence and word. Sentence tokenization creates a token for each unit of text considered a sentence based on the local rules of grammar and punctuation. In the English language, for example, a period, question mark, and explanation mark denote the end of a sentence. Since much text is now computerized, a newline character is also counted as an ending mark for a sentence. A semicolon might count depending on the text analysis software implementation.

Although sentence tokens are possible, most often only words are tokenized primarily because people are more interested in the frequency of occurrence of individual words rather than sentences. This is changing, however, because collections of words (e.g., phrases and clauses as well as sentences) convey meaning, sentiments, and opinions. I will discuss sentiment analysis and opinion mining in Chapter 7.

Word tokenization is not simple; there are complications. Examples are:

  • • contractions formed by deleting a vowel, replacing it with an apostrophe, and joining parts (e.g., “won’t” rather than “will not”; “can’t” rather than “cannot”);
  • • misspellings;
  • • disallowed repetition of a letter (e.g., in English, a letter cannot be used more than twice in a row in a word);
  • • shortened versions of words (i.e., “/<>/” for “laughing out loud”) to mention a few. Tokenizing a word such as “won’t” could yield “won” and “’t”, neither of which makes much sense. Some text softwares recognize this problem and return a single correct token: “won’t.” See Sarkar |2016] for an in-depth discussion of tokenization problems and how they are handled using Python tools.

Once you have word-tokenized, or just tokenized for short, a document you then have a bag-of-words (BOW). This is an apt description for exactly what you have: a collection of words that, in and of itself, is useless; you must do something with it. A common operation, the one market researchers traditionally used as described above, is to count the occurrence (i.e., frequency) of each term across all documents in a corpus. Let > 0 be the frequency of occurrence of term i, f, , in document j in corpus C. There is no j subscript on t, because the document is not important; it is just a term. Term t, may not occur in document j so f.9 = 0. However, it must occur at least once in all documents so fc >1/ You could simply array the terms by their frequency counts and focus on those that occur most often in all the documents. The frequencies are usually normalized by the total number of terms in the documents so a relative frequency is displayed. The raw, unnormalized frequencies across all documents are simply referred to as term frequencies (tf). Note that tf = fc is the term frequency for term t,- over all documents in C. A simple display is a barchart with each word listed next to its frequency count and a bar showing the size of the count. An example is shown in Figure 2.4.

A problem with this approach is that documents have different lengths. For example, product reviews could vary in length from one word (e.g. “garbage”, “great”) to whole paragraphs and pages. Consequently, one word could appear more frequently in one document than in another merely because of the different document lengths. In addition, some words naturally occur more often than others

This is an example of a term frequency display for the product review “ the product works well and the service was rated high but the price was too high” Stop-words have been filtered out

FIGURE 2.4 This is an example of a term frequency display for the product review “ the product works well and the service was rated high but the price was too high” Stop-words have been filtered out. Seven terms were retained out of 16 in the review. Of the seven terms, “high” occurred twice while the other six occurred once each.

but really do not contribute any meaning in the overall text. Stop-words such as “the”, “and”, “an” are examples. In a product review, a customer could write “the product works well and the service was rated high but the price was too high!’ As a BOW, there are 16 terms. The term “the” occurs three times so tfflle = 3 while the term “was” occurs twice so tfms = 2. The word “high” also occurs twice so tf^, = 2. All other words occur once each. Clearly the words “the” and “was” have no importance. They should be filtered out of the BOW before any analyses are done. These deleted words are the stop-words. The word “high” is important and should not be filtered out. A frequency count display might look like the one in Figure 2.4. The issue is weighting the terms so those that are important and which occur less frequently across all the documents are given more weight in an analysis while less important ones are given less weight. In Figure 2.4 all the terms have equal weight.

Weighting data is not unusual in empirical work. It is common in survey analysis, for example. In fact, it is probably the rule rather than the exception. See Valliant and Dever [20181 for reasons and methods for weighting survey data. Weights are also used in regression analysis when heteroskedasticity is an issue. In this case, the objective is to weight down those cases with large variances and weight up those with small variances to equalize variances across all the cases. This equalization is tantamount to the classical regression assumption of homoskedasticity which is needed for the Gauss-Markon Theorem. In this latter situation, a variance generating process (VGP) is assumed and this process is the basis for determining the weights. See Hill et al. [2008] for a discussion.

In sample survey analysis, a simple random sampling (SRS) without replacement design is based on every object having the same chance of being selected for the sample as every other object. Problems do occur in real-world applications but this assumption is a good starting point. See Valliant and Dever |2018] for a discussion of problems. In actual applications, an SRS design is rarely used. Instead, most sample designs use a probability sampling method in which each element or unit to be sampled (i.e., the Primary Sampling Unit or PSU) is assigned a nonzero probability of being selected. So SRS is a special case of probability sampling. In an SRS, each unit has the same probability of being selected. If N is the population size, then the selection probability is p, = 1 /.v, i = 1,... ,N, and sampling is with replacement. If the sample size is n, then the selection probabilities are »/n since each of the n units has the same selection probability. For example, if n = 3 and N = 10, then p, = 3/io so each person has a 0.30 chance of being selected for the study (with replacement). These selection probabilities do not have to be all equal for probability sampling. See Heeringa et al. 12010] for a discussion.

A sampling weight is the reciprocal of the selection probability: wt = t/Pi. Sometimes the weight is interpreted as “the number (or share) of the population elements that is represented by the sample observations. Observation i, sampled with probability ... i/ю, represents 10 individuals in the population (herself and nine others).” See Heeringa et al. [2010, 35] for a discussion.

Referring to the terms in a corpus C, the probability of any term t, being randomly selected or found in ;YC documents in the corpus is the number of

TABLE 2.1 This illustrates the calculation of term frequencies. The term does not appear in document 3 so fF = 0 and the indicator function returns 0. Then tfr = 3 so Pr(t, G C) =

= 3/4.














documents in which f. occurs, nc, divided by the number of documents. If f.c > 0 is the frequency of occurrence of term t, in document / in corpus C and I(ffr > 0) is the indicator function that returns 1 if the argument is true, 0 otherwise, then «,С = 1м Д/;,С>0)- Then

Table 2.1 illustrates a simple example. Note that nc is the number of documents t, appears in whilef.c defined earlier is the frequency count of t, in a single document, document j. Analogous to sampling weights, the weight for term i is s'c/«‘: or the inverse of the probability. This expression is the basis for the inverse document frequency (IDF). See Robertson [2004]. Applications use log base 2 of this expression, or IDF* = log0 (NC/«f). The reason for the log0 lies in information retrieval theory where a term is either retrieved or not. This is a binary situation, hence the base 2. If term f, appears in all documents in C so that it is a commonly occurring term, then nc = Nc and hg2( 1) = 0 so the term has zero weight: it contributes no insight because it occurs all the time.

Variations exist for the IDF calculation. For example, JMP software uses

Sometimes the denominator in the log term is increased by 1 to avoid the case of n? = 0 (i.e., tj does not appear in any document in the corpus) which should never occur since f, would never be under consideration if it did. Sometimes 1 is added to the whole log term to reflect the case where all the terms are contained in all the documents in the corpus. The IDF is then written as

A final adjustment sometimes done is to divide the IDF by the Euclidean distance of the values. This normalizes the IDF. See Sarkar [2016] for some discussion.

Since IDF(tj), however calculated, is a weight, it can be applied to the term frequency, tf.c, through multiplication. The result is the term frequency-inverse document frequency measure (tfidf):

for term tj. It is merely the weighted term frequency.

Since there is a set of terms for each document in the corpus, a matrix could be created that has as many rows as there are documents and as many columns as there are tokenized terms from all documents. If Nc is the number of documents and t = £v, tj is the number of terms across all documents in the corpus, then the matrix has size Nc X t. This is called the Document Term Matrix (DTM).8 This matrix will necessarily be sparse since not all terms will be in each document (i.e., rows of the DTM). A sparse matrix is one with a lot of zeros. Aside from zeros, which indicate that a term does not appear in a document, the cells of the DTM can be populated with any of the following:

  • • binary values: 0 if the term does not appear in the document and 1 if it does;
  • • ternary values: 0 if the term does not appear in the document, 1 if it appears once, and 2 if it appears more than once;
  • • frequency count: number of times the term appears in the document;
  • • log frequency count: /oglo(l + x) where x is the term’s frequency count in the document9;
  • tfidf, the weighted term frequency.

The tfidf is the most common.


Consider a simple example of three product reviews. Each review is a document so Nc = 3. This is a corpus of size three. The three reviews are:

  • 1. The product failed the second time - it failed]
  • 2. The product worked the first time.
  • 3. It’s nice but failed.

Clearly, one review is favorable (review #2); one is moderate but noted that the product failed (review #3); and one is negative (review #1). There are 18 “terms” or tokens (the ” is deleted) but only seven unique terms or tokens have meaning. The stop-words and the ” were removed. A report on the seven terms is shown in Figure 2.5. It is easy to see that “failed” is the most frequently used word. The frequency for each term by each document is shown in Figure 2.6. Looking at the word “failed” you can see that it occurs twice in the first document, not at all in the second, and once in the third.

This is an example for three product reviews. Even though there are three reviews (i.e., documents), there are seven terms after stop-words and punctuation are removed

FIGURE 2.5 This is an example for three product reviews. Even though there are three reviews (i.e., documents), there are seven terms after stop-words and punctuation are removed.

This is the DTM corresponding to the example with three cases. Notice that the column sums correspond to the frequency counts in Figure 2.5

FIGURE 2.6 This is the DTM corresponding to the example with three cases. Notice that the column sums correspond to the frequency counts in Figure 2.5.

Since “failed” occurs in two documents and there are three documents, the inverse document frequency for “failed” is = /2 = 1-5. The /oy10(1.5) =

0.17609. This is shown in Table 2.2. This is a (weighted) DTM. This is the structure imposed on unstructured text data. These same calculations would be done for each of the seven terms. The complete set is shown in Figure 2.7.

Multivariate analyses

Once the DTM is created, a number of multivariate statistical procedures can be used to extract information from the text data. A common procedure is to extract key words and phrases and groups of words and phrases as topics. Conceptually, phrase extraction is the process of taking groups of words once the document has been tokenized, each group based on a prespecified maximum size. A group of size n is an n-gram. If n = 1, the group is a unigram; for и = 2, it is a bigram; for it = 3, it is a trigram; etc. Creating n-grams is tantamount to creating a small window of a prespecified size, placing it over a vector of tokens, and treating all the words inside

TABLE 2.2 This illustrates a tfidf calculation for the term “failed". Column В is the term frequency for “failed” for each document in column A. The frequencies correspond to the values in the “failed” column of the DTM in Figure 2.6. Column C is the inverse document frequency: the ratio of the number of documents in Column A divided by the number of documents that contain the word “fail!’ There are three documents and “failed" appears in two of them, so и£.( = Y' , Д/jw/j > 0) = 2. The ratio is 3/> = 1.5. The log10 of the ratio is in Column D. The logl0 is used by JMP which was used for this example. The tfidf is in Column E and equals the value in Column A times the value in Column D. Notice that the values in Columns C and D are constant but the final weight in Column E varies because of the frequencies in Column B.



f = l-5


2x0.17609 = 0.35218



- = 1.5


0x0.17609 = 0.00000



  • ? = 1.5
  • 2


1x0.17609 = 0.17609

The weighted DTM for the example with three cases

FIGURE 2.7 The weighted DTM for the example with three cases.

the window as a phrase. The window can be moved to the right one token and all the words inside the new placement of the window would be a new phrase. This would be continued until the end of the vector of tokens. The phrases are then counted and a report created showing the frequency and (sometimes) length of each phrase. This is largely a counting function. See Sarkar [2016] for a discussion.

Phrases are of limited use in part because many of them would be nonsensical since they are created by selecting groups of words. The groups do not have to make any sense at all. You, as the analyst, must decide what is a useful, meaningful phrase. Many believe that topics are more useful, especially because topics are extracted using a statistical modeling method. Topics are viewed as latent, much like factors in factor analysis. In fact, the extraction of topics can be viewed as factor analysis applied to text data. In factor analysis, the resulting extracted factors must be interpreted to give them meaning and usefulness. Interpretation is also needed with latent topics. Nonetheless, a model is still necessary for topic analysis. It is this model that distinguishes topics from phrases.

There are three latent topic methods in common use: [1] [2] [3]

Some have argued that LDA is not a good method because it is inconsistent. A different answer results from the same data set each time it is used. In addition, it uses hyperparameters so more “tuning” is required to get the right set of hyperparameters. Hyperparameters are those parameters that are set prior to model estimation in contrast to the parameters that are estimated. Hyperparameters define the model and its complexity. See Alpaydin [20141. The method, however, is improving at a rapid speed as research continues to add to its foundations.10

The LSH is the most used method. It is based on the Singular Value Decomposition of a matrix; in this case the matrix is the DTM.11 This allows terms that are used in the same context and occur together more often than not to be grouped together, much like numeric data are grouped in factor analysis. The terms are given weights or loadings. The SVD is preferred because it gives one unique solution regardless how often it is run, it does not have hyperparameters, it is fast, and is unaffected by the sparseness of the DTM.

Non-negative factorization is another matrix factorization method, but in this case a matrix is factored into two parts. A property of this factorization is that all three matrices have non-negative elements.12

The essence of the SVD of the DTM is that the matrix is decomposed into three submatrices simply called the left, center, and right matrices. The matrix product of these three submatrices returns the original DTM. The left matrix has columns corresponding to the rows of the DTM; the right has columns corresponding to the columns of the DTM; the center is a connection that has a special interpretation. The center matrix’s components are called the singular values. The columns in the left and right matrices are called eigenvectors. The eigenvectors have the feature that they are independent of each other in their respective submatrix. The squares of the singular values are eigenvalues that correspond to the eigenvectors. See the appendix to this chapter for a discussion of the Singular Value Decomposition.

The weights or loadings are the product of the singular values and the right set of eigenvectors. The right set is used because these correspond to the terms in the DTM.13 The products are sometimes called singular vectors. Once the SVD is applied, other multivariate procedures can be used with the results. This includes cluster analysis and predictive modeling.


There are 22 online reviews of a robotic vacuum cleaner the marketing and product management teams know is not selling well.14 They want to produce the next generation robotic cleaner and are looking for ideas for a better product. A quick perusal of the reviews reveals the main problems with the current design and a perusal of competitor reviews does the same for their products. Suppose, however, that these reviews were in the hundreds with each review being quite lengthy. Reading them is inefficient. The 22 reviews for this example are just a snapshot to illustrate the technique. Each review is a document and the collection of 22 reviews is a corpus so Nc = 22.

Word and phrase analysis of 22 robotic vacuum cleaner reviews. The list of terms on the left is truncated for display purposes

FIGURE 2.8 Word and phrase analysis of 22 robotic vacuum cleaner reviews. The list of terms on the left is truncated for display purposes.

Based on the 22 online reviews, a word and phrase analysis is shown in Figure 2.8. There are 1,579 tokens with 438 terms. This yields an average of 71.8 tokens per case or product review.

Notice that “app”, “robot”, and “work” are the top three terms. “Robot” should not be surprising since this is robotic technolog)' so this word should be added to a user defined stop-word list but I will leave it in for this example. The word “work” does not convey much insight, but “app” does. This word is at the top of the list, appearing 26 times in the 22 reviews, suggesting that the app is an issue, good or bad. A word cloud, a visual display of the word frequencies, is shown in Figure 2.9. The size of a word is proportional to its frequency.

The phrases column on the right of the report shows the frequency count of each phrase and the number of terms appearing in the phrase. “Wi fi” appears the most and has two terms in the phrase; it is a bigram. A glance at the top phrases, say the top 10, suggests that the vacuum cleaner has a tendency to stop in the middle

Word cloud based on the frequency of occurrence of each word. This clearly shows that “app” is important

FIGURE 2.9 Word cloud based on the frequency of occurrence of each word. This clearly shows that “app” is important.

This is a plot of the first two dimensions of the SVD of the DFM using TFIDF weights. The plot on the left is for the documents while the one on the right is for the terms

FIGURE 2.10 This is a plot of the first two dimensions of the SVD of the DFM using TFIDF weights. The plot on the left is for the documents while the one on the right is for the terms.

of the room, takes a long time to run, and crashes. Now this is some insight into problems with the product. This is Rich Information.

A Latent Semantic Analysis (LSA) was done on the 22 reviews based on the SVD of the DFM. The tfidf was used to weight the DFM. The weighted DFM was scaled to have a zero mean and unit standard deviation. The results for the first two singular vectors for the documents and the terms are shown in Figure 2.10. The graph on the left shows the distribution of the documents and the one on the right the terms. The terms are more important. Notice that there are three clusters of points with two, possibly three, outliers. The large term-cluster on the left, if highlighted, shows that the words “map”, “recon”, “well”, “run”, and “vacuum” are the dominant words for this group. This is shown in Figure 2.11. A second group is highlighted in Figure 2.12 while a third cluster is highlighted in Figure 2.13.

This set of graphs highlights specific points for the SVD of DTM using TFIDF weights

FIGURE 2.11 This set of graphs highlights specific points for the SVD of DTM using TFIDF weights.

The analysis can be extended by basically doing a factor analysis on the DTM but utilizing the SVD components. This results in “topics” which are like the factors in factor analysis. Each topic has terms comprising that topic with term loadings that show the impact on the topic. These are correlations exactly the way loadings are correlations in factor analysis. Figure 2.14 shows five topics extracted from the DTM.

  • [1] Latent Semantic Analysis (LSA), also sometimes referred to as Latent SemanticIndexing (LSI);
  • [2] Latent Diriclilet Allocation (LDA); and
  • [3] Non-negative Matrix Factorization.
< Prev   CONTENTS   Source   Next >