Market segmentation has been an important component of marketing operations for a long time. Fundamentally, segmentation involves dividing the market into homogeneous subgroups and developing a different marketing campaign for each segment. Segmentation also includes developing a different pricing strategy and price point for each segment. This is called price segmentation in the marketing literature and price discrimination in the economic literature. See Paczkowski  for a discussion of price segmentation.
Many different approaches have been proposed for segmenting a market all of which could be summarized in three categories:
- 1. a priori segmentation;
- 2. unsupervised learning segmentation; and
- 3. supervised learning (or model based) segmentation.
A priori segmentation refers to those situations in which the marketing team, the corporate culture, or the executive management have decided on the segments without regard to any data. Basically, they just intuitively know the segments because they just seem to fit the business. An example is electricity customers. Utilities sometimes divide their market into residential, commercial, and industrial customers merely because this is intuitive.7 Marketing campaigns and pricing would be developed for these segments.
Unsupervised and supervised learning refer to how understanding is gained from data. The analog)' is learning in an educational situation, say a college class. The traditional education model consists of a professor who guides a student (called a “learner”) in processing, interpreting, and learning from books, articles, and lecture notes (i.e., input), and then tests the learner’s performance in learning via a test with an assigned grade. There is a teacher, a learner, an input, and a performance measure; this is supervised educational learning. In statistical analysis, a model (e.g., a linear, logistic, or Poisson model) guides how a learner (the estimation techniques such as least squares or maximum likelihood) processes inputs (i.e., data) with the processing performance measured by a goodness-of-fit measure such as R2 or badness-of-fit measure such as Akaike or Bayesian Information Criteria (AIC or BIC, respectively). This is called supervised learning in the statistical and Machine Learning spaces. The estimation technique learns from the data under the guidance of the model. The entire regression family of statistical methodologies are supervised learning methods.
If there is no college professor so the students are left on their own to learn, then they are clearly unsupervised. In the statistical analysis and Machine Learning spaces, if there is no model but there is data, an algorithm for operating on that data, and (maybe) a performance criteria, then this situation is called иnsupervised learning. The algorithm is not guided by a model but follows some procedure which is the algorithm itself. Cluster analysis, both hierarchical and к-means as the two most popular examples, is in the family of unsupervised learning techniques. A performance measure may not exist for these methods. For hierarchical clustering, a Cubic Clustering Criteria (CCC) is sometimes used, but this is not without controversy. See Paczkowski |2016] for some discussion of the CCC and references.
The two learning approaches and their college counterparts for reference are summarized in Table 4.3.
Regarding segmentation, a latent class regression analysis is an example of a supervised learning segmentation method. Unsupervised learning segmentation would be some form of clustering. Of the two, the more popular is the unsupervised clustering with the supervised latent class segmentation becoming more popular as software develops in this area and more market researchers are trained in supervised learning methods.
TABLE 4.3 Learning comparisons.
R2, AIC. BIC
The advantage of supervised learning is that a model guides the learning process. The model reflects assumptions about how the world works and what the important drivers are for determining a key variable as reflected in the model specification, the key variable being the dependent variable. Given the data (the independent variables) for the model, there could be only one solution for the unknown parameters, the weights on the independent variables, that produce the best predictions for the dependent variable. Consider OLS as an example. The best predictions are those that minimize the sum of the squared differences between the actual values of the dependent variable and their predictions. Estimates of the unknown parameters are chosen that yield this minimum. In this sense, the dependent variable and the model for it guide the selection of those parameters. There is only one set of estimated parameters; there is only one solution that weights the independent variables to predict the dependent variable.
For unsupervised learning, there is no model since there is no dependent variable or model. There cannot be. There are no assumptions about how the world works. In fact, the search is for something that yields relationships among a set of variables without any prior view of a relationship. The supervised learning has a prior view of a relationship, the model, while unsupervised learning does not. The unsupervised learning takes place using an algorithm with a specific set of parameter settings that search for relationships. These parameters are hyperparameters. You specify the hyperparameters; you do not search for them. Consequently, by a simple change in the hyperparameters, a different relationship can be found. Consider hierarchical cluster analysis which is an unsupervised learning method. The parameter set before clustering can be done is the type of clustering algorithm to be used. Most software allows you to select one of five algorithms: Average Linkage, Centroid Linkage, Ward’s Method, Single Linkage, and Complete Linkage. Each method can generate a different cluster solution. Most practicing statisticians, marketing researchers, and Machine Learning experts use Ward’s minimum variance method and select the most appealing solution from this method. See Everitt et al.  and Jobson [19921 for discussions of clustering methods.
A supervised learning segmentation method is superior because it uses a model. In this category, there is latent regression analysis which is a combination of regression analysis and latent class analysis. This will work well if the dependent variable is continuous. See Paczkowski  for an example using latent regression analysis for price segmentation. The latent class regression approach has been extended to a discrete choice case. Referring to (4.6), the modification involves conditioning the model on a grouping of the dependent variable. This leads to Pr,(/1 s). The seemingly small change in notation, the conditioning on a segment or group s, complicates estimation since the groupings are unknown. They have to be estimated simultaneously with the unknown systematic utility parameters. See Greene and Hensher  for a discussion and application. Also see Wen et al.  for an application to transportation carrier choice.
Another supervised learning option is decision trees, sometimes called recursive partitioning. Unlike latent regression analysis, decision tree analysis can handle a dependent variable that is continuous or discrete or categorical. If the dependent variable is continuous, then the decision trees are referred to as regression trees; if discrete, then they are referred to as categorical trees. The method, regardless of the nature of the dependent variable, is a recursive partitioning model because it proceeds in a recursive fashion to partition or divide the dependent variable space into smaller spaces using constants determined by the independent variables. These constants that partition the dependent variable space are selected based on the best independent variable, the one that does the best at accounting for the dependent variable based on a criterion measure, and the best division of that variable. If a “best” independent variable is discrete with, say, two levels, then the constants are the two levels. If the “best” independent variable is continuous, then the constants are based on the point or value that optimally divides that variable into two parts. In either case, a constant is determined. Once a partition, based on a constant, is determined, then the algorithm proceeds to find the next best variable and a constant that divides the space, but all given the first variable and its partition. The identification of succeeding variables that contribute to explaining the dependent variable and the associated partitions continues until some stopping rule is met. The resulting set of variables and partitions are displayed as a tree with the dependent variable as the root and the successive “best” variables and partitions as branches emanating from the root. The final set of partitions are interpreted as segments. See Paczkowski  for some discussion of decision trees in JMP.
In the previous sections, I focused on newer methods for testing new products in the marketplace. This was predicated on there being just one version of a product. But suppose there are several. An older methodology for determining which of several versions of a product will sell is based on finding the best combination of those products with the notion that the combination will sell the best. It may be that each product alone does not garner enough market share to meet business objectives, but in combination with one or two others they could produce total sales to meet the objectives. A good example is ice cream. Offering one flavor (e.g., chocolate) may not attract enough customers to be profitable. Several flavors, however, could attract customers because of the wider selection even though they would still buy only one; they would buy at least one of the offered flavors. Offering a huge selection of flavors may not be practical because of the cost of producing, maintaining, and marketing them all. An optimal subset, a small combination, may be more profitable. Also, too many options may stifle purchases because customers could be overwhelmed and therefore just not make a purchase. There is a paradox of choice that involves creating purchase anxiety which customers may not be able to resolve and overcome, so they just do not purchase. See Schwartz  for the classic discussion of this interesting paradox.
A market research methodology named TURF, an acronym for “Total Unduplicated Reach and Frequency,” was developed to handle situations such as this. It has its origins in advertisement research and is quite old and simplistic, but still used, especially beyond its original intent. I will briefly discuss TURF in this section for product research. MaxDiff is a modern approach to testing claims and messages, as well as different versions of a product. 1 will discuss this approach in Chapter 5 and then show how it can be combined with TURF.
TURF was developed to determine the best or optimal set of magazines (or newspapers or TV shows) to use to promote a product. If there are five magazines, it may not be cost effective to promote in all five, not just because of the costs of placing an ad, but because customers may only buy, say, two of the five so placing an ad in two will have the best exposure; it is not necessary to have all five. The percent exposure in the set of magazines is the set’s reach. The set is the combination of magazines. If there are n = 5 magazines, then the number of combinations of size two is
There are 10 combinations or sets of magazines, each consisting of 2 magazines. The question is: “Which combination of size two of the 10 possible combinations has the largest reach”? The number of times a customer is reached is the frequency. The proportion of times at least one item in the set is responded to is the reach. A complete discussion of reach and TURF is in Paczkowski .
To implement a TURF study for new products with different versions, customers are surveyed and asked their preference for the versions. They could simply be asked a Yes/No question such as “ Would yon buy this product”? Or they could be asked their preference on a nine-point Likert Scale. Regardless of the underlying scale, TURF requires that responses be dummy coded as 0 for No and 1 for Yhs. For the first way to ask a question, the Yes/No responses just have to be encoded. For the second way, the preference rating could be dummy coded as 1 if the rating is a 7 or higher, 0 otherwise. This would create a top three box scale (T3B).
Suppose there are n products. The number of combinations of size r,r = 1,2.....и, is given by
Suppose r is restricted to some number less than n. All combinations of size i = 1, i = 2, ...,» = r < и can be determined. Let 7 be the total number of these combinations. That is,
For example, if r= 1 and n = 5, then 7 = 5; if r = 1, 2, then 7=15.
Out of the class of n candidate designs, only a few would be marketed, say r < n. This may be the case because of manufacturing constraints or marketing budget constraints that allow only a few products be brought to market. Which r has the largest reach?
The TURF reach calculation proceeds by creating an indicator matrix that is ? X n. The rows are the combinations and the columns are the products. For our example of r = 2, there are 10 rows for the 10 combinations and five columns for the five products. The cell values are either 0 or 1: 1 if the product is in a combination and 0 otherwise. A second m X и indicator matrix identifies the preference for each of the n products for each of the m customers surveyed.
The two indicator matrices are matrix multiplied. The result is an mXc matrix with cell values representing the number of products purchased in a combination. In our example of r = 2, the number of products in a combination could only be 0, 1, or 2.
Reach is formally defined as the proportion of customers purchasing at least one item in a combination. This is implemented by dummifying the m X c matrix of frequencies. If a frequency is greater than or equal to 1, the cell is recoded as 1; otherwise it is recoded as 0. Since the mean of 0/1 values is a proportion, the mean for each column in the recoded indicator matrix is the proportion of customers reached by the respective combination. This is the reach part of the TURF acronym. The sum of each column of the original frequency matrix is the frequency part in the TURF acronym. The Appendix to this chapter shows these calculations in matrix notation.
As an example, consider a shampoo manufacturer with five new forms of shampoo for deeper and richer cleaning. All five cannot be marketed; just two. But which two? A sample of 275 consumers were recruited and asked to try the shampoos in home tests for five weeks, one shampoo per week. At the end of each week, they were asked to rate their likelihood to buy the tested shampoo on a 1-9 Likert Scale: 1 = Not at all Likely and 9 = Very Likely. The resulting scores were converted into the top three box using the encoding definition
All combinations of the five shampoos taken one-at-a-time and two-at-a-time were determined. The one-at-a-time combinations are just simple proportions. There are a total of 15 combinations. The results of the TURF calculations are shown in Figure 4.5.
Some analysts extend the TURF framework to include the marginal contribution of each item to each set the item belongs to. That is, which item contributes the most to the set’s reach, followed by the second most contributory item, and so on. The objective is to identify which item in the set should be marketed first. An example report is shown in Figure 4.6.
FIGURE 4.S TURF shampoo example for five shampoo products simply labeled A—E. The number of combinations were based on r = 1 and r = 2 giving 15 combinations. Of the 275 customers in the study, 119 would buy A, 92 would buy C, and 151 would buy at least of the two shampoos giving a reach of0.549. So 55.0% of the market would buy at least one of A or C.
FIGURE 4.6 This illustrates the marginal contribution of the two products, A and C, in the top reach bundle of Figure 4.5. Shampoo A contributes the most to the combination, having a marginal reach of 0.43273, which happens to be its stand-alone reach proportion. Shampoo C contributes 0.11636 to the combination resulting in a total reach of0.54909 as in Figure 4.5.
Stated preference discrete choice experiments can be designed using JMP from the SAS Institute Inc. This software has a powerful platform for choice designs as well as a good platform for estimating choice models. Nlogit, an extension of the econometric software Limdep, is the gold standard in choice modeling. This package has all the latest developments in the choice analysis area as it should since its developer is a leading researcher in choice analysis.8 Stata will also handle choice estimation. R has packages for estimation but they are a challenge to use.
Latent class regression modeling can be done using Latent Class Gold by Statistical Innovations, Inc. Decision trees can be grown usingJMP, SAS, Python, and R.
I recommend JMP for its simple interface. Cluster analysis, both hierarchical and к-means, can be handled by almost all software packages. JMP has a good interface and is my typical choice.
The TURF calculations in this chapter were done using a JMP script written just for TURF analysis.
In this chapter, I described methods for testing a product design just prior to launch. Some of the testing relied on discrete choice analysis, a methodology that has become very popular in the market research space. This is a useful framework for testing products with customers to determine demand and final attribute setting before the product is released to the market. The advantage of this approach is the fact that you do not actually have to market the product to determine its demand. You can set up mock situations and use prototypes and achieve almost the same effects. Another advantage is that competitive products can be included in a study so a competitive effect can be derived. A disadvantage is that an experimental design is needed. Conjoint analysis, which I described in the previous chapter, also requires an experimental design, but the discrete choice design is more complicated. Special skill-sets are required for its implementation.
I also described the use of clinics for assessing demand. A discrete choice study could be included in a clinic, and usually is.
Finally, I gave an overview of price segmentation and TURF analysis to further your understanding of product testing at this stage of development.
This Appendix contains some detail on the TURF calculation.
Let X be an appropriately dummy-coded matrix of customer preferences for n products. If there are m customers, then X is m X ti. An example ofX is shown in Figure 4.7 for seven customers.
FIGURE 4.7 This illustrates the TURF calculations for responses by seven customers to five products. The notation corresponds to that in this Appendix.
Let C be an appropriately dummy-coded matrix of all combinations of 1 < r < n products. You could have r = n but this would not be practical if n is large. If 7 is the number of combinations, then C is 7 X n. It is not unusual for r to vary from r = 1,2For example, you could have combination sizes of r = 1 for each product independent of the others, r = 2 for pairs of products, r = 3 for 3-tuples of products, and so forth. The matrix C would contain dummies for all the needed possibilities. If r = 1 and r = 2 and n = 5, then C is 15x5. Obviously, for r = 1, an «X n identity matrix, I, is created and C is the 7Хи matrix augmented by I. An example of C for n = 5 is shown in Figure 4.7.
Frequencies are calculated as
See Figure 4.7 for an example.
To calculate the reach for each combination, F must be transformed into a matrix of 0 and 1 values based on a threshold defining the reach. The threshold is usually 1: a customer is reached if he/she buys at least one of the products in the bundle. Larger thresholds are certainly possible. For example, referring to the ice cream example I mentioned earlier, people generally buy several flavors at once when they shop at grocer)' stores, especially if they have young children with differing tastes and preferences. A threshold might be set at, say, three ice creams in this case. Let 0 be the matrix indicator function such that
where F,(- is the value in row i, column j of F. The resulting matrix is the reach matrix, R. See Figure 4.7.
The reach for each combination is the vector of column means of R and the total frequency for each combination is the vector of column sums of F. See Figure 4.7.
- 1 See “AT&T Test Kitchen” in CIO Magazine (May 15, 1994) for a description of the Consumer Lab.
- 2 See the Decision Analysts white paper “Car Clinics (The Head-to-Head Contest)” at www.decisionanalyst.com/whitepapers/carclinics/. Last accessed April 22, 2019.
- 3 The Decision Analysts white paper “Car Clinics (The Head-to-Head Contest)” lists bulldozers, construction cranes, lawn mowers, chain saws, vacuum cleaners, refrigerators, and washing machines as examples.
- 4 Food and beverage taste testing is in the general area of sensory evaluation. See O’Mahony  and Meullenet et al. .
- 5 For cars, use could involve a test drive. For a household durable product such as a washing machine, it could involve actually doing a small laundry. For a food or beverage, it could involve eating or drinking the product(s).
- 6 See the Decision Analysts white paper “Car Clinics (The Head-to-Head Contest)” for brief discussion of possibilities.
- 7 I was once the chief forecaster at a utility company that divided its market into these three groups. There was no reason for it other than this was what made sense.
- 8 The developer is William H. Greene of Econometric Software, Inc.