# Predictive modeling

Now that you have your data and you have visually examined them, you have to turn your attention to predictive modeling to address the product manager’s business

FIGURE 7.12 This chart shows the relationship between the log of net unit sales and the log of the pocket price using a hex plot. Histograms are shown on the two margins of the graph to emphasize the distributions. You can see that the distributions are almost normal.

problems. Recall that she wants to know the price elasticity for living room blinds now that actual market data are available but she also wants to know the effect of changing the price level. The latter is a more complicated problem. See Paczkowski [2018J for a thorough discussion regarding price modeling.

Every process has steps that should be followed. Predicting is no different. The steps for predictive modeling are:

• 1. split the data into training and testing data sets;
• 2. train a model with the training data set; and
• 3. test the trained model with the testing data set.

These steps hold even for forecasting the future, but that is not our focus here. Forecasting has its own complex issues, but the framework is the same. I will discuss

FIGURE 7.13 This chart shows the distributions of total discounts by the four marketing regions.

TABLE 7.6 These are the test results for differences in mean total discounts by marketing regions. The difference in the means is Group 2-Group 1. The columns labeled “Lower” and “Upper” are the confidence limits around the mean difference. Notice that the Null Hypothesis is rejected for all combinations of the Southern Region and the other three (the last row of the table should be reversed to be consistent with the other Southern comparisons). Also notice that the difference in the means for the Southern Region is negative in all cases.

 Croup t Croup 2 Mean Difference Lower Upper Reject Null? Midwest Northeast -0.0038 -0.0053 -0.0024 True Midwest South -0.0549 -0.0561 -0.0537 True Midwest West -0.0003 -0.0014 0.0008 False Northeast South -0.0511 -0.0526 -0.0496 True Northeast West 0.0035 0.0021 0.0049 True South West 0.0546 0.0535 0.0558 True

these three general steps in the following subsections for predicting. Forecasting methods were already discussed in Chapter 6.

Training and testing data sets

It is best practice to divide a data set into two parts for prediction purposes:

TABLE 7.7 These are the summary statistics to help interpret Table 7.6.

 Region Count Mean Std. Den. Min Ql Median Qd Max Midwest 19565 0.294990 0.040023 0.023 0.267 0.295 0.323 0.414 Northeast 8704 0.291163 0.056135 0.139 0.249 0.293 0.334 0.431 South 15831 0.240060 0.028707 0.149 0.220 0.240 0.260 0.325 West 26170 0.294682 0.048876 0.158 0.258 0.294 0.331 0.431

FIGURE 7.14 This chart shows the distributions of the components of the total discount for the Southern Region.

• 1. a training data set; and
• 2. a testing data set.

The former is used for model estimation while the latter is used for model testing. This is best practice to ensure optimal modeling. When a model is estimated, we say that it learns what the estimates should be from the data we give the estimation procedure. In essence, the model is trained by the data. Actually, the model is trained by a dependent variable in the sense that this variable guides how the parameters are estimated. It supervises the training so this is sometimes called supervised learning. There is, of course, unsupervised learning, but that is another issue.

Once a model is trained, it has to perform; that is, predict. You need to check those predictions, but you cannot do that with the same data used in the training since the model already saw or knows that data. This is where the testing data come

FIGURE 7.15 This chart shows the trend of the mean monthly dealer discount for the Southern Region. Notice that several months are missing discounts and that the last few months indicate an upward trend.

FIGURE 7.16 This boxplot illustrates how anomalies, or outliers, are revealed by a boxplot. The data for this graph are simulated.

in. This data set is a new one the model has not seen. If the model is good, it should be able to predict the testing data, albeit with some random variation or noise. A rough rule-of-thumb is to split your data into 3/4 training and 1/4 testing. Another is 2/3 training and 1/3 testing. It all depends, of course, on the size of your original data. Basically, you need a larger portion in the training data set because learning is more challenging and data intensive.

If time is not a characteristic of the data, then simple random sampling could be used to divide a data set into training and testing parts. However, if time is involved, then a split cannot be based on random sampling as I noted in Chapter 6 and the method I described there must be used. If the data are a panel data set consisting of both cross-sectional and time series dimensions, then it is best to randomly sample on the cross-sectional dimension and preserve the time series for those units.

Training a model

Let me now discuss training a model with the training data set. The framework you select for predictive modeling depends on the nature of the dependent variable, the variable that guides the training. There are several cases or possibilities.8 For one case, the dependent variable is continuous; this is just OLS. For another case, the dependent variable is binary; this is where logistic regression is used. Both are members of a regression family, which is large. There is actually a third case that can handle either a continuous or discrete dependent variable but it does so by fitting constants to the data. This is also a member of the regression family. In a sense, these are all cousins. Regardless of the specific family member, a regression model fits:

• • straight lines: this is OLS;
• • curves: this is logistic regression9; and
• • constants: this is decision trees.

These three cases are discussed in the following subsections.

Case I: Training a model with a continuous dependent variable

Technically, this is the model for an OLS regression. A model is

where:

Y, is the dependent variable for the i,h observation, i= 1,2,...,n; p() is the intercept;

fij,j = 1,2.....p are the slopes;

XjjJ = 1,2.....p are the independent variables; and

is the random disturbance: e; ~ jT((),a2).

Yt is continuous and our goal is to estimate the parameters or slopes to say something about Vj which could, for example, be net unit sales. There should be an intercept, /?0, but I usually do not care about this since it just places the line or plane or hypersurface. The slopes, however, are important since they show the effects of the independent variables. These slopes are intimately related to the dependent variable which is why I refer to the model as being trained by the dependent variable.

Once the model is trained, you have to examine it statistically. This means you have to test the effects. You use an F-test for this purpose. You also have to check the relationship among the independent variables for multicollinearity, using the variance inflation factors (VIF). See Gujarati [2003] and Paczkowski [2018] for a discussion of VIF. There are other checks you need to do, but these will suffice for here. In general, a model is trained following four steps:10

• 1. define a formula (i.e., the specific model to estimate);
• 2. instantiate the model (i.e., specify it);
• 3. fit the model; and
• 4. summarize the fitted model.

For the Case Study, a model for unit sales is

where the focus is on the pocket price and marketing regions. Notice only three regions are specified. Since Region is categorical, you must create dummy variables for it. There are four marketing regions corresponding to the four U.S. Census Regions so only three dummies should be created. The Western region is omitted to avoid the dummy variable trap. This region is the last in alphanumeric order and was arbitrarily selected as the base. Although omitted in the model specification, it is still present as the intercept. The intercept is the Western region and the estimated parameters for the other three regions show differences from the Western region which is the base. See Paczkowski [2018] and Gujarati |2003] for discussions about dummy variables, their interpretation, and the dummy variable trap.

This model is an inherently linear model meaning it can be linearized by taking the natural log of both sides. The earlier histogram analysis showed that net sales are highly right-skewed but that the distribution for the log transformation is more symmetrical. This is why it is preferred. In addition to being inherently linear, this model is also a constant-elasticity (or isoelastic) model. The price elasticity is the estimated parameter, /?,. See Paczkowski [2018] for a discussion of this type of model and elasticities. Also see the Appendix to this chapter for the constant elasticity demonstration.

Applying the natural log to both sides of the model yields

Once a model is specified, it must be instantiated, which simply means the estimation methodology must be specified for the model just stated. Since an OLS model is considered in this section, instantiation means that OLS must be specified. In addition to specifying the methodology the data used with that model is also specified. This will be the training data set.

FIGURE 7.17 Regression summary for the linearized model, (7.1).

Next, the model is fit; that is, the unknown model parameters are estimated using the training data set. In some software, once the model is fit, a summary report is displayed which is the last step; in other software products, you have to call a summary function. Regardless, a summary of the fit is displayed.

Estimation results for the Case Study are shown in Figure 7.17. The estimated price parameter is -1.5 which indicates that the demand for the new product is highly elastic. This should be expected since window treatments is a highly competitive business. In addition to blinds, there are also shades and drapes. In addition, this new product is a high-tech product that is voice activated which means that its appeal in the market may be hindered by the low-tech (i.e., old-fashioned) blinds currently in the market. If sales are suffering as the product manager’s dashboard and business intelligence systems indicate, lowering the price may be necessary. This elasticity helps justify this strategy'.

Why lower the price rather than raise it? I show in the Appendix to this chapter and in Paczkowski [2018] that the total revenue elasticity with respect to a price

FIGURE 7.18 Regression effects summary shows that the regions are significant. The price and region effects tests F-ratios are calculated as the respective sum of squares (SS) divided by the error SS in the ANOVA table. The SS are the difference in the ANOVA model SS when the effect is omitted. So the Region SS is the ANOVA model SS including and excluding the region variable. The Effects Summary report plots the p-values from the Effects Test report. These p-values are scaled using the negative of the log to the base 10 of the p-value. This transformation is called a log- worth which is Iogl0(p— value). See Paczkowski [2016] for a discussion of the log-worth and its applications.

change is 17 = 1 +t]® where is the price elasticity for unit sales with respect to

the price. For this problem, t]p = -1.5 so t]pR = -0.5. If price is lowered 1%, then revenue rises 0.5%. Clearly a price increase results in revenue declining.

The regression results also indicate that regions vary in significance. An effects test (an F-Test) shows that regions as a whole are significant. The test results are shown in Figure 7.18.

The analysis could be taken a step further by estimating the interaction between price and region. There may be a regional price elasticity differential that could lead to an enhanced price discrimination strategy. The result of interacting price and region is shown in Figure 7.19. The effects are summarized in Figure 7.20. Notice that the interactions are insignificant.

Testing a model

Once a model is trained, you should test it against the pristine testing data set. Recall that this is a data set the model did not see during its training. The estimated, trained coefficients could be applied to the testing data to predict values for the dependent variable. For the Case Study, first recognize that net unit sales are in natural log terms. You will convert back to net unit sales in “normal” terms by exponentiating the log term. The same holds for the price variable. Once this is done, an “r2_score” can be calculated to check the fit of actual vs. predicted values.

FIGURE 7.19 Regression summary of price and region interaction. An adjustment is made to the interaction term to avoid potential bias. This output was produced using JMP. See the JMP documentation for the adjustment.

The r2_score function “computes R2, the coefficient of determination. It provides a measure of how well future samples are likely to be predicted by the model. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R2 score of O.O.”11 For the Case Study, the score is 0.138 which is not very good.

You can graph the actual vs predicted values. Sometimes, however, the number of data points is too large to plot so a random sample may be needed. This is our situation for the Case Study. The data visualization methods described earlier could be used for this problem.

You can also predict unit sales for different settings of the pocket price variable. This is scenario or what-if analysis. Since the pocket price is a function of the discounts given by the sales force, you might want to have separate equations for pocket price with different discounts. Recall that pocket price is the list price less

FIGURE 7.20 Regression effects summary for the price-region interaction.

TABLE 7.8 This is a tally of those who purchased each of the three window treatments. There were 372 (= 179+ 193) customers who did not purchase any of the three products but are in the data set because they purchased some other product in the company’s product line.

 Blinds Drapes Shades No Yes Total No Yes Total No 179 178 357 193 164 357 Yes 161 618 779 155 624 779 Total 340 796 1136 348 788 1136

discounts. A simple program could be written to use discounts as inputs and return pocket prices which are then used as inputs to the estimated model.

Case II: Training a model with a discrete dependent variable

For the Case Study, the product manager wants to know the determinants or key drivers for a customer buying or not buying the new window blinds product. She is particularly interested in knowing the likelihood of buying blinds if window drapes and/or shades are purchased. To address her questions, you create a data table that has for each unique customer ID (CID) the total number of blinds, drapes, and shades purchased by each customer since the introduction of the new blinds. There were 1,136 unique customer IDs. An indicator variable was created for each product that was 1 if the total order was greater than zero and 0 otherwise. A tally is shown in Table 7.8.

The data were split into training and testing data sets. The training set had 761 records and the testing data set had 375; the total for the two is the original 1,136.

A logit model was specified for the blind purchases. The model is for the probability that any customer either buys or does not buy the new living room blind as a function of several factors:

• • whether or not drapes were ordered (0/1 coding);
• • whether or not shades were ordered (0/1 coding);
• • Marketing Region (dummy coded);
• • Loyalty Program membership (0/1 coding);
• • Buyer Rating (dummy coded);
• • Customer Satisfaction Rating (T2B coded);
• • population size of store location;
• • median household income of store location;
• • number of housing units of store location;
• • number of occupied housing units of store location;
• • ratio of occupied to total housing units; and
• • credit rating (T2B).

Since there is a variable that measures the ratio of occupied to total housing units, only this ratio was used. This means 10 of the twelve variables listed were used to fit a model.

A logit regression model was fit to the training data with the blinds as the dependent variable. Estimation results are shown in Figure 7.21 and Figure 7.22. The odds ratios for drapes and shades, the two significant variables, are shown in Figure 7.23.12

Case III: Training a model with constants

Now let me consider what happens when you have constants to model. This is where decision trees come in. So let me discuss this class of models which are also in the regression family. See Beck [2008] for a discussion of decision trees as a regression family member.

Decision trees are used when the dependent variable is continuous or discrete; it is versatile. There are also independent variables as with OLS and logistic regression. These variables, however, are used to “cut” the dependent variable into groups by fitting constants to the dependent variable. This cutting up of the dependent variable amounts to creating smaller and smaller groups of the dependent variable and therefore the sample. The dependent variable is thus partitioned so the trees are sometimes called partition trees. Trees are graphed upside down with the root at the top and branches flowing down. There are “leaves” and it is these leaves that show or summarize the partitioning of the dependent variable.

For terminology, when the dependent variable is discrete with classes or groups the tree is called a classification tree; otherwise, it is a regression tree. The contents of the leaves, also called nodes, vary depending on the software. The leaves generally look like the one displayed in Figure 7.24. In this example, the dependent variable is buy or not buy the new blind, which is discrete, so the tree is a classification

FIGURE 7.21 This shows the logit regression model fit statistics for the training data.

tree. The purchase of a shade is a key driver. The node has a sample size of 534 customers. The G2 value is the Likelihood-ratio Chi-square value discussed in Chapter 5 and the log-Worth is the transformed p-value for this chi-square. Recall that the logistic model estimates a probability. So does the classification tree. The predicted probabilities are 0.223 for Not Buy (the 0 value) and 0.777 for Buy (the 1 value). Since the predicted probability is larger for Buy, then the predicted class for all those in this node is Buy.

See Paczkowski [2016] for a discussion of decision trees using the JMP software and how to interpret the output.

FIGURE 7.22 This shows the logit regression model fit estimates for the training data.

# New product forecast error analysis

If a sales forecast was produced prior to launch, then it should be monitored in conjunction with sales data. The problem for forecast tracking is how the sales forecast was produced. If a judgement, naive, or constant mean method was used, then forecast tracking is most likely pointless - the forecast will be off simply because poor methods were used. Recall that a forecast has to be developed for planning purposes and this is probably as far as it should go when these methods are used. If analog product histories were combined either through simple averaging (e.g., “cluster-then-estimate” approach) or a more complex method (e.g., “cluster-while- estimate” approach), better forecasting methods could have been used and a forecast has more power and credibility. The forecast should be tracked and updated as new data become available. The updating is important because there are implications for all the business domains impacted by the original forecast. The issue is how to assess forecast accuracy.

Forecast accuracy and model fit are often confused. They are two different parts of an overall modeling process that, although separate and distinct, are nonetheless

FIGURE 7.23 This shows the odds ratios for drapes and shades for the logit regression model fit for the training data.

FIGURE 7.24 General composition of a decision tree node.

connected. A model that fits the training data very well may produce very good forecasts if the trends and patterns in the training data are stable and repeatable. But there is no guarantee that they will be. In addition, a model may forecast well into a testing period, but then may not do well outside this period. Conditions in the economy could suddenly change (e.g., a new tax policy is enacted) that could cause a radical change in the time series underlying the forecast. Or a new competitor could enter the market and completely change market dynamics. A model, on the other hand, that had a terrible fit to the training data or poor error performance in the testing data would most likely produce a forecast that was equally terrible.

Model estimation must be judged using the basic criteria applied to all models. This includes R[1] [2] (where applicable), F-statistics, p-values, and so forth as part of a modeling strategy. A model that does poorly on these statistics should be discarded in favor of one that does well. Once a good model is selected, then a separate check for forecast accuracy should be done using the testing data as I described in Chapter

6. But the model may still not forecast well outside the testing period. You still must check its performance against actual values when they become available. The forecast error for an h-step ahead forecast, FT(h), is the difference AT+I -FT( 1). This differs from the definition in Chapter 6 by the use of the last value in the full time series sample, T, and not T' < T for a testing period. All the error measures in Chapter 6 can be used with this slight adjustment. See Levenbach and Cleary [2006] for the measures and some examples. In addition, seejr. |2015] for modeling strategies and model fit criteria. Finally, see Chatfield [2000] for a discussion of model uncertainty and forecast accuracy.

# Additional external data – text once more

Tracking, root cause, and what-if analyses are not the sole analyses that must or should be done post-launch. In the modern social media age, customers do write online reviews. In the case of a business-to-business (B2B) business, those customers will be other businesses that buy at wholesale. Most likely they will not write reviews but rather go directly to their sales rep to voice an opinion. Their customers, the end-user consumers who are the ultimate customers, will, however, write reviews. It was an analysis of these that was the basis for some new product ideas described and discussed in Chapter 2. These same reviews, and the same text analysis described in Chapter 2, should be studied post-launch to determine if there are any problems. This is in the domain of Sentiment Analysis and Opinion Mining which I will discuss in the next section.

• [1] “ This is a great product. You must own one!’
• [2] • This is a positive sentiment.