Predictive modeling
Now that you have your data and you have visually examined them, you have to turn your attention to predictive modeling to address the product manager’s business
FIGURE 7.12 This chart shows the relationship between the log of net unit sales and the log of the pocket price using a hex plot. Histograms are shown on the two margins of the graph to emphasize the distributions. You can see that the distributions are almost normal.
problems. Recall that she wants to know the price elasticity for living room blinds now that actual market data are available but she also wants to know the effect of changing the price level. The latter is a more complicated problem. See Paczkowski [2018J for a thorough discussion regarding price modeling.
Every process has steps that should be followed. Predicting is no different. The steps for predictive modeling are:
 1. split the data into training and testing data sets;
 2. train a model with the training data set; and
 3. test the trained model with the testing data set.
These steps hold even for forecasting the future, but that is not our focus here. Forecasting has its own complex issues, but the framework is the same. I will discuss
FIGURE 7.13 This chart shows the distributions of total discounts by the four marketing regions.
TABLE 7.6 These are the test results for differences in mean total discounts by marketing regions. The difference in the means is Group 2Group 1. The columns labeled “Lower” and “Upper” are the confidence limits around the mean difference. Notice that the Null Hypothesis is rejected for all combinations of the Southern Region and the other three (the last row of the table should be reversed to be consistent with the other Southern comparisons). Also notice that the difference in the means for the Southern Region is negative in all cases.
Croup t 
Croup 2 
Mean Difference 
Lower 
Upper 
Reject Null? 
Midwest 
Northeast 
0.0038 
0.0053 
0.0024 
True 
Midwest 
South 
0.0549 
0.0561 
0.0537 
True 
Midwest 
West 
0.0003 
0.0014 
0.0008 
False 
Northeast 
South 
0.0511 
0.0526 
0.0496 
True 
Northeast 
West 
0.0035 
0.0021 
0.0049 
True 
South 
West 
0.0546 
0.0535 
0.0558 
True 
these three general steps in the following subsections for predicting. Forecasting methods were already discussed in Chapter 6.
Training and testing data sets
It is best practice to divide a data set into two parts for prediction purposes:
TABLE 7.7 These are the summary statistics to help interpret Table 7.6.
Region 
Count 
Mean 
Std. Den. 
Min 
Ql 
Median 
Qd 
Max 
Midwest 
19565 
0.294990 
0.040023 
0.023 
0.267 
0.295 
0.323 
0.414 
Northeast 
8704 
0.291163 
0.056135 
0.139 
0.249 
0.293 
0.334 
0.431 
South 
15831 
0.240060 
0.028707 
0.149 
0.220 
0.240 
0.260 
0.325 
West 
26170 
0.294682 
0.048876 
0.158 
0.258 
0.294 
0.331 
0.431 
FIGURE 7.14 This chart shows the distributions of the components of the total discount for the Southern Region.
 1. a training data set; and
 2. a testing data set.
The former is used for model estimation while the latter is used for model testing. This is best practice to ensure optimal modeling. When a model is estimated, we say that it learns what the estimates should be from the data we give the estimation procedure. In essence, the model is trained by the data. Actually, the model is trained by a dependent variable in the sense that this variable guides how the parameters are estimated. It supervises the training so this is sometimes called supervised learning. There is, of course, unsupervised learning, but that is another issue.
Once a model is trained, it has to perform; that is, predict. You need to check those predictions, but you cannot do that with the same data used in the training since the model already saw or knows that data. This is where the testing data come
FIGURE 7.15 This chart shows the trend of the mean monthly dealer discount for the Southern Region. Notice that several months are missing discounts and that the last few months indicate an upward trend.
FIGURE 7.16 This boxplot illustrates how anomalies, or outliers, are revealed by a boxplot. The data for this graph are simulated.
in. This data set is a new one the model has not seen. If the model is good, it should be able to predict the testing data, albeit with some random variation or noise. A rough ruleofthumb is to split your data into 3/4 training and 1/4 testing. Another is 2/3 training and 1/3 testing. It all depends, of course, on the size of your original data. Basically, you need a larger portion in the training data set because learning is more challenging and data intensive.
If time is not a characteristic of the data, then simple random sampling could be used to divide a data set into training and testing parts. However, if time is involved, then a split cannot be based on random sampling as I noted in Chapter 6 and the method I described there must be used. If the data are a panel data set consisting of both crosssectional and time series dimensions, then it is best to randomly sample on the crosssectional dimension and preserve the time series for those units.
Training a model
Let me now discuss training a model with the training data set. The framework you select for predictive modeling depends on the nature of the dependent variable, the variable that guides the training. There are several cases or possibilities.^{8} For one case, the dependent variable is continuous; this is just OLS. For another case, the dependent variable is binary; this is where logistic regression is used. Both are members of a regression family, which is large. There is actually a third case that can handle either a continuous or discrete dependent variable but it does so by fitting constants to the data. This is also a member of the regression family. In a sense, these are all cousins. Regardless of the specific family member, a regression model fits:
 • straight lines: this is OLS;
 • curves: this is logistic regression^{9}; and
 • constants: this is decision trees.
These three cases are discussed in the following subsections.
Case I: Training a model with a continuous dependent variable
Technically, this is the model for an OLS regression. A model is
where:
Y, is the dependent variable for the i^{,h} observation, i= 1,2,...,n; p_{()} is the intercept;
fij,j = 1,2.....p are the slopes;
XjjJ = 1,2.....p are the independent variables; and
is the random disturbance: e_{;} ~ jT((),a^{2}).
Y_{t} is continuous and our goal is to estimate the parameters or slopes to say something about Vj which could, for example, be net unit sales. There should be an intercept, /?_{0}, but I usually do not care about this since it just places the line or plane or hypersurface. The slopes, however, are important since they show the effects of the independent variables. These slopes are intimately related to the dependent variable which is why I refer to the model as being trained by the dependent variable.
Once the model is trained, you have to examine it statistically. This means you have to test the effects. You use an Ftest for this purpose. You also have to check the relationship among the independent variables for multicollinearity, using the variance inflation factors (VIF). See Gujarati [2003] and Paczkowski [2018] for a discussion of VIF. There are other checks you need to do, but these will suffice for here. In general, a model is trained following four steps:^{10}
 1. define a formula (i.e., the specific model to estimate);
 2. instantiate the model (i.e., specify it);
 3. fit the model; and
 4. summarize the fitted model.
For the Case Study, a model for unit sales is
where the focus is on the pocket price and marketing regions. Notice only three regions are specified. Since Region is categorical, you must create dummy variables for it. There are four marketing regions corresponding to the four U.S. Census Regions so only three dummies should be created. The Western region is omitted to avoid the dummy variable trap. This region is the last in alphanumeric order and was arbitrarily selected as the base. Although omitted in the model specification, it is still present as the intercept. The intercept is the Western region and the estimated parameters for the other three regions show differences from the Western region which is the base. See Paczkowski [2018] and Gujarati 2003] for discussions about dummy variables, their interpretation, and the dummy variable trap.
This model is an inherently linear model meaning it can be linearized by taking the natural log of both sides. The earlier histogram analysis showed that net sales are highly rightskewed but that the distribution for the log transformation is more symmetrical. This is why it is preferred. In addition to being inherently linear, this model is also a constantelasticity (or isoelastic) model. The price elasticity is the estimated parameter, /?,. See Paczkowski [2018] for a discussion of this type of model and elasticities. Also see the Appendix to this chapter for the constant elasticity demonstration.
Applying the natural log to both sides of the model yields
Once a model is specified, it must be instantiated, which simply means the estimation methodology must be specified for the model just stated. Since an OLS model is considered in this section, instantiation means that OLS must be specified. In addition to specifying the methodology the data used with that model is also specified. This will be the training data set.
FIGURE 7.17 Regression summary for the linearized model, (7.1).
Next, the model is fit; that is, the unknown model parameters are estimated using the training data set. In some software, once the model is fit, a summary report is displayed which is the last step; in other software products, you have to call a summary function. Regardless, a summary of the fit is displayed.
Estimation results for the Case Study are shown in Figure 7.17. The estimated price parameter is 1.5 which indicates that the demand for the new product is highly elastic. This should be expected since window treatments is a highly competitive business. In addition to blinds, there are also shades and drapes. In addition, this new product is a hightech product that is voice activated which means that its appeal in the market may be hindered by the lowtech (i.e., oldfashioned) blinds currently in the market. If sales are suffering as the product manager’s dashboard and business intelligence systems indicate, lowering the price may be necessary. This elasticity helps justify this strategy'.
Why lower the price rather than raise it? I show in the Appendix to this chapter and in Paczkowski [2018] that the total revenue elasticity with respect to a price
FIGURE 7.18 Regression effects summary shows that the regions are significant. The price and region effects tests Fratios are calculated as the respective sum of squares (SS) divided by the error SS in the ANOVA table. The SS are the difference in the ANOVA model SS when the effect is omitted. So the Region SS is the ANOVA model SS including and excluding the region variable. The Effects Summary report plots the pvalues from the Effects Test report. These pvalues are scaled using the negative of the log to the base 10 of the pvalue. This transformation is called a log worth which is Iog_{l0}(p— value). See Paczkowski [2016] for a discussion of the logworth and its applications.
change is 17 = 1 +t]® where is the price elasticity for unit sales with respect to
the price. For this problem, t]p = 1.5 so t]p^{R} = 0.5. If price is lowered 1%, then revenue rises 0.5%. Clearly a price increase results in revenue declining.
The regression results also indicate that regions vary in significance. An effects test (an FTest) shows that regions as a whole are significant. The test results are shown in Figure 7.18.
The analysis could be taken a step further by estimating the interaction between price and region. There may be a regional price elasticity differential that could lead to an enhanced price discrimination strategy. The result of interacting price and region is shown in Figure 7.19. The effects are summarized in Figure 7.20. Notice that the interactions are insignificant.
Testing a model
Once a model is trained, you should test it against the pristine testing data set. Recall that this is a data set the model did not see during its training. The estimated, trained coefficients could be applied to the testing data to predict values for the dependent variable. For the Case Study, first recognize that net unit sales are in natural log terms. You will convert back to net unit sales in “normal” terms by exponentiating the log term. The same holds for the price variable. Once this is done, an “r2_score” can be calculated to check the fit of actual vs. predicted values.
FIGURE 7.19 Regression summary of price and region interaction. An adjustment is made to the interaction term to avoid potential bias. This output was produced using JMP. See the JMP documentation for the adjustment.
The r2_score function “computes R^{2}, the coefficient of determination. It provides a measure of how well future samples are likely to be predicted by the model. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^{2} score of O.O.”^{11} For the Case Study, the score is 0.138 which is not very good.
You can graph the actual vs predicted values. Sometimes, however, the number of data points is too large to plot so a random sample may be needed. This is our situation for the Case Study. The data visualization methods described earlier could be used for this problem.
You can also predict unit sales for different settings of the pocket price variable. This is scenario or whatif analysis. Since the pocket price is a function of the discounts given by the sales force, you might want to have separate equations for pocket price with different discounts. Recall that pocket price is the list price less
FIGURE 7.20 Regression effects summary for the priceregion interaction.
TABLE 7.8 This is a tally of those who purchased each of the three window treatments. There were 372 (= 179+ 193) customers who did not purchase any of the three products but are in the data set because they purchased some other product in the company’s product line.
Blinds 
Drapes 
Shades 

No 
Yes 
Total 
No 
Yes 
Total 

No 
179 
178 
357 
193 
164 
357 
Yes 
161 
618 
779 
155 
624 
779 
Total 
340 
796 
1136 
348 
788 
1136 
discounts. A simple program could be written to use discounts as inputs and return pocket prices which are then used as inputs to the estimated model.
Case II: Training a model with a discrete dependent variable
For the Case Study, the product manager wants to know the determinants or key drivers for a customer buying or not buying the new window blinds product. She is particularly interested in knowing the likelihood of buying blinds if window drapes and/or shades are purchased. To address her questions, you create a data table that has for each unique customer ID (CID) the total number of blinds, drapes, and shades purchased by each customer since the introduction of the new blinds. There were 1,136 unique customer IDs. An indicator variable was created for each product that was 1 if the total order was greater than zero and 0 otherwise. A tally is shown in Table 7.8.
The data were split into training and testing data sets. The training set had 761 records and the testing data set had 375; the total for the two is the original 1,136.
A logit model was specified for the blind purchases. The model is for the probability that any customer either buys or does not buy the new living room blind as a function of several factors:
 • whether or not drapes were ordered (0/1 coding);
 • whether or not shades were ordered (0/1 coding);
 • Marketing Region (dummy coded);
 • Loyalty Program membership (0/1 coding);
 • Buyer Rating (dummy coded);
 • Customer Satisfaction Rating (T2B coded);
 • population size of store location;
 • median household income of store location;
 • number of housing units of store location;
 • number of occupied housing units of store location;
 • ratio of occupied to total housing units; and
 • credit rating (T2B).
Since there is a variable that measures the ratio of occupied to total housing units, only this ratio was used. This means 10 of the twelve variables listed were used to fit a model.
A logit regression model was fit to the training data with the blinds as the dependent variable. Estimation results are shown in Figure 7.21 and Figure 7.22. The odds ratios for drapes and shades, the two significant variables, are shown in Figure 7.23.^{12}
Case III: Training a model with constants
Now let me consider what happens when you have constants to model. This is where decision trees come in. So let me discuss this class of models which are also in the regression family. See Beck [2008] for a discussion of decision trees as a regression family member.
Decision trees are used when the dependent variable is continuous or discrete; it is versatile. There are also independent variables as with OLS and logistic regression. These variables, however, are used to “cut” the dependent variable into groups by fitting constants to the dependent variable. This cutting up of the dependent variable amounts to creating smaller and smaller groups of the dependent variable and therefore the sample. The dependent variable is thus partitioned so the trees are sometimes called partition trees. Trees are graphed upside down with the root at the top and branches flowing down. There are “leaves” and it is these leaves that show or summarize the partitioning of the dependent variable.
For terminology, when the dependent variable is discrete with classes or groups the tree is called a classification tree; otherwise, it is a regression tree. The contents of the leaves, also called nodes, vary depending on the software. The leaves generally look like the one displayed in Figure 7.24. In this example, the dependent variable is buy or not buy the new blind, which is discrete, so the tree is a classification
FIGURE 7.21 This shows the logit regression model fit statistics for the training data.
tree. The purchase of a shade is a key driver. The node has a sample size of 534 customers. The G^{2} value is the Likelihoodratio Chisquare value discussed in Chapter 5 and the logWorth is the transformed pvalue for this chisquare. Recall that the logistic model estimates a probability. So does the classification tree. The predicted probabilities are 0.223 for Not Buy (the 0 value) and 0.777 for Buy (the 1 value). Since the predicted probability is larger for Buy, then the predicted class for all those in this node is Buy.
See Paczkowski [2016] for a discussion of decision trees using the JMP software and how to interpret the output.
FIGURE 7.22 This shows the logit regression model fit estimates for the training data.
New product forecast error analysis
If a sales forecast was produced prior to launch, then it should be monitored in conjunction with sales data. The problem for forecast tracking is how the sales forecast was produced. If a judgement, naive, or constant mean method was used, then forecast tracking is most likely pointless  the forecast will be off simply because poor methods were used. Recall that a forecast has to be developed for planning purposes and this is probably as far as it should go when these methods are used. If analog product histories were combined either through simple averaging (e.g., “clusterthenestimate” approach) or a more complex method (e.g., “clusterwhile estimate” approach), better forecasting methods could have been used and a forecast has more power and credibility. The forecast should be tracked and updated as new data become available. The updating is important because there are implications for all the business domains impacted by the original forecast. The issue is how to assess forecast accuracy.
Forecast accuracy and model fit are often confused. They are two different parts of an overall modeling process that, although separate and distinct, are nonetheless
FIGURE 7.23 This shows the odds ratios for drapes and shades for the logit regression model fit for the training data.
FIGURE 7.24 General composition of a decision tree node.
connected. A model that fits the training data very well may produce very good forecasts if the trends and patterns in the training data are stable and repeatable. But there is no guarantee that they will be. In addition, a model may forecast well into a testing period, but then may not do well outside this period. Conditions in the economy could suddenly change (e.g., a new tax policy is enacted) that could cause a radical change in the time series underlying the forecast. Or a new competitor could enter the market and completely change market dynamics. A model, on the other hand, that had a terrible fit to the training data or poor error performance in the testing data would most likely produce a forecast that was equally terrible.
Model estimation must be judged using the basic criteria applied to all models. This includes R^{[1]} ^{[2]} (where applicable), Fstatistics, pvalues, and so forth as part of a modeling strategy. A model that does poorly on these statistics should be discarded in favor of one that does well. Once a good model is selected, then a separate check for forecast accuracy should be done using the testing data as I described in Chapter
6. But the model may still not forecast well outside the testing period. You still must check its performance against actual values when they become available. The forecast error for an hstep ahead forecast, F_{T}(h), is the difference A_{T+I} F_{T}( 1). This differs from the definition in Chapter 6 by the use of the last value in the full time series sample, T, and not T' < T for a testing period. All the error measures in Chapter 6 can be used with this slight adjustment. See Levenbach and Cleary [2006] for the measures and some examples. In addition, seejr. 2015] for modeling strategies and model fit criteria. Finally, see Chatfield [2000] for a discussion of model uncertainty and forecast accuracy.
Additional external data – text once more
Tracking, root cause, and whatif analyses are not the sole analyses that must or should be done postlaunch. In the modern social media age, customers do write online reviews. In the case of a businesstobusiness (B2B) business, those customers will be other businesses that buy at wholesale. Most likely they will not write reviews but rather go directly to their sales rep to voice an opinion. Their customers, the enduser consumers who are the ultimate customers, will, however, write reviews. It was an analysis of these that was the basis for some new product ideas described and discussed in Chapter 2. These same reviews, and the same text analysis described in Chapter 2, should be studied postlaunch to determine if there are any problems. This is in the domain of Sentiment Analysis and Opinion Mining which I will discuss in the next section.