Case study data sources
As for all the components of the new product development process I described in this book, data are needed. Data are the driving force for everything a business must do. To quote W. Deming: “In Cod we trust. All others must bring data.’'^{4} The data sources are both internal and external.
Internal business data are the most common data used, not only in a dashboard but also for drilldown to root causes and predictive modeling. This internal data comes from a number of different sources or databases, perhaps consolidated in a data mart, such as and certainly not limited to:
 • orders;
 • products;
 • pricing;
 • marketing;
 • customer; and
 • many others.
The external databases are too numerous to mention. At the least, the data analyst should have access to one (or several) that provides demographics on households and business firmographics, detail that includes household income, age distribution, education distribution and socioeconomic classification for households and revenue and number of employees for businesses.
The business analyst for the Case Study compiled a data dictionary for an initial analytical database she will use to address the product manager’s requests. A data dictionary contains “metadata” which are data about the data. Metadata can be anything that documents the data. They are information about the distinct data elements such as^{3}:
 • means of creation;
 • purpose of the data;
 • time and date of creation;
 • creator/author/keeper of the data;
 • placement on a network (electronic form);
 • where the data were created;
 • what standards were used to create the data; and
 • so forth. ^{[1]}
 • variable name;
 • possible values or value ranges;
 • source; and
 • mnemonic.
The mnemonic is the label used in data files and statistical and modeling output. The data dictionary for the Case Study is shown in Table 7.1. A customer background data dictionary, reflecting data from the customer database, is shown in Table 7.2. The customer data will be merged or joined with the basic orders data based on a common identifier which is the CID.
Case study data analysis
A basic data analysis should be done before any complex analysis such as estimating a regression model. There are two parts or stages discussed here in what could be interpreted as a chronological order but which are really done in any order. The two parts are an examination of descriptive statistics and data visualization.
The basic descriptive statistics include:
 • count of the observations;
 • mean;
 • standard deviation;
 • minimum value;
 • 25'' percentile or first quartile (Q?);
 • 50'^{,l} percentile or median (or Q2);
 • 75'' percentile or third quartile (Qi); and
 • maximum value.
The last five are known as the Five Number Summary, a robust set of statistics unaffected by distribution skewness. Although unaffected, they help determine skewness and symmetry of the data distribution:
 1. Symmetric: (75% — 50%) = (50%25%)
 2. RightSkewed: (75%50%) > (50%25%)
 3. Left Skewed: (75%50%) < (50%25%)
The descriptive statistics for the discounts are shown in Table 7.3.
A correlation is also helpful. Recall that a correlation shows association, not cause and effect. This distinction is important because you could have disastrous results if the wrong interpretation is placed on a correlation. Also, correlation is only between pairs of variables; never more than two. A correlation value varies between 1.0 and +1.0. A whole set of pairwise correlations could be calculated but these are then shown in a correlation matrix. A correlation matrix for the discounts is shown in Table 7.4.
TABLE 7.1 This is a basic data dictionary for the Case Study showing the order detail for customers of living room blinds.
Variable 
Values 
Source 
Mnemonic 
Order Number 
Nominal Integer 
Order Sys 
Onum 
Customer ID 
Nominal Integer 
Customer Sys 
CID 
Transaction Date 
MM/DD/YYYY 
Order Sys 
Tdate 
Product Line ID 
Five house rooms 
Product Sys 
Pline 
Product Class ID 
Item in line 
Product Sys 
Pclass 
Units Sold 
Units/ order 
Order Sys 
Usales 
Product Returned? 
Yes/No 
Order Sys 
Return 
Amount Returned 
Number of units 
Order Sys 
returnAmount 
Material Cost/Unit 
USD cost of material 
Product Sys 
Mcost 
List Price 
USD list 
Price Sys 
Lprice 
Dealer Discount 
% to dealer (decimal) 
Sales Sys 
Ddisc 
Competitive Discount 
% for competition (decimal) 
Sales Sys 
Cdisc 
Order Size Discount 
% for size (decimal) 
Sales Sys 
Odisc 
Customer Pickup Allowance 
% for pickup (decimal) 
Sales Sys 
Pdisc 
Total Discount 
% discount 
Sum of discounts 
Tdisc 
Pocket Price 
USD 
LPrice X (1 — TDisc) 
Pprice 
Net Unit Sales 
Net units/order 
Usales  returnAmount 
netUsales 
Log of Net Unit Sales 
Log net sales 
log(netUsales) 
log_netUsales 
Log of Unit Sales 
Log sales 
log(Usales) 
log_Usales 
Log of Pocket Price 
USD 
log( Pprice) 
log_Pprice 
Revenue 
USD 
Usales X Pprice 
Rev 
Contribution 
USD 
Rev Mcost 
Con 
Contribution Margin 
% 
Con/Rev 
CM 
Net Revenue 
USD 
(Usales returnAmount) X Pprice 
netRev 
Lost Revenue 
USD 
Rev net Rev 
lostRev 
A second step in data analysis is data visualization. This is the latest “buzz phrase”  everyone talks about data visualization. It is really just looking at your data using graphs. The graphs, however, could be somewhat complex and hopefully informative  that they convey Rich Information. Graphs are generally more informative than tables. Small tables, like a 2x2, may be sufficient for presentations, but they usually cannot provide the Rich Information needed by decision makers. Larger
TABLE 7.2 This is a basic customer data dictionary' for the Case Study. It shows the customer features that could eventually be used with the sales data to provide a complete picture of the customers and their orders. The Buyer Rating is based on the sales force evaluations of ease of doing business with that customer while the customer satisfaction is based on an annual satisfaction study.
Variable 
Values 
Source 
Mnemonic 
Customer ID 
Nominal Integer 
Customer Sys 
CID 
State of Business 
Twocharacter state code 
Customer Sys 
State 
ZIP Code of location 
Fivedigit ZIP (Character) 
Customer Sys 
ZIP 
Marketing Region 
Four Census Regions 
Customer Sys 
Region 
In Loyalty Program 
Yes/No 
Customer Sys 
loyalty Program 
Buyer Rating 
Poor, Good, Excellent 
Customer Sys 
buyerRating 
Buyer Satisfaction 
FivePoint Likert 
Customer Sys 
buyerSatisfaction 
TABLE 7.3 The descriptive statistics for the four discounts are for the Case Study. The variable Tdisc is the total discount as the sum of the four discounts. It was added to the data set of the two primary data sets merged into one. Notice that the dealer discount (Ddisc) is rightskewed. Also note that the sample size differs slightly. The full data set has 70,270 records. This indicates that there are some missing data.
count 
mean 
std 
min 
25% 
50% 
15% 
max 

Ddisc 
70262 
0.12 
0.04 
0.04 
0.09 
0.12 
0.16 
0.22 
Cdisc 
70261 
0.07 
0.02 
0.04 
0.05 
0.07 
0.09 
0.10 
Odisc 
70266 
0.05 
0.01 
0.02 
0.04 
0.05 
0.06 
0.08 
Pdisc 
70268 
0.04 
0.01 
0.02 
0.03 
0.04 
0.05 
0.06 
Tdisc 
70270 
0.28 
0.05 
0.02 
0.24 
0.28 
0.32 
0.43 
TABLE 7.4 This is the correlation matrix for the four discounts. The total discount was not included because, as the sum of discounts, it will be highly correlated with the four discounts and so it will not provide any new information.
Ddisc 
Cdisc 
Odisc 
Pdisc 

Ddisc 
1.000000 
0.000919 
0.001962 
0.003748 
Cdisc 
0.000919 
1.000000 
0.002217 
0.000396 
Odisc 
0.001962 
0.002217 
1.000000 
0.002888 
Pdisc 
0.003748 
0.000396 
0.002888 
1.000000 
tables are needed, but we have difficulty seeing patterns in large tables so they have a drawback. Generally, visuals are superior to tables of numbers because we are visual creatures. We can see, for example, patterns in graphs much faster and more easily than in large tables. See Cleveland [1993J and Cleveland [1994] for thorough accounts of data visualization. Also see Tufte [1983] for the authoritative treatise on effective data visualization. It is worthwhile also to look at Wickham [2009] for useful discussions and implementations in R of the Grammar of Graphics due to Wilkinson [2005].
I cannot oversell visual displays either. Not all visuals are created equal or are the best for revealing Rich Information. There are visual display issues, such as colors and excessive use of 3D, that counter the benefits of visualization.^{6} Nonetheless, you cannot do effective Business Analytics without data visualization.
To say you need to create graphs to visualize data is insufficient. A logical question to ask is: “ What do I look for in graphs?” I have my own list of “clues” to look for in graphs that tells you about the underlying processes generating the data and data issues you need to note.
 1. Distributions
 • Symmetry/skewness.
 2. Relationships
 • Correlations or causality statements between two or more variables.
 3. Trends
 • Developments or changes over time (e.g., a tracking study).
 4. Patterns
 • Groupings of objects as in market segments.
 5. Anomalies
 • Outliers or unexpected values.
These are not in any special order. In fact, after a while you will unconsciously, almost instinctively, look for them as soon as you produce a graph. I will discuss these one at a time in the following sections.
Examine distributions
First on my list is an examination of the distribution of the data. The distribution is where the data are centered and how they are spread out from that point. Two display tools that reveal the distribution are the histogram and the boxplot. Both are introduced in a basic statistics course with the histogram being introduced first, almost within the first week, followed by the boxplot a few weeks later after summary statistics are introduced.
The histogram, although a basic almost elementary graph for one variable, has a major issue  the size, and therefore the number, of bins to display where the bins are represented by bars. Changing the size of the bins can dramatically change how the histogram is interpreted and, therefore, what kind of Rich Information it will provide. Most statistics packages have their own algorithm for determining bin sizes. One rule is the FreedmanDiaconis rule for bin width:
where h = bin width, IQR = Interquartile Range (= Q3  Ql) for variable x, and n is the number of observations. The number of bins is then
You can set your own bin size in most data visualization software, but the default is usually sufficient. The Python package, Seaborn, uses this rule. Another rule is Sturges’ Rule for determining the number of bins:
where n is the sample size. See the short note by Hyndman [1995] about issues with these rules.
The shape of the histogram indicates the shape of the data’s distribution. There are three possibilities. The distribution could be:
 1. symmetric  the most desirable;
 2. rightskewed; or
 3. left skewed.
A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. A distribution is skewed to the right if the right side of the histogram extends much farther out than the left; there is a long right tail. A distribution is skewed to the left if the left side of the histogram extends much farther out than the right; there is a long left tail. Skewness is an issue because skewed data impacts many summary statistics, especially the mean or average. Three possible shapes are shown in Figure 7.4. The lower right is almost uniform while the lower left is almost normal. The one on the lower left is the most desirable.
Because of the bin issue, and also since a histogram is for a single variable, a better plot is the boxplot because it does not have this issue. This creative display shows the Five Number Summary mentioned earlier and it can also show the effect of a second, qualitative variable. Figure 7.5 shows the anatomy of a boxplot.
Examine relationships
The best way to examine and discover relationships between two or more quantitative variables is with a scatter plot. This shows combinations of values of the variables as points and all the points will show a relationship, if one exists. Figure 7.6 illustrates possibilities.
There is something to notice about the one in the lower left of Figure 7.6: there are a lot of data points! Maybe there is something hidden inside this graph, some Rich Information, hidden because there are too many data points. I will comment on this shortly.
When you see a nonlinear relationship, such as the one on the upper right of Figure 7.6, you could make it linear with a log transformation. The log actually does two things:
FIGURE 7.4 Three possible shapes for a histogram. The top left is left skewed; the top right is rightskewed; the two bottom ones are symmetric. The bottom left histogram is normal while the bottom right is uniform.
 1. Straighten, or linearize, a curve
 • Linear relationships are easier to work with.
 • Linear relationships make it easier to handle probability distributions.
 2. Stabilize variances
 • Make the distributions more evenly spread.
 • Remove extreme skewness.
Demonstrations of these two properties are in the Appendix to this chapter.
Figure 7.7 shows the distribution of the net unit sales for the Case Study. Net sales are total, or gross, unit sales less returns. Notice that net sales are highly rightskewed which distorts the impression of the data. Any statistical analysis is jeopardized by this skewness. Transforming the data with the natural log normalizes the data so when you model net unit sales you should use a log transformation. Figure 7.8 shows the same data but on a natural log scale. The distribution is now fairly normal. As a recommendation, use the natural log of one plus the argument, or In(1 + Л'), rather than ln.v to avoid cases where .%• = 0 since InO is undefined, which is meaningless, but In 1 = 0 so you have a meaningful number. For the Case Study, it is possible that net sales could be zero if the entire order was returned. This is, in fact, the case as shown by the Five Number Summary in Table 7.5.
Log transforming data also produces a nice, useful economic interpretation: the estimated coefficient in an OLS model is an elasticity! This is demonstrated in the
FIGURE 7.5 This chart shows the anatomy of a boxplot. The distance inside the box is the Interquartile Range (IQR). The upper and lower fences are at most 1.5 times the IQR. Data points outside the fences are outliers.
TABLE 7.5 Fivenumber summary of net unit sales. Notice that the minimum is zero which justifies adding 1 to the log transformation of net sales as discussed in the text. Since Q3 Median > Median Q1, then net sales are rightskewed.
Minimum 
Ql 
Median 
QJ 
Maximum 
0 
15 
24 
36 
338 
Appendix. Also see Paczkowski [2018 for a discussion of this demand function, elasticities, and the log transformation.
Since one objective from the product manager is to estimate a price elasticity, you should graph net unit sales and the Pocket Price. I observed earlier that net unit sales were rightskewed but that using a log transform shifted the distribution to a more normal one. The log of net unit sales as well as the log of pocket price should therefore be used. This is a common transformation in empirical demand analysis because the slope of a line is the elasticity. A scatter plot of the log of net sales versus the log of the pocket price is shown in Figure 7.9. The data cloud is dense because of the number of data points plotted, hence it is difficult to see any clear pattern. A regression line was overlaid on the graph but the line is barely discernible. This is a case of LargeN. See Carr et al. [1987 for a discussion of LargeN problems.
A LargeN case results in a dense scatter plot making it difficult to see patterns. Options for handling this situation are:
FIGURE 7.6 This chart shows some examples of relationships revealed by a scatter plot. The upper left is the most desirable because it is the easiest to work with. The one on the lower left is the least desirable because there is no discernible pattern. The data for the top row are from the data set know as Anscombe s quartet. See https://en.wikipedia. org/wiki/Anscombe%27s_quartet. The data for the bottom row were randomly generated.
 1. select a random sample of the data;
 2. use a contour rather than a scatter plot; and
 3. use a hex bin plot.
Sampling is powerful for reducing the LargeN problem and allowing you to do crossvalidation. Crossvalidation means that you could draw several samples and use each one as a check against the others. You should see the same relationships, more or less, but not dramatic differences. There are two issues with sampling:
1. because you are sampling, you might not have the best view of the data, so outliers and the mode may be hidden;
FIGURE 7.7 This chart shows the distribution of net unit sales without a log transformation.
FIGURE 7.8 This chart shows the distribution of net unit sales with a (natural) log transformation.
2. you may actually introduce modes and outliers that are not really there, but appear to be there by luck of the draw of the sample.
FIGURE 7.9 This chart shows the relationship between the log of net unit sales and the log of the pocket price.
See Carr et al. [1987] for a discussion of using random sampling with LargeN data. These are problems with any sampling, whether for graphs, which is our concern, or for surveys, or clinical trials, and so forth. Figure 7.10 shows the same data but based on a sample size of n = 500. The negative relationship between net sales and pocket price is clear.
An alternative is a contour plot which shows contours of the data. Different densities are shown with shaded areas as in Figure 7.11. A drawback to contour plots is the hyperparameters defining the plot. Different hyperparameter values give different impressions. See Dupont and W. Dale Plummer [20051 for a discussion and examples.
The hex bin plot is sometimes better because it involves drawing small hexagonal shapes, the coloring or shading of the shapes indicating the number of data points inside the hexagon. These shapes are less affected by hyperparameters and so are more robust. This is a much more effective way to visualize LargeN data. A hex bin plot is shown in Figure 7.12.
The product manager for the Case Study wanted some insight into discounts by marketing regions. Boxplots will help reveal any differences. Figure 7.13 shows the distribution of total discounts by region. Notice that discounts are lowest in the Southern Region while the Midwest has a large number of low discounts. Also, the dispersion of discounts in the Southern Region is small relative to that in the other three regions. Let me drill down to verify the differences for the Southern Region. You can use Tukey’s pairwise Honestly Significant Difference (USD) test for this which I mentioned in Chapter 2. The test adjusts for the fact that multiple pairwise tests
FIGURE 7.10 This chart shows the relationship between the log of net unit sales and the log of the pocket price using a random sample of n = 500 data points.
are conducted which increases the probability of making a false decision. In fact, the usual standard for a hypothesis test is a = 0.05; this is the probability of falsely rejecting the null hypothesis. When multiple tests are conducted, this probability becomes 1 (1 — a)^{k} where к is the number of pairwise tests. If к = 1, then this simplifies to a. If there are n units to test, then there are к = "^{x}(»i)/> tests. For the region problem, и = 4 so there are к = 6 pairwise comparison tests. Results of a Tukey HSD test of total discounts by region are shown in Table 7.6. Summary statistics are shown in Table 7.7 to help interpret Table 7.6. These results confirm the observation made from the boxplots in Figure 7.13. See Paczkowski [2016] for a discussion of Tukey s HSD test.
A further step in the tracking analysis by region is to drill down on the components of total discounts for the Southern Region. Figure 7.14 shows boxplots of the components for the Southern Region. Notice that the dealer discount tends to be the largest while the order discount has the most variation. These discounts could affect sales and revenue.
Examine patterns
As rulesofthumb, look for clumps, or darkly shaded areas, or dispersion of the data, all depending on the type of graph you are using.
FIGURE 7.11 This chart shows the relationship between the log of net unit sales and the log of the pocket price using a contour plot. Smooth histograms, called kernel density plots, are shown on the two margins of the graph to emphasize the distributions. You can see that the distributions are almost normal.
Examine anomalies
Finally, look for anomalies, or outliers. These are suspicious points, some of which could be innocuous, while others could be pernicious leading to the wrong information; in other words, Poor Information. A boxplot is a great tool for showing outliers as in Figure 7.16. The points above the upper fence are all considered to be outliers. In this case, there are many of them. There is also one point slightly below the lower fence so there is only one low outlier. It is the mass at the top that is the issue in this example. These are cases that should be checked because they have the potential to adversely affect any statistics and modeling.
 [1] will restrict the metadata to: