Case study data sources

As for all the components of the new product development process I described in this book, data are needed. Data are the driving force for everything a business must do. To quote W. Deming: “In Cod we trust. All others must bring data.’'4 The data sources are both internal and external.

Internal business data are the most common data used, not only in a dashboard but also for drill-down to root causes and predictive modeling. This internal data comes from a number of different sources or databases, perhaps consolidated in a data mart, such as and certainly not limited to:

  • • orders;
  • • products;
  • • pricing;
  • • marketing;
  • • customer; and
  • • many others.

The external databases are too numerous to mention. At the least, the data analyst should have access to one (or several) that provides demographics on households and business firmographics, detail that includes household income, age distribution, education distribution and socioeconomic classification for households and revenue and number of employees for businesses.

The business analyst for the Case Study compiled a data dictionary for an initial analytical database she will use to address the product manager’s requests. A data dictionary contains “metadata” which are data about the data. Metadata can be anything that documents the data. They are information about the distinct data elements such as3:

  • • means of creation;
  • • purpose of the data;
  • • time and date of creation;
  • • creator/author/keeper of the data;
  • • placement on a network (electronic form);
  • • where the data were created;
  • • what standards were used to create the data; and
  • • so forth. [1]
  • • variable name;
  • • possible values or value ranges;
  • • source; and
  • • mnemonic.

The mnemonic is the label used in data files and statistical and modeling output. The data dictionary for the Case Study is shown in Table 7.1. A customer background data dictionary, reflecting data from the customer database, is shown in Table 7.2. The customer data will be merged or joined with the basic orders data based on a common identifier which is the CID.

Case study data analysis

A basic data analysis should be done before any complex analysis such as estimating a regression model. There are two parts or stages discussed here in what could be interpreted as a chronological order but which are really done in any order. The two parts are an examination of descriptive statistics and data visualization.

The basic descriptive statistics include:

  • • count of the observations;
  • • mean;
  • • standard deviation;
  • • minimum value;
  • 25'' percentile or first quartile (Q?);
  • • 50',l percentile or median (or Q2);
  • 75'' percentile or third quartile (Qi); and
  • • maximum value.

The last five are known as the Five Number Summary, a robust set of statistics unaffected by distribution skewness. Although unaffected, they help determine skewness and symmetry of the data distribution:

  • 1. Symmetric: (75% — 50%) = (50%-25%)
  • 2. Right-Skewed: (75%-50%) > (50%-25%)
  • 3. Left Skewed: (75%-50%) < (50%-25%)

The descriptive statistics for the discounts are shown in Table 7.3.

A correlation is also helpful. Recall that a correlation shows association, not cause and effect. This distinction is important because you could have disastrous results if the wrong interpretation is placed on a correlation. Also, correlation is only between pairs of variables; never more than two. A correlation value varies between -1.0 and +1.0. A whole set of pairwise correlations could be calculated but these are then shown in a correlation matrix. A correlation matrix for the discounts is shown in Table 7.4.

TABLE 7.1 This is a basic data dictionary for the Case Study showing the order detail for customers of living room blinds.

Variable

Values

Source

Mnemonic

Order Number

Nominal Integer

Order Sys

Onum

Customer ID

Nominal Integer

Customer Sys

CID

Transaction Date

MM/DD/YYYY

Order Sys

Tdate

Product Line ID

Five house rooms

Product Sys

Pline

Product Class ID

Item in line

Product Sys

Pclass

Units Sold

Units/ order

Order Sys

Usales

Product Returned?

Yes/No

Order Sys

Return

Amount Returned

Number of units

Order Sys

returnAmount

Material Cost/Unit

USD cost of material

Product Sys

Mcost

List Price

USD list

Price Sys

Lprice

Dealer Discount

% to dealer (decimal)

Sales Sys

Ddisc

Competitive

Discount

% for competition (decimal)

Sales Sys

Cdisc

Order Size Discount

% for size (decimal)

Sales Sys

Odisc

Customer Pickup Allowance

% for pickup (decimal)

Sales Sys

Pdisc

Total Discount

% discount

Sum of discounts

Tdisc

Pocket Price

USD

LPrice X (1 — TDisc)

Pprice

Net Unit Sales

Net units/order

Usales - returnAmount

netUsales

Log of Net Unit Sales

Log net sales

log(netUsales)

log_netUsales

Log of Unit Sales

Log sales

log(Usales)

log_Usales

Log of Pocket Price

USD

log( Pprice)

log_Pprice

Revenue

USD

Usales X Pprice

Rev

Contribution

USD

Rev- Mcost

Con

Contribution

Margin

%

Con/Rev

CM

Net Revenue

USD

(Usales

returnAmount) X Pprice

netRev

Lost Revenue

USD

Rev- net Rev

lostRev

A second step in data analysis is data visualization. This is the latest “buzz phrase” - everyone talks about data visualization. It is really just looking at your data using graphs. The graphs, however, could be somewhat complex and hopefully informative - that they convey Rich Information. Graphs are generally more informative than tables. Small tables, like a 2x2, may be sufficient for presentations, but they usually cannot provide the Rich Information needed by decision makers. Larger

TABLE 7.2 This is a basic customer data dictionary' for the Case Study. It shows the customer features that could eventually be used with the sales data to provide a complete picture of the customers and their orders. The Buyer Rating is based on the sales force evaluations of ease of doing business with that customer while the customer satisfaction is based on an annual satisfaction study.

Variable

Values

Source

Mnemonic

Customer ID

Nominal Integer

Customer Sys

CID

State of Business

Two-character state code

Customer Sys

State

ZIP Code of location

Five-digit ZIP (Character)

Customer Sys

ZIP

Marketing Region

Four Census Regions

Customer Sys

Region

In Loyalty Program

Yes/No

Customer Sys

loyalty Program

Buyer Rating

Poor, Good, Excellent

Customer Sys

buyerRating

Buyer Satisfaction

Five-Point Likert

Customer Sys

buyerSatisfaction

TABLE 7.3 The descriptive statistics for the four discounts are for the Case Study. The variable Tdisc is the total discount as the sum of the four discounts. It was added to the data set of the two primary data sets merged into one. Notice that the dealer discount (Ddisc) is right-skewed. Also note that the sample size differs slightly. The full data set has 70,270 records. This indicates that there are some missing data.

count

mean

std

min

25%

50%

15%

max

Ddisc

70262

0.12

0.04

0.04

0.09

0.12

0.16

0.22

Cdisc

70261

0.07

0.02

0.04

0.05

0.07

0.09

0.10

Odisc

70266

0.05

0.01

0.02

0.04

0.05

0.06

0.08

Pdisc

70268

0.04

0.01

0.02

0.03

0.04

0.05

0.06

Tdisc

70270

0.28

0.05

0.02

0.24

0.28

0.32

0.43

TABLE 7.4 This is the correlation matrix for the four discounts. The total discount was not included because, as the sum of discounts, it will be highly correlated with the four discounts and so it will not provide any new information.

Ddisc

Cdisc

Odisc

Pdisc

Ddisc

1.000000

-0.000919

0.001962

-0.003748

Cdisc

-0.000919

1.000000

0.002217

0.000396

Odisc

0.001962

0.002217

1.000000

0.002888

Pdisc

-0.003748

0.000396

0.002888

1.000000

tables are needed, but we have difficulty seeing patterns in large tables so they have a drawback. Generally, visuals are superior to tables of numbers because we are visual creatures. We can see, for example, patterns in graphs much faster and more easily than in large tables. See Cleveland [1993J and Cleveland [1994] for thorough accounts of data visualization. Also see Tufte [1983] for the authoritative treatise on effective data visualization. It is worthwhile also to look at Wickham [2009] for useful discussions and implementations in R of the Grammar of Graphics due to Wilkinson [2005].

I cannot oversell visual displays either. Not all visuals are created equal or are the best for revealing Rich Information. There are visual display issues, such as colors and excessive use of 3-D, that counter the benefits of visualization.6 Nonetheless, you cannot do effective Business Analytics without data visualization.

To say you need to create graphs to visualize data is insufficient. A logical question to ask is: “ What do I look for in graphs?” I have my own list of “clues” to look for in graphs that tells you about the underlying processes generating the data and data issues you need to note.

  • 1. Distributions
  • • Symmetry/skewness.
  • 2. Relationships
  • • Correlations or causality statements between two or more variables.
  • 3. Trends
  • • Developments or changes over time (e.g., a tracking study).
  • 4. Patterns
  • • Groupings of objects as in market segments.
  • 5. Anomalies
  • • Outliers or unexpected values.

These are not in any special order. In fact, after a while you will unconsciously, almost instinctively, look for them as soon as you produce a graph. I will discuss these one at a time in the following sections.

Examine distributions

First on my list is an examination of the distribution of the data. The distribution is where the data are centered and how they are spread out from that point. Two display tools that reveal the distribution are the histogram and the boxplot. Both are introduced in a basic statistics course with the histogram being introduced first, almost within the first week, followed by the boxplot a few weeks later after summary statistics are introduced.

The histogram, although a basic almost elementary graph for one variable, has a major issue - the size, and therefore the number, of bins to display where the bins are represented by bars. Changing the size of the bins can dramatically change how the histogram is interpreted and, therefore, what kind of Rich Information it will provide. Most statistics packages have their own algorithm for determining bin sizes. One rule is the Freedman-Diaconis rule for bin width:

where h = bin width, IQR = Interquartile Range (= Q3 - Ql) for variable x, and n is the number of observations. The number of bins is then

You can set your own bin size in most data visualization software, but the default is usually sufficient. The Python package, Seaborn, uses this rule. Another rule is Sturges’ Rule for determining the number of bins:

where n is the sample size. See the short note by Hyndman [1995] about issues with these rules.

The shape of the histogram indicates the shape of the data’s distribution. There are three possibilities. The distribution could be:

  • 1. symmetric - the most desirable;
  • 2. right-skewed; or
  • 3. left skewed.

A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. A distribution is skewed to the right if the right side of the histogram extends much farther out than the left; there is a long right tail. A distribution is skewed to the left if the left side of the histogram extends much farther out than the right; there is a long left tail. Skewness is an issue because skewed data impacts many summary statistics, especially the mean or average. Three possible shapes are shown in Figure 7.4. The lower right is almost uniform while the lower left is almost normal. The one on the lower left is the most desirable.

Because of the bin issue, and also since a histogram is for a single variable, a better plot is the boxplot because it does not have this issue. This creative display shows the Five Number Summary mentioned earlier and it can also show the effect of a second, qualitative variable. Figure 7.5 shows the anatomy of a boxplot.

Examine relationships

The best way to examine and discover relationships between two or more quantitative variables is with a scatter plot. This shows combinations of values of the variables as points and all the points will show a relationship, if one exists. Figure 7.6 illustrates possibilities.

There is something to notice about the one in the lower left of Figure 7.6: there are a lot of data points! Maybe there is something hidden inside this graph, some Rich Information, hidden because there are too many data points. I will comment on this shortly.

When you see a nonlinear relationship, such as the one on the upper right of Figure 7.6, you could make it linear with a log transformation. The log actually does two things:

Three possible shapes for a histogram

FIGURE 7.4 Three possible shapes for a histogram. The top left is left skewed; the top right is right-skewed; the two bottom ones are symmetric. The bottom left histogram is normal while the bottom right is uniform.

  • 1. Straighten, or linearize, a curve
  • • Linear relationships are easier to work with.
  • • Linear relationships make it easier to handle probability distributions.
  • 2. Stabilize variances
  • • Make the distributions more evenly spread.
  • • Remove extreme skewness.

Demonstrations of these two properties are in the Appendix to this chapter.

Figure 7.7 shows the distribution of the net unit sales for the Case Study. Net sales are total, or gross, unit sales less returns. Notice that net sales are highly right-skewed which distorts the impression of the data. Any statistical analysis is jeopardized by this skewness. Transforming the data with the natural log normalizes the data so when you model net unit sales you should use a log transformation. Figure 7.8 shows the same data but on a natural log scale. The distribution is now fairly normal. As a recommendation, use the natural log of one plus the argument, or In(1 + Л'), rather than ln.v to avoid cases where .%• = 0 since InO is undefined, which is meaningless, but In 1 = 0 so you have a meaningful number. For the Case Study, it is possible that net sales could be zero if the entire order was returned. This is, in fact, the case as shown by the Five Number Summary in Table 7.5.

Log transforming data also produces a nice, useful economic interpretation: the estimated coefficient in an OLS model is an elasticity! This is demonstrated in the

This chart shows the anatomy of a boxplot

FIGURE 7.5 This chart shows the anatomy of a boxplot. The distance inside the box is the Interquartile Range (IQR). The upper and lower fences are at most 1.5 times the IQR. Data points outside the fences are outliers.

TABLE 7.5 Five-number summary of net unit sales. Notice that the minimum is zero which justifies adding 1 to the log transformation of net sales as discussed in the text. Since Q3- Median > Median- Q1, then net sales are right-skewed.

Minimum

Ql

Median

QJ

Maximum

0

15

24

36

338

Appendix. Also see Paczkowski [2018| for a discussion of this demand function, elasticities, and the log transformation.

Since one objective from the product manager is to estimate a price elasticity, you should graph net unit sales and the Pocket Price. I observed earlier that net unit sales were right-skewed but that using a log transform shifted the distribution to a more normal one. The log of net unit sales as well as the log of pocket price should therefore be used. This is a common transformation in empirical demand analysis because the slope of a line is the elasticity. A scatter plot of the log of net sales versus the log of the pocket price is shown in Figure 7.9. The data cloud is dense because of the number of data points plotted, hence it is difficult to see any clear pattern. A regression line was overlaid on the graph but the line is barely discernible. This is a case of Large-N. See Carr et al. [1987| for a discussion of Large-N problems.

A Large-N case results in a dense scatter plot making it difficult to see patterns. Options for handling this situation are:

This chart shows some examples of relationships revealed by a scatter plot

FIGURE 7.6 This chart shows some examples of relationships revealed by a scatter plot. The upper left is the most desirable because it is the easiest to work with. The one on the lower left is the least desirable because there is no discernible pattern. The data for the top row are from the data set know as Anscombe s quartet. See https://en.wikipedia. org/wiki/Anscombe%27s_quartet. The data for the bottom row were randomly generated.

  • 1. select a random sample of the data;
  • 2. use a contour rather than a scatter plot; and
  • 3. use a hex bin plot.

Sampling is powerful for reducing the Large-N problem and allowing you to do cross-validation. Cross-validation means that you could draw several samples and use each one as a check against the others. You should see the same relationships, more or less, but not dramatic differences. There are two issues with sampling:

1. because you are sampling, you might not have the best view of the data, so outliers and the mode may be hidden;

This chart shows the distribution of net unit sales without a log transformation

FIGURE 7.7 This chart shows the distribution of net unit sales without a log transformation.

This chart shows the distribution of net unit sales with a (natural) log transformation

FIGURE 7.8 This chart shows the distribution of net unit sales with a (natural) log transformation.

2. you may actually introduce modes and outliers that are not really there, but appear to be there by luck of the draw of the sample.

This chart shows the relationship between the log of net unit sales and the log of the pocket price

FIGURE 7.9 This chart shows the relationship between the log of net unit sales and the log of the pocket price.

See Carr et al. [1987] for a discussion of using random sampling with Large-N data. These are problems with any sampling, whether for graphs, which is our concern, or for surveys, or clinical trials, and so forth. Figure 7.10 shows the same data but based on a sample size of n = 500. The negative relationship between net sales and pocket price is clear.

An alternative is a contour plot which shows contours of the data. Different densities are shown with shaded areas as in Figure 7.11. A drawback to contour plots is the hyperparameters defining the plot. Different hyperparameter values give different impressions. See Dupont and W. Dale Plummer [20051 for a discussion and examples.

The hex bin plot is sometimes better because it involves drawing small hexagonal shapes, the coloring or shading of the shapes indicating the number of data points inside the hexagon. These shapes are less affected by hyperparameters and so are more robust. This is a much more effective way to visualize Large-N data. A hex bin plot is shown in Figure 7.12.

The product manager for the Case Study wanted some insight into discounts by marketing regions. Boxplots will help reveal any differences. Figure 7.13 shows the distribution of total discounts by region. Notice that discounts are lowest in the Southern Region while the Midwest has a large number of low discounts. Also, the dispersion of discounts in the Southern Region is small relative to that in the other three regions. Let me drill down to verify the differences for the Southern Region. You can use Tukey’s pairwise Honestly Significant Difference (USD) test for this which I mentioned in Chapter 2. The test adjusts for the fact that multiple pairwise tests

This chart shows the relationship between the log of net unit sales and the log of the pocket price using a random sample of n = 500 data points

FIGURE 7.10 This chart shows the relationship between the log of net unit sales and the log of the pocket price using a random sample of n = 500 data points.

are conducted which increases the probability of making a false decision. In fact, the usual standard for a hypothesis test is a = 0.05; this is the probability of falsely rejecting the null hypothesis. When multiple tests are conducted, this probability becomes 1 -(1 — a)k where к is the number of pairwise tests. If к = 1, then this simplifies to a. If there are n units to test, then there are к = "x(»-i)/> tests. For the region problem, и = 4 so there are к = 6 pairwise comparison tests. Results of a Tukey HSD test of total discounts by region are shown in Table 7.6. Summary statistics are shown in Table 7.7 to help interpret Table 7.6. These results confirm the observation made from the boxplots in Figure 7.13. See Paczkowski [2016] for a discussion of Tukey s HSD test.

A further step in the tracking analysis by region is to drill down on the components of total discounts for the Southern Region. Figure 7.14 shows boxplots of the components for the Southern Region. Notice that the dealer discount tends to be the largest while the order discount has the most variation. These discounts could affect sales and revenue.

Examine patterns

As rules-of-thumb, look for clumps, or darkly shaded areas, or dispersion of the data, all depending on the type of graph you are using.

This chart shows the relationship between the log of net unit sales and the log of the pocket price using a contour plot

FIGURE 7.11 This chart shows the relationship between the log of net unit sales and the log of the pocket price using a contour plot. Smooth histograms, called kernel density plots, are shown on the two margins of the graph to emphasize the distributions. You can see that the distributions are almost normal.

Examine anomalies

Finally, look for anomalies, or outliers. These are suspicious points, some of which could be innocuous, while others could be pernicious leading to the wrong information; in other words, Poor Information. A boxplot is a great tool for showing outliers as in Figure 7.16. The points above the upper fence are all considered to be outliers. In this case, there are many of them. There is also one point slightly below the lower fence so there is only one low outlier. It is the mass at the top that is the issue in this example. These are cases that should be checked because they have the potential to adversely affect any statistics and modeling.

  • [1] will restrict the metadata to:
 
Source
< Prev   CONTENTS   Source   Next >