Forecast error analysis
The previous sections focused on models or methods using the training data set portion. Recall that a portion of the time series data set was reserved, if possible, for testing. There are many ways to test a forecast, but they all basically rely on the use of measures using the forecast error at time t defined as
Forecast Error, = Actual,-Forecast_{r}(f)
where Actual, is the actual value in the testing data set and Forecast_{T}(t) is the forecasted value in the testing data set. The forecasted value is based on the training data that ended at time T. The error measurement is illustrated in Figure 6.6. Forecast
FIGURE 6.6 This time line illustrates the values used for measuring forecast accuracy during model development. The training period, as described earlier, extends from t_{0 }to T' and the testing period from T' to T. At time T' + 1, the known actual is A_{T},_{+l }and the one-step ahead forecast into the testing period at this point in time is F_{T}<(1). The error at T' + 1 is the difference d_{J4l} — F_{T},(1).
error is sometimes called the out-of-sample error because it is based on data not used in the training (i.e., outside that data or sample).
There are a number of forecast error statistics, each based on the forecast error: A_{T}r_{+l} -F_{T}i( 1). Some analysts calculate an error for each step ahead into the testing period and then graph these errors, perhaps with a bar chart. The goal is to see pattern in the errors. I do not recommend this because patterns may be difficult to discern. I recommend any one of the following:
Afi+t F-ri (t)
Percentage Error PE_{T}t_{+l} = 100 X-
A_{T}'+,
Z'L, (A_{T},_{+i}-F_{T},(i))
Mean Error ME = ^{1} ----
h
yl, (Ar>+,-F_{T}'(i))
^{Li=l} A_{v+i}
Mean Percentage Error MPE= 100 X---
(A_{T}>_{+i}-F_{T},(t))
■^i=l A ,
Лт'+i
Mean Absolute Percentage Error MAPE= 100 X---
— Fj-r(i)
Median Absolute Percentage Error MdAPE= Median value of - ;
^{8 A}T'+i
»= 1,2,...,A
X,=i (Ar'+i ~ 1^{:}гА0)
Mean Square Error MSE =---
Since the testing period only extends for time T' to time T, then the steps ahead,
/», can only go as far as time T, so 1 < /; < T.
The ME and MPE are useful supplements to a count of the frequency of underand over-forecasts, the ME gives the average of the forecast errors expressed in the units of measurement of the data and the MPE gives the average of the forecast errors in terms of percentage and is unit-free. The Root Mean Square Error (RMSE), the square root of MSE, is a popular measure.
See Levenbach and Clear)' [2006] for these statistics and some application examples. They also discuss the difference between model fit error and forecast error. The former applies to the data used for training and the latter for testing. The distinction should be clear.
I defined these error measures in terms of the training and testing data sets. The same measures are used when the full forecast is developed beyond period T. I mention this in Chapter 7.
Software
There are many software packages that handle time series models. JMP, SAS, Stata, R, and Python are excellent options. Hyndman and Athanasopoulos [2018 ] provide an excellent treatment of R for time series analysis and forecasting.
Summary
This is a very long and complex chapter. The forecasting methods outlined here are sophisticated enough that many of them warrant their own book. This is especially true of the ARIMA family of models. Nonetheless, new product development requires a sales forecast before launch so these methods should be studied and considered.
Appendix
This appendix summarizes a general class or family of time series models used for forecasting. Three models mentioned in this chapter, Naive 1 (NF1), constant mean, and exponential smoothing, are special cases of an ARIMA specification. The first subsection reviews two operators commonly used in time series analysis and that are used in this Appendix. The remaining sections review various models using these estimates.
Time series definition
Following Parzen [1962], let a set of data values measured at discrete equidistant points in time be S. Values collected or measured in real-time are definitely possible, but most applications rely on discrete measurements. We write the set as
S = [1,2,3.....T], where T is the number of observations. An observation is a
realization at time t of an underlying process. This realization is denoted as Y,. The set of observations {Y_{(}, t € S] is a time series. This is simply written as Y_{h} Y_{2},..., Y_{T}. See Parzen [1962] for the definition.
Backshift and differencing operators
The backshift operator is a convenient tool to use when dealing with time series models. Another operator, the differencing operator, is related to the backshift operator. I present a high-level overview of both of them in this section.
Denote a time series of T observations as Y,, Y_{2},..., Y_{T}. The backshift operator (B) when applied to a time series produces a new series oflagged values: BY_{t} = V,_j. В can be applied successively. For example,
The exponent for В means to repeatedly apply the backshift in order to move backward a number of time periods equal to the “power”, but the exponent is not a power; it just indicates the amount of backward shift. In general, based on reapplication of the basic definition you have
Observe that B° = 1 so that B° Y, = Y_{(}. B° is the identity operator. Also note that if c is a constant, then B^{k}c = c since a constant cannot be shifted by definition.
The differencing operator, V, gives the change in Y, from the previous period: VY, = Y_{t}— Y,_]. The differencing and backshift operators are related. Notice that
Therefore, V = 1 - B by equating “coefficients”. The V operator can also be applied successively. For example,
Random walk model and naive forecast
A random walk model is
where the error term is white noise: E(e_{t}) = 0, V(e_{t}) = a^{1}, and COK(e,,e,__{(}) = 0,Vf, j. Starting from some base number, Y_{0}, the series evolves as
The last value, Y_{T}, is the evolution of all the past white noise terms. This is the basis for how the stock market operates. See Malkiel [1999] for the classic treatment of the stock market and random walks.
If e, is white noise following a normal distribution, e, ~ ,^{г}(0,<т_{(}^{2}), then У, is also normally distributed by the reproductive property of normals. This property states that a linear combination of normally distributed random variables is itself normally distributed. See Dudewicz and Mishra [1988| for the reproductive property.
Using the backshift operator, B,
Since (1 -B)~^{]} = 1 +В + ВГ + ... = £“_{0}B then
which can be truncated to going back to a finite past for practical purposes. Note that Y_{0} is so far back (i.e., the infinite past) that it can be ignored. The infinite series B^{1} must converge to some value, K. Otherwise, the series explodes which is impractical. More importantly, if the series diverges, the E(Y_{t}) = Ox oo which is an indeterminate form. Nelson 11973] notes that this is a condition that must hold. Assume that the infinite sum is truncated at time T. Then you have
The mean and variance are then (note that B' converges because it is a finite sum)
Notice that the variance gets bigger the further out in time you go.
Suppose you use the random walk model to forecast. What form will those forecasts take and how are they derived? Let X = ..., Y_{T}_,, Y_{T} be given. Note for what follows that the actual Y_{r+1} is a random variable. By the random walk formula, Y_{T}(1) = Y_{T} + e_{T+1}. Y_{T} is known so it is not a random variable, but ey_{+1 }is an unknown random variable. Therefore, the one-step ahead forecast, Y_{r}(l), is a random variable. You can describe the probability distribution of the one-step ahead forecast, Y_{r}(l), conditioned on its past history. Its expected value is
The expected number in the next period is just the current number. I will now drop the X to simplify notation, but remember that the forecasts are conditioned on this history.
The variance of Y_{r}(l) is
Remember that Y_{T} is an actual, so it is nonstochastic and thus has no variance.
If e, is white noise following a normal distribution, then
by the reproductive property of normals. A 95% confidence interval is simply
Now extend the forecast to two-steps ahead. Then
The variance is
The forecast is just the current value, Y_{T}, and the variance is twice a^{2}. The forecast for h-steps ahead is still Y, and the variance is h X a^{2}. Using the reproductive property of normals, then Y_{r}(/i) ~ jV(Y_{T},/»
root of the number of steps ahead.
Random walk with drift
Consider the random walk model
where 8 is a constant. This is the random walk with drift model where 8 is the drift parameter. Starting at Y_{0} as before, through successive substitution you get
The series keeps drifting by 8.
Using the backshift notation, you have
Truncating the summations to t periods, yields
Note that 8 ^{=} <5x(l+B^{l}+B^{2}+...+ B'~^{l}) where the expression in
parentheses has t terms. The second line then simplifies as shown.
The mean and variance are then
The 1-step ahead forecast is Y_{r}(l) = Y_{T} + 8 + e_{T+1} with E[ Y_{r}( 1)) = Y_{r} + 8. It is easy to show that E[Y_{T}(lt)] = Y_{T} + It X 8. The variance is V Y_{r}(h) = It X c^{2}. So the drift affects the level and not the variance of the forecast.
Constant mean model
For the constant mean model,
so that E[Y_{T}(h)] = 0 since E(e,) = 0,Vf. The h-step ahead forecast is unbiased. The variance of this forecast is
See Gilchrist [1976| for this demonstration.
The forecast error is
so E(e_{T+}h) = 0. The mean square error (MSE) of the forecast is You can now write a 95% confidence interval statement as
An unbiased estimate of o_{c} is the sample standard deviation of the historical data. See Gilchrist [19761 for discussion of this model.
The ARIMA family of models
I will now consider a broad, general class of models that are sometimes called:
- • stochastic time series models;
- • time series models;
- • Autoregressive Integrated Moving Average (A RIM A) models; or
- • Box-Jenkins models
although A RIM A is the most common reference. In this class of models, the random element, e, plays a dominant role rather than being just an add-on error (i.e., disturbance) to a strictly deterministic model. The ARIMA model is actually a family of models with different members of the family (cousins, if you wish) that themselves define more specific cases. The ARIMA model has parameters the specification of which define these family members. The parameters are p, d, and q which I will specify below.
Consider the model
where e, is white noise. If ф = 1, then this is a random walk. From backward substitution, you get
So У, is a weighted sum of past white noise terms and the initial value of Y. Using our backshift operator, you have
If t -> oo and | ф |< 1, then you have an infinite sum that must converge. That is, ^°1_{0}Ф'В' = К where К is a finite (but perhaps large) constant.
A general form for our model of practical value is
This has the form of a regression model with p explanatory variables. It is called an autoregressive model of order p or an AR(p) model. Using the backshift operator, B, you can write
where the parameter p defines the order of the process. You could also write where
The polynomial Ф(B) is called the AR(p) operator.
Suppose you now write a new model This can be rewritten as
This is a moving average of order q (MA(q)) model and ©(B) is the MA(q) operator. The parameter q defines the order of this process. This is a linear filter that takes the white noise series and transforms it to Y,. The “moving average” name is misleading since the weights do not necessarily sum to 1.0, so do not confuse this with what you know from above.
You can extend the model to include both AR and MA components to capture lingering effects and temporary shocks, respectively. The model is
This mixed model is called an autoregressive moving average model of order p and q, or simply ARMA(p.q), and is conveniently written as
so that
All econometric models should have a constant term unless economic theory strongly says otherwise - and it typically does not. The inclusion of a constant term also holds for time series models. An AR.IA( 1,1) model, for example, is
Most often, the level is removed by subtracting the mean p so you have Y, = Yj — p.Yt. This way, /•( Y_{(}) = E(Y_{t}) — p = 0. Therefore, there are no changes from what you already learned. An estimate of p is, of course, the sample mean, Y.
Suppose you have a nonstationary time series. That is, it has a randomly occurring shift in its level. You require a model whose behavior is not influenced by the level of the process. You can often eliminate the effect of the changing level by differencing. This is where the differencing operator, V, is used. For example, consider a random walk given by Y, = Y_{M} + e,. This model can be written as Y, = BY, + e, or (1 — B)Y, = e,. The first term is equivalent to V Y, so VY, = £,.
Using the differencing operator and removing the level, our basic mixed model is expanded to be
TABLE 6.1 This is a list of common time series and forecasting models derived from a general ARIMA model. Seasonal variations are also possible but these are not relevant for new product forecasts. Source: https://stats.stackexchange.com/questions/23864/what- common-forecasting- models-can-be-seen-as-special-cases-of-arima-models.
Model |
ARIMA Specification |
Constant Mean |
ARIMA(0, 0, 0) with constant |
Random Walk or Naive (NF1) |
ARIMA(0, 1, 0) |
Random Walk with Drift: |
ARIMA(0, 1, 0) with constant |
Simple Exponential Smoothing |
ARIMA(0, 1, 1) |
Holt’s Exponential Smoothing |
ARIMA(0, 2, 2) |
Damped Holt’s |
ARIMA(0, 1, 2) |
where Ф( /i) is the generalized autoregressive operator. This model is called the autore
gressive integrated moving average model of order p, d, and q, or simply ARIMA(p,d,q). This is a general family since different values for the parameters define different types of models. Table 6.1 shows some key family members that typically arise in practice. As the table shows, the naive and constant mean models are special cases. Also, the classic random walk model is equivalent to the naive model and is also a special case of the ARIMA(p,d,q) model.
For more details on this class of models including seasonal variations, see Wei [2006], Montgomery et al. [2008], and Nelson [1973]. Also see the original treatment in Box et al. [1994]. For implementation of many time series models in R, see Hyndman and Athanasopoulos [2018].
Notes
- 1 See https://marketing-insider.eu/categories-of-new-products/.
- 2 NP-hard problems are a complex topic in theoretical computer science and are concerned with the time needed to solve a problem. The time is a polynomial of the input size, hence the “P”. See www.quora.com/What-does-NP-hard-mean for a good explanation.
- 3 If they are positive and do not sum to 1.0, then the weighted moving average must be divided by the sum of the weights which forces them to sum to 1.0.