Forecast error analysis

The previous sections focused on models or methods using the training data set portion. Recall that a portion of the time series data set was reserved, if possible, for testing. There are many ways to test a forecast, but they all basically rely on the use of measures using the forecast error at time t defined as

Forecast Error, = Actual,-Forecastr(f)

where Actual, is the actual value in the testing data set and ForecastT(t) is the forecasted value in the testing data set. The forecasted value is based on the training data that ended at time T. The error measurement is illustrated in Figure 6.6. Forecast

This time line illustrates the values used for measuring forecast accuracy during model development

FIGURE 6.6 This time line illustrates the values used for measuring forecast accuracy during model development. The training period, as described earlier, extends from t0 to T' and the testing period from T' to T. At time T' + 1, the known actual is AT,+l and the one-step ahead forecast into the testing period at this point in time is FT<(1). The error at T' + 1 is the difference dJ4l — FT,(1).

error is sometimes called the out-of-sample error because it is based on data not used in the training (i.e., outside that data or sample).

There are a number of forecast error statistics, each based on the forecast error: ATr+l -FTi( 1). Some analysts calculate an error for each step ahead into the testing period and then graph these errors, perhaps with a bar chart. The goal is to see pattern in the errors. I do not recommend this because patterns may be difficult to discern. I recommend any one of the following:

Afi+t F-ri (t)

Percentage Error PETt+l = 100 X-


Z'L, (AT,+i-FT,(i))

Mean Error ME = 1 ----


yl, (Ar>+,-FT'(i))

Li=l Av+i

Mean Percentage Error MPE= 100 X---


■^i=l A ,


Mean Absolute Percentage Error MAPE= 100 X---

— Fj-r(i)

Median Absolute Percentage Error MdAPE= Median value of - ;

8 AT'+i

»= 1,2,...,A

X,=i (Ar'+i ~ 1:гА0)

Mean Square Error MSE =---

Since the testing period only extends for time T' to time T, then the steps ahead,

/», can only go as far as time T, so 1 < /; < T.

The ME and MPE are useful supplements to a count of the frequency of underand over-forecasts, the ME gives the average of the forecast errors expressed in the units of measurement of the data and the MPE gives the average of the forecast errors in terms of percentage and is unit-free. The Root Mean Square Error (RMSE), the square root of MSE, is a popular measure.

See Levenbach and Clear)' [2006] for these statistics and some application examples. They also discuss the difference between model fit error and forecast error. The former applies to the data used for training and the latter for testing. The distinction should be clear.

I defined these error measures in terms of the training and testing data sets. The same measures are used when the full forecast is developed beyond period T. I mention this in Chapter 7.


There are many software packages that handle time series models. JMP, SAS, Stata, R, and Python are excellent options. Hyndman and Athanasopoulos [2018 ] provide an excellent treatment of R for time series analysis and forecasting.


This is a very long and complex chapter. The forecasting methods outlined here are sophisticated enough that many of them warrant their own book. This is especially true of the ARIMA family of models. Nonetheless, new product development requires a sales forecast before launch so these methods should be studied and considered.


This appendix summarizes a general class or family of time series models used for forecasting. Three models mentioned in this chapter, Naive 1 (NF1), constant mean, and exponential smoothing, are special cases of an ARIMA specification. The first subsection reviews two operators commonly used in time series analysis and that are used in this Appendix. The remaining sections review various models using these estimates.

Time series definition

Following Parzen [1962], let a set of data values measured at discrete equidistant points in time be S. Values collected or measured in real-time are definitely possible, but most applications rely on discrete measurements. We write the set as

S = [1,2,3.....T], where T is the number of observations. An observation is a

realization at time t of an underlying process. This realization is denoted as Y,. The set of observations {Y(, t € S] is a time series. This is simply written as Yh Y2,..., YT. See Parzen [1962] for the definition.

Backshift and differencing operators

The backshift operator is a convenient tool to use when dealing with time series models. Another operator, the differencing operator, is related to the backshift operator. I present a high-level overview of both of them in this section.

Denote a time series of T observations as Y,, Y2,..., YT. The backshift operator (B) when applied to a time series produces a new series oflagged values: BYt = V,_j. В can be applied successively. For example,

The exponent for В means to repeatedly apply the backshift in order to move backward a number of time periods equal to the “power”, but the exponent is not a power; it just indicates the amount of backward shift. In general, based on reapplication of the basic definition you have

Observe that B° = 1 so that B° Y, = Y(. B° is the identity operator. Also note that if c is a constant, then Bkc = c since a constant cannot be shifted by definition.

The differencing operator, V, gives the change in Y, from the previous period: VY, = Yt Y,_]. The differencing and backshift operators are related. Notice that

Therefore, V = 1 - B by equating “coefficients”. The V operator can also be applied successively. For example,

Random walk model and naive forecast

A random walk model is

where the error term is white noise: E(et) = 0, V(et) = a1, and COK(e,,e,_() = 0,Vf, j. Starting from some base number, Y0, the series evolves as

The last value, YT, is the evolution of all the past white noise terms. This is the basis for how the stock market operates. See Malkiel [1999] for the classic treatment of the stock market and random walks.

If e, is white noise following a normal distribution, e, ~ ,г(0,<т(2), then У, is also normally distributed by the reproductive property of normals. This property states that a linear combination of normally distributed random variables is itself normally distributed. See Dudewicz and Mishra [1988| for the reproductive property.

Using the backshift operator, B,

Since (1 -B)~] = 1 +В + ВГ + ... = £“0B then

which can be truncated to going back to a finite past for practical purposes. Note that Y0 is so far back (i.e., the infinite past) that it can be ignored. The infinite series B1 must converge to some value, K. Otherwise, the series explodes which is impractical. More importantly, if the series diverges, the E(Yt) = Ox oo which is an indeterminate form. Nelson 11973] notes that this is a condition that must hold. Assume that the infinite sum is truncated at time T. Then you have

The mean and variance are then (note that B' converges because it is a finite sum)

Notice that the variance gets bigger the further out in time you go.

Suppose you use the random walk model to forecast. What form will those forecasts take and how are they derived? Let X = ..., YT_,, YT be given. Note for what follows that the actual Yr+1 is a random variable. By the random walk formula, YT(1) = YT + eT+1. YT is known so it is not a random variable, but ey+1 is an unknown random variable. Therefore, the one-step ahead forecast, Yr(l), is a random variable. You can describe the probability distribution of the one-step ahead forecast, Yr(l), conditioned on its past history. Its expected value is

The expected number in the next period is just the current number. I will now drop the X to simplify notation, but remember that the forecasts are conditioned on this history.

The variance of Yr(l) is

Remember that YT is an actual, so it is nonstochastic and thus has no variance.

If e, is white noise following a normal distribution, then

by the reproductive property of normals. A 95% confidence interval is simply

Now extend the forecast to two-steps ahead. Then

The variance is

The forecast is just the current value, YT, and the variance is twice a2. The forecast for h-steps ahead is still Y, and the variance is h X a2. Using the reproductive property of normals, then Yr(/i) ~ jV(YT,/»2) and a 95% confidence interval is Y'i ± 1.96 x The confidence interval expands in proportion to the square

root of the number of steps ahead.

Random walk with drift

Consider the random walk model

where 8 is a constant. This is the random walk with drift model where 8 is the drift parameter. Starting at Y0 as before, through successive substitution you get

The series keeps drifting by 8.

Using the backshift notation, you have

Truncating the summations to t periods, yields

Note that 8 = <5x(l+Bl+B2+...+ B'~l) where the expression in

parentheses has t terms. The second line then simplifies as shown.

The mean and variance are then

The 1-step ahead forecast is Yr(l) = YT + 8 + eT+1 with E[ Yr( 1)) = Yr + 8. It is easy to show that E[YT(lt)] = YT + It X 8. The variance is V Yr(h) = It X c2. So the drift affects the level and not the variance of the forecast.

Constant mean model

For the constant mean model,

so that E[YT(h)] = 0 since E(e,) = 0,Vf. The h-step ahead forecast is unbiased. The variance of this forecast is

See Gilchrist [1976| for this demonstration.

The forecast error is

so E(eT+h) = 0. The mean square error (MSE) of the forecast is You can now write a 95% confidence interval statement as

An unbiased estimate of oc is the sample standard deviation of the historical data. See Gilchrist [19761 for discussion of this model.

The ARIMA family of models

I will now consider a broad, general class of models that are sometimes called:

  • • stochastic time series models;
  • • time series models;
  • • Autoregressive Integrated Moving Average (A RIM A) models; or
  • • Box-Jenkins models

although A RIM A is the most common reference. In this class of models, the random element, e, plays a dominant role rather than being just an add-on error (i.e., disturbance) to a strictly deterministic model. The ARIMA model is actually a family of models with different members of the family (cousins, if you wish) that themselves define more specific cases. The ARIMA model has parameters the specification of which define these family members. The parameters are p, d, and q which I will specify below.

Consider the model

where e, is white noise. If ф = 1, then this is a random walk. From backward substitution, you get

So У, is a weighted sum of past white noise terms and the initial value of Y. Using our backshift operator, you have

If t -> oo and | ф |< 1, then you have an infinite sum that must converge. That is, ^°10Ф'В' = К where К is a finite (but perhaps large) constant.

A general form for our model of practical value is

This has the form of a regression model with p explanatory variables. It is called an autoregressive model of order p or an AR(p) model. Using the backshift operator, B, you can write

where the parameter p defines the order of the process. You could also write where

The polynomial Ф(B) is called the AR(p) operator.

Suppose you now write a new model This can be rewritten as

This is a moving average of order q (MA(q)) model and ©(B) is the MA(q) operator. The parameter q defines the order of this process. This is a linear filter that takes the white noise series and transforms it to Y,. The “moving average” name is misleading since the weights do not necessarily sum to 1.0, so do not confuse this with what you know from above.

You can extend the model to include both AR and MA components to capture lingering effects and temporary shocks, respectively. The model is

This mixed model is called an autoregressive moving average model of order p and q, or simply ARMA(p.q), and is conveniently written as

so that

All econometric models should have a constant term unless economic theory strongly says otherwise - and it typically does not. The inclusion of a constant term also holds for time series models. An AR.IA( 1,1) model, for example, is

Most often, the level is removed by subtracting the mean p so you have Y, = Yjp.Yt. This way, /•( Y() = E(Yt) — p = 0. Therefore, there are no changes from what you already learned. An estimate of p is, of course, the sample mean, Y.

Suppose you have a nonstationary time series. That is, it has a randomly occurring shift in its level. You require a model whose behavior is not influenced by the level of the process. You can often eliminate the effect of the changing level by differencing. This is where the differencing operator, V, is used. For example, consider a random walk given by Y, = YM + e,. This model can be written as Y, = BY, + e, or (1 — B)Y, = e,. The first term is equivalent to V Y, so VY, = £,.

Using the differencing operator and removing the level, our basic mixed model is expanded to be

TABLE 6.1 This is a list of common time series and forecasting models derived from a general ARIMA model. Seasonal variations are also possible but these are not relevant for new product forecasts. Source: common-forecasting- models-can-be-seen-as-special-cases-of-arima-models.


ARIMA Specification

Constant Mean

ARIMA(0, 0, 0) with constant

Random Walk or Naive (NF1)

ARIMA(0, 1, 0)

Random Walk with Drift:

ARIMA(0, 1, 0) with constant

Simple Exponential Smoothing

ARIMA(0, 1, 1)

Holt’s Exponential Smoothing

ARIMA(0, 2, 2)

Damped Holt’s

ARIMA(0, 1, 2)

where Ф( /i) is the generalized autoregressive operator. This model is called the autore

gressive integrated moving average model of order p, d, and q, or simply ARIMA(p,d,q). This is a general family since different values for the parameters define different types of models. Table 6.1 shows some key family members that typically arise in practice. As the table shows, the naive and constant mean models are special cases. Also, the classic random walk model is equivalent to the naive model and is also a special case of the ARIMA(p,d,q) model.

For more details on this class of models including seasonal variations, see Wei [2006], Montgomery et al. [2008], and Nelson [1973]. Also see the original treatment in Box et al. [1994]. For implementation of many time series models in R, see Hyndman and Athanasopoulos [2018].


  • 1 See
  • 2 NP-hard problems are a complex topic in theoretical computer science and are concerned with the time needed to solve a problem. The time is a polynomial of the input size, hence the “P”. See for a good explanation.
  • 3 If they are positive and do not sum to 1.0, then the weighted moving average must be divided by the sum of the weights which forces them to sum to 1.0.
< Prev   CONTENTS   Source   Next >