Appendix 3.B
Brief overview of correspondence analysis
The mathematics of correspondence analysis can be a little challenging because it mostly relies on matrix algebra. The main matrix algebra tool is the Singular Value Decomposition (SKD) outlined in the Appendix to Chapter 2. Recall from that discussion that the SVD method decomposes a matrix A of size nxp into three parts which, when multiplied together return the original matrix A: where
- • U is an ii X n orthogonal matrix such that U^{T}U = I where I is /1X «;
- • £ is an nXr diagonal matrix such that £ = diagonal(a_{l}, a_{2}_____
values are the singular values; and
• V^{T} is an / X p orthogonal matrix such that V^{T} V = I where I is p Xp.
The following is based on Greenacre [2007] which is the definitive presentation and development of correspondence analysis.
A Ixj matrix S of standardized residuals is the basis for correspondence analysis. The Singular Value Decomposition (SVD) is applied to S to give
The SVD is shown in the .S' VD Components portion of Figure 3.22. The matrix Л is a Ixj diagonal matrix with elements in descending order. The diagonal elements are singular values. The left matrix, U, is lxl and provides information about the rows of the original crosstab. The right matrix, V^{T}, is Ixj and provides information about the columns of the crosstab.
The main output of a correspondence analysis is a map which means that plotting coordinates are needed. Since the crosstab is rows by columns, a set of plotting coordinates is needed for the rows and another set is needed for the columns. These are given as functions of the SVD components. The row coordinates are designated as Ф and are based on
while the column coordinates, designated as Г, are based on
These sets are sometimes called the Standard Row Coordinates and the Standard Column Coordinates, respectively. For plotting purposes, however, these are usually adjusted as
and
These are the Principal Row Coordinates and Principal Column Coordinates, respective. These are shown in the Coordinates section of Figure 3.22. The full correspondence analysis for the example crosstab is shown in Figure 3.21. Notice that the plotting coordinates agree with the principal coordinates in Figure 3.21.
The SVD provides more information than just the plotting coordinates. It also provides measures of the amount of variation in the table explained by the dimensions. These are the inertias. The inertia is the variation in the crosstab. See Greenacre and Korneliussen [2017].
It can be shown that the singular values from the SKD of the crosstab are related to the Pearson Chi-square value. The singular values are usually arranged in descending order with the corresponding eigenvectors appropriately arranged. The square of a singular value is called the inertia of the table where the concept of inertia comes from the physics of a rigid body. In particular, it is the force or torque necessary to change the angular momentum of the rigid body. The formula for moment of inertia and variance are the same, hence in the correspondence literature the variance of the table is referred to as the inertia.^{16} There is a singular value for each dimension extracted from the crosstab table where the total number of dimensions that could be extracted is d = min(r— 1, с— 1) for r rows and c columns of the table. The singular value for each dimension is SVj.
If Aj is the inertia for the i^{11}' dimension, then A = jjf-j A_{(}. It can be shown that A, = S V^{2} so A = SV^{2}. It can also be shown that the total chi-square of the table is /~ = NXЛ where N is the total sample size. This means that x~ ^{=} NX A,-. From Figure 3.13, you have the data in Table 3.9.
Appendix 3.C
Very brief overview of ordinary least squares analysis
Assume there is one dependent variable arranged as а и X 1 vector Y. Also assume there are p > 1 independent variables arranged ina«X(p+l) matrix X where the first column consists of Is for the constant term. Then a model is
Y = Xp + e (3.C.1)
where /? is a (/; + 1) X 1 vectors of parameters to be estimated with the first element as the constant, and e is а и X 1 vector of random disturbance terms. It is usually
TABLE 3.9 This table illustrates the calculations for the inertia values for the correspondence analysis.
Dimension |
Cumulative Percent |
||||
1 |
0.14457 |
0.02090 |
89.873 |
97.04 |
97.04 |
2 |
0.02524 |
0.00064 |
2,740 |
2.96 |
100.00 |
Total |
0.02154 |
92.613 |
100.00 |
N = 4300 from Figure 3.5 X^{2 =} 92.613 from Figure 3.14.
FIGURE 3.21 this is a comprehensive correspondence analysis report for the example prototype table Figure 3.20.
assumed that e, ~ jV(0,
FIGURE 3.22 These are the details for the Singular Value Decomposition calculations for the correspondence analysis in Figure 3.21.
It is easy to show that
Then E(p) = P and V(p) = a^{2} x (X^{T}X)-‘.
See Greene [2003] for a detailed development of this result. Also see Goldberger [1964] for a classic derivation. If there is perfect multicollinearity, then the X^{T}X matrix cannot be inverted and the parameters cannot be estimated.
Brief overview of principal components analysis
Principal components analysis works by finding a transformation of the X matrix into a new matrix such that the column vectors of the new matrix are uncorrelated.
An important first step in principal components analysis is to mean-center the data. This involves finding the mean for each variable and then subtracting these means from their respective variable. This has the effect of removing any large values that could negatively impact results. To mean-center, let X be the и X /> matrix of variables. Let 1„ be a column vector with a 1 for each element so that 1„ is nX 1. Then a I X p row vector of means is given by
where (ljl_{;l})^{_1} = ^{]}/». The mean-centered matrix is then
I can now do an SVD on X to get X = UXP^{T} where U and P are orthogonal matrices. Since P is orthogonal, then PP^{T} = I implying that P^{T} = P ^{1}. Similarly for U. Let T = US so that X = TP^{T}. Then XP = TP^{T}P or T = XP. The matrix T is the matrix of principal component scores and P is the matrix of principal components that transform X. See Ng [20131.
Following Ng [2013], you can now write the covariance matrix for the principal component scores, аЧ'(Т.Т), as
where S = '/(«-iiX^{T}X. The matrix S is a covariance matrix and is square so a spectral decomposition can be applied to get S = UDU^{T}. Then,
Let P = U, then
Since D is diagonal, the diagonal elements are the variances and they are in decreasing order. Also, since D is diagonal, the off-diagonal elements are all zero implying independence. Finally, as noted by Lay [2012], the matrix of principal components, P, makes the covariance matrix for the scores diagonal.
Without loss of generality, the columns of T are arranged in descending order of the variance explained so that the first column explains the most variance in X, the second column explains the second most variation, and so forth. That is,
where P is an pXp transformation matrix called the principal components and T is the resulting n Xp matrix of principal components scores resulting from the transformation. Usually, only the first к < p columns of T are needed since they account for most of the variation in X. This reduced matrix can be denoted as T_{fc}. In principal components regression, the reduced matrix T*. replaces the matrix X in the OLS formulation.
See Jolliffe |2002] for the definitive treatment of principal components analysis.
Principal components regression analysis
Principal components regression analysis involves using the principal components scores as the independent variables in a regression model. The columns of this score matrix are orthogonal by construction so multicollinearity is not an issue. If Y is a column vector for the dependent variable, then the model is Y = T/?, ignoring the disturbance term vector for simplicity, and OLS can be used.
Brief overview of partial least squares analysis
Partial least squares (PLS), initially developed by Wold [1966b|, works by finding linear combinations of independent variables, called manifest variables, which are directly observable. The linear combinations are latent or hidden in the data and are sometimes called factors, components (as in PCA), latent vectors, or latent variables. The factors should be independent of each other and account for most of the variance of Y. This is akin to principal components analysis. PLS uses the result that X can be decomposed into X = TP^{T} as shown above. The vector T is the score vector for X. In particular, a single linear combination or factor can be extracted from the X matrix, say t, which is one of many such possible factors. This factor represents a reduced combination of the variables in X which means it can be used in regression models for predicting X and Y. Let the predictions be X_{(l} and Y_{0}. The subscript “0” on both predictions indicates that this is the base or initial prediction. The two predictions are based on OLS estimations using the OLS estimation formula from above. In this case, the extracted factor, t, is the independent variable and X_{0} and Y_{0} are the dependent variables. The prediction for X_{0} is given by X_{0} = t^trTX,,. Similarly, Y_{0} = t(t^{T}t)-'t^{T}Y_{{1}.
The factor t as a linear combination of the manifest independent variables is important. This is, however, only one factor combination out of many possible combinations. The combination used should meet a criterion and this is that the factor for X should have the maximum covariance with a factor extracted for Y. The extracted factor for Y is u = Y,,q. The covariance is cov(t,u) = t^{T}u. So the objective is to extract factors (or latent linear combinations of manifest independent and dependent variables) such that the covariance between them is as large as possible.
Once the first pair of factors are extracted, you have to find another pair that meets the same criterion. You cannot, however, have the first set be used again so they have to be deleted. This is done by subtraction, thus creating two new matrices. That is, you now have X_{t} = X_{0}—X_{0} and Y, = Y_{0}-Y_{0}. This is sometimes referred to as “partialing out” the effect of a factor. The process outlined above is repeated using these two new matrices. The overall process of doing OLS regressions and partialing out the predicted values is continued until either you reach a desired number of extracted factors or no more factors can be extracted. The combination of OLS regressions and partialing out predicted values is the basis of the name partial least squares.^{17}
Since predicted values are partialed out of both the X and Y matrices, an iterative algorithm can be specified. This is usually written as four successive steps from i = 0,1,..., и where it is the maximum number of iterations of the algorithm:
- 1. Estimate the X weights as w = X^{T}u(u^{T}u)^{-1}
- 2. Estimate the X factor scores as t = X,w
- 3. Estimate the Y weights as c = Y^{T}t(t^{T}t)^{_l}
- 4. Estimate the Y factor scores as c = Y,c
Stop the iterations either when the number of desired iterations (i.e., factors) is reached or no more factors can be extracted as determined by a convergence criterion.^{18} The SAS Proc PLS implementation uses a default of и = 200 iterations and a default convergence criterion of 1СГ^{12}.
The algorithm outlined here is called the NIPALS Algorithm which stands for “Nonlinear Iterative Partial Least Squares.” It was developed by Wold [1966aJ. An alternative algorithm is SIMPLS. See de Jong [1993J for a discussion.
There are software packages that implement this PLS algorithm. SAS has Proc PLS and JMP has a partial least squares platform. The book by Cox and Gaudard [2013J gives an excellent overview of PLS using JMP.
An interesting history of PLS is provided by Gaston Sanchez: “The Saga of PLS” at sagaofpls.github.io.
Notes
- 1 A “no selection” option is often included so that the respondents are not forced to select anything from a choice set if nothing appeals to them.
- 2 In some conjoint studies, all the products are presented at once and respondents are asked to rank them in terms of their preference. 1 do not like or advocate this approach because it becomes impractical if the number of products become moderately large of if they are similar.
- 3 See https://performingarts.uncg.edu/patech/airturn-vs-pageflip-cicada-a-bluetooth- pedal-showdown/ and www.musicnotes.com/now/tips/the-3-best-hands-free-pageturners/ for commentary on devices with these, and other, features. Also see Page turning solutions for musicians: A survey pdf file in Notes
- 4 The term “runs” comes from the design of experiments literature.
- 5 See Paczkowski [2016] and Paczkowski [2018] for an explanation.
- 6 Dummy coding is also called indicator coding. In the machine language literature, it is called one-hot encoding.
- 7 Source: https://en.wikipedia.org/wiki/Semantic_differential#Use_of_adjectives
- 8 The concept of a data cube is foundational to relational database design and management. See my discussion in Chapter 7 as well as Lemahieu et al. [2018].
- 9 Recall that for two levels, two dummy variables can be created, but only one is needed to avoid the dummy variable trap. Hence, there is only one dummy variable per attribute for my example.
- 10 The idea of an offset from one end, say the beginning, is not unique. It also appears in many programming languages. The Python package called Pandas, for example, has data tables, called DataFrames, for which the rows are indexed beginning at zero. The zero indicate that the row is 0 rows from the beginning; the row indexed by 1, which is the second row, means the row is offset from the beginning by 1 row.
- 11 This is based on Chuang et al. [2001].
- 12 Based on my experience at Bell Labs, we often created mock-ups of new service concepts that were used in customer testing. The actual services were too far from actual development so using something from the development team was not practical. The mock-up served a good purpose. See “AT&T Test Kitchen” in CIO Magazine (May 15, 1994).
- 13 See www.ipsos-ideas.com/articlc.cfm rid=2166 for some discussion of these definitions.
- 14 Some refer to this as the “normal” price in the market.
- 15 These descriptions are from Paczkowski [2018]. Permission to use granted by Routledge.
- 16 See https://stats.stackexchange.com/questions/85436/what-could-it-mean-to-rotate- a-distribution. Also see https://en.wikipedia.org/wiki/Variance. Both last accessed on February 9, 2019.
- 17 There is some controversy regarding what “partial” really means. See the discussion at https://stats.stackexchange.com/questions/135527/what-is-the-partial-in- partial-least-squares-methods, last accessed March 1, 2019.
- 18 The convergence criterion is based on the difference X, - X,_,.