Dimensionality Reduction for Classification and Cluster Validation

Dimensionality reduction is often used as a preprocessing step in classification, regression, and cluster validation (Lunga et al. 2014). Once a low-dimensional model is obtained, classification and cluster validation techniques can be applied.

Cluster Validation

Clustering can be applied to data visualization, data organization, and exploratory data analysis to name a few applications. In extracting clusters, methods are also needed to measure/validate the performance of cluster algorithms, especially for large-dimensional data bases.

Table 6.1 summarizes some typical cluster quality measures. Excellent accounts and further analytical approaches are given in column two of this table.

Temporal Correlations

Recent studies suggest that the presence of strong temporal correlations may affect the performance of nonlinear dimensionality reduction techniques (Schoeneman et al. 2018). An implicit assumption in some of these techniques is that the input data is weakly correlated.

Data correlation implies that the observation matrix X is not full rank. This also implies that features of the data share information, which may result in spurious information in the data pattern (Ghojogh et al. 2019).


Cluster Quality Measures



Davies-Bouldin index

Davies and Bouldin (1979)

Dunn index

Dunn (1974)

Haikidi indexes

Haikidi and Vazirgiannis (2002)

Calisnki-Harbasz criterion

Calinski and Harbasz (1974)

Gap statistics

Albalate and Suendermann (2009)


Rousseeuw (1987)

Root-mean-square standard total deviation, RMD, etc.

Hennig et al. (2016)

Experience from this research shows that linear analysis techniques may overestimate the latent dimension of the process and fail due to the highly correlated nature of dynamic data.

The issue of temporal correlation is especially important in the case of dynamic processes represented by dynamic trajectories, such as in the case of transient simulations. Few attempts have been made to study these issues in the power system literature.

Time-Scale Separation of System Dynamics

Detailed simulation results reveal that projection-based models generate accurate estimates of modal parameters even in the presence of measurement noise and can be used to extract and create reduced representations of large-scale dynamic processes characterized by separation of time scales (Arvizu and Messina 2016; Berry et al. 2013).

In the spirit of spectral methods, a general model of the form

is adopted, where d denotes the fast modes of interest and the model (6.11) is estimated based on the frequency range or time scale of interest.

Markov Dynamic Spatiotemporal Models

First-Order Gauss-Markov Process

Markov state models are a powerful approach to model transient processes and can be used to cluster data. Following Chodera et al. (2007), assume that a process is observed at times t = 0, Af, 2Af,... where Af denotes the observation interval. The sequence of observations can be represented in terms of the state the system visits at each of these discrete times; the sequence of states produced is a realization of a discrete-time.

These models assume that the probability vector at some time nAt can be represented by a model of the form

where P(nAt) is the probability matrix of the Markov chain and p(0) is the initial probability vector (Moghadas and Jaberi-Douraki 2019; Swope and Pitera 2004). This implies that the future state depends on the current state and the observational data.

For this process to be described by a Markov chain, it must satisfy the Markov property, whereby the probability of observing the system in any state in the sequence is independent of all but the previous state. Use of this model gives

where P(r)] denotes higher powers of the transition matrix.

The following general characteristics of this method can be noted:

  • 1. Matrix P is symmetric and positivity preserving, (ш,, > 0).
  • 2. From the properties of a Markov matrix, Matrix P is row-stochastic (i.e., (mu = 1, Vi), with 0 < m„ < 1).
  • 3. P is positive semidefinite. Further, since P is a Markov matrix, all its eigenvalues are smaller than or equal to 1 in absolute value.
  • 4. The eigenvector spectrum of the transition probability matrix gives information about transitions between subsets and the time scales at which the transitions occur. More precisely, the time scale for a given transition can be calculated as

where r is the lag time and A is the eigenvalue (Noe et al. 2006).

As a result, all eigenvalues A,• of P are real and the eigenvectors form a basis. This latter feature is exploited in Section 6.6.3 to cluster power system data.

First-Order Markov Differential Equation

Insight into the nature of Markov processes can be obtained from the study of a memoryless master equation. The general form of a first-order Markov differential equation is given by

where p(t) is an и-dimensional vector containing the probability to find the system in each of its и states at time t. The matrix К = [fc/y] is a rate matrix, where fc,y represents the transition rate constant from state i to state j. As discussed by Noe et al. (2016), the system dynamics can be described by a discrete-time Markov process using the transition matrix T (r).

To formalize the model, let p (the vector of transition probabilities) be defined as

One way to model the dynamics of trajectories is to rephrase (6.15) in the form

where p0 = p(0) is the probability that the walker remains in its current position after the time step At (Wang and Ferguson 2018).

The transition matrix, therefore, provides complementary information to that available from static approaches and can be used for clustering dynamic analysis of memoryless systems.

Markov Clustering Algorithm

The Markov clustering method is a recent addition to clustering dynamic trajectories (Enright et al. 2002). The method is based on the analysis of a transition network, which is obtained by (i) Mapping the dynamic trajectories onto a discrete set of microstates, and (ii) Building a transition network in which the nodes are the microstates and a link is placed between them if two microstates are visited one after the other along the trajectory.

To formalize the adopted model, let X e 9f'"xN be the matrix of measurements corresponding to a given operating scenario.

The Markov clustering algorithm is summarized in the Markov Clustering Algorithm box.

Markov Clustering Algorithm * 1 2 3 4

Given a trajectory matrix X e 9t"'xN

  • 1. Obtain the undirected graph using the procedures in Chapter 3.
  • 2. Build a transition matrix, P = \_Pji] e 9mxN, in which each element Pji represents the transition probability from node j to node i. Normalize the matrix.
  • 3. Expand by taking the ith power of the matrix

and normalize each column to one.

4. Inflate each of the columns of matrix M with power coefficient r as

where r; is the inflation operator.

  • 5. Repeat steps (3) and (4) until MCL converges to a matrix MMcl (r).
  • 6. Compute the first few right eigenvalues and eigenvectors of the resulting matrix /-> using Equation (6.4). This results in a set of real eigenvalues and eigenvectors и,.

As discussed in Cazade et al. (2015), the parameter r determines the granularity of the clustering. This approach can be used to identify state subgroups sharing common, dynamically meaningful characteristics. Experience shows that a relatively low value of r may be enough to capture the dynamics of interest. Several variations of this method have been discussed in the literature.

The next example illustrates the application of this method.

Example 6.1

In this example, the measured data from Section 3.6 is used to illustrate the application of the Markov clustering algorithm. The data set is the same used in Figure 3.11 in Section 3.6.

Using the enumerated procedure, the transition matrix, P, is obtained as

where D and К are as defined in Section 3.4.

The transition matrix for this example is


Application of the Markov clustering algorithm results in 14 modes; the algorithm takes four steps to converge (p = 4). Figure 6.4 shows the clusters extracted using this approach. Comparison of the spatial pattern for the most dominant mode with that obtained using DMs in Figure 6.5 shows the accuracy of the developed models.


Schematic illustration of dimensionality reduction. Values unnormalized.


Schematic illustration of dimensionality reduction. Values unnormalized, (a) Diffusion maps (b) Markov chain.

Predictive Modeling

All major analytical techniques described above can be used for predictive modeling. The idea is simple and can be outlined as follows:

  • 1. Given a set of dynamic trajectories, build a low-dimensional representation using a dimensionality reduction technique.
  • 2. Compute spatial and temporal coefficients and build a spatiotempo- ral representation.
  • 3. Approximate the solution at all future times.

A typical representation model of the output signal y(f) that takes into account the time-varying and possibly nonlinear behavior of the observed data is

where S(f) = £j=1 Aj(t)cos(cOj(t)).

In the above expression, y(t) is the observed output time series, m(t) is a trend or low frequency component, S(t) is an oscillatory or quasi-oscillatory component, and e(t) denotes noise. Predictions can be obtained for each individual oscillatory component S(t), j = 1,n and the total model prediction, y(t), is obtained by combining the individual predictions.

Sensor Selection and Placement

Previous studies have suggested that a few diffusion coordinates are often enough to describe the observed global system dynamics. In this section, the sensor selection problem is introduced in the context of dimensionality reduction techniques. This is followed by a discussion of extrapolation.

Sensor Placement

Recent analytical experience by the author using model reduction techniques suggests that the spatial patterns уr-f (extracted using nonlinear projection methods) can be used to detect good observables or preliminary placement of sensors and impact the determination of relevant signals for sensor placement and state reconstruction. This may be useful to reduce the dimension of the data and/or identify or eliminate redundancy. The latter problem has been recently studied using analytical techniques (Lee 2018).

In Messina (2015) techniques to locate PMUs in order to enhance observability of critical oscillatory modes were introduced. Sensor placement to monitor global system behavior involves the solution of three interrelated problems (Zhang and Bellingham 2008): (a) Determining a limited number of sensors to observe global oscillatory behavior, (b) Selecting the most appropriate type of measurement variables or signal sources so as to have an effective description of the system based on a small number of coarse variables, and (c) Data fusion as in Arvizu and Messina (2016). In addition, it is desirable to avoid redundant measurements to improve the performance of wide-area monitoring systems and extract maximum useful information. A related but separate question is how to reconstruct the observed system behavior from a small number of sensors (field reconstruction) for monitoring and forecasting of wide-area dynamic processes (Usman and Faruque 2019). Recent literature discusses the applicability of these approaches to real-world power system models.

Attention in Section 6.7.2 is centered on the solution of issues (a) and (b).

Mode Reconstruction

Given a set of measurements X = {x, }!^ , a problem of interest is constructing a measurement matrix (a multisensor observing network) P„/,s e or equivalently, determining a subset of sensor locations х0ы = {^, *k2 ■■■ **,,},

such that x0i,s = P0bSX with p « m that accurately represents the original data set (Zhang and Bellingham 2008; Castillo and Messina 2019). A second objective is to represent the ш-dimensional measured data by a low-dimensional modal decomposition that captures critical oscillatory behavior of concern; the low-dimensional model can then be used for prognosis, forecasting, and the development of early warning system. Conceptually, this is equivalent to finding good observables that best describe the state of the original system. As a byproduct, the measurement space can be used for estimating system behavior at unobserved system locations.

A simple illustration of this idea is given in Figure 6.6.

In practical applications, matrix P0()S is determined by the user based on several criteria such as experience or the application of correlation measures. Note that a related issue was discussed in Chapter 4 in the context of compressive sampling.

The discussion hypothesizes that the selected signals can be determined from the application of nonlinear dimensionality reduction techniques. The proposed approach consists of two main stages: (a) Extracting a low-dimensional representation using any of the dimensionality reduction techniques described in Chapter 3, and (b) Identifying observables from the spatial structure of the low-dimensional embedding.

For reference and comparison, a guided search method is adopted to place sensors (Messina 2015; Alonso et al. 2004). An alternative approach for determining sensor placement has been recently proposed by Castillo and Messina (2019).


Generic model for state reconstruction and data interpolation. (Based on Castillo and Messina, 2019).

To optimize the number (and location) of sensors, the error between the approximate field obtained using information from a few sensors (a subspace X,i), d and the measured data, X, is calculated. The first problem considered consists of calculating the lowest possible estimation error for the field by using a set of leading modes, assuming that, the first modes retain most of the energy present in the system. In other words, the aim is reducing the dimensionality of data by finding dimension d < D on which the original data can be projected with a small error X г= X,( + e,i, where e,i is the error associated with the approximation.

It can be shown (Castillo and Messina 2019) that the error that minimizes st) leads to find the smallest average (mean-squared) distance which measures the difference between the original field X and its projection Xlt, namely

where £,? ш,„ is the error associated with the (d-D) unimportant or unphysical modes that are discarded.

Example 6.2 will illustrate these ideas.

Example 6.2

As an example, data sets of bus voltage magnitude deviations shown in Figure 6.7 for the Australian test power system were used to identify a limited number of measurements that characterize the overall system behavior. Based on the system response, the data matrix is defined as XeSHN*ra =[V] Уг... V,,,]7, with m = 174 and N = 3000.


Bus voltage magnitude deviations following a three-phase short circuit at bus 217.

A two-step approach is utilized to detect good observables to characterize global motion: (a) Introduce a critical contingency that stimulates electromechanical modes in a frequency range of concern а>„ф, mx, and (b) Locate sensors to observe the dynamic of interest.

Results involve a three-phase stub fault at bus 217, cleared after 0.05 seconds; this fault is found to excite the two slowest modes in the system. The data set contains 30 seconds of bus voltage magnitude measurements at 59 system locations. Both, load buses and generator terminal buses are selected for analysis.

Table 6.2 lists optimal sensor locations using the approach shown in Messina (2015). In this table, the second column gives the cumulative energy. The technique identifies buses 308,315,300 and 509 as the best options to locate sensors.

As discussed in Section 6.4, dimensionality reduction provides a simple approach to estimate good observables for state reconstruction and sensor placement.

Figure 6.8 shows the time evolution of the dominant Koopman modes extracted using the procedure in Chapter 4, while Figure 6.9 shows the volt- age-based patterns extracted from the bus voltage measurements.

Figure 6.6 shows the voltage-based shape extracted from the time series in Figure 6.7 using Koopman analysis. For the sake of clarity and completeness, voltage-based shapes are obtained for the three slowest modes in the system. Results suggest that candidate sensor locations are associated with


Sensor Placement Results. Time Window: 1000-3000 s

Candidate Sets

Cumulative Energy




Dominant Koopman modes.


Voltage-based shapes associated with the Koopman modes in Figure 6.6. (a) Koopman mode 1, (b) Koopman mode 2, and (c) Koopman mode 3.

the dominant modes excited by the perturbation. This observation is substantially confirmed in the analysis of various test systems and operating conditions.

Based on voltage measurements, results show that dynamic reduction can be useful as an exploratory data mining tool for measured power system data.


As a motivation for spatial interpolation, let the time evolution of sensors measurements at location i be given by the POD-Galerkin expansion

where r denotes the number of selected measurements. Similar approximations can be obtained using other approaches such as DMD, DHR, or Prony analysis to mention a few techniques.

Let {s; :j = l,...,K} denote the location of measuring locations. An estimate of the time evolution of sensor location j (a site where no measurements are taken) can be obtained as

where the иц are prediction weights for the site i and the unobserved site j (Munoz et al. 2008). The general framework for data extrapolation is illustrated in Figure 6.10.

If necessary, the data can be standardized using the procedures describe in Chapter 3, in which, for system locations where no measurements are available—see Eq. (3.10), the mean values, q(x,), and standard deviations, cr(Xj), have to be estimated.

In the simplest case, weights can be obtained from topological conditions; more elaborate approaches that reduce spatial variability can be obtained using formal approaches such as Krigging, correlation analysis, or intelligent analysis techniques.

Data Classification

Once model reduction has been obtained, data classification techniques can be applied to the reduced model. Common classifiers include techniques such support vector machines and linear discriminant analysis techniques. While these methods can be applied directly to the original unreduced data, the large dimensionality of the models may preclude an in-depth analysis of features of interest.



< Prev   CONTENTS   Source   Next >