- Dimensionality Reduction and Feature Extraction and Classification
- Introduction: Background and Driving Forces
- Fundamentals of Dimensionality Reduction
- Formal Framework
- Numerical Considerations
- Data-Driven Feature Extraction Procedures
- Feature Extraction
- Feature Selection
- Dimensionality Reduction Methods
- Projection-Based Methods
- Coarse Graining of Dynamic Trajectories
- Koopman and DMD Analysis
- Physical Interpretation

# Dimensionality Reduction and Feature Extraction and Classification

## Introduction: Background and Driving Forces

Dimensionality reduction has been an active field of research in power system analysis and control in recent years. These methods produce low-dimensional representations of high-dimensional data, where the representation is chosen to accurately preserve relevant structure. A survey of dimension reduction techniques is given in (Fodor 2002).

Dimensionality reduction is important for various reasons. First, mapping high dimensional data into a low-dimensional manifold facilitates better understanding of the underlying processes and aids to eliminate or reduce irrelevant features and noise as well as to facilitate retrieval of selected dynamic information. In addition, an effective reduced description provides some intuition about the physical system or process and allows visualization of the data using two or three dimensions (Mendoza-Schcrock et al. 2012). Finally, with a successful application of these methods, it is possible to find features that may not be apparent in the original space.

Most approaches to dimensionality reduction are based on three main steps: (a) Data normalization, (b) Dimension reduction, and (c) Clustering and data classification.

Multivariate dimension reduction techniques have been applied to power system measured data including POD/PCA analysis, PLS and their nonlinear variants, DM, and DMD to mention a few approaches (Chen et al. 2013; Ramos et al. 2019; Arvizu and Messina 2016). These models have been shown to be useful in the analysis and characterization of the global behavior of transient processes, as well as to extract and isolate the most dominant modes for model reduction of complex systems. Current analysis approaches, however, are limited in that they consider that the distribution of physically relevant states is highly clustered around a set of much lower dimensionality, and often result in full matrix representations.

In this chapter, a brief description and background of dimensionality reduction is presented, including a general review of related work in this field of research. The context and framework for the development of reduced order representations are set forth, with a particular focus on the extraction of spatiotemporal patterns of large data sets.

## Fundamentals of Dimensionality Reduction

### Formal Framework

Dimensionality reduction techniques map (project) high-dimensional data points in the original observation space (X”'^{xD}) to a low-dimensional space (л ^{x}‘) by constructing meaningful coordinates that are combinations of the original feature vectors. Usually, this is achieved by optimizing certain criterion or objective function—see Van der Maaten et al. (2009), for a taxonomy of dimensionality reduction methods and the definition of constraints.

In practice, the number of coordinates *d* D is selected to retain or preserve as much information as possible; in power system applications, *m* typically represents the number of sensors (or network points or states) and *d* is the number of coordinates, which are often associated to global system behavior.

Figure 6.1 schematically summarizes the context and main purposes of dimensionality reduction. Inputs to this model are raw data collected from dynamic sensors or selected features derived from measured data. With current multichannel PMUs and other specialized sensors coexisting together,

FIGURE 6.1

Basic dimensionality reduction pipeline.

inputs to the data fusion process are multiple sets of data often differing in dimensionality, nature, and even resolution.

Emerging from the data fusion center are techniques to visualize, classify, correlate, and forecast system dynamic behavior. Analysis tasks such as clustering, classification and regression, visualization, and event or anomaly detection may be carried out efficiently in the constructed low dimensional space (Chen et al. 2013).

Three main advantages can be obtained using such an approach: (a) Hidden parameters may be discovered, (b) Noise effects can be suppressed or reduced, and (c) Interesting and significant structures (patterns) are revealed or highlighted such as change points and hidden states. In addition, data can be efficiently visualized using a few dimensions.

The application of common dimensionality reduction techniques, however, faces several challenges:

- • Systematically choosing effective low-dimensional coordinates is a difficult problem. Usually the mapping is non-invertible and special techniques are needed to relate the observed behavior in the lowdimensional space to the original (data) space (Arvizu and Messina 2016).
- • Linear analysis methods may fail to characterize nonlinear relationships in measured data and result in poor or incorrect classification or clustering of measured data.
- • Dimensionality reduction methods in other fields are often developed for data that have no dynamics (Berry et al. 2013). Moreover, many power system dynamic processes can be characterized by time scale separation which may affect the performance of many techniques.

The following sections examine some open issues in the application of linear and nonlinear dimensionality reduction techniques to measured power system data in the context of data fusion and data mining applications.

### Numerical Considerations

The realization of practical dimensionality reduction techniques is challenging due to the large number of measurements collected, noise, and other effects. By reducing the model to the number of sensors or network points (*in*), the size of the problem can be substantially reduced. This task is also referred to as *input space reduction.*

With the increasing size of recording lengths (N) with (N > *m),* algorithms with the ability to compress system information are needed. The issue of near real-time processing (N small), on the other hand, requires special analysis techniques.

Spectral dimensionality reduction algorithms perform eigen decomposition on a matrix that is proportional in size to the number of data points;

when the number of data points increases, so does the time required to compute the eigen decomposition. For very large data sets, this decomposition step becomes intractable.

For typical power system applications involving WAMS, the dimension *m* is moderately large (in the order of a few dozen to hundreds of points or sensors), and therefore, efficient data reduction techniques can be developed.

## Data-Driven Feature Extraction Procedures

### Feature Extraction

Feature extraction consists of finding a set of measurements or a block of information with the objective of describing in a clear way the data or an event present in a signal. These measurements or features can then be used for detection, classification, or regression (prediction) tasks in power system analysis. Feature selection is also important for model reduction of power grid networks and visual representation of monitored system locations (Lee 2018; Ghojogh et al. 2019). As noted by Lunga et al. (2014), however, feature selection and feature extraction may result in some loss of information relative to the original (unreduced) data.

Formally, given a measurement matrix, X = [x, x_{2} *■■■ x*_{N}] e *, *feature extraction aims at determining a reduced order representation of the data X = [xi x_{2} • • • x„] e 9f^{,,x} , such that *p Usually, the quality of the low-dimensional representation can be assessed using techniques such as the residual variance (Tenenbaum et al. 2000; Dzemyda et al. 2013; Lu et al. 2014).*

### Feature Selection

Feature selection aims at selecting a subset of the most representative features {£, *x _{2} ■■■ x_{p}}cz{x x_{2} x*

_{m}} according to an objective

function or criteria (Yan et al. 2006). At its most basic level, feature selection aims at determining a model of the form X = WX, where *W* is usually a binary matrix.

The reconstruction error can be expressed for the ith signal in the form or, at a global level, as

Feature extraction analysis for power system applications differs from that in other fields in many respects. First, features of interest involve both prefault and post-fault system conditions (Li et al. 2015). Further, power system data is time-ordered, noisy, and uncertain.

## Dimensionality Reduction Methods

Dimensionality reduction methods produce low-dimensional representations of high-dimensional data where the representation is chosen to preserve or highlight features of interest according to a suitable optimization criterion that is specific to each method—refer to Lunga et al. (2014) for a description of affinities and constraints for common embedding algorithms. Dimensionality reduction techniques are often divided into convex and non- convex according to the optimization criteria utilized (Van der Matten et al. 2006). Other criteria are proposed in (Lee and Verleysen 2007).

Nonlinear dimensionality reduction methods can be broadly characterized as selection-based and projection-based. Among these methods, spectral dimensionality reduction (manifold-based) techniques have been successfully applied to power system data. Other dimensionality reduction tools, such as Koopman-based techniques and Markov algorithms are rapidly gaining acceptance in the power system literature (Ramos and Kutz 2019; Messina 2015).

As discussed in Chapters 3 and 4, the primary goal of nonlinear dimensionality reduction techniques is to obtain an optimal low-dimensional basis, *4*j (x),* for representing an ensemble of high-dimensional data, X. This basis can then be used to formulate reduced-order models of the form

where *aft)* represent temporal trajectories, *Yfx)* are the spatial patterns, and *a _{0}(t)vJ(x)* indicates average behavior (Long et al. 2016). Each temporal coefficient,

*aft),*provides information about a particular frequency or a frequency range, while

*'Fj(x)*gives information about spatial structure. Such a parameterization allows a broad range of physical processes and models to be studied.

### Projection-Based Methods

The underlying idea behind these methods is the spectral decomposition of a (square) symmetric feature matrix (Strange and Zwiggelaar 2014, Singer 2009). Common examples include linear models such as PCA/POD, PLS, nonlinear projection methods, and Koopman-based representations.

To pursue these ideas further, consider a set of points, Х = {ж,-}_{ы} e9t^{D }(D-dimensional feature or input space), representing a data matrix in a high-dimensional space fH^{D}, where *L)* usually corresponds to the number of recorded responses. The goal of dimensionality reduction techniques is to recover a set of d-dimensional data X = ^{[1]};}_{M} e with *d such that У accurately captures the relevant features of *

*X.*Conceptually, this is achieved by computing the eigenvectors of a feature matrix derived from X. Implicit in this notion is the existence of a mapping / —> that maps high dimensional data

*Xj*to low-dimensional

*x*

_{{}.

Regardless of the dimensionality reduction technique adopted, these methods solve an eigenvalue problem of the form

or, alternatively,

In the first case, *P* represents a Markov or transition matrix, a Hessian matrix, or an alignment matrix to mention a few alternatives. In the second case, *L *denotes, for instance, the graph Laplacian.

The corresponding spatial structure is given by a set of coordinates of the form

which are expected to capture the intrinsic degrees of freedom (latent variables) of the data set. A schematic representation of the reduction process is given in Figure 6.2. Examples of these methods are given Chapters 3 and 5.

Remarks:

FIGURE 6.2

Illustration of dimension reduction. Dimensionality reduction involves mapping high-dimensional inputs into a low-dimensional representation space that best preserves the relevant features of the data.

Depending on the analysis approach, the modal coordinates, V6t, can be orthogonal or non-orthogonal and may exhibit some degree of correlation. This also applies to temporal components in Equation (6.3).

Now, attention is turned the use of dimensionality reduction for cluster validation and classification. Whenever possible, a physical interpretation is provided.

### Coarse Graining of Dynamic Trajectories

As discussed earlier, power system data is time-varying in nature and may exhibit time scale separation. Slow motion is usually captured by the first few eigenvectors in Equation (6.5) and the associated temporal coefficients. Identifying these components, however, is not straightforward and may require special algorithms or analysis techniques and some fine tuning.

Two problems are of interest here:

- 1. Extracting the essential coordinates,
*d,*from the low-dimensional representation, and - 2. The analysis of control or physical interactions between states or physical phenomena.

The first problem has been recently studied using diffusion maps (Arvizu and Messina 2016). The second problem in terms of Equation (6.5) concerns the analysis mode interaction described in the context of the perturbation theory of linear systems and is illustrated schematically in Figure 6.3.

FIGURE 6.3

Schematic illustration of control interactions. (Based on Arvizu and Messina, 2016.)

One way to achieve this is to analyze each component *Xj **= a, (t)y/J* (x) separately to infer spatiotemporal content. Another approach is to use a Markov chain analysis.

In Sections 6.4.3 and 6.6.3, two alternatives to dimensionality reduction are explored: Koopman analysis and Markov chains.

### Koopman and DMD Analysis

Koopman mode analysis and its variants provide an effective alternative to reduce system dimensionality. In the case of manifold learning techniques, the dominant eigenvectors (coordinates) approximate the most slowly evolving collective modes or variables over the data. Applications of this approach to both simulated and measured data are described in Barocio et al. (2015) and Ramos and Kutz (2019).

From dynamic decomposition theory, these models can be expressed in the general form

where the *Xdmd _{k}*,

*к*= 1,...,

*p*matrices in Equation (6.6) capture specific behavior associated with modes of motion.

### Physical Interpretation

The Koopman modes have a useful and interesting interpretation as columns of a matrix of observability measures (Ortega and Messina 2017).

To illustrate these ideas, consider the linear system

where z(f) e 91" is the vector of state variables; i/(f) e 91"' is the vector of system outputs; *A,B,* and *C* are constant matrices of appropriate dimensions; and z_{0} is the vector of initial conditions.

As suggested by Chan (1984), a useful measure of observability is motivated by the expression

where A, is the ?th eigenvalue, and *v,* and *iv,* are the corresponding right and left eigenvectors.

The observability matrix is defined as

It can be proved that, for the case of a linear system *x _{k}+,* =

*Ax*the Koopman eigenfunctions can be defined as

_{k},*for the observables*k) = (х

_{к},ю^

*f(x*

_{k}) = x_{k}.It has been proved in analogy with matrix *CV* that, for the linear case, the empirical Ritz eigenvectors, *v _{x},* can be interpreted as the columns of a matrix (Ortega and Messina 2017)

Analysis of this expression may provide insight about regions of nonlinearity and observability associated with the fundamental modes of motion.

In cases where the observational data can be associated with a dynamic process measured at different system locations, data points can be visualized as dynamic trajectories whose temporal (spatial) behavior maybe highly correlated and exhibit nonlinear dynamics.

*As* suggested in Equation (6.3), spatiotemporal behavior can be approximated by a summation of block-matrices of the form *Xj = a, (t)igj* (*).

- [1] { The eigenvectors (diffusion coordinates) y/k give a good descriptionof the slow (collective variables) dynamics of the system. Experienceshows that only a few coordinates, A, are needed to capture systembehavior. • The mapping Tj: 'Jin —»'.Rrf is only given at the recorded states.Techniques to extend this approach to nearby points in the originalspace, without requiring full recomputation of a new matrix and itseigenvectors are given in Erban et al. (2007). • Using this information, the modes that dominate the systemresponse and the global measurements that contribute most to theoscillations can be accurately captured.