Data Reduction: Dimensionality Reduction

Dimensionality in machine learning and data-related fields refers to a data set number of attributes, or in other words, how many columns are being considered as explainable variables and added to the modeling step. It is essential to find the best low-dimensional representation of the data and avoid making models unnecessarily complicated, prevent overfitting, and

TABLE 4.14

Sample Data with Nonnumeric Data and Its Encoded Form

Equipment ID

Productivity (Daily per Shift)

Operator

Date

Shift

Shift Encoded

HT05

3,000

4,210

01/07/2018

Night

1

HT02

4,100

4,315

01/07/2018

Night

1

HT01

2,650

4,166

01/07/2018

Night

1

HT01

4,000

4,200

02/07/2018

Day

0

TABLE 4.15

Common Feature Scaling, Encoder Techniques, and Recommended Use

Feature Scaling Technique

Recommended Use

Normalization

The scaling vector to its unit form is useful whenever a distance-based algorithm is used. Also, when the data set has sparse features (many zeros), this scaling method can be useful.

Standardization

Useful to deal with a multivariate analysis that has different scales. Also, artificial neural networks better converge when the data are in the standard scale form.

Fourier transform

Useful to deal with high-dimensional time-series data.

Log transform

Useful when the data set has skewed features.

Min-Max

Useful to achieve better convergence of optimization algorithms. Gradient descent algorithm is highly sensitive; scaling the data to a specific range can help to speed up.

Encoders

Can be used to map a set of values to another, especially the case of string data; an encoder can be used to map string values to numeric values

the inclusion of redundant variables. There are two main approaches for optimizing dimensionality: feature selection and feature extraction.

Feature extraction refers to simplifying multidimensional data, maintaining fewer selected dimensions, and at the same time, a meaningful representation. The final reduced representation should have a dimensionality that corresponds to the essential dimensionality of the data, which is the minimum number of parameters needed to maintain the observed properties of the original data set. The most popular methods are principal component analysis, single-value decomposition, and Kernel PCA.

Feature selection methods find a smaller subset of a many-dimensional data set to create a data model. The primary strategies for feature selection filter techniques (using mainly statistical measures like Pearson correlation), wrapper techniques (using a predictive model), and embedded techniques (which perform feature selection while building a model, mainly represented by regularization methods). Since feature extraction techniques change the original feature representation of the data and, consequently, result in more limited interpretability, feature selection tends to be an interesting alternative.

 
Source
< Prev   CONTENTS   Source   Next >