# Background Concepts

The steganalysis space hitherto has been dominated by SVMs as the machine learning tool for classification problems. Interestingly, SVMs are inherently binary classifiers but different researchers used these for multi-class classification by constructing different architectures such as using multiple SVMs for individual binary classification and then aggregating results for the overall multi-class classification tasks and using the kernel trick for nonlinear classification, etc. Another inherent limitation of the SVMs is their slow performance. ELMs are new machine learning tools which have been successfully applied to various classification and regression tasks. ELMs are intrinsically fast and do not require much human intervention. The kernel ELMs even do away with the limitation of random input weights and bias.

The performance of any machine learning algorithm is equally dependent on the feature set used. For universal as opposed to targeted steganalysis, the characteristics of the feature set play an important role. SPAM feature set effectively captures the statistical relationships and inter-pixel dependencies of stego images. This feature set has been widely used for steganalysis of spatial domain embedding algorithms. In the following paragraphs, we present some background concepts related to ELM, kernel ELM, and SPAM feature set.

## Extreme Learning Machine

ELM (Huang et al. 2006) is a neural network architecture with two stages for learning. The brief architecture of ELM can be summarized, as shown in Figure 10.1.

Assuming an image dataset consisting of N instances of images with d-dimensional feature set, the input space to a typical ELM is (*X„* T,) where X, e *R ^{Nxd}* with target class, T, e

*R*where

^{Nx}<*c*is the number of target classes.

The output function for a single-layer feed-forward ELM network with L hidden layer neurons and g(x) activation function can be represented as in Eq. (10.1):

*Figure 10.1* ELM architecture.

where *i = 1,..., N; w, = [wj,Wj _{2}* is the weight vector connecting

the input layer with /th hidden layer neuron; /3, = [/3_{;1},/3,_{2} /3_{/Я}]^{Т} is the

weight vector connecting the /th hidden layer neuron with the output layero,; and *bj* is the threshold of the /th hidden layer neuron. Input weights, *w,* and biases, *b,* are randomly initialized.

The standard ELM with L hidden layer neurons and *g[x)* activation function can approximate N instances with zero error which can be written as in Eq. (10.2):

Output weights, /3, can be analytically determined by finding the unique smallest norm least-squares solution of the linear system (10.3):

where *H ^{i}is* the Moore-Penrose generalized inverse of matrix H and H* = . Huang et al. (2006) proposed to add a positive value

1/C (where C is a user-defined and optimized hyper parameter) for the calculation of the output weights */1* such that

## Kernel ELM

Huang et al. (2012) proposed to substitute a kernel function for the inner product by applying the Mercer condition to ELM. It can be represented as in Eq. (10.5):

where *k(X _{h}Xi)* is a kernel function. The output function can be written as in Eq. (10.6)

Different kernel functions such as polynomial, RBF, wavelet function can be used with kernel-based ELM.

## SPAM (Subtractive Pixel Adjacency Matrix) Feature Set

This feature set was proposed by Fridrich et al. mentioned in the study by Pevny et al. (2010). The local dependencies between differences of neighboring pixel values of an image are modeled as first-order and second-order Markov chains. The sample probability transition matrix then computed constitutes the feature space. These feature extraction steps are as follows:

1. Calculate the difference array D for all eight directions, i.e., two along each horizontal, vertical, major, and minor diagonal directions:

2. Model the first-order and second-order SPAM features by the Markov process.

where *x, y, z* e{-T, ..., T).

3. Horizontal and vertical matrices and both diagonal matrices are separately averaged to get the final features, i.e., F^{lst} and F^{2nd}.

The authors showed that the value of T should be optimum to trade-off between classification accuracy and computational complexity. They further demonstrate that increasing T in the first order features does not necessarily improve performance of any steganalyzer but the second order features extracted by fixing T = 3 exhibit much improved performance despite increased dimensionality of 686.

# Proposed Methodology

The methodology used in this paper is presented in Figure 10.2.

The process of steganographic algorithm identification involves the following steps:

a. *Feature set extraction*: The SPAM features of the image set are extracted.

*Figure 10.2* Multi-class classification using KELM.

b. *Feature pre-processing:* The extracted features are scaled in the range of 0 to +1.

c. *KELM classifier training-.* The KELM is trained using the training set cover and stego images.

d. *Hyper parameter optimization:* Hyper parameters are searched using the grid search method and 5-fold cross-validation is used for hyper optimization. Here, Gaussian Radial Basis Function is used as the kernel function given by Eq. (10.10)

Two main parameters to be optimized are the penalty parameter C in Eq. (10.4) and the kernel bandwidth у in Eq. (10.10). The parameter C determines the trade-off between model complexity and the fitting error minimization. The parameter у defines the non-linear mapping from the input space to high-dimensional feature space.

e. *Test set classification:* Using the trained classifier’s model, the unseen test images are classified as either cover or different stegos, thus identifying the embedding algorithm.