Feature extraction is a significant step in any recognition process. Unique features are to be selected out of total features, which are responsible for the exact recognition for speaker and speech. Feature extraction is also used to lessen the dimensionality of the feature vector. Although speaker and SR systems are different, their feature extraction methods are overlapped. Various feature extraction algorithms are investigated, including perceptual linear prediction (PLP), Mel-ffequency cepstral coefficients (MFCC), relative spectra perceptual linear predictive (RASTA-PLP), and linear predictive coding coefficients (LPCC). The speech signal is not stationary. Therefore, before the feature extraction step, there is a need for the preprocessing of the speech signal. Preprocessing means to get ready for the feature extraction procedure, so that signal is removed, that does not contain any information. After silence removal, the speech signal can be short and requires less memory for its storage and moreover less time for processing also. This can be done with the help of measuring the energy of the whole signal, and some value is set, so the signal is removed according to the selected value. The formula for calculating energy is

where x(n) is the speech signal.

To perk up the signal-to-noise ratio, the energy level for higher frequencies is boosted for which a high-pass filter is used. This is done because high frequencies have less energy as compared to low frequencies. The speech signal is variable. However, for a little episode of time, it may be treated as a fixed signal because the duration of phonemes is assumed to be stationary for the period of 5-100 ms. Then, windowing is done to deal with the loss of data at the edges of the speech due to framing. This is done with the help of window. The features that are in one frame of size P ms are multiplied by the window size R ms. P should be lesser than R. That is why two frames get overlapped. There are many types of windows available, such as Harming, Chebyshev, Hamming, Kaiser, etc. There is nothing any weight to choose the type of window except some soft window is used to avoid an abrupt change in signal characteristics.

The effect of noisy surroundings has more impact on SR systems. Therefore, to deal with this problem, many algorithms were proposed [36]. Rabiner and Sambur [45] implemented a system in which the ZCR of the signal is calculated to measure the noise signal. In addition, the energy of the signal is calculated. However, the accuracy decreases when this system is applied for the real-time scenario. The processing time was too slow. Therefore, it was the requirement for an algorithm that should be fast and efficient. Therefore, Hopper and Adliami [25] tried a system using the digital-signal-processing-based algorithm such as fast Fourier transform (FFT). This algorithm achieved excellent results in terms of recognition rate in noisy surroundings. Many authors have hied tune-domain features with frequency-domain features. Voiced and imvoiced regions were detected using energy and ZCR that was proposed in [48]. After that, using FFT, elements of speech signal were extracted. Various feature extraction techniques used by multiple investigators are as follows.


Spectral magnitude can be analyzed using linear predictive coding coefficients. Vocal tract coefficients are generated using this technique, and this algorithm was most popular in early algorithms. In this technique, speech parameters at the current time can be treated as a linear combination of old speech samples. Many researchers have implemented an LPCC feature extraction technique in their systems [27,28]. The coefficients in this technique can be reduced by other method known as wavelet decomposition [43, 50].


It is a well-accepted method for feature extraction. In this method, fust of all, the preprocessed speech signal is converted into a frequency domain to know about the frequencies of the speech signal using FFT. The output of FFT contains a lot of data that are not required because at higher frequencies, there is not any difference between the rates. This is based upon the way how human being listens. The scale is linear until 1000 and logarithmic after it. Therefore, to calculate the energy level at each frequency, Mel-scale analysis is done using Mel filters. Then, energy is calculated. After that, logarithmic of filter bank energies is taken. This operation is done to match the features more close to human hearing.

At last, discrete cosine transform of the log filter bank energies is taken to decorrelate the overlapped energies. First, 13 coefficients are selected known as MFCC features. This is because higher features degrade the recognition accuracy of the system. Those features do not cany speaker and speech-related information. Many researchers have implemented the MFCC method for better results in the recognition rate [56, 60]. Experimentation was also done by combining different feature extraction techniques. Trang et al. [54] described a new feature set by combining principal component analysis with MFCC.


hi the year 1990, Hennansky [18] developed a PLP technique that vras based upon the perceptual predictive coefficients. This method v'as based upon the human hearing system to create specific features for the recognition purpose. This method is like the LPC technique that is about the spectrum of the signal. Transformations are done according to psychophysically, and the frequency of the signal is modified. There is a difference in spectral smoothing of PLP from MFCC. Linear predictions are made for spectral smoothing of PLP. The procedure consists of critical band spectral resolution and then loudness curve and after that intensity loudness power law-. All pole models are used for approximation of the auditory system. The PLP technique is the only technique in which speaker-dependent features can be suppressed using a fifth-order all-pole model. Many results showed that PLP is much more useful than LPCC analysis, and for the implementation of speaker-independent systems, PLP is quite popular.


Noise has a veiy severe effect on recognition systems. Therefore, to remove noise, a bandpass filter can be added to the PLP technique. Hennansky et al. [17] implemented this technique. Many researchers have done u'ork in this domain by associating PLP features with RASTA processing [19,20]. The advantage of using this technique is that it gives good results when the conditions are changing unpredictably. The work in this chapter u'as focused on the convolutional noise in the communication channel in which collections were made in the log spectral domain. The experimental analysis has been performed on a speech corrupted with convolutional noise. The acquired results concluded an order-of-magnitude improvement in terms of error rate in comparison with LPC or PLP technique. Moreover, the obtained results showed consistency because of different databases as well as different recognition techniques. Hirsch et al. [22] implemented a filtering method in the power spectral domain, and this method has reduced the background noise.


The database is created by four speakers: two females and two males (30-40 years old). Selected words are one, two, three, four, and five. Each speaker was given 100 samples for each word, and thus, a 2000-word database is prepared. MATLAB is used for the implementation of the system. The calculated features from all techniques mentioned above are stored as a reference pattern. Then, in the testing phase, reference patterns are matched with the uttered word using dynamic time algorithm (DTW). This DTW technique is used to calculate the distance between the uttered word and reference patterns. Let the two speech patterns A and В of length n and m be

and the whole distance can be shown as

The minimum distance is calculated for the matched pattern for the uttered word. Due to the variations in speech samples, distance has some value but not zero.

White Gaussian noise (WGN) is chosen to corrupt the speech samples. It is a basic noise that is used for many processes occurring in nature, and the best part is that it has a uniform distribution on all frequencies [53]. SR accuracy is measured in different measuring metrics such as precision rate, recall rate, accuracy, sensitivity, and specificity, defined as follows:

where true positive (TP) represents accurately selected features, false positive (FP) represents falsely selected features sets, true negative (TN) represents all negative feature those are true, and false negative (FN) represents all negative feature those are false. Table 13.1 shows the results of different implemented techniques.

TABLE 13.1 Average Recognition Accuracy for Different Techniques

Type of Speech Samples

Average Recognition Rate (%) for LPCC Technique

Average Recognition Rate (%) for PLP Technique

Average Recognition Rate (%) for RASTA-PLP Technique

Average Recognition Rate (%) for MFCC Technique

Clean Speech Samples





Speech Samples with White Gaussian Noise (WGN)





Results showed that the average recognition rate is 93.12% for clean speech signals using the LPCC technique. For clean speech signals, the average recognition rate is 93.17% using the PLP technique, 93.16% using the RASTA PLP technique, and 94.25% using the MFCC technique. The recognition rate decreases as speech samples are corrupted with WGN. Then, the average accuracy rate decreases to 73.54% using the LPCC technique, to 73.12% using the PLP technique, to 73.98% using the RASTA PLP technique, and to 73.98% by using the MFCC technique. Results conclude that the MFCC technique worked well in clean as well as in a noisy environment.

< Prev   CONTENTS   Source   Next >