Speech signal conveys lots of information. At the first stage, the speech signal gives the message; at the second stage, the speech signal tells the idea about the speaker. Speech production is a veiy complex phenomenon. It comprises many levels of processing. First of all, message planning is done in our mind, and then, language coding is done. After coding, the neural muscular command is generated. After this, the sound is produced through vocal cords. Every human speech is different from each other because of various parameters like linguistics (lexical, syntactic, semantic, and pragmatics), paralinguistic (intentional, attitudinal, and stylistic), and nonlinguistics (physical emotional). Therefore, speech signal contains different segmental and supra segmental features, which can be extracted for the SPR as well as SR [31].


It is essential to know the individual speech construction model for developing the speech model as one can then extract the speech features more accurately. A physiological model of human speech is shown in Figure 13.6. Glottis is the opening of the vocal tract. The vocal tract is like a cavity, which is having two openings: nostrils and lips. Velum is an articulator organ, which moves up and down, and it is responsible for the radiation of sound either from the nose or through lips. The flow of air is controlled by vocal cords. The vocal cords open and close accordingly. Effectively, it is pulse-like excitation that is given to vocal tract. Vocal tract acts as a filter. Whatever pulses are fed to it, spectral shaping of those pulses is done by the vocal tract. Spectral shaping varies from time to time based upon the utterance of words. The speech signal is quasi-periodic waveform, that is, statistical parameters remain the same within a short time interval of order 10 to 20 ms. Therefore, the short-time analysis of speech signals is required. Different sounds are produced with the help of articulators organs such as velum. When velum moves downward, the air passage is blocked from vocal tract region up to lips, and hence, nasal sounds are emitted. When the movement of velum is in an upward direction, the nostril passage is closed, and voiced sounds are produced through lips. The tongue is another organ that is responsible for different sounds [10].

Voiced speech is produced when the vocal cords tremble through the articulation, for example, /а/, /е/, /if, /о/, /и/, whereas unvoiced speech is produced when there is no trembling of vocal cords, for example, /t/, /р/ [11]. The human production system is modeled by a source-filter model. The source is excitation generation of the glottis pulses, and vocal tract and effects of radiation are represented by the time-varying system (linear). The actual speech spectrum is not uniform. There are some frequency components, which are more dominating than others. This can be explained with the help of a cylindrical tube model, which is open at one end (lip end). First resonance will occur, creating nodes and antinodes. If the length of the tube is L = 17 cm (distance between larynx and lips) and velocity of sound in air is 340 m/s, then first resonance frequency comes out to be 500 Hz, the second resonance frequency is 1500 Hz, and the thud resonance frequency is at 2500 Hz. These frequencies called formant frequencies maybe in some range for different persons. Therefore, this is one of the speech features that can be extracted from the speech signal.

Speech perception involves detecting sounds through ear. The ear consists of an outer part, middle part, and the inner part. The central part of the ear consists of three bones, and inner part consists of a snail-like structure known as the cochlea. Cochlea consists of fluid-filled chambers and is separated by the basilar membrane, as shown in Figure 13.7.

Human ears

FIGURE 13.7 Human ears.

Sound waves are received by the ear canal and fall on the eardrum. The eardrum vibrates, and these vibrations are passed upon to bones and then to the cochlea. These vibrations set up the resonances in the inner ear. The membranes are connected with the nerves, and thus, the signal is transmitted to the brain. The audible range of human speech is from 20 Hz to 20 kHz. The size of the cochlea is around 30 nun. Resolution of low-frequency sound is much better than the high-frequency sounds because the basilar membrane is sensitive to different frequencies. The sensitivity of frequency is not linear. The scale is linear up to 2 kHz, and then, it is a logarithmic scale. The scale is known as bark or Mel scale. At last, it is concluded that the brain gets much more information about low-frequency components of speech [35].


Speech waveform is having many characteristics such as loudness, voiced/ unvoiced speech, fundamental frequency, pitch, spectral envelope, formants, etc. Speech features can be classified as physical and perceptual. Physical characteristics are dependent upon physical properties of speech signal such as power, fundamental frequency, etc. The amplitude of the speech signal specifies power in it. More power means a louder signal. Silence zones in the speech signal can be discovered using power measurement. Another physical feature is the fundamental frequency. The range of fundamental frequency of females is 165-255 Hz, and that of a male is from 85 to 155 Hz, so male and female voices can be separated using this attribute [26]. Pitch and prosody are perceptual features. Pitch analysis can tell about the emotions of the speaker. Speech features are extracted by mathematical analysis of the signal. Analysis of speech signals can be done in tune as well as in the frequency domain. Various speech features, such as the energy of the signal and amplitude of the speech signal, are analyzed in the time domain using zero-crossing rate (ZCR) and autocorrelation (AC) [16]. The simple process is involved in these tasks in the time domain, but in the frequency domain, various methods with extensive calculations are required such as Fourier transform, spectrograms, filter banks, cepstral, etc. Spoken words are not exactly equal to what we write because of background noise, body language, channel variability, speaker variability, speaking style, dialects, etc. All the mentioned factors are responsible for the accuracy of SR.


SR is quite easy for individuals, but it is not easy for devices because of many factors such as style of speech, environment, and speaker features. [26].

  • • The recognition rate is high when words are uttered indefinite style.
  • • The environment has a significant role concerning recognition accuracy. Noisy surroundings have less recognition as compared to clean surroundings.
  • • Emotions of a speaker: how fast or slow he/slie speaking, accent, gender, etc. affect the recognition rate. Emotions of a speaker: how fast or slow he/she speaking, accent, gender, etc., have effect on recognition rate.
  • • There may be many difficulties when SR systems are made with different languages. Quality depends upon the language model and applications where the system is to be used.


The work under speech and SPR systems is made into four generations. The first generation (1950-1969) work was mainly on the phonetics methods. Second generation (1970-1990) work was on matching techniques. In the third generation (1990-2000) work was on statistical modeling. Deep learning methods were implemented in the fourth generation (the 2000s onward).

13.5.1 FIRST GENERATION (FROM 1950S TO 1969)

Starting from the year 1950, many researchers implemented the systems using acoustic-phonetic methods. A digit recognizer system was first implemented by Davis et al. [9] in Bell Laboratories formant frequencies. The noise or unintentional data from the audio signal were removed by using the time constant of the sampling model. This was one of the most preferred methods of that era to perform matching of the speaker's word. In this technique, the developed circuit splits the speech spectrum into two different frequencies, that is, below and above 900 Hz, respectively. A set of digits were kept in a table, and then, the speaker’s words were matched with the predefined words, and matching was performed based on the correlation among the values. In 1952, an SR system was implemented to recognize numbers from zero to nine. However, its accuracy was depended upon the adjustment of the requirements of the speaker. In 1956, a system was made based upon the phonemes classification by When and Stubbs [57]. The unvoiced and voiced speech was separated using pitch frequency. Zero-crossing levels were used for amplitude measurement of the signal. The authors took the idea from the published book by Jakobson et al. [32] in 1952. In 1962, Fiy [11] analyzed the role of acoustics on human sound perception system. Based on that analysis, many authors have done researches on this topic [47].


In 1970, many authors Itakura and Atal [3,29] defined that vocal tract features could be modeled in terms of time-varying features. The US Department of Defense has given many funds to flourish research in this area from 1971 to 1976. Now, the vocabulary size increased from 10 words. In 1979, Rabiner et al. [46] implemented a recognizing system by taking 26 isolated words. In the training session of the proposed model, the speaker speaks each word of the vocabulary in a direct-dialed up phone line, and then, an AC mechanism is applied to locate the starting and end of the word. The system is divided into two modes: mode one comprises of recording the signals and mode two is clustering the words.. The clustering was performed by using unsupervised without averaging. The simulation was done on four test sets. The proposed work was observed to have a substantial improvement to the alternative recognizer with a similar vocabulary. In 1976, statistical methods were implemented [33]. Many researchers used statistical techniques in their work [4,34]. The vast vocabulary and speaker dependency on the recognition accuracy and mixture density algorithms of HMM techniques were implemented [30]. After the evolution of neural network systems in 1989, this field got the opportunity to enhance the research. Therefore, many researchers have used algorithms such as backpropagation neural network to train the data [7, 23, 42]. Fuzzy-based systems can be combined with neural network systems.

13.5.3 THIRD GENERATION (FROM THE 1990s TO 2000s)

Now, in the era of 1990s, many techniques were implemented for making the recognition systems, but the effect of noise was very prominent. So far, the focus was on to reduce the impact of noise on recognition systems. Therefore, many techniques were combined to reduce the noise such as HMM, neural, and fuzzy techniques [58]. Bourlard et al. [7] extended the work done by Fiy [11] by introducing a multilayered neural network. Many advanced neural network-based algorithms such as recurrent neural network were implemented to reduce noise and to increase the accuracy [24].


In fourth generation (the 2000s onward), deep-leaming-based algorithms were implemented. Therefore, multilevel processing of features could be used to increase the accuracy. However, the main challenge in this area was how to train the multilevel neural network. Many authors have discovered training methods. In 2006, Hinton et al. [21] found the deep belief network (DBN). In this method, layerwise training was done. Results showed that accuracy rate increased when layerwise training was done as compared to other conventional networks [1, 8, 30, 39]. In 2012, Mohamed et al. [40] implemented the combined techniques using a Gaussian mixture model with the DBN. Deep architecture got famous in this era. Many researchers have explored this area, and many advances have been emerged [13,44,51].

< Prev   CONTENTS   Source   Next >