V: Speech and NLP Applications in Cognition

Speech Recognition Fundamentals and Features

GURPREET KAUR1', MOHIT SRIVASTAVA2, and AMOD KUMAR3

University Institute of Engineering and Technology, Punjab University, Chandigarh 160025, India

  • 2Chandigarh Engineering College, Landran, Mohali 140307, India
  • 3National Institute of Technical Teachers Training and Research, Chandigarh 160026, India

'Corresponding author. E-mail: This email address is being protected from spam bots, you need Javascript enabled to view it

ABSTRACT

Speech communication is an essential part of humans. It is natural in humans but artificial in machines. Speech waveform defines the message communicated along with various parameters regarding the speaker. It may be about his/ her age group, gender, and emotions. Speech processing has different areas according to its applications, such as in communication, reproduction of speech signal, and its tr ansmission and recognition of speaker or speech. The speech recognition system means letting a machine to understand voice, hi this chapter, the speech recognition fundamentals, features, and difficulties faced are explored.

INTRODUCTION

Speech is the method to communicate and to tell emotions. It is veiy easy for human beings but quite difficult with machines. Due to the large dependency of humans on machines, recognition of speech signals by machine is growing area since the last 65 years [12]. To make a good recognition system in any language, any background noise with the large dataset is today’s need.

Researchers have developed systems that are available in the market, but the performance of those systems still requires improvement [16,26,35].

Speech processing is a vast area in terms of its purpose and applications. There may be different purposes of speech processing such as:

Understanding of speech signal as a means of communication.

  • • Reproduction and transmission.
  • • Automatic recognition of speech signal.
  • • Characteristics of the speaker.
  • • Language identification.

Based on the purpose, speech processing has three significant areas:

  • • Analysis/synthesis
  • • Recognition
  • • Coding

Analysis/synthesis of the speech waveform involves characterization of the spectral information of speech signal for transmission and storage. Recognition field is categorized into speech recognition (SR), speaker recognition (SPR), and language recognition (LR). There may be a recognition system, in which the only message is recognized, known as SR, where the only speaker is recognized, known as SPR, and at last, only language is recognized, known as LR. These systems can be developed according to the need of society. Next area is coding of the speech signals. This is an application of data compression of audio signals. The classification of speech processing can be shown in Figure 13.1.

SR, SPR, and integrated speaker and speech recognition (ISSR) systems are explained in the following section.

13.1.1 SPEECH RECOGNITION

SR is the field where words are identified from the spoken entity, that is, what a person is saying? SR systems may be categorized into three types depending upon the speech mode, speaker mode, and vocabulary mode. Types of speech mode can be isolated SR, connected SR, and continuous SR. In isolated SR, the speaker has to take a break between the words. In connected SR, continues speaking is there without taking any pause. Continuous SR is like a person talks. Speaker can speak into the system without prior training, known as speaker-independent systems, and when the speaker take prior training before speaking into the system, it is known as speaker-dependent systems. Types of dataset can be small, medium, and large. The small dataset can be less than a hundred words. The medium dataset can be 1000 of words, and large dataset can be more than 5000 words. The basic building block for the SR system can be shown in Figure 13.2.

Classification of speech processing

FIGURE 13.1 Classification of speech processing.

Feature extraction is the process for selecting those feature vectors, which are helpful for SR and discarding all speaker-related information. The acoustic model for SR models the context variations, whereas the language model models linguistic information. The performance of SR systems depends upon the feature extraction and classification methods [5].

13.1.2 SPEAKER RECOGNITION

SPR is the field where the speaker is recognized without consideration of words, that is, who is speaking? Speech signal consists of not only a linguistic message but also speaker-related information in terms of speaking style, emotions, accent, etc. SPR field is also categorized into speaker verification (SV) and speaker identification (SI). SV means to confirm the person who she/he claims to be. SI is to give the identity to a person among all registered persons.

Furthermore, SPR can be text-dependent (TD) and text-independent (TID). TD systems are less flexible but more accurate because the system is trained with the specific dataset, but TID systems are more elastic with complexity because any new word can be spoken into the system. The SPR task is highly dependent upon the cooperation of the user. If SR is used as biometric, then, of course, the user would cooperate, but for forensic applications, the user may or may not be cooperative. Therefore, the vital block of this system lies with the feature extraction stage. The basic building blocks of the SV system are shown in Figure 13.3 [59].

Block diagram of the ЭЛ' system

FIGURE 13.3 Block diagram of the ЭЛ' system.

In the SV system, feature extraction is used to select the speaker-related feature vectors. Speaker models are generated on the basis of training data.

In the testing phase, the claimed user model is matched with the existing models, and according to some threshold criterion, the user model is accepted or rejected. The applications for the SV system include the transaction in banks, access to some restricted area, attendance recording systems, etc. There is little difference between SI systems and SV systems. In SI systems, there may be more than two speaker models. Out of registered speakers, identity is assigned to the claimed speaker. The block diagram for the SI system is shown in Figure 13.4 [16].

Block diagram of the SI system

FIGURE 13.4 Block diagram of the SI system.

The SI system is based upon one-to-many comparisons. In this system, the voice pattern of the unknown speaker is compared with all stored speaker models. The best match is reported, and the speaker ID is assigned. The SV and SI systems perform well when they are TD systems. This means that the speaker has to speak all those words/sentences that are known to the system. This type of systems, however, has fewer applications as compared to TID SV/SI systems. TD SV systems are used in home banking services where a specific user is to speak PIN or password. However, because of the security issues, these types of systems are less popular. TID SV systems are used in forensic applications, and they are quite popular.

13.1.3 INTEGRATED SPEAKER AND SPEECH RECOGNITION

SPR and SR are different fields according to the claims. However, some applications require combined SPR and SR systems.

Feature extraction techniques used in SPR and SR are conventional. The combined area of SPR and SR encompasses speaker-dependent SR. The block diagram for ISSR is shown in Figure 13.5. In ISSR, all basic blocks are similar except the feature extraction techniques [48, 49]. Now, the feature vectors should represent both speaker and speech characteristics. This type of system is used in command- and control-related applications, where specific user can only control the devices.

Block diagram of SPR and SR systems

FIGURE 13.5 Block diagram of SPR and SR systems.

VARIATIONS OF AUTOMATIC SPEECH RECOGNITION SYSTEMS

Kinds of speech mode can be isolated SR (IWSR), connected SR (CWR), and continuous SR (CSR). In IWSR, the speaker has to take a break between the words. In CWR, continuous speaking is there without taking any break.

CSR is like a person talks. In IWSR, the speech is treated as a single phrase, and without any prior knowledge about phonetic, it is recognized. In this, speech is categorized into two parts: starting point and ending point. The isolated word recognition system is implemented using IWSR. Gupta and Sivakumar [15] applied an order for 10 digits of Hindi language. Speech waveform is divided into divisions using a Hamming window. An online recognition system is developed by Venkataramani [55] using 10 digits on a field-programmable gate array programmable chip. The speech signals that are acquired by microphones are converted into phrases using the hidden Markov model (HMM) model. Many authors have done the research using the electromyogram (EMG) signals also. They have shown the dependency of the speech signal with the EMG signal [37].

In CWR, the words are divided by pause. The speech signal strings size may be little, average, or significant. Myers and Rabiner [41] implemented a CWR system using a dynamic warping method. The matching is done between the stored database and the uttered words. Syama and Maiy [52] applied a Malayalam LR system using the same method that is used by Myers and Rabiner. In CSR type of operation, it is same like a human speech [14]. Thangarajan et al. [53] implemented a system for the Tamil language. However, it was quite challenging to build the system because of the variations in vowels and consonants. However, the recognition rate can be increased using extensive training. Many authors have implemented the systems on different languages like Abushariah et al. [2] developed a CSR system using the Sphinx and НТК tools. The five stages of the HMM were also deployed, which had three emitting steps corresponding to the triphone acoustic model. The statistical model was comprised of bigram, unigram, and trigram. The simulation was done based on the various amalgamations of speakers and sentences.

Along with this, it included a manual validation and classification process for validating the right pronunciation of the words. This manual validation was done by a human. After getting the results, it was concluded that the system had the best suitability when the speakers were different, but the sentences were similar, and it failed when different speakers and different sentences were considered.

 
Source
< Prev   CONTENTS   Source   Next >