Segmentation of the speech signal
As mentioned in the introduction, a speech signal consists of speech units separated by pauses. The speech units contain voiced and unvoiced regions and on the voiced parts, we distinguish stable and unstable intervals. The algorithms and criteria to detect these different structures are described in the following sections.
Speech and pause segments
We use the algorithm to determine the endpoints of isolated utterances by Rabiner and extend it by a heuristics to find the pauses between the spoken segments in a speech signal [RAB 75]. We refer to this combined algorithm as a pause-finding algorithm. Rabiner’s algorithm decides whether a signal frame, i.e. a small 10 ms long section of the signal, is characterized as speech or pause based on its energy and the silence energy. The silence energy is the mean energy of an interval that contains silence or signal noise. The silence or noise in our algorithm is expected at the beginning of the speech signal. Users may configure the length over which the silence energy is computed; the default value is 100 ms. First, the pause-finding algorithm calculates an initial segment list, where each segment is characterized by its start and end sample positions and the segment type - either SPEECH or PAUSE. Second, the algorithm merges pause segments that are too short with their neighboring speech segments. In a similar way, it merges speech segments that are too short with their neighboring pause segments. Some of the segments in the initial segment list are too short to form a true speech or pause segment. For instance, a glottal stop before a plosive or a low-energy speech segment is often identified as a pause segment. The minimum lengths of both pause and speech segments are configurable. Finally, the algorithm extends the speech segments by a certain small length. This is necessary since the ends of speech segments may be low-energy phonemes. These phonemes are automatically included by extending the speech segments by a configurable length. The pause-finding algorithm consists of six steps that we present in the following:
- 1) Energies, peak energy and silence energy: the energies E(k) are computed at discrete points k every 10 ms each over a window of 10 ms in the speech signal, k = 0, ..n - 1. The peak energy Emax is the maximum energy of all energies E(k). Emin is the mean energy of the initial silence that is supposed to occur at the beginning of the speech signal.
- 2) Threshold ITL for speech/pause decision: the threshold ITL is computed as in [RAB 75]:
- 3) Initial segment list: each frame is classified as either speech or pause frame, comparing its energy with ITL. It is a speech frame if its energy is larger than ITL, a pause frame otherwise. Consecutive speech frames form a speech segment, and consecutive pause frames form a pause segment of the initial segment list.
- 4) Merging of too short segments of type PAUSE: pause segments shorter than the configurable minimum pause length are merged with their neighboring speech segments. The default minimum pause length is 200 ms.
- 5) Merging of too short segments of type SPEECH: speech segments shorter than the configurable minimum speech length are merged with their neighboring pause segments. The default minimum speech length is 150 ms.
- 6) Extension of segments of type SPEECH: all segments of type SPEECH are extended by the length given by the configurable maximum speech segment extension (default value 50 ms). At the same time, the pause segments to the left and right of each speech segments are reduced by that amount.
Figure 9.1 shows the result of the pause-finding algorithm for a speech recording, where a female speaker reads the beginning of the story “The north wind and the sun” of the Keele pitch database [PLA 95].
Figure 9.1. Segmentation of a speech signal into speech (S) and pause (P) segments. For a color version of this figure, see www.iste.co.uk/sharp/cognitive.zip