Speech Coding

The most efficient speech coding systems for narrowband applications use the analysissynthesis-based method, shown in Figure 1, in which the speech signal is analyzed during the coding process to estimate the main parameters of speech that allow its synthesis during the decoding process. Two sets of speech parameters are usually estimated: (1) the linear filter system parameters, which model the vocal track, estimated using the linear prediction method, and (2) the excitation sequence. Most speech coders estimate the linear filter in a similar way, although several methods have been proposed to estimate the excitation sequence that determines the synthesized speech quality and compression rates. Among these speech coding systems we have the multipulse excited (MPE) and regular pulse excited (RPE) linear predictive coding, the codebook exited linear predictive coding (CELP), and so forth, that achieve bit rates among 9.6 Kb/s and 2.4 kb/s, with reasonably good speech quality. Table 1 shows the main characteristics of some of the most successful speech coders.The analysis by synthesis codecs split the input speech s(n) into frames, usually about 20 ms long. For each frame, parameters are determined for a synthesis filter, and then the excitation to this filter is determined by finding the excitation signal, which, passed into the given synthesis filter, minimizes the error between the input speech and the reproduced speech.Finally, for each frame the encoder transmits information representing the synthesis filter parameters and the excitation to the decoder, and at the decoder, the given excitation, is passed through the synthesis filter to give the reconstructed speech. Here the synthesis filter is an all pole filter, which is estimated by using linear prediction methods, assuming that the speech signal can be properly represented by modeling it as an autoregressive process.The synthesis filter may also include a pitch filter to model the long-term periodicities presented in voiced speech. Generally MPE and RPE coders will work without a pitch filter, although their performance will be improved if one is included. For CELP coders, however, a pitch filter is extremely important, for reasons discussed next.

The error-weighting filter is used to shape the spectrum of the error signal in order to reduce the subjective loudness of the error signal. This is possible because the error signal in frequency regions where the speech has high energy will be at least partially masked by the speech. The weighting filter emphasizes the noise in the frequency regions where the speech content is low. Thus, the minimization of the weighted error concentrates the energy of the error signal in frequency regions where the speech has high energy, allowing that the error signal be at least partially masked by the speech, reducing its subjective importance.

Such weighting is found to produce a significant improvement in the subjective quality of the reconstructed speech for analysis by synthesis coders.

The main feature distinguishing the MPE, RPE, and CELP coders is how the excitation waveform u(n) for the synthesis filter is chosen. Conceptually every possible waveform is passed through the filter to see what reconstructed speech signal this excitation would produce. The excitation that gives the minimum weighted error between the original and the reconstructed speech is then chosen by the encoder and used to drive the synthesis filter at the decoder. This determination of the excitation sequence allows the analysis by synthesis coders to produce good quality speech at low bit rates. However, the numerical complexity required determining the excitation signal in this way is huge; as a result, some means of reducing this complexity, without compromising the performance of the codec too badly, must be found.

The differences between MPE, RPE, and CELP coders arise from the representation of the excitation signal u(n) to be used. In MPE the excitation is represented using pulses not uniformly distributed, typically eight pulses each 10ms. The method to determine the position and amplitude of each pulse is through the

minimization of a given criterion, usually the mean square error, as shown in Figure 1.

The regular pulse is similar to MPE, in which the excitation is represented using a set of 10 pulses uniformly in an interval of 5ms. In this approach the position of the first pulse is determined, minimizing the mean square error. Once the position of the first pulse is determined, the positions of the remaining nine pulses are automatically determined. Finally the optimal amplitude of all pulses is estimated by solving a set of simultaneous equations. The pan-European GSM mobile telephone system uses a simplified RPE codec, with long-term prediction, operating at 13kbits/s. Figure 2 shows the difference between both excitation

sequences.

Although MPE and RPE coders can provide good speech quality at rates of around 10kbits/s and higher, they are not suitable for lower bit rates. This is due to the large amount of information that must be transmitted about the excitation pulses positions and amplitudes. If we attempt to reduce the bit rate by using fewer pulses, or coarsely quantizing their amplitudes, the reconstructed speech quality deteriorates rapidly. It is necessary to look for other approaches to produce good quality speech at rates below 10kbits/s. A suitable approach to this end is the CELP proposed by Schroeder and Atal in 1985, which differs from MPEand RPE in that the excitation signal is effectively vector quantized. Here the excitation is given by an entry from a large vector quantizer codebook and a gain term to control its power. Typically the codebook index is represented with about 10 bits (to give a codebook size of 1,024 entries), and the gain is coded with about 5 bits. Thus the bit rate necessary to transmit the excitation information is greatly reduced. Typically it requires around 15 bits compared to the 47 bits used for example in the GSM RPE codec.Early versions of the CELP coders use codebooks containing white Gaussian sequences. This is because it was assumed that long- and short-term predictors would be able to remove nearly all the redundancy from the speech signal to produce a random noise-like residual signal. Also, it was shown that the short-term probability density function (pdf) of this residual error was nearly Gaussian, and then using such a codebook to produce the excitation for long and short-term synthesis filters could produce high quality speech. However, to choose which codebook entry to use in an analysis-by-synthesis procedure

meant that every excitation sequence had to be passed through the synthesis filters to see how close the reconstructed speech it produced would be to the original. Because this procedure requires a large computational complexity, much work has been carried out for reducing the complexity of CELP codecs, mainly through altering the structure of the codebook. Also, large advances have been made with the speed possible from DSP chips, so that now it is relatively easy to implement a real-time CELP codec on a single, low cost DSP chip. Several important speech-coding standards have been defined based on the CELP, such as the American Defense Department (DoD) of 4.8kbits/s and the CCITT low delay CELP of 16kbits/s.

The CELP codec structure can be improved and used at rates below 4.8kbits/s by classifying speech segments into voiced, unvoiced, and transition frames, which are then coded differently with a specially designed encoder for each type. For example, for unvoiced frames the encoder will not use any long-term prediction, whereas for voiced frames such prediction is vital but the fixed codebook may be less important. Such class-dependent codecs are capable of producing reasonable quality speech at bit rates of 2.4kbits/s. Multi-band excitation (MBE) codecs work by declaring some regions the frequency domain as voiced and others as unvoiced. They transmit for each frames a pitch period, spectral magnitude and phase information, and voiced/unvoiced decisions for the harmonics of the fundamental frequency.

This structure produces a good quality speech at 8kbits/s. Table 1 provides a summary of some of the most significant CELP coders.

Higher bandwidths than that of the telephone bandwidth result in major subjective improvements.

Thus a bandwidth of 50 to 20 kHz not only improves the intelligibility and naturalness of audio and speech, but also adds a feeling of transparent communication, making speaker recognition easier. However, this will result in the necessity of storing and transmitting a much larger amount of data, unless efficient wideband coding schemes are used. Wideband speech and audio coding intend to minimize the storage and transmission costs while providing an audio and speech signal with no audible differences between the compressed and the actual signals with 20kHz or higher bandwidth and a dynamic range equal of above 90 dB.Four key technology aspects play a very important role to achieve this goal: the perceptual coding, frequency domain coding, window switching, and dynamic bit allocation. Using these features the speech signal is divided into a set of non-uniform subbands to encode with more precision the perceptually more significant components and with fewer bits the perceptually less significant frequency components. The subband approach also allows the use of the masking effect in which the frequency components close to those with larger amplitude are masked, and then they can be discharged without audible degradation. These features, together with a dynamic bit allocation, allow significant reduction of the total bits required for encoding the audio signal without perceptible degradation of the audio signalquality. Some of the most representative coders of this type are listed in Table 2.