Performance Analysis of Advanced Hybrid Speech Coding Techniques in Time domain, Spectral domain and Perceptual domain

Speech coding is the art of creating a minimally redundant representation of the speech signal that can be efﬁciently transmitted or stored in digital media and decoding the signal with the best possible perceptual Quality. The speech transmission in wireless networks is associated with the reduction of extra information present in signal in such a way to preserve the quality and intelligibility of speech. It is known that the lower the bit rate the lesser the quality of the reconstructed speech however there is a constant quest to achieve a better speech quality at lower bit-rates. This paper presents performance analysis for the quality of advanced hybrid speech coding techniques in Time domain, Spectral domain and perceptual domain. These analyses are implemented on three different algorithms of advanced hybrid speech coding techniques such as CELP, G729 Annex A, G723.1 to assess the quality performance for English female speaker, English male speaker and Arabic female speaker by using Mat lab simulation program. Our evaluation criterion implemented includes the following tests: Signal to Noise Ratio (SNR), Segmental Signal to Noise Ratio (SNRseg), The Log-Likelihood Ratio (LLR), The Weighted Spectral Slope (WSS), Absolute Error, Perceptual Evaluation of Speech Quality (PESQ), Rating of speech distortion, rating of background noise and the predicted rating of overall quality.


Introduction
CELP coder is widely used for mobile communication speech coding as a generic algorithm for implementing highly efficient and high-quality speech coding. Many standardized codecs are based on it. G729 and G723.1 are ITU standard speech codec based on CELP coder. G729 coder also called Conjugate Structure Algebraic CELP (CS-ACELP) coder. G723.1 also called Multi-pulse Maximum Likelihood Quantization (MP-MLQ). Both CELP, CS-ACELP and MP-MLQ encode speech in frames using linear predictive analysis by synthesis coding. This paper is organized as follows. In section 2, CELP speech coder is introduced. In section 3, ITU-T G.723.1 speech coder is introduced. In section 4, ITU-T G.729.1 speech coder is introduced. In section 5, various objective evaluation measures have been touched upon. In section 6, we describe MATLAB simulation for Objective Speech Quality Measures for the proposed coders. Performance evaluation of proposed coders is computed and demonstrated using set of tables and set of graphs. Finally the concluding remarks are given in section 7.

CELP Speech Coder
CELP coder provides the bridge among waveform coders and vocoders as it presents compression of speech comparable to medium bit rate waveform coders [1]. CELP algorithm is used to find the best code word characterizing the excitation signal for each 30 ms speech frame. This code word is found by applying each code word as an excitation for the CELP synthesizer. CELP is one of the most efficient speech coding algorithms where the speech is compressed with rate of 4.8 kbps by preserving quality of speech [1] .The synthesized speech signal is subsequently compared with the input speech signal and a difference signal is calculated. This difference signal is weighted by a perceptual weighting filter. As a result, the error signal e(n) is obtained from perceptual weighting filter [2]. That code word which ensures the lowest power of the error signal e(n) is selected as the best code word characterizing the frame. The characteristics of the formant weighting filter were chosen to ensure the best subjective human perception of the synthesized speech signal. The harmonic noise weighting filter controls the amount of error in the harmonics of the speech signal [2].

ITU-T G.723.1 Speech Coder
ITU-T G.723.1, the standard for multimedia communication speech coders, has two modes with bit rates of 5.3 and 6.3 kbit/s. The coder is based on the principles of linear prediction analysis-by-synthesis coding and attempts to minimize a perceptually weighted error signal [4]. The encoder operates on blocks (30 ms frame) of 240 samples each. Each frame is first divided into four sub frames of 60 samples each. In addition, there is a look-ahead of 7.5 ms, so the coder has a 37.5 ms total algorithmic delay. For every 60-sample sub frame, a set of tenth order LPC coefficients is computed. The LPC set of the last sub frame is converted to LSP parameters, and the LSP set is divided into 3 sub-vectors. The quantization is performed using a predictive split vector quantizer (PSVQ).The unquantized LPC coefficients are used to construct the short-term perceptual weighting filter, which is used to filter the entire frame speech and to obtain the perceptually weighted speech signal. For every two sub frames, the open-loop pitch lag is computed using the weighted speech signal. Every sub frame speech signal is then encoded by the ACB and FCB search procedures. The ACB search is performed using a fifth-order pitch predictor to obtain the closed-loop pitch and gains. Finally, the stochastic excitation pulses are approximated by MP-MLQ excitation for high bit rate (6.3 kbit/s), and ACELP for low bit rate (5.3 kbit/s) [5].

ITU-T G.729-ANNEX A Speech Coder
The general description of the coding/decoding algorithm is similar to ITU G729.
The G.729-ANNEX A is like G729 codec which is based on Conjugate Structure Algebraic Code Excited Linear Prediction (CS-ACELP). The coder operates on a speech frame (block) of 10 ms, which is equivalent to 80 samples at the sampling rate of 8000 Hz [6].Each block of 10 ms is first divided into two sub frames of 40 samples each. There is a 5 ms look-ahead for linear prediction (LP) analysis, resulting in a total 15 ms algorithmic delay For every 10 ms frame, the speech signal is analyzed to extract the parameters of the Code-Excited Linear-Prediction (CELP) coding model. A set of tenth order LPC coefficients are computed using the Levinson-Durbin algorithm. The LPC coefficients for the second sub frame are converted to LSP coefficients and are quantized using a predictive two stage vector quantizer. The unquantized LPC coefficients are used to construct the short term perceptual weighting filter. After computing the weighted speech signal, an open-loop pitch lag is estimated once per 10 ms frame based on the perceptually weighted speech signal. Next, the ACB and FCB are searched to obtain optimum excitation code vectors. ACB search is performed using a first-order pitch predictor, and a fractional pitch lag with one-third the sample resolution. In the FCB search, the stochastic excitation pulses are modeled using algebraic codebooks with four pulses [5].
The major algorithmic differences between G.729-ANNEX A and G729 are summarized below:  The perceptual weighting filter uses the quantized LP filter parameters that are given by W(z ) = Â (z )/ Â(z/ γ) with a fixed value of γ = 0.75.  Open-loop pitch analysis is simplified by using decimation while computing the correlations of the weighted speech.  Computation of the impulse response of the weighted synthesis filter W(z)/Â(z) computation of the target signal, and updating the filter states are simplified since W(z )/ Â(z ) is reduced to 1/Â (z / γ).  The search of the adaptive codebook is simplified. The search maximizes the correlation between the past excitation and the backward filtered target signal (the energy of filtered past excitation is not considered).
 The search of the fixed algebraic codebook is simplified. Instead of the nested-loop focused search, an iterative depth-first tree search approach is used.  At the decoder, the harmonic post filter is simplified by using only integer delays.
This annex describes the changes to the full implementation which have been made in order to reduce the codec algorithmic complexity.

Objective Speech Quality Measures
The speech quality of a coding system can be linked to the perceived difference between the output of a system under test and a known reference signal. These differences are sometimes referred to as impairments. In evaluating the quality of a system, different types of Objective analysis have been carried out. This includes calculation of different parameters like [7]:

 Absolute error (ABSErr)
The process of ABSErr computation is carried out by summing up the error values of each sample [8]. If and represents the original speech signal and the synthesize speech signal respectively, then the error signal e(n) can be written as [7]: Then, ABSErr can be given by:

 Percentage Error
Percentage error is calculated using the following formula:

 Mean Squared Error (MSE)
The mean squared error (MSE) defined as:

 Root Mean Squared Error (RMSE)
RMSE is calculated as:

 Signal to Noise Ratio (SNR)
A widely used objective measure of Speech quality is the SNR. It is the ratio of the average energy in the original speech waveform to the average energy in the error signal. SNR represent the distortion introduced by the coding algorithm [8]. SNR can be calculated as follows: where x (n) is the original speech, the synthesized speech, and (N) the number of samples.

 Segmental Signal to noise ratio (SSNR)
SSNR is an improved of classical SNR, whereby the SNR measured over a quasi-stationary interval of 15-30 ms (Frames) and the individual SNR measures are averaged. SSNR makes distinction between errors that occur in high-energy regions and those in the low energy regions, where any errors will have a greater perceptual effect [9]. SSNR can be calculated as follows: Where L is the frame length (number of samples), and M the number of frames in the signal (N = ML).

 The Log-Likelihood Ratio (LLR)
It is a distance measure that can be directly calculated from the LPC vector of the clean and distorted speech .LLR measure can be calculated as follows: Where ( ) is the LPC vector for the original speech, ( ) is the LPC vector for the synthesized speech, ( ) is the transpose of , and ( ) is the auto-correlation matrix for the clean speech.

 The Weighted Spectral Slope (WSS)
It is a direct spectral distance measure. It is based on comparison of smoothed spectra from the original and synthesized speech samples. The smoothed spectra can be obtained from either LP analysis. WSS can be defined as follows:

= (9)
Where K is the number of bands, M is the total number of frames, and Sc(j, m ) andSd(j , m ) are the spectral slopes (typically the spectral differences between neighboring bands) of the j th band in the m th frame for clean and distorted speech, respectively. W (j, m) are weights.

 Perceptual Evaluation of Speech Quality (PESQ)
It is an international standard (ITU-T recommendation P.862) for estimating the Mean Opinion Score (MOS) from both the original signal and its degraded signal. PESQ record the difference between the original signal and the synthesized signal and derive a score from 0 to 5 where 5 are the best. PESQ score is computed as a linear combination of the average disturbance value Diand and the average asymmetrical disturbance values Ai and as follows [10].

MATLAB Simulation Results
To compare the performance of each implemented codec algorithms, a simulation using MATLAB program is carried out with quality measurements to measure the quality performance for different algorithms in Time domain , Spectral domain and perceptual domain. These measures are applied on three algorithms of hybrid speech coders to test the quality performance of each algorithm for English female speaker, English male speaker, Arabic female speaker and Arabic male speaker. The objective performance evaluation of speech files includes calculation of parameters like Absolute Error, Mean Square Error, Signal to Noise Ratio, segmental Signal to Noise Ratio, Perceptual Evaluation of Speech Quality, The Log-Likelihood Ratio, Weighted Spectral Slope, rating of speech distortion, rating of background noise and the predicted rating of overall quality respectively.

Measuring the Quality performance of three algorithms for English female speaker using sound file (f1058.wav)
The wave file is used here for the purpose of this analysis, is (f1058.wav) for English female speech having 22630 samples. Equations utilized to calculate the above parameters are as inked in section V. MATLAB simulated mathematical results in Table 1 and graphical resulting plots are shown in Fig. 1, 2, 3. Results obtained by the objective analysis are found to be satisfactory as can be judged from figures cited at below [11].  The wave file is used here for the purpose of this analysis, is (male.wav) for English male speech having 408226 samples. Equations utilized to calculate the above parameters are as inked in section V. MATLAB simulated mathematical results in Table 2 and graphical resulting plots are shown in Fig. 4, 5, 6. Results obtained by the objective analysis are found to be satisfactory as can be judged from figures cited at below.  The wave file is used here for the purpose of this analysis, is (test.wav) for Arabic female speech having 58000 samples. Equations utilized to calculate the above parameters are as inked in section V. MATLAB simulated mathematical results in Table 3 and graphical resulting plots are shown in Fig. 7, 8, 9. Results obtained by the objective analysis are found to be satisfactory as can be judged from figures cited at below.

Conclusion
This paper presents a performance analysis to assess the quality performance of advanced hybrid speech coding techniques in Time domain, Spectral domain and perceptual domain. evaluation criterion are implemented on three different algorithms of advanced hybrid speech coding techniques such as CELP, G729 Annex A, G723.1 to assess the quality performance for English female speaker, English male speaker and Arabic female speaker. by using Mat lab simulation program. Our evaluation criterion implemented includes the following tests: Signal to Noise Ratio (SNR), Segmental Signal to Noise Ratio (SNRseg), The Log-Likelihood Ratio (LLR), The Weighted Spectral Slope (WSS), Absolute Error, Perceptual Evaluation of Speech Quality (PESQ), Rating of speech distortion, rating of background noise and the predicted rating of overall quality. As can be seen from the obtained results and graphs, the quality of each codec is still good and can be heard but the analytical results proved that G723.1 is better than both of CELP and G.729-ANNEX A despite of G.729-ANNEX A has a bit rate higher than G723.1 and CELP. Also we can see that the quality performance of G729-ANNEX A coder is the lowest performance for English male speakers and Arabic female speaker, on the other side G.729 -ANNEX a slightly better performance than CELP coder but lower than G723.1 for English female speakers.
This means that the changes that have been applied to G729 by this annex in order to reduce the codec algorithmic complexity affected on the quality performance of this coder. It is observed generally that any increase of complexity of any algorithm will lead to an increase in delay time. In the future we may find a method to reduce the complexity of G.729-ANNEX A and G723.1 while maintaining the speech quality.