1
PERCEPTUAL EVALUATION OF SPEECH QUALITY (PESQ) – A
                       NEW METHOD FOR SPEECH QUALITY ASSESSMENT OF
                             TELEPHONE NETWORKS AND CODECS
                     Antony W. Rix1, John G. Beerends2, Michael P. Hollier1 and Andries P. Hekstra2
                        1
                          PsyTechnics, B54/86 Adastral Park, Ipswich IP5 3RE, United Kingdom
                          2
                            Royal PTT Nederland NV, NL-2260 Leidschendam, The Netherlands
                                                 E-mail: awr@iee.org
                        ABSTRACT                                        measurement system (PAMS) [8–11]. This was the first model
                                                                        in the literature to focus on end-to-end behaviour, including the
Previous objective speech quality assessment models, such as            effects of filtering and variable delay [10, 11].
bark spectral distortion (BSD), the perceptual speech quality
measure (PSQM), and measuring normalizing blocks (MNB),                 These effects, along with certain types of coding distortion,
have been found to be suitable for assessing only a limited range       packet loss and background noise, were found to cause earlier
of distortions. A new model has therefore been developed for            models – such as BSD, PSQM and MNB – to produce inaccurate
use across a wider range of network conditions, including               scores [10–12]. A competition was therefore held by ITU-T
analogue connections, codecs, packet loss and variable delay.           study group 12 to select a new model with good performance
Known as perceptual evaluation of speech quality (PESQ), it is          across a very wide range of codecs and network conditions. The
the result of integration of the perceptual analysis measurement        two algorithms with the highest performance in this competition,
system (PAMS) and PSQM99, an enhanced version of PSQM.                  PAMS and PSQM99 (an updated and extended version of
PESQ is expected to become a new ITU-T recommendation                   PSQM), were combined to produce a new model known as
P.862, replacing P.861 which specified PSQM and MNB.                    perceptual evaluation of speech quality (PESQ). This was
                                                                        selected in May 2000 as draft ITU-T recommendation P.862, and
                                                                        is expected to replace P.861 early in 2001 [12, 13].
                  1. INTRODUCTION                                       The next section of this paper presents a description of the
                                                                        structure of PESQ and the key processes that it includes. This is
The motivation for using perceptual models to assess non-linear         followed by results from 38 known and 8 unknown subjective
and error-prone audio communications systems is well-                   tests. The scope and limitations of PESQ are also discussed and
established and models have been proposed by many authors.              conclusions are drawn.
Beerends and Stemerdink’s model, the perceptual speech quality
measure (PSQM) [1], was adopted in 1996 as International                               2. DESCRIPTION OF PESQ
Telecommunication Union (ITU-T) recommendation P.861 [2].
An alternative system based on measuring normalizing blocks             2.1 Overview
(MNB) [3], proposed by Voran, was added in 1998 as an
                                                                        The structure of PESQ is shown in Figure 1. The model begins
appendix to P.861. Another model by Beerends and Stemerdink,
                                                                        by level aligning both signals to a standard listening level. They
the perceptual audio quality measure (PAQM) [4], was combined
                                                                        are filtered (using an FFT) with an input filter to model a
with several different audio models to produce a method known
                                                                        standard telephone handset. The signals are aligned in time and
as perceptual evaluation of audio quality (PEAQ), which became
                                                                        then processed through an auditory transform similar to that of
ITU-R recommendation BS.1387 in 1999 [5, 6].
                                                                        PSQM. The transformation also involves equalising for linear
Hollier’s extensions to the bark spectral distortion (BSD) model        filtering in the system and for gain variation. Two distortion
[7] led to the development of the perceptual analysis                   parameters are extracted from the disturbance (the difference
                Reference            Level      Input                       Auditory
                 signal              align      filter                     transform
                                                                                                            Prediction of
                         System                             Time                                             perceived
                                                                           Disturbance         Cognitive
                        under test                        align and                                            speech
                                                                           processing          modelling
                                                          equalise                                             quality
                  Degraded           Level      Input                       Auditory      Identify bad
                   signal            align      filter                     transform        intervals
                                                                      Re-align bad intervals
                             Figure 1: Structure of perceptual evaluation of speech quality (PESQ) model.
                                                                                                                                       2
between the transforms of the signals), and are aggregated in        Deletion. A deletion (a negative delay change) leaves a section
frequency and time and mapped to a prediction of subjective          which overlaps in the degraded signal. If the deletion is longer
mean opinion score (MOS). Some details are discussed below.          than half a frame, the overlapping sections are discarded.
2.2 Time alignment                                                   Masking. Masking in each time-frequency cell is modelled
                                                                     using a simple threshold below which disturbances are inaudible;
The time alignment of PESQ assumes that the delay of the             this is set to the lesser of the loudness of the reference and
system is piecewise constant. This assumption appears to be          degraded signals, divided by four. The threshold is subtracted
valid for a wide range of systems, including packet-based            from the absolute loudness difference, and values less than zero
transmission such as voice over IP (VoIP) [10, 11]. Delay            are set to zero. Methods for applying masking over distances
changes are allowed in silent periods (where they will normally      larger than one time-frequency cell were examined with earlier
be inaudible) and in speech (where they are usually audible).        versions of PSQM and PSQM99, but did not improve overall
The signals are aligned using the following steps [11].              performance [14], and were not used in PESQ.
•    Narrowband filter applied to both signals to emphasise          Asymmetry. Unlike P.861 PSQM [2], PESQ computes two
     perceptually important parts. These filtered signals are only   different error averages, one without and one with an asymmetry
     used for time alignment.                                        factor.     The PESQ asymmetry factor is calculated from a
•    Envelope-based delay estimation.                                stabilised ratio of the Bark spectral density of the degraded to the
•    Division of reference signal into utterances.                   reference signals in each time-frequency cell. This is raised to the
•    Envelope-based delay estimation for each utterance.             power 1.2 and is bounded with an upper limit of 12.0. Values
•    Fine correlation histogram-based delay identification for       smaller than 3.0 are set to zero. The asymmetric weighted
     each utterance.                                                 disturbance, obtained by multiplying by this factor, thus
•    Utterance splitting and re-alignment to test for delay          measures only additive distortions.
     changes during speech.
These give a delay estimate for each utterance, which is used to     2.5 Aggregation of disturbance in frequency and time
find the frame-by-frame delay for use in the auditory transform.     Following the understanding that localised errors dominate
                                                                     perception [9], PESQ integrates disturbance over several time-
2.3 Auditory transform                                               frequency scales using a method designed to take optimal
The auditory transform in PESQ is a psychoacoustic model             account of the distribution of error in time and amplitude. The
which maps the signals into a representation of perceived            disturbance values are aggregated using an Lp norm, which
loudness in time and frequency. It includes the following stages.    calculates a non-linear average using the following formula:
                                                                                                                 1 p
Bark spectrum. An FFT with a Hamming window is used to                                    1 N                  
                                                                                                               
                                                                                          N ∑
                                                                                     Lp =      disturbance[m] p
calculate the instantaneous power spectrum in each frame, for                                                   
50% overlapping frames of 32ms duration. This is grouped                                   m =1                
without smearing into 42 bins, equally spaced in perceptual          The disturbance is first summed across frequency using an Lp
frequency on a modified Bark scale similar to that of PSQM [2].      norm, giving a frame-by-frame measure of perceived distortion.
Frequency equalisation. The mean Bark spectrum for active            This frame disturbance is multiplied by two weightings. The first
speech frames is calculated. The ratio between the spectra of        weight is inversely proportional to the instantaneous energy of
reference and degraded gives a transfer function estimate,           the reference, raised to the power 0.04, giving slightly greater
assuming that the system under test has a constant frequency         emphasis on sections for which the reference is quieter. This
response. The reference is equalised to the degraded signal using    process replaces the silent interval weighting used in P.861.
this estimate, with bounds to limit the equalisation to ±20dB.       After this, the frame disturbance is bounded with an upper limit
                                                                     of 45. The second weight gives reduced emphasis on the start of
Equalisation of gain variation. The ratio between the audible        the signal if the total length is over 16s, modelling the effect of
power of the reference and the degraded in each frame is used to     short-term memory in subjective listening. This multiplies the
identify gain variations. This is filtered with a first-order low-   frame disturbance at the start of the signal by a factor decreasing
pass filter, and bounded, then the degraded signal is equalised to   linearly from 1.0 (for files shorter than 16 seconds) to 0.5 (for
the reference.                                                       files longer than 60 seconds).
Loudness mapping. The Bark spectrum is mapped to (Sone)              After weighting, the frame disturbance is averaged in time over
loudness, including a frequency-dependent threshold and              split second intervals of 20 frames (approx 320ms, accounting
exponent. This gives the perceived loudness in each time-            for the overlap of frames) using Lp norms. These intervals
frequency cell.                                                      overlap 50%, and no window function is used. The split second
                                                                     disturbance values are finally averaged over the length of the
2.4 Disturbance processing and cognitive modelling                   speech files, again using Lp norms. Thus the aggregation
                                                                     process uses three Lp norms – in general with different values of
The absolute difference between the degraded and the reference
                                                                     p – to map the disturbance to a single figure. The value of p is
signals gives a measure of audible error. In PESQ, this is
                                                                     higher for averaging over the split second intervals to give
processed through several steps before a non-linear average over
                                                                     greatest weight to localised distortions. The symmetric and
time and frequency is calculated.
                                                                     asymmetric disturbance are averaged separately.
                                                                                                                                      3
2.6 Realignment of bad intervals                                       Tests are grouped according to whether conditions were
                                                                       predominantly from mobile, fixed, voice over IP (VoIP) and
In certain cases the time alignment described in section 2.2 may       multiple type networks. Tables 1 and 2 show correlation and
fail to correctly identify a delay change, resulting in large errors   residual error distribution for PESQ, PSQM and MNB [2] for 38
for each section with incorrect delay. These are identified by         subjective tests that were available to the developers of PESQ.
labelling bad frames (which have a symmetric disturbance of            These included a wide range of simulated and real network
more than 45) and joining together bad sections in which bad           measurements. Tables 3 and 4 present the results, for PESQ
frames are separated by less than 5 good frames.                       only, of an independent evaluation that was conducted after
Each bad section is then realigned and the disturbance                 development was complete. All of this data relates to subjective
recalculated. Cross-correlation is used to find a new delay            listening tests carried out on the absolute category rating (ACR)
estimate. The auditory transform of the degraded signal is             listening quality (LQ) opinion scale. Test material consists of
recalculated and the disturbance found. For each frame, if the         natural speech recordings of 8–12s in duration, with four talkers
realignment results in a lower disturbance value, the new value is     (two male, two female) for each condition. The results are
used. Aggregation over split second intervals and the whole            calculated per condition unless otherwise stated.
signal is performed after realignment.
                                                                       No. tests Type            Corr. coeff.    PESQ PSQM MNB
2.7 MOS prediction and model calibration                                 19      Mobile           average        0.962 0.924 0.883
                                                                                 network         worst-case      0.906 0.841 0.705
To train PESQ a large number of different symmetric and                   9      Fixed            average        0.942 0.881 0.802
asymmetric disturbance parameters were calculated by using                       network         worst-case      0.902 0.657 0.596
multiple values of p for each of the three averaging stages. A           10      VoIP/            average        0.921 0.679 0.694
linear combination of disturbance parameters was used as a                       multi-type      worst-case      0.810 0.260 0.363
predictor of subjective MOS. A further regression is required for
each subjective test to account for context and voting preferences        Table 1: Average and worst-case correlation coefficient
of different subjects, as discussed in section 3; for calibration a       for 38 subjective tests known during PESQ development,
linear mapping was also used at this stage. Parameter selection           sub-divided by test type.
was performed for all candidate sets of up to four disturbance
                                                                       Absolute error range    <0.25      <0.5    <0.75   <1.0   <1.25
parameters. The optimal combination – giving the highest
                                                                       % errors in range, PESQ  74.7      93.9    99.2    99.9   100.0
average correlation coefficient – was found. This enabled the
                                                                       % errors in range, PSQM 54.6       82.3    92.1    96.7   98.7
best parameters to be chosen from the full set of several hundred
candidate disturbance parameters.                                      % errors in range, MNB   46.1      74.5    89.4    96.1    98.9
                                                                          Table 2: Error distribution across all 38 known
The use of partial compensation in PESQ, for example in
                                                                          subjective tests.
equalising for gain modulation, avoids the need for using a large
number of parameters to predict quality. A combination of only         Test   Type                                               Corr.
two parameters – one symmetric disturbance (dSYM) and one               1     Mobile; real network measurements                  0.979
asymmetric disturbance (dASYM) – gave a good balance between            2     Mobile; simulations                                0.943
accuracy of prediction and ability to generalise. However, as this      3     Mobile; real networks, per file only               0.927
low-dimension model depends on earlier stages to incorporate            4     Fixed; simulations, 4–32 kbit/s codecs             0.992
complex perceptual effects, several design iterations were              5     Fixed; simulations, 4–32 kbit/s codecs             0.974
required. Coefficients in the auditory transform and disturbance
                                                                        6     VoIP; simulations                                  0.971
processing were optimised then the optimal parameter
                                                                        7     Multiple network types; simulations                0.881
combination was found, and the process repeated several times.
                                                                        8     VoIP frame erasure concealment; simulations        0.785
Final training was performed on a database of 30 subjective tests,
giving the following output mapping used in PESQ:                         Table 3: Correlation coefficient, 8 unknown subjective
           PESQMOS = 4.5 – 0.1 dSYM – 0.0309 dASYM                        tests (PESQ only).
For normal subjective test material the values lie between 1.0         Absolute error range       <0.25   <0.5    <0.75 <1.0 <1.25
(bad) and 4.5 (no distortion). In extremely high distortion the        % errors in range, PESQ     72.3   91.1     97.8 100.0 100.0
PESQMOS may fall below 1.0, but this is very uncommon.
                                                                          Table 4: Error distribution, 7 unknown subjective tests
            3. PERFORMANCE RESULTS                                        (PESQ only). Test 3 excluded as data was per-file only.
Following the methodology of the ITU-T competition, we used                      4. SCOPE AND APPLICATIONS
correlation coefficient and residual error distribution to quantify
the performance of models at predicting subjective MOS. These          Using results such as those above, a range of applications and
metrics are calculated for each subjective test separately, after      test conditions have been identified for which PESQ is believed
mapping the objective scores to the subjective scores for that test    to give accurate predictions of quality [13]. These include the
in a minimum squared error sense using monotonic 3rd-order             following.
polynomial regression.        This mapping ensures that the            Codec and error distortions: waveform codecs (e.g. G.711,
comparison is made in the MOS domain whilst allowing for               G.726), CELP/hybrid codecs at or above 4kbit/s (e.g. G.728),
normal variations in subjective voting between tests.
                                                                                                                                    4
mobile codecs/systems including GSM FR, EFR, HR, AMR,                 at BT and KPN. Antony Rix is also supported by the Royal
CDMA EVRC, TDMA ACELP, VSELP, and TETRA;                              Commission for the Exhibition of 1851.
transcodings of various codecs; random, burst, and packet loss
errors. PESQ can be used for applications such as codec and/or                            7. REFERENCES
system evaluation, selection and optimisation.
                                                                      [1] Beerends, J. G. and Stemerdink, J. A. “A perceptual
Network behaviours: filtering e.g. due to analogue interfaces;             speech-quality measure based on a psychoacoustic sound
time warping (variable delay) such as packet-based transmission            representation”. Journal of the Audio Engineering Society,
in VoIP. This enables PESQ to be used in a wide range of end-              42 (3), 115–123, 1994.
to-end measurement applications with live and simulated               [2] Objective quality measurement of telephone-band (300–
networks.     Background (environmental) noise, and noise                  3400 Hz) speech codecs. ITU-T Recommendation P.861,
processing, can be assessed by presenting PESQ with the clean,             February 1998.
unprocessed original and the coded, noisy degraded signal.            [3] Voran, S. “Objective estimation of perceived speech
                                                                           quality — part I: development of the measuring normalizing
One distortion type, replacement of speech by silence, causes all
                                                                           block technique”.      IEEE Trans. Speech and Audio
perceptual models difficulty in predicting MOS. Up to about
                                                                           Processing, 7 (4), 371–382, July 1999.
50ms of front- and back-end clipping (due to voice activity
                                                                      [4] Beerends, J. G. and Stemerdink, J. A. “A perceptual audio
detection) can have little to no subjective impact. However
                                                                           quality measure based on a psychoacoustic sound
clipping during speech, e.g. packet loss concealment by silence,
                                                                           representation”. Journal of the Audio Engineering Society,
is often rated harshly by subjects – with a drop of over 1 MOS
                                                                           40 (12), 963–974, 1992.
for 50ms of clipping. PESQ scores in between these extremes:
                                                                      [5] Method for objective measurements of perceived audio
50ms clipping typically causes PESQMOS to fall by around 0.5
                                                                           quality. ITU-R Recommendation BS.1387, January 1999.
regardless of location. PESQ may thus correlate poorly with
                                                                      [6] Thiede, T., Treurniet, W. C., Bitto, R., Schmidmer, C.,
subjective MOS if this is a factor – such as test 8 in Table 3.
                                                                           Sporer, T., Beerends, J. G., Colomes, C., Keyhl, M., Stoll,
As a listening-only model with a fixed assumed listening level,            G., Brandenburg, K. and Feiten, B. “PEAQ–The ITU
PESQ should not be used to assess the effect of listening level,           standard for objective measurement of perceived audio
sidetone/talker echo, or conversational delay, and it is not               quality”. Journal of the Audio Engineering Society, 48
intended for non-intrusive measurements.         Certain other             (1/2), 3–29, January/February 2000.
applications have not yet been fully characterised or may need        [7] Wang, S., Sekey, A. and Gersho, A. “An objective measure
parts of the model to be changed. These include: music quality;            for predicting subjective quality of speech coders”. IEEE
wideband telephony; the so-called “intermediate audio quality”;            Journal on Selected Areas in Communications, 10 (5), 819–
listener echo; very low bit-rate vocoders below 4kbit/s; acoustic          829, 1992.
and head-and-torso simulator measurements.                            [8] Hollier, M. P., Hawksford, M. O. and Guard, D. R.
                                                                           “Characterisation of communications systems using a
In contrast, PSQM and MNB were only recommended for use in                 speech-like test stimulus”, Journal of the Audio
narrowband codec assessment [2], and were known to produce                 Engineering Society, 41 (12), 1008–1021, 1993.
inaccurate predictions with certain types of codec, background        [9] Hollier, M. P., Hawksford, M. O. and Guard, D. R. “Error
noise, and end-to-end effects such as filtering and variable delay.        activity and error entropy as a measure of psychoacoustic
The scope of PESQ is therefore very much wider than P.861.                 significance in the perceptual domain”. IEE Proc. Vision,
                                                                           Image and Signal Processing, 141 (3), 203–208, 1994.
                     5. CONCLUSION                                    [10] Rix, A. W., Reynolds, R. and Hollier, M. P. “Perceptual
                                                                           measurement of end-to-end speech quality over audio and
The result of a major international collaboration, PESQ provides
                                                                           packet-based networks”. 106th Audio Engineering Society
significantly higher correlation with subjective opinion than the
                                                                           Convention, pre-print no. 4873, May 1999.
models of P.861, PSQM and MNB. Results indicate that it gives
                                                                      [11] Rix, A. W. and Hollier, M. P. “The perceptual analysis
accurate predictions of subjective quality in a very wide range of
                                                                           measurement system for robust end-to-end speech quality
conditions, including those with background noise, analogue
                                                                           assessment”. IEEE ICASSP, June 2000.
filtering, and/or variable delay. We believe that PESQ is suitable
                                                                      [12] Rix, A. W., Beerends, J. G., Hollier, M. P. and Hekstra, A.
for many applications in assessing the speech quality of
                                                                           P. “PESQ – the new ITU standard for end-to-end speech
telephone networks and speech codecs.
                                                                           quality assessment”. 109th Audio Engineering Society
                                                                           Convention, pre-print no. 5260, September 2000.
             6. ACKNOWLEDGEMENTS                                      [13] Perceptual evaluation of speech quality (PESQ), an
                                                                           objective method for end-to-end speech quality assessment
Thanks are due to ITU-T study group 12 for organising and
                                                                           of narrowband telephone networks and speech codecs.
driving the recent competition, and in particular the other
                                                                           ITU-T Draft Recommendation P.862, May 2000.
proponents (Ascom, Deutsche Telekom and Ericsson) who
                                                                      [14] Beerends, J. G. and Stemerdink, J. A. “The optimal time-
contributed valuable test data and provided stiff competition.
                                                                           frequency smearing and amplitude compression in
We would also like to thank the companies who acted as
                                                                           measuring the quality of audio devices”. 94th Audio
independent     validation   laboratories:  AT&T,     Lucent
                                                                           Engineering Society Convention, pre-print no. 3604, 1993.
Technologies, Nortel Networks, and especially France Telecom
R&D. We acknowledge the assistance of many of our colleagues