0% found this document useful (0 votes)
140 views5 pages

Spoken Language Identification Using Hybrid Feature Extraction Methods

This document discusses using hybrid feature extraction methods for spoken language identification. It introduces two hybrid features: Bark Frequency Cepstral Coefficients (BFCC) and Revised Perceptual Linear Prediction Coefficients (RPLP), which combine MFCC and PLP features. The paper tests BFCC and RPLP, along with MFCC and PLP, on a database of speech samples from 10 languages using Vector Quantization with Dynamic Time Warping and Gaussian Mixture Models for classification. The results show the hybrid features, particularly RPLP with GMM, provide better language identification performance than the conventional MFCC and PLP features.

Uploaded by

Melissa Mulè
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views5 pages

Spoken Language Identification Using Hybrid Feature Extraction Methods

This document discusses using hybrid feature extraction methods for spoken language identification. It introduces two hybrid features: Bark Frequency Cepstral Coefficients (BFCC) and Revised Perceptual Linear Prediction Coefficients (RPLP), which combine MFCC and PLP features. The paper tests BFCC and RPLP, along with MFCC and PLP, on a database of speech samples from 10 languages using Vector Quantization with Dynamic Time Warping and Gaussian Mixture Models for classification. The results show the hybrid features, particularly RPLP with GMM, provide better language identification performance than the conventional MFCC and PLP features.

Uploaded by

Melissa Mulè
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

JOURNAL OF TELECOMMUNICATIONS, VOLUME 1, ISSUE 2, MARCH 2010

11

Spoken Language Identification Using Hybrid


Feature Extraction Methods
Pawan Kumar, Astik Biswas, A .N. Mishra and Mahesh Chandra

Abstract—This paper introduces and motivates the use of hybrid robust feature extraction technique for spoken language identification (LID) sys
tem. The speech recognizers use a parametric form of a signal to get the most important distinguishable features of speech signal for recognition
task. In this paper Mel-frequency cepstral coefficients (MFCC), Perceptual linear prediction coefficients (PLP) along with two hybrid features are
used for language Identification. Two hybrid features, Bark Frequency Cepstral Coefficients (BFCC) and Revised Perceptual Linear Prediction
Coefficients (RPLP) were obtained from combination of MFCC and PLP. Two different classifiers, Vector Quantization (VQ) with Dynamic Time
Warping (DTW) and Gaussian Mixture Model (GMM) were used for classification. The experiment shows better identification rate using hybrid fea-
ture extraction techniques compared to conventional feature extraction methods.BFCC has shown better performance than MFCC with both clas-
sifiers. RPLP along with GMM has shown best identification performance among all feature extraction techniques.

Index Terms—Bark Frequency Cepstral Coefficient, Dynamic Time Warping, Language Identification, Gaussian Mixture Model, Mel Frequency
Cepstral Coefficient, Perceptual Linear Prediction, Revised Perceptual Linear Prediction, Vector Quantization

——————————  ——————————

1 INTRODUCTION

N ow Language Identification (LID) systems is an integral


part of telephone and speech input computer networks
which provide services in many languages. Automatic
A LID system has three major components, database prepara-
tion, feature extraction and classification as shown in Fig 1. A
prerequisite for the development and evaluation of automatic
language identification (language ID for short) is the problem speech recognition system is the availability of appropriate
of identifying the language being spoken from a sample of database. The recognition performance heavily depends on the
speech by a speaker. performance of the feature extraction block. Thus choice of
As with speech recognition, humans are the most accurate lan- features and its extraction from the speech signal should be
guage identification systems in the world today. Within such that it gives high identification performance with reason-
seconds of hearing speech, people are able to determine able amount of computation. There are two main methods
whether it is a language they know. If it is a language with used to parameterize the speech signal. Both PLP coefficients
which they are not familiar, they often can make subjective based on the linear prediction (LP) based technique [1, 2, 3]
judgments as to its similarity to a language they know, e.g., and MFCCs [4] based on discrete cosine transform (DCT) have
“sounds like Tamil”. A LID [1] system can be used to pre-sort some conceptual similarities of the speech signal processing.
the callers into the language they speak, so that the required But there are some differences, which can be important for
service will be provided in the language appropriate to the given conditions and for good identification performance.
talker. Examples of these LID services includes application like
travel information, automated dialogue system, spoken lan-
guage translation system, emergency assistance, language in-
terpretation, buying services International markets and tour-
ism add to the desirability of offering services in many lan-
guages. The languages of the world differ from one another
along many dimensions which have been codified as linguistic
Fig 1:- Block diagram of Language identification system
categories. These include phoneme inventory, phoneme se-
quences, syllable structure, prosodic, phonotactics, lexical
words and grammar. Therefore we hypothesize that an LID The long time goal of the work is to use two hybrid robust fea-
system which exploits each of these linguistic categories in ture extraction techniques which may give better identification
turn will have the necessary discriminative power to provide performance for noisy environment as well as clean environ-
good performance on short utterances. ment. Cepstral features were choosen because they yield high
identification accuracy, and are invariant to fixed linear spec-
————————————————
tral distortion from recording and transmission environment.
• Pawan Kumar is with the Department of ECE, Birla Institute of Technolo- Since speech production is usually modeled as a convolution
gy, Mesra, Ranchi, India.
• Astik Biswas is with the Department of ECE, Birla Institute of Technology, of the impulse response of the vocal tract filter with an excita-
Mesra, Ranchi, India. tion source, the cepstrum effectively de-convolves these two
• A. N. Mishra is with the Department of ECE, Birla Institute of Technolo- parts, resulting in a low time component corresponding to the
gy, Mesra, Ranchi, India.
• Dr. Mahesh Chandra is with Birla Institute of Techonology, Mesra, Ran- vocal tract system and a high time component corresponding
chi, India - 835215. to excitation source. Two hybrid robust feature extraction
techniques, revising PLP (RPLP)[5, 6] and Bark frequency cep-
stral coefficient (BFCC)[6] developed from basic parameteriza-
12

tion methods PLP and MFCC were used as feature extraction applied.
techniques. Vector Quantization (VQ) [7, 8, 9], Dynamic Time During normalization every sample value of the speech signal
Warping (DTW)[8, 10] and Gaussian Mixture Model is divided by the highest amplitude sample value. Mean of the
(GMM)[11, 12] were used for classifying the languages into speech signal is subtracted from the speech signal to remove
different classes. The Language Identification system was im- the DC offset and some of the disturbances induced by the
plemented with MATLAB 7.1. recording instruments. After pre-processing, language utter-
This paper is organized in five sections. Section1 introduces ances were divided into different frames of 25ms duration. The
the motivation of LID. Section2 gives the details of database second frame starts after 15ms of first frame and overlaps the
preparation. Different feature extraction techniques are ex- first frame by 10ms. The third frame starts after 15ms of second
plained in section3. Experimental setup & result are given in frame and overlaps the second frame by 10ms and this process
section4. Finally the conclusions are drawn in section5. continue till the end of all the frames of speech sample. Next
each frame was multiplied by a hamming window. After win-
2 DATABASE PREPARATION dowing first FFT and then Mel spaced filter banks are applied
to get the Mel-spectrum. The Mel scale is logarithmic scale
A database for three Indian Languages (Bengali, Hindi and resembling the way that the human ear perceives sound.
Telugu) has been prepared with 16 kHz sampling frequency
with 16 bits resolution. Each language consists of seven
different speakers and each speaker utterance was of one-
minute duration. All speakers of respective Languages uttered
same paragraph for one minute duration which were recorded
in a noise-free environment. The foreign Language samples
(Dutch, English, French, German, Italian, Russian, and
Spanish) has been downloaded from Internet [13] and
reformatted with 16 kHz sampling frequency & 16 bits
resolution. Thus there were total ten Languages with seven
different speakers for each language. So we have a total of 70
speech utterances. The duration of speech utterance of all
languages ranges from 35 sec to 70 sec. Goldwave 5.10 & Cool
Edit 96 software were used for database preparation and re-
sampling.

3 FEATURE EXTRACTION
The raw speech signal is complex and may not be suitable for
feeding as input to the automatic language identification
system; hence the need for a good front-end arises. The task of
this front-end is to extract all relevant acoustic information in a
compact form compatible with the acoustic models. In other
words, the preprocessing should remove all non-relevant
information such as background noise and characteristics of
the recording device, and encode the remaining (relevant)
information in a compact set of features that can be given as
input to the classifier. Features can be defined as a minimal
unit, which distinguishes maximally close classes. The entire
scheme for feature extraction using PLP, MFCC, BFCC and
RPLP techniques are shown in Fig 2.

3.1 Mel Frequency Cepstral Coefficient


Pre-emphasis filtering, normalization and mean subtraction
are the three steps in pre-processing. The digitized speech is
pre-emphasized using a digital filter with a transfer function
. Pre-emphasis filter [3] spectrally flat-
tens the signal and makes it less susceptible to finite precision
effects later in the signal processing. Due to possible mis-
match between training and test conditions, it is considered
Fig 2: Feature extraction process for different methods
good practice to reduce the amount of variation in the data
that does not carry important speech information as much as
possible. For instance, differences in loudness between re-
cordings are irrelevant for recognition. For reduction of such The filter bank is composed with 24 triangular filters that are
irrelevant sources of variation, normalization transforms are equally spaced on a log scale. The Mel-scale is represented by
13

the following:
In the second approach instead of using bark filter bank, Mel
filter bank has been applied to compute RPLP. The signal is pre-
……………. (1) emphasized before the segmentation and FFT spectrum is
processed by Mel scale filter bank. The resulting spectrum is
The natural logarithm is taken to transform into the cepstral converted to the cepstral coefficients using LP analysis with
domain and the discrete cosine transform (DCT) is finally prediction order of 13 followed by cepstral analysis.
applied to get the 24 MFCCs. The component due to the
periodic excitation source may be removed from the signal by
simply discarding the higher order coefficients. DCT de- 4 EXPERIMENTAL SETUP & RESULTS
correlates the features and arranges them in descending order
of information, they contain about speech signal. Hence 13 The whole work was carried in two different ways. In first
coefficients out of 24 coefficients are used as MFCC features in phase, the work was carried out with all ten language database
our case. MFCC features are more compact since the same and VQ was used for creating the language model and DTW
information can be contained in fewer coefficients. was used as classifier to classify languages into different classes.
In second phase, the work was carried out with ten language
3.2 Perceptual Linear Prediction database and Gaussian Mixture Model was used to generate the
An approach for linear prediction completely based on percep- language models and GMM was used as classifier to classify
tual criteria is the Perceptual Linear Prediction (PLP) [5]. This languages into different classes.
model includes the following perceptually motivated analyses:
1. Critical-band spectral resolution. The spectrum of the orig- 4.1 Training and Testing (VQ + DTW)
inal signal is warped into the Bark frequency scale, where a Here VQ was used for training the language model and DTW
critical-band masking curve is convolved to the signal. For the was used as classifier to classify languages into different classes.
case of PLP, trapezoidal shaped filters are applied at roughly one The experimental set-up is shown in fig 3. By using MFCC,
bark intervals, where the bark axis is derived from the frequency BFCC, PLP and RPLP techniques, features were extracted for
axis by using a warping function from Schroeder is given in each utterance of all 70 speakers of all languages.
equation number 2: The sub frames were of 25 ms with 10 ms overlapping. For all
kind of feature extraction techniques thirteen features are calcu-
lated for each frame.
…………. (2)

2. Equal-loudness pre-emphasis. The signal is pre-


emphasized by a simulated equal-loudness curve to match the
frequency magnitude response of the ear.
3. Intensity-loudness power law. The signal amplitude is
compressed by the cubic-root to match the nonlinear relation
between intensity of sound and perceived loudness.
After these operations, all signal components are perceptual-
ly equally weighted and we can, from the modified signal, make
a regular Linear prediction (LP) model.

3.3 Hybrid Features


In this experiment two main blocks as shown in fig.2 were inter-
changed to develop two hybrid feature extraction techniques.
The interest is to see the influence of the spectral processing on
the different cepstral transformation. The fig.2 shows the steps of
parameterization for the basic method and besides PLP and
MFCC the way of computing the hybrid techniques has been
shown by dashed arrow in fig 2.

Bark Frequency Cepstral Coefficient (BFCC)


Fig 3: VQ codebook generation for all ten Languages
BFCC is the process where we combine PLP processing of the
spectra and cosine transform to get the cepstral coefficients.
All feature vectors of all frames of an utterance were coded
Instead of using Mel filter bank, Bark filter bank has been ap-
into single feature vector using VQ. In this way 70 feature vec-
plied and equal loudness pre-emphasis with intensity to loud-
tors were prepared for all speakers of all languages. These fea-
ness power law has been applied to the MFCC like features.
ture vectors were stored for further use during classification.
Only first thirteen cepstral features of each windowed frame of
Further the seven feature vectors of all seven speakers of each
speech utterances were taken.
language were coded into a single feature vector using VQ.
Finally a total of 10 feature vectors were received, one feature
vector corresponding to one language.
Revised Perceptual Linear Prediction (RPLP)
14

During testing phase, languages were classified into their re- one language model with 2, 4, 8, and 16 component densities.
spective classes by measuring similarity of each feature vector Finally total of 60 speech utterances were used for training and
(of all 70 stored feature vectors) with the finally received ten 10 speech utterances were used for testing. By using MFCC,
feature vectors of ten languages. DTW was used for calculation BFCC, PLP and RPLP techniques, features were extracted for
to find similarity between two sequences which may vary in all 60 utterances of all languages. In order to have more tem-
time. Comparative results for all the feature extraction tech- poral information, the duration of each sample was divided
niques MFCC, BFCC, PLP and RPLP with VQ & DTW are into number of sub-frames. The sub frames were of 25 ms with
shown in fig 4. 10 ms overlapping. For all kind of feature extraction tech-
niques thirteen features are calculated for each frame.
All feature vectors of all frames of all utterances were stored
for further use in training. Then for all languages (i.e. feature
vectors of first six utterance of that particular language) were
used to create the corresponding language model with 2, 4, 8,
and 16 component densities. In this manner all language mod-
els were created with 2, 4, 8, and 16 component densities [11].
The testing phase was divided into three different sub phase.
At first testing was carried out for two second test utterances,
then testing was carried out for four second test utterances and
at last testing was carried out for ten second test utter-
ances.The languages were classified into their respective
classes on the basis of maximum log-likelihood [11, 12] w.r.t
each language model The objective is to find the language
model which has the maximum posteriori probability [14] for a
given observation sequence. The whole experimental setup is
Fig 4: Language identification results with MFCC, BFCC, PLP and shown in fig 5 and fig 6 respectively.
RPLP with VQ and DTW Comparative results for all the feature extraction techniques
MFCC, BFCC, PLP and RPLP with GMM are shown in table 1.
From the result it can be found that the identification perfor-
mance of BFCC is 0.6% better than MFCC features for LID. The
identification performance with PLP is further improved by
2.1% over BFCC. RPLP has shown the best identification per-
formance among all the feature extraction techniques.

4.1 Training and Testing (GMM+GMM)


During this phase, work was carried out by taking duration of
thirty second of speech for each sample of each language. Here
database was divided into two parts. First six utterances of Fig 6: Testing Phase
each language were used for training purpose and last utter- Table1: Language identification performance (%) of
ance was used for testing phase. During training phase, total of MFCC, BFCC, PLP and RPLP features with GMM
(6*30 = 180) seconds of speech per language was used to create

Fig 5: Mixture model generation for all ten Languages


15

Language identification results reveal that performace sion, Image and Signal Processing, IEE Proceedings, Oct 1995.
[9] Qu Dan, Wang Bingxi, Wel Xin,” Language Identification Using Vec-
increases with increament in length of test utterances. For each
tor Quantization.” Proc.IEEE Int. Conf .Speech Processing., June.2002,
model order, the identification performance for 2, 4, and 10 pp. 492-495.
second test utterance lengths are shown in table1. BFCC has [10] W. Abdulla, D. Chow, and G. Sin, “Cross-words reference template
shown better identification performance compared to MFCC. for DTW-based speech recognition systems”, Proc. IEEE TENCON,
Bangalore, India, 2003.
Further the performance with PLP has increased compared to [11] Reynolds, D. A. and Rose, R. C. "Robust text-independent speaker
BFCC. In this case also RPLP has shown best identification identification using Gaussian mixture speaker models”. IEEE Trans.
performance among all. It also has been seen that in most of Speech Audio Process. 3, 1995, pp. 72–83.
[12] T. K. Moon, "The Expectation-Maximization Algorithm,” IEEE Signal
the cases performance peaks at 8 mixture components. Processing Magazine, 1996, pp. 47-59.
[13] http://www.cmdnyc.com.
5 CONCLUSION [14] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood
from incomplete data via the EM algorithm," Journal of the Royal
Statistical Society, Series B, vol. 39, 1977, pp. 1-38,.
The Language Identification system provides satisfactory
results by using four different types of features, MFCC, BFCC,
Mr. Pawan Kumar has received his M. Sc in 2005 from
PLP and RPLP with two classifiers, VQ along with DTW and Ranchi University, Ranchi, India. Presently he is pur-
GMM. BFCC has shown better identification performance suing Ph.D. from Birla Institute of Technology, Mesra
(Jharkhand)-India in the field of speech Recognition. His
compared to MFCC because it is more invariant to fixed areas of interest are Speech and Signal Processing.
spectral distortion and channel noise compared to MFCC. PLP
features have shown further improvement in identification
performance. This is due to the fact that PLP is combination of Mr. Astik Biswas has received his B.Tech in 2008 from
both MFCC and LP based features. PLP features performed West Bengal University of Technology, Kolkata, India.
Presently he is pursuing ME from Birla Institute of Tech-
better because the signal was pre-emphasized by a simulated nology, Mesra (Jharkhand)-India in the field of speech
equal-loudness curve to match the frequency magnitude Recognition. His areas of interest are Speech and Signal
Processing, Digital Electronics.
response of the ear as well as all signal components were
perceptually equally weighted. RPLP features performed best
among all feature extraction techniques. This is due to the fact Mr. A. N. Mishra has received his B.Tech from Gulbarga
University, Gulbarga- India in 2000 and M.Tech from Uttar
that it takes advantage of preemphasis filter, Mel scale filter Pradesh Technical University, Lucknow (UP)-India in 2006.
bank along with Linear Prediction and cepstral analysis. Presently he pursuing Ph.D. from Birla Institute of Technolo-
gy, Mesra (Jharkhand)-India. He has worked as lecturer in the
All feature extractiontechniques performed better with GMM Department of Electronics & Communication Engg. at BBIET,
as compared to VQ and DTW because gaussian mixture Bulandshahr (UP) from Nov 2000 to July 2003. He has
worked as Lecturer in the Department of Electronics & Communication Engg. at
language model falls into the implicit segmentation approach
GLAITM ,Mathura (UP) from Oct 2003 to Aug 2005. He has worked as Reader in
to language identification. It also provides a probabilistic the Department of Electronics & Communication Engg. at BBIET ,Bulandshar
model of the underlying sounds of a person’s voice. (UP) from Aug 2005 to Oct 2007. He has worked as Reader in the Department of
Electronics & Communication Engg. at RKGIT ,Ghaziabad (UP) from Oct 2007 to
Jan 2009. Since Jan 2009, he is working as Reader and HOD in the Electronics
REFERENCES & Communication Engg. Department, GGIT, Greater Noida, (UP)-India. He has
published more than 4 research papers in the area of Speech, Signal and Image
[1] William Campbell, Terry Gleason “Advanced Language Recognition Processing at National/International level. His areas of interest are Speech, Signal
using Cepstral and Phonotactics: MITLL System Performance on the and Image Processing.
NIST 2005 Language Recognition Evaluation”. In Proc. IEE Odyssey,
2006. Dr. Mahesh Chandra received B.Sc. from Agra University,
Agra(U.P.)-India in 1990 and A.M.I.E. from I.E.I., Kolkat-
[2] J. Makhoul, “Linear prediction: A tutorial review,” Proc. of IEEE, vol.
ta(W.B.)-India in winter 1994. He received M.Tech. from
63, no. 4, pp. 561-580, 1975. J.N.T.U., Hyderabad-India in 2000 and Ph.D. from AMU, Ali-
[3] L. R. Rabiner and B. H. Juang, Fundamental of Speech Recognition, garh (U.P.)-India in 2008. He has worked as Reader & HOD in
1st ed., Pearson Education, Delhi. the Department of Electronics & Communication Engg. at
S.R.M.S. College of Engineering and Technology, Bareilly
[4] S. B. Davis and P. Mermelstein, “Comparison of parametric represen-
(U.P.)-India from Jan 2000 to June 2005. Since July 2005, he is working as
tations for monosyllabic word recognition in continuously spoken Reader in the Electronics & Communication Engg. Department, B.I.T., Mesra,
sentences,” IEEE Transactions on acoustics, speech and signal Process., Ranchi (Jharkhand)-India. He is a Life Member of ISTE, New Delhi-India and
ASSP, vol. 28, no. 4, pp. 357–366, 1980. Member of IEI Kolkata (W.B.)-India. He has published more than 23 research
[5] H. Hermansky. “Perceptual linear predictive (plp) analysis of papers in the area of Speech, Signal and Image Processing at Nation-
speech.” Journal of Acoust. Soc. Am., 87(4), 1990, pp.1738–1752. al/International level. He is currently guiding four Ph.D. students in the area of
Speech, Signal and Image Processing. His areas of interest are Speech, Signal
[6] Josef RAJNOHA, Petr POLLAK,” Modified Feature Extraction Me-
and Image Processing.
thods in Robust Speech Recognition.” Radioelektronika, 17th IEEE In-
ternational Conference, PP.1-4, 2007.
[7] Y. Linde, A. Buzo and R.M. Gray, “An Algorithm for Vector Quantiz-
er Design,” IEEE Transactions on Communication, vol. 28, pp. 84-95,
1980.
[8] Yu, K., Mason, J., Oglesby, J., “Speaker recognition using Hidden
Markov Models, dynamic time warping and vector quantization” Vi-

You might also like