Voice Recognition with Neural Networks,
Type-2 Fuzzy Logic and Genetic Algorithms
                                  1. Apeksha Reddy, SDMCET, Dharwad. Id: apeksha.r.r@gmail.com
                             2.    Ashrit Mangesh R, SDMCET, Dharwad. Id: mangeshashrit@gmail.com
Abstract we describe in this paper the use of neural               there are methods in which a small set of words, such as
networks, fuzzy logic and genetic algorithms for voice              digits, are used as key words and each user is prompted to
recognition. In particular, we consider the case of speaker         utter a given sequence of key words that is randomly chosen
recognition by analyzing the sound signals with the help of         every time the system is used. Yet even this method is not
intelligent techniques, such as the neural networks and fuzzy       completely reliable, since it can be deceived with advanced
systems. We use the neural networks for analyzing the sound         electronic recording equipment that can reproduce key words
signal of an unknown speaker, and after this first step, a set of   in a requested order. Therefore, a text-prompted speaker
type-2 fuzzy rules is used for decision making. We need to use      recognition method has recently been proposed .
fuzzy logic due to the uncertainty of the decision process. We
also use genetic algorithms to optimize the architecture of the
neural networks. We illustrate our approach with a sample of
sound signals from real speakers in our institution.
Index TermsType-2 Fuzzy Logic, Neural Networks, Genetic
Algorithms, Voice Recognition.
                       I. INTRODUCTION
         Speaker recognition, which can be classified into
identification and verification, is the process of automatically
recognizing who is speaking on the basis of individual
information included in speech waves. This technique makes
                                                                                         (a) Speaker identification
it possible to use the speaker's voice to verify their identity
and control access to services such as voice dialling, banking
by telephone, telephone shopping, database access services,
information services, voice mail, security control for
confidential information areas, and remote access to
computers [10].
          Fig. 1 shows the basic components of speaker
identification and verification systems. Speaker identification
is the process of determining which registered speaker
provides a given utterance. Speaker verification, on the other
hand, is the process of accepting or rejecting the identity claim
of a speaker. Most applications in which a voice is used as the
key to confirm the identity of a speaker are classified as
speaker verification [11].                                                                (b) Speaker Verification
Speaker recognition methods can also be divided into text-
dependent and text-independent methods. The former require              Fig. 1. Basic structure of speaker recognition systems
the speaker to say key words or sentences having the same
text for both training and recognition trials, whereas the latter         II. TRADITIONAL METHODS FOR SPEAKER
do not rely on a specific text being spoken                                            RECOGNITION
      Both text-dependent and independent methods share a             Speaker identity is correlated with the physiological and
problem however. These systems can be easily deceived               behavioural characteristics of the speaker. These
because someone who plays back the recorded voice of a              characteristics exist both in the spectral envelope (vocal tract
registered speaker saying the key words or sentences can be         characteristics) and in the supra-segmental features (voice
accepted as the registered speaker. To cope with this problem,
source characteristics and dynamic features spanning several         likelihood ratio normalization approximates optimal scoring in
segments).                                                           the Bayes sense.
                                                                         A normalization method based on a posteriori probability
       The most common short-term spectral measurements              has also been proposed .The difference between the
currently used are Linear Predictive Coding (LPC)-derived            normalization method based on the likelihood ratio and the
cepstral coefficients and their regression coefficients. A           method based on a posteriori probability is whether or not the
spectral envelope reconstructed from a truncated set of              claimed speaker is included in the speaker set for
cepstral coefficients is much smoother than one reconstructed        normalization; the speaker set used in the method based on the
from LPC coefficients. Therefore it provides a stabler               likelihood ratio does not include the claimed speaker, whereas
representation from one repetition to another of a particular        the normalization term for the method based on a posteriori
speaker's utterances. As for the regression coefficients,            probability is calculated by using all the reference speakers,
typically the first- and second-order coefficients are extracted     including the claimed speaker.
at every frame period to represent the spectral dynamics.                 Experimental results indicate that the two normalization
These coefficients are derivatives of the time functions of the      methods are almost equally effective .They both improve
cepstral coefficients and are respectively called the delta- and     speaker separability and reduce the need for speaker-
delta-cepstral coefficients.                                         dependent or text-dependent thresholding, as compared with
                                                                     scoring using only a model of the claimed speaker.
   A. Normalization Techniques                                              A new method in which the normalization term is
The most significant factor affecting automatic speaker              approximated by the likelihood of a single mixture model
recognition performance is variation in the signal                   representing the parameter distribution for all the reference
characteristics from trial to trial (inter-session variability and   speakers has recently been proposed. An advantage of this
variability over time). Variations arise from the speaker            method is that the computational cost of calculating the
themselves, from differences in recording and transmission           normalization term is very small and this method has been
conditions, and from background noise. Speakers cannot               confirmed to give much better results than either of the above-
repeat an utterance precisely the same way from trial to trial.      mentioned normalization methods.
It is well known that samples of the same utterance recorded
                                                                       D. Text-Dependent Speaker Recognition Methods
in one session are much more highly correlated than samples
recorded in separate sessions. There are also long-term              Text-dependent methods are usually based on template-
changes in voices. It is important for speaker recognition           matching techniques. In this approach, the input utterance is
systems to accommodate to these variations. Two types of             represented by a sequence of feature vectors, generally short-
normalization techniques have been tried; one in the                 term spectral feature vectors. The time axes of the input
parameter domain, and the other in the distance/similarity           utterance and each reference template or reference model of
domain.                                                              the registered speakers are aligned using a dynamic time
                                                                     warping (DTW) algorithm and the degree of similarity
  B. Parameter-Domain Normalization                                  between them, accumulated from the beginning to the end of
Spectral equalization, the so-called blind equalization method,      the utterance, is calculated.
is a typical normalization technique in the parameter domain              The hidden Markov model (HMM) can efficiently model
that has been confirmed to be effective in reducing linear           statistical variation in spectral features. Therefore, HMM-
channel effects and long-term spectral variation .This               based methods were introduced as extensions of the DTW-
method is especially effective for text-dependent speaker            based methods, and have achieved significantly better
recognition applications that use sufficiently long utterances.      recognition accuracies.
Cepstral coefficients are averaged over the duration of an
                                                                       E. Text-Independent Speaker Recognition Methods
entire utterance and the averaged values subtracted from the
cepstral coefficients of each frame. Additive variation in the       One of the most successful text-independent recognition
log spectral domain can be compensated for fairly well by this       methods is based on vector quantization (VQ). In this method,
method. However, it unavoidably removes some text-                   VQ code-books consisting of a small number of representative
dependent and speaker specific features; therefore it is             feature vectors are used as an efficient means of characterizing
inappropriate for short utterances in speaker recognition            speaker-specific features. A speaker-specific code-book is
applications.                                                        generated by clustering the training feature vectors of each
                                                                     speaker. In the recognition stage, an input utterance is vector-
  C. Distance/Similarity-Domain Normalization                        quantized using the code-book of each reference speaker and
A normalization method for distance (similarity, likelihood)         the VQ distortion accumulated over the entire input utterance
values using a likelihood ratio has been proposed .The               is used to make the recognition decision.
likelihood ratio is defined as the ratio of two conditional
probabilities of the observed measurements of the utterance:              Temporal variation in speech signal parameters over the
the first probability is the likelihood of the acoustic data given   long term can be represented by stochastic Markovian
the claimed identity of the speaker, and the second is the           transitions between states. Therefore, methods using an
likelihood given that the speaker is an imposter. The                ergodic HMM, where all possible transitions between states
are allowed, have been proposed. Speech segments are               calculated and used for the speaker recognition decision. If the
classified into one of the broad phonetic categories               likelihood is high enough, the speaker is accepted as the
corresponding to the HMM states. After the classification,         claimed speaker.
appropriate features are selected.                                      Although many recent advances and successes in speaker
    In the training phase, reference templates are generated and   recognition have been achieved, there are still many problems
verification thresholds are computed for each phonetic             for which good solutions remain to be found. Most of these
category. In the verification phase, after the phonetic            problems arise from variability, including speaker-generated
categorization, a comparison with the reference template for       variability and variability in channel and recording conditions.
each particular category provides a verification score for that    It is very important to investigate feature parameters that are
category. The final verification score is a weighted linear        stable over time, insensitive to the variation of speaking
combination of the scores from each category.                      manner, including the speaking rate and level, and robust
   This method was extended to the richer class of mixture         against variations in voice quality due to causes such as voice
autoregressive (AR) HMMs. In these models, the states are          disguise or colds. It is also important to develop a method to
described as a linear combination (mixture) of AR sources. It      cope with the problem of distortion due to telephone sets and
can be shown that mixture models are equivalent to a larger        channels, and background and channel noises.
HMM with simple states, with additional constraints on the              From the human-interface point of view, it is important to
possible transitions between states.                               consider how the users should be prompted, and how
   It has been shown that a continuous ergodic HMM method          recognition errors should be handled. Studies on ways to
is far superior to a discrete ergodic HMM method and that a        automatically extract the speech periods of each person
continuous ergodic HMM method is as robust as a VQ-based           separately from a dialogue involving more than two people
method when enough training data is available. However,            have recently appeared as an extension of speaker recognition
when little data is available, the VQ-based method is more         technology. This section was not intended to be a
robust than a continuous HMM method .A method using                comprehensive review of speaker recognition technology.
statistical dynamic features has recently been proposed. In this   Rather, it was intended to give an overview of recent advances
method, a multivariate auto-regression (MAR) model is              and the problems, which must be solved in the future.
applied to the time series of cepstral vectors and used to
                                                                     G. Speaker Verification
characterize speakers. It was reported that identification and
verification rates were almost the same as obtained by a                The speaker-specific characteristics of speech are due to
HMM-based method.                                                  differences in physiological and behavioural aspects of the
                                                                   speech production system in humans. The main physiological
  F. Text-Prompted Speaker Recognition Method                      aspect of the human speech production system is the vocal
       In the text-prompted speaker recognition method, the        tract shape. The vocal tract modifies the spectral content of an
recognition system prompts each user with a new key                acoustic wave as it passes through it, thereby producing
sentence every time the system is used and accepts the input       speech. Hence, it is common in speaker verification systems
utterance only when it decides that it was the registered          to make use of features derived only from the vocal tract.
speaker who repeated the prompted sentence. The sentence               The acoustic wave is produced when the airflow, from the
can be displayed as characters or spoken by a synthesized          lungs, is carried by the trachea through the vocal folds. This
voice. Because the vocabulary is unlimited, prospective            Source of excitation can be characterized as phonation,
impostors cannot know in advance what sentence will be             whispering, frication, compression, vibration, or a
requested. Not only can this method accurately recognize           combination of these. Phonated excitation occurs when the
speakers, but it can also reject utterances whose text differs     airflow is modulated by the vocal folds. Whispered excitation
from the prompted text, even if it is spoken by the registered     is produced by airflow rushing through a small triangular
speaker. A recorded voice can thus be correctly rejected.          opening between the arytenoid cartilage at the rear of the
        This method is facilitated by using speaker-specific       nearly closed vocal folds. Frication excitation is produced by
phoneme models, as basic acoustic units. One of the major          constrictions in the vocal tract. Compression excitation results
issues in applying this method is how to properly create these     from releasing a completely closed and pressurized vocal
speaker-specific phoneme models from training utterances of        tract. Vibration excitation is caused by air being forced
a limited size. The phoneme models are represented by              through a closure other than the vocal folds, especially at the
Gaussian-mixture continuous HMMs or tied-mixture HMMs,             tongue. Speech produced by phonated excitation is called
and they are made by adapting speaker-independent phoneme          voiced, that produced by phonated excitation plus frication is
models to each speaker's voice. In order, to properly adapt the    called mixed voiced, and that produced by other types of
models of phonemes that are not included in the training           excitation is called unvoiced.
utterances, a new adaptation method based on tied-mixture             Using cepstral analysis as described in the previous section,
HMMs was recently proposed.                                        an utterance may be represented as a sequence of feature
       In the recognition stage, the system concatenates the       vectors. Utterances spoken by the same person but at different
phoneme models of each registered speaker to create a              times result in similar yet a different sequence of feature
sentence HMM, according to the prompted text. Then the             vectors. The purpose of voice modeling is to build a model
likelihood of the input speech matching the sentence model is      that captures these variations in the extracted set of features.
There are two types of models that have been used extensively           After capturing the sound signals, these voice signals are
in speaker verification and speech recognition systems:            digitized at a frequency of 8 KHz, and as consequence we
stochastic models and template models. The stochastic model        obtain a signal with 8008 sample points. This information is
treats the speech production process as a parametric random        the one used for analyzing the voice.
process and assumes that the parameters of the underlying               We also used the Sound Forge 6.0 computer program for
stochastic process can be estimated in a precise, well-defined     processing the sound signal. This program allows us to cancel
manner. The template model attempts to model the speech            noise in the signal, which may have come from environment
production process in a non-parametric manner by retaining a       noise or sensitivity of the microphones. After using this
number of sequences of feature vectors derived from multiple       computer program, we obtain a sound signal that is as pure as
utterances of the same word by the same person. Template           possible. The program also can use fast Fourier transform for
models dominated early work in speaker verification and            The program also can use fast Fourier transform for voice
speech recognition because the template model is intuitively       filtering. We show in Figure 4 the use of the computer
more reasonable. However, recent work in stochastic models         program for a particular sound signal.
has demonstrated that these models are more flexible and
hence allow for better modelling of the speech production
process. A very popular stochastic model for modelling the
speech production process is the Hidden Markov Model
(HMM). HMMs are extensions to the conventional Markov
models, wherein the observations are a probabilistic function
of the state, i.e., the model is a doubly embedded stochastic
process where the underlying stochastic process is not directly
observable (it is hidden). The HMM can only be viewed
through another set of stochastic processes that produce the
sequence of observations.
   The pattern matching process involves the comparison of a
given set of input feature vectors against the speaker model
for the claimed identity and computing a matching score. For
the Hidden Markov models discussed above, the matching
score is the probability that a given set of feature vectors was
generated by a specific model. We show in Figure 2 a
schematic diagram of a typical speaker recognition system.
        III. VOICE CAPTURING AND PROCESSING                        Fig. 4. Main window of the computer program for processing
    The first step for achieving voice recognition is to capture                           the signals.
the sound signal of the voice. We use a standard microphone
for capturing the voice signal. After this, we use the sound
                                                                     We also show in Figure 5 the use of the Fast Fourier
recorder of the Windows operating system to record the
                                                                   Transform (FFT) to obtain the spectral analysis of the word
sounds that belong to the database for the voices of different
                                                                   "way" in Spanish.
persons. A fixed time of recording is established to have
homogeneity in the signals. We show in Figure 3 the sound
signal recorder used in the experiments.
                                                                     Fig. 5. Spectral analysis of a specific word using the FFT.
       Fig. 3. Sound recorder used in the experiments.                     IV. NEURAL NETWORKS FOR VOICE RECOGNITION
   We used the sound signals of 20 words in Spanish as training      We now show in Table 2 a comparison of the recognition
data for a supervised feed forward neural network with one         ability achieved with the different training algorithms for the
hidden layer. The training algorithm used was the Resilient        supervised neural networks. We are showing average values of
Backpropagation (trainer) that has been used previously with       experiments performed with all the training algorithms. We
good results. We show in Table 1 the results for the experiments   can appreciate from this table that the resilient
with this type of neural network.                                  backpropagation algorithm is also the most accurate method,
   The results of Table I are for the Resilient Backpropagation    with a 92% average recognition rate.
training algorithm because this was the fastest learning
algorithm found in all the experiment (required only 7% of the         TABLE II. COMPARISON OF AVERAGE RECOGNITION OF FOUR TRAINING
total time in the experiments). The comparison of the time                                    ALGORITHMS.
performance with other training methods is shown in Figure 6.
   TABLE 1. RESULTS OF FEEDFORWARD NEURAL NETWORKS FOR 20 WORDS
                             IN SPANISH
                                                                      We describe below some simulation results of our approach
                                                                   for speaker recognition using neural networks. First, in Figure
                                                                   7 we have the sound signal of the word "example" with noise.
                                                                   Next, in Fig. 8 we have the identification of the word
                                                                   "example" without noise. We also show in Fig. 9 the word
                                                                   "layer" with noise. In Fig. 10, we show the identification of
                                                                   the correct word "layer" without noise.
                                                                       Fig. 7. Input signal of the word "example" with noise
                                                                       From the figures 7 to 10 it is clear that simple monolithic
                                                                   neural networks can be useful in voice recognition with a
                                                                   small number of words. It is obvious that words even with
                                                                   noise added can be identified, with at least 92% recognition
                                                                   rate (for 20 words). Of course, for a larger set of words the
    Fig. 6. Comparison of the time performance of several          recognition rate goes down and also computation time
                    training algorithms.                           increases. For these reasons it is necessary to consider better
                                                                   methods for voice recognition.
                                                                   neural networks from the same training data. We describe
         Fig. 8. Identification of the word "example".
                                                                             Fig. 10. Identification of the word "layer".
                                                                   in this section our modular neural network approach with the use
                                                                   of type-2 fuzzy logic in the integration of results .
                                                                      We now show some examples to illustrate the hybrid
                                                                   approach. We use two modules with one neural network each
                                                                   in this modular architecture. Each module is trained with the
                                                                   same data, but results are somewhat different due to the
                                                                   uncertainty involved in the learning process. In all cases, we
                                                                   use neural networks with one hidden layer of 50 nodes and
                                                                   "trainrp" as learning algorithm. The difference in the results is
                                                                   then used to create a type-2 interval fuzzy set that represents
                                                                   the uncertainty in the classification of the word. The first
                                                                   example is of the word "example" which is shown in Fig. 11.
  Fig. 9. Input signal of the word "layer" with noise added.
 V. VOICE RECOGNITION WITH MODULAR NEURAL NETWORKS
                 AND TYPE-2 FUZZY LOGIC
We can improve on the results obtained in the previous section
by using modular neural networks because modularity enables
us to divide the problem of recognition in simpler sub-
problems, which can be more easily solved. We also use type-2
fuzzy logic to model the uncertainty in the results given by the
                                                                             Fig. 11. Sound signal of the word "example" .
   Considering for now only 10 words in the training, we have         We now describe the complete modular neural network
that the first neural network will give the following results:     architecture (Fig. 12) for voice recognition in which we now
                                                                   use three neural networks in each module. Also, each module
SSE = 4.17649e-005 (Sum of squared errors)                         only processes a part of the word, which is divided in three
Output = [0.0023, 0.0001, 0.0000, 0.0020, 0.0113,                  parts one for each module.
0.0053, 0.0065, 0.9901, 0.0007, 0.0001]
   The output can be interpreted as giving us the membership
values of the given sound signal to each of the 10 different
words in the database. In this case, we can appreciate that the
value of 0.9901 is the membership value to the word
"example", which is very close to 1. But, if we now train a
second neural network with the same architecture, due to the
different random inicialization of the weights, the results will
be different. We now give the results for the second neural
network:
  SSE = 0.0124899
Output = [0.0002, 0.0041, 0.0037, 0.0013, 0.0091,
0.0009, 0.0004, 0.9821, 0.0007, 0.0007]
   We can note that now the membership value to the word
"example" is of 0.9821. With the two different values of
membership, we can define an interval [0.9821, 0.9901],
which gives us the uncertainty in membership of the input            Fig. 12. Complete modular neural network architecture for
signal belonging to the word "example" in the database. We                               voice recognition.
have to use centroid deffuzification to obtain a single               We have also experimented with using a genetic algorithm
membership value. If we now repeat the same procedure for          for optimizing the number of layers and nodes of the neural
the whole database, we obtain the results shown in Table II. In    networks of the modules with very good results. The approach
this table, we can see the results for a sample of 6 different     is very similar to the one described in the previous chapter.
words.                                                             We show in Fig. 13 an example of the use of a genetic
                                                                   algorithm for optimizing the number of layers and nodes of
   TABLE II. SUMMARY OF RESULTS FOR THE TWO MODULES (M1 AND M2)
                 FOR A SET OF WORDS IN "SPANISH ".
                                                                   one of the neural networks in the modular architecture. In this
                                                                   figure we can appreciate the minimization of the fitness
                                                                   function, which takes into account two objectives: sum of
                                                                   squared errors and the complexity of the neural network.
    The same modular neural network approach was extended to
the previous 20 words (mentioned in the previous section) and
the recognition rate was improved to 100%, which shows the
advantage of modularity and also the utilization of type-2 fuzzy
logic. We also have to say that computation time was also            Fig. 13. Genetic algorithm showing the optimization of a
reduced slightly due to the use of modularity.                                            neural network.
                      VI. CONCLUSIONS                                                        VIII. ACKNOWLEDGEMENTS
    We have described in this paper an intelligent approach for                   1.   We thank JNCE, Shivamogga for conducting
pattern recognition for the case of speaker identification. We                         Techzone 2k10 and giving us an opportunity to
first described the use of monolithic neural networks for voice                        participate in the same.
recognition. We then described a modular neural network                           2.   We thank Principal, Staff and the management of our
approach with type-2 fuzzy logic. We have shown examples                               college, SDMCET, Dharwad for continuously
for words in which a correct identification was achieved. We                           supporting us in completing this paper.
have performed tests with about 20 different words in wich
were spoken by three different speakers. The results are very
good for the monolithic neural network approach, and
excellent for the modular neural network approach. We have
considered increasing the database of words, and with the
modular approach we have been able to achieve about 96%
recognition rate on over 100 words. We still have to make
more tests with different words and levels of noise.
                          VII. REFERENCES
[1] O. Castillo, O. and P. Melin, "A New Approach for Plant Monitoring using
     Type-2 Fuzzy Logic and Fractal Theory", International Journal of
     General Systems, Taylor and Francis, Vol. 33, 2004, pp. 305-319.
[2] S. Furui, "Cepstral analysis technique for automatic speaker verification",
      IEEE Transactions on Acoustics, Speech and Signal Processing, 29(2),
      1981, pp. 254-272.
[3] S. Furui, "Research on individuality features in speech waves and
     automatic speaker recognition techniques", Speech Communication,
     5(2), 1986, pp. 183-197.
[4] S. Furui, "Speaker-independent isolated word recognition using dynamic
      features of the speech spectrum", IEEE Transactions on Acoustics,
      Speech and Signal Processing, 29(1), 1986, pp. 59-59.
[5] S. Furui, "Digital Speech Processing, Synthesis, and Recognition". Marcel
      Dekker, New York, 1989.
[6] S. Furui, "Speaker-dependent-feature extraction, recognition and
    processing techniques", Speech Communication, 10(5-6), 1991, pp. 505-
    520.
[7] S. Furui, "An overview of speaker recognition technology", Proceedings
      of the ESCA Workshop on Automatic Speaker Recognition,
      Identification and Verification, 1994, pp. 1-9.
[8] A. L. Higgins, L. Bahler, and J. Porter, "Speaker verification using
     randomized phrase prompting", Digital Signal Processing, Vol. 1, 1991,
     pp. 89-106.
[9] N.N Karnik, and J.M. Mendel, An Introduction to Type-2 Fuzzy Logic
     Systems, Technical Report, University of Southern California, 1998.
[10] T. Matsui, and S. Furui, "Concatenated phoneme models for text-variable
      speaker recognition", Proceedings of ICASSP'93, 1993, pp. 391-394.
[11] T. Matsui, and S. Furui, "Similarity normalization method for speaker
     verification based on a posteriori probability", Proceedings of the ESCA
     Workshop on Automatic Speaker Recognition, Identification and
     Verification, 1994, pp. 59-62.
[12] P. Melin, M. L. Acosta, and C. Felix, "Pattern Recognition Using Fuzzy
      Logic and Neural Networks", Proceedings of IC-AI'03, Las Vegas,
      USA, 2003, pp. 221-227.
[13] P. Melin, and O. Castillo, A New Method for Adaptive Control of Non-
      Linear Plants Using Type-2 Fuzzy Logic and Neural Networks,
      International Journal of General Systems, Taylor and Francis, Vo