Labman 2
Labman 2
Lab session 2 :
               Introduction to Hidden Markov Models
                                                       Think !
   If you get lost with some of the questions or some of the explanations, DO ASK the assistants or
the teacher for help : they are here to make the course understood. There is no such thing as a stupid
question, and the only obstacle to knowledge is laziness.
Contents
1 Preamble                                                                                                    1
5 Training of HMMs 13
1    Preamble
Useful formulas and definitions :
    - a Markov chain or process is a sequence of events, usually called states, the probability of each of
      which is dependent only on the event immediately preceding it.
    - a Hidden Markov Model (HMM) represents stochastic sequences as Markov chains where the states
      are not directly observed, but are associated with a probability density function (pdf). The gener-
      ation of a random sequence is then the result of a random walk in the chain (i.e. the browsing of
      a random sequence of states Q = {q1 , · · · qK }) and of a draw (called an emission) at each visit of a
      state.
      The sequence of states, which is the quantity of interest in speech recognition and in most of the
      other pattern recognition problems, can be observed only through the stochastic processes defined
      into each state (i.e. you must know the parameters of the pdfs of each state before being able to
      associate a sequence of states Q = {q1 , · · · qK } to a sequence of observations X = {x1 , · · · xK }).
      The true sequence of states is therefore hidden by a first layer of stochastic processes.
      HMMs are dynamic models, in the sense that they are specifically designed to account for some
      macroscopic structure of the random sequences. In the previous lab, concerned with Gaussian
      Statistics and Statistical Pattern Recognition, random sequences of observations were considered
      as the result of a series of independent draws in one or several Gaussian densities. To this simple
      statistical modeling scheme, HMMs add the specification of some statistical dependence between
      the (Gaussian) densities from which the observations are drawn.
    - HMM terminology :
        – the emission probabilities are the pdfs that characterize each state q i , i.e. p(x|qi ). To simplify
          the notations, they will be denoted bi (x). For practical reasons, they are usually Gaussian or
          combinations of Gaussians, but the states could be parameterized in terms of any other kind
          of pdf (including discrete probabilities and artificial neural networks).
        – the transition probabilities are the probability to go from a state i to a state j, i.e. P (q j |qi ).
          They are stored in matrices where each term aij denotes a probability P (qj |qi ).
        – non-emitting initial and final states : if a random sequence X = {x1 , · · · xK } has a finite
          length K, the fact that the sequence begins or ends has to be modeled as two additional
          discrete events. In HMMs, this corresponds to the addition of two non-emitting states, the
          initial state and the final state. Since their role is just to model the “start” or “end” events,
          they are not associated with some emission probabilities.
          The transitions starting from the initial state correspond to the modeling of an initial state dis-
          tribution P (I|qj ), which indicates the probability to start the state sequence with the emitting
          state qj .
          The final state usually has only one non-null transition that loops onto itself with a probability
          of 1 (it is an absorbent state), so that the state sequence gets “trapped” into it when it is
          reached.
                                                      1
                                                                                               1 PREAMBLE
        – ergodic versus left-right HMMs : a HMM allowing for transitions from any emitting state to
          any other emitting state is called an ergodic HMM. Alternately, an HMM where the transitions
          only go from one state to itself or to a unique follower is called a left-right HMM.
The following 2-dimensional Gaussian densities will be used to model simulated vowel observations, where
the considered features are the two first formants :
                                                                                        
         Density N/a/ :                             730                      1625 5300
                                     µ/a/ =                     Σ/a/ =
                                                   1090                      5300 53300
                                                                                        
         Density N/e/ :                             530                      15025 7750
                                     µ/e/ =                     Σ/e/ =
                                                   1840                       7750 36725
                                                                                        
         Density N/i/ :                             270                      2525 1200
                                     µ/i/ =                     Σ/i/ =
                                                   2290                      1200 36125
                                                                                        
         Density N/o/ :                            570                       2000 3600
                                      µ/o/ =                    Σ/o/ =
                                                   840                       3600 20000
                                                                                        
         Density N/y/ :                             440                      8000 8400
                                     µ/y/ =                     Σ/y/ =
                                                   1020                      8400 18500
(Those densities have been used in the previous lab session.) They will be combined into Markov Models
that will be used to model some observation sequences. The resulting HMMs are described in table 1.
    The parameters of the densities and of the Markov models are stored in the file data.mat. A Markov
model named, e.g., hmm1 is stored as an object with fields hmm1.means, hmm1.vars and hmm1.trans, and
corresponds to the model HMM1 of table 1. The means fields contains a list of mean vectors; the vars
field contains a list of variance matrices; the trans field contains the transition matrix; e.g to access the
mean of the 3rd state of hmm1, use :
>> hmm1.means{3}
The initial and final states are characterized by an empty mean and variance value.
                                                          2
1 PREAMBLE
HMM3 :
• state 1: initial state                                              
                                                                                                0.5                 0.5                    0.5
                                             0.0     1.0 0.0 0.0 0.0
                                                                    
                                            0.0     0.5 0.5 0.0 0.0 
• state 2: Gaussian N/a/                                            
                                           
                                            0.0
                                                                                        1.0                 0.5             0.5                  0.5
                                                    0.0 0.5 0.5 0.0 
                                                                     
• state 3: Gaussian N/i/                                                           I          N/a/               N/i/                 N/y/             F
                                            0.0     0.0 0.0 0.5 0.5 
                                                                    
• state 4: Gaussian N/y/                     0.0     0.0 0.0 0.0 1.0
Generate a sample X coming from the Hidden Markov Models HMM1, HMM2 and HMM4. Use the
function genhmm (>> help genhmm) to do several draws with each of these models. View the resulting
samples and state sequences with the help of the functions plotseq and plotseq2.
Example :
Do a draw :
>> [X,stateSeq] = genhmm(hmm1);
Use the functions plotseq and plotseq2 to picture the obtained 2-dimensional data. In the resulting
views, the obtained sequences are represented by a yellow line where each point is overlaid with a colored
dot. The different colors indicate the state from which any particular point has been drawn.
>> figure; plotseq(X,stateSeq); % View of both dimensions as separate sequences
This view highlights the notion of sequence of states associated with a sequence of sample points.
>> figure; plotseq2(X,stateSeq,hmm1); % 2D view of the resulting sequence,
                                                % with the location of the Gaussian states
This view highlights the spatial repartition of the sample points.
Draw several new samples with the same parameters and visualize them :
>> clf; [X,stateSeq] = genhmm(hmm1); plotseq(X,stateSeq);
(To be repeated several times.)
Repeat with another model :
>> [X,stateSeq] = genhmm(hmm2);plotseq(X,stateSeq);
and re-iterate the experiment. Also re-iterate with model HMM3.
Questions :
    2. What is the effect of the different transition matrices on the sequences obtained during the current
       experiment ? Hence, what is the role of the transition probabilities in the Markovian modeling
       framework ?
    4. In the case of HMMs with plain Gaussian emission probabilities, what quantities should be present
       in the complete parameter set Θ that specifies a particular model ?
      If the model is ergodic with N states (including the initial and final), and represents data of
      dimension D, what is the total number of parameters in Θ ?
5. Which type of HMM (ergodic or left-right) would you use to model words ?
Answers :
(Answers continue on the next page...)
                                                      4
                                                   5
2. The transition matrix of HMM1 indicates that the probability of staying in a particular state is close
   to the probability of transiting to another state. Hence, it allows for frequent jumps from one state
   to any other state. The observation variable therefore frequently jumps from one “phoneme” to any
   other, forming sharply changing sequences like /a,i,a,y,y,i,a,y,y,· · · /.
      Alternately, the transition matrix of HMM2 specifies high probabilities of staying in a particular
   state. Hence, it allows for more “stable” sequences, like /a,a,a,y,y,y,i,i,i,i,i,y,y,· · · /.
      Finally, the transition matrix of HMM4 also rules the order in which the states are browsed :
   the given probabilities force the observation variable to go through /a/, then to go through /i/, and
   finally to stay in /y/, e.g. /a,a,a,a,i,i,i,y,y,y,y,· · · /.
     Hence, the role of the transition probabilities is to introduce a temporal (or spatial) structure in
   the modeling of random sequences.
      Furthermore, the obtained sequences have variable lengths : the transition probabilities implicitly
   model a variability in the duration of the sequences. As a matter of fact, different speakers or
   different speaking conditions introduce a variability in the phoneme or word durations. In this
   respect, HMMs are particularly well adapted to speech modeling.
3. If we didn’t have a final state, the observation variable would wander from state to state indefinitely,
   and the model would necessarily correspond to sequences of infinite length.
4. In the case of HMMs with Gaussian emission probabilities, the parameter set Θ comprises :
      • the transition probabilities aij ;
      • the parameters of the Gaussian densities characterizing each state, i.e. the means µ i and the
        variances Σi .
   The initial state distribution is sometimes modeled as an additional parameter instead of being
   represented in the transition matrix.
   In the case of an ergodic HMM with N emitting states and Gaussian emission probabilities, we
   have :
      • (N − 2) × (N − 2) transitions, plus (N − 2) initial state probabilities and (N − 2) probabilities
        to go to the final state;
      • (N − 2) emitting states where each pdf is characterized by a D dimensional mean and a D × D
        covariance matrix.
   Hence, in this case, the total number of parameters is (N − 2) × (N + D × (D + 1)). Note that this
   number grows exponentially with the number of states and the dimension of the data.
5. Words are made of ordered sequences of phonemes : /h/ is followed by /e/ and then by /l/ in the
   word “hello”. Each phoneme can in turn be considered as a particular random process (possibly
   Gaussian). This structure can be adequately modeled by a left-right HMM.
     In “real world” speech recognition, the phoneme themselves are often modeled as left-right HMMs
   rather than plain Gaussian densities (e.g. to model separately the attack, then the stable part of
   the phoneme and finally the “end” of it). Words are then represented by large HMMs made of
   concatenations of smaller phonetic HMMs.
                                                                                Answers (continued) :
                                 2 GENERATING SAMPLES FROM HIDDEN MARKOV MODELS
                                                                  3 PATTERN RECOGNITION WITH HMMS
                                            T
                                            Y
                              p(X|Q, Θ) =         p(xi |qi , Θ) = b1 (x1 ) · b2 (x2 ) · · · bT (xT )
                                            i=1
      i.e. it is the product of the emission probabilities computed along the considered path.
      In the previous lab, we had learned how to compute the likelihood of a single observation with
      respect to a Gaussian model. This method can be applied here, for each term xi , if the states
      contain Gaussian pdfs.
    - Joint likelihood of an observation sequence X and a path Q : it consists in the probability that
      X and Q occur simultaneously, p(X, Q|Θ), and decomposes into a product of the two quantities
      defined previously :
                                 p(X, Q|Θ) = p(X|Q, Θ)P (Q|Θ)       (Bayes)
      i.e. it is the sum of the joint likelihoods of the sequence over all possible state sequence allowed by
      the model.
    - the forward recursion : in practice, the enumeration of every possible state sequence is infeasible.
      Nevertheless, p(X|Θ) can be computed in a recursive way by the forward recursion. This algorithm
      defines a forward variable αt (i) corresponding to :
      i.e. αt (i) is the probability of having observed the partial sequence {x1 , x2 , · · · , xt } and being in
      the state i at time t (event denoted qit in the course), given the parameters Θ. For a HMM with 5
      states (where states 1 and N are the non-emitting initial and final states, and states 2 · · · N − 1 are
      emitting), αt (i) can be computed recursively as follows :
                                                           6
3 PATTERN RECOGNITION WITH HMMS                                      3.1 Likelihood of a sequence given a HMM
1. Initialization
          where a1i are the transitions from the initial non-emitting state to the emitting states with pdfs
          bi, i=2···N −1 (x). Note that b1 (x) and bN x do not exist since they correspond to the non-emitting
          initial and final states.
       2. Recursion
                                           "N −1              #
                                            X                                   1≤t≤T
                              αt+1 (j) =           αt (i) · aij bj (xt+1 ),
                                                                                2≤j ≤N −1
                                            i=2
       3. Termination
                                                             "N −1                  #
                                                              X
                                              p(X|Θ) =               αT (i) · aiN
                                                              i=2
          i.e. at the end of the observation sequence, sum the probabilities of the paths converging to
          the final state (state number N ).
(For more detail about the forward procedure, refer to [RJ93], chap.6.4.1).
     This procedure raises a very important implementation issue. As a matter of fact, the computation
     of the αt vector consists in products of a large number of values that are less than 1 (in general, sig-
     nificantly less than 1). Hence, after a few observations (t ≈ 10), the values of α t head exponentially
     to 0, and the floating point arithmetic precision is exceeded (even in the case of double precision
     arithmetics). Two solutions exist for that problem. One consists in scaling the values and undo the
     scaling at the end of the procedure : see [RJ93] for more explanations. The other solution consists
     in using log-likelihoods and log-probabilities, and to compute log p(X|Θ) instead of p(X|Θ).
Questions :
1. The following formula can be used to compute the log of a sum given the logs of the sum’s arguments :
                                                                                         
                           log(a + b) = f (log a, log b) = log a + log 1 + e(log b−log a)
  2. Express the log version of the forward recursion. (Don’t fully develop the log of the sum in the
                                            P         log
     recursion step, just call it “logsum” : N                 N
                                              i=1 xi 7−→ logsumi=1 (log xi ).) In addition to the arithmetic
     precision issues, what are the other computational advantages of the log version ?
                                                         7
                                                             8
To perform a Bayesian classification, we need the prior probabilities P (Θ i |Θ) of each model. In addition,
we can assume that all the observation sequences are equi-probable :
                                                  p(X|Θi , Θ)P (Θi |Θ)
                                                       P (X|Θ)
                                   P (Θi |X, Θ) =
                                                ∝ p(X|Θi )P (Θi )
P (Θi ) can be determined by counting the probability of occurrence of each model (word or phoneme) in a
database covering the vocabulary to recognize (see lab session 4).
   If every model has the same prior probability, then Bayesian classification becomes equivalent to ML
classification.
                                                                                                      Answer :
classification ?
    Which additional condition makes the result of Bayesian classification equivalent to the result of ML
Bayesian classification rather than a Maximum Likelihood classification of the sequences ?
Maximum Likelihood sense. What additional quantities and assumptions do we need to perform a true
a HMM. Hence, given a sequence of features, we are able to find the most likely generative model in a
The forward recursion allows us to compute the likelihood of an observation sequence with respect to
                                                                                                    Question :
                                                                              Bayesian classification       3.2
  1. Demonstration :                            a = elog a       ;   b = elog b
                                      a + b = elog a + elog b
                                              = elog a 1 + e(log b−log a)
                                                                          
                               log(a + b) = log a + log 1 + e(log b−log a)   QED.
                                                                          
     The computation of the exponential overflows the double precision arithmetics for big values (≈ 700)
     earlier than for small values. Similarly, the implementations of the exponential operation are gener-
     ally more precise for small values than for big values (since an error on the input term is exponen-
     tially amplified). Hence, if log a > log b, the first version (log(a + b) = log a + log 1 + e(log b−log a) )
                                                                                                               
     is more precise since in this case (log b − log a) is small. If log a < log b, it is better to swap the
     terms (i.e. to use the second version).
  2. (a) Initialization
                                     (log)
                                    α1       (i) = log a1i + log bi (x1 ),    2≤i≤N −1
      (b) Recursion
                   (log)           N −1   (log)                                          1≤t≤T
                  αt+1 (j) = logsumi=2   αt (i) + log aij + log bj (xt+1 ),
                                                                                         2≤j ≤N −1
                            h                           i
      (c) Termination
                                                      N −1   (log)
                                   log p(X|Θ) = logsumi=2   αT (i) + log aiN
                                               h                            i
     In addition to the precision issues, this version transforms the products into sums, which is more
     computationally efficient. Furthermore, if the emission probabilities are Gaussians, the computation
     of the log-likelihoods log(bj (xt )) eliminates the computation of the Gaussians’ exponential (see lab
     session 4).
These two points just show you that once the theoretic barrier is crossed in the study of a particular
statistical model, the importance of the implementation issues must not be neglected.
                                                                                                     Answers :
3 PATTERN RECOGNITION WITH HMMS                                                      3.2 Bayesian classification
3 PATTERN RECOGNITION WITH HMMS                                       3.3 Maximum Likelihood classification
Experiment :
Classify the sequences X1 , X2, · · · X6 , given in the file data.mat, in a maximum likelihood sense with
respect to the six Markov models given in table 1. Use the function logfwd to compute the log-forward
recursion expressed in the previous section. Store the results in a matrix (they will be used in the next
section) and note them in the table below.
Example :
>> plot(X1(:,1),X1(:,2));
>> logProb(1,1) = logfwd(X1,hmm1)
>> logProb(1,2) = logfwd(X1,hmm2)
etc.
>> logProb(3,2) = logfwd(X3,hmm2)
etc.
Filling the logProb matrix can be done automatically with the help of loops :
 Sequence      log p(X|Θ1 )   log p(X|Θ2 )   log p(X|Θ3 )   log p(X|Θ4 )   log p(X|Θ5 )   log p(X|Θ6 )
                                                                                                         Most likely
                                                                                                         model
      X1
      X2
      X3
      X4
      X5
      X6
Answer :
X1 → HM M 1, X2 → HM M 3, X3 → HM M 5, X4 → HM M 4, X5 → HM M 6, X6 → HM M 2.
                                                        9
                                                                                             4 OPTIMAL STATE SEQUENCE
• the highest likelihood δt (i) along a single path among all the paths ending in state i at time t :
• a variable ψt (i) which allows to keep track of the “best path” ending in state i at time t :
Note that these variables are vectors of (N − 2) elements, (N − 2) being the number of emitting states.
With the help of these variables, the algorithm takes the following steps :
1. Initialization
      where, again, a1i are the transitions from the initial non-emitting state to the emitting states with
      pdfs bi, i=2···N −1 (x), and where b1 (x) and bN x do not exist since they correspond to the non-emitting
      initial and final states.
    2. Recursion
                                                                                                 1≤t≤T −1
                         δt+1 (j) =             max          [δt (i) · aij ] · bj (xt+1 ),
                                               2≤i≤N −1                                          2≤j ≤N −1
                                                                                   1≤t≤T −1
                             ψt+1     = arg max [δt (i) · aij ] ,
                                               2≤i≤N −1                            2≤j ≤N −1
      “Optimal policy is composed of optimal sub-policies” : find the path that leads to a maximum likeli-
      hood considering the best likelihood at the previous step and the transitions from it; then multiply
      by the current likelihood given the current state. Hence, the best path is found by induction.
3. Termination
      Find the best likelihood when the end of the observation sequence is reached, given that the final
      state is the non-emitting state N .
4. Backtracking
                                                                    10
4 OPTIMAL STATE SEQUENCE
Hence, the Viterbi algorithm delivers two useful results, given an observation sequence X = {x 1 , · · · , xT }
and a model Θ :
   • the selection, among all the possible paths in the considered model, of the best path Q ∗ = {q1∗ , · · · , qT∗ },
     which corresponds to the state sequence giving a maximum of likelihood to the observation se-
     quence X;
   • the likelihood along the best path, p(X, Q∗ |Θ) = p∗ (X|Θ). As opposed to the the forward procedure,
     where all the possible paths are considered, the Viterbi computes a likelihood along the best path
     only.
(For more detail about the Viterbi algorithm, refer to [RJ93], chap.6.4.1).
Questions :
  1. From an algorithmic point of view, what is the main difference between the computation of the δ
     variable in the Viterbi algorithm and that of the α variable in the forward procedure ?
Answers :
(d) Backtracking
                                                  h                  i
                                                  2≤i≤N −1
                                       = arg max δT (i) + log aiN    qT∗
                                                     (log)
                                                  h                  i
                                         2≤i≤N −1
                          log p∗ (X|Θ) =   max      δT (i) + log aiN
                                                     (log)
(c) Termination
                                                         h                i
                          2≤j ≤N −1                                            2≤i≤N −1
                                                = arg max δt (i) + log aij ,
                          1≤t≤T −1
                                                                                                 ψt+1
                                                           (log)
                                             i                             h
          2≤j ≤N −1                                                            2≤i≤N −1
                                                                      δt                         δt+1 (j)
          1≤t≤T −1
                                (i) + log aij + log bj (xt+1 ),                  max        =
                                                                  (log)                              (log)
(b) Recursion
                                            ψ1 (i) = 0
                        2≤i≤N −1                   = log a1i + log bi (x1 ),    δ1 (i)
                                                                                 (log)
                                                                                                    2. (a) Initialization
   of δ. Hence, the Viterbi procedure takes less computational power than the forward recursion.
1. The sums that were appearing in the computation of α become max operations in the computation
Experiments :
  1. Use the function logvit to find the best path of the sequences X1 , · · · X6 with respect to the
     most likely model found in section 3.3 (i.e. X1 : HMM1, X2 : HMM3, X3 : HMM5, X4 : HMM4,
     X5 : HMM6 and X6 : HMM2). Compare with the state sequences ST1 , · · · ST6 originally used to
     generate X1 , · · · X6 (use the function compseq, which provides a view of the first dimension of the
     observations as a time series, and allows to compare the original alignment to the Viterbi solution).
                                                          11
                                                                             4 OPTIMAL STATE SEQUENCE
  2. Use the function logvit to compute the probabilities of the sequences X1 , · · · X6 along the best
     paths with respect to each model Θ1 , · · · Θ6 . Note your results below. Compare with the log-
     likelihoods obtained in the section 3.3 with the forward procedure.
Examples :
  1. Best paths and comparison with the original paths :
     >> figure;
     >> [STbest,bestProb] = logvit(X1,hmm1); compseq(X1,ST1,STbest);
     >> [STbest,bestProb] = logvit(X2,hmm3); compseq(X2,ST2,STbest);
     Repeat for the remaining sequences.
  2. Probabilities along the best paths for all the models :
     >> [STbest,bestProb(1,1)] = logvit(X1,hmm1);
     >> [STbest,bestProb(1,2)] = logvit(X1,hmm2);
     etc.
     >> [STbest,bestProb(3,2)] = logvit(X3,hmm2);
     etc. (You can also use loops here.)
      To compare with the complete log-likelihood, issued by the forward recurrence :
      >> diffProb = logProb - bestProb
                                                                                                                    Most likely
 Sequence     log p∗ (X|Θ1 )   log p∗ (X|Θ2 )   log p∗ (X|Θ3 )   log p∗ (X|Θ4 )   log p∗ (X|Θ5 )   log p∗ (X|Θ6 )
                                                                                                                    model
    X1
    X2
    X3
    X4
    X5
    X6
    X1
    X2
    X3
    X4
    X5
    X6
Question :
Is the likelihood along the best path a good approximation of the real likelihood of a sequence given a
model ?
Answer :
true likelihood.
the procedure. Hence, the likelihood along the best path can be considered as a good approximation of the
further the computational load since it replaces the sum or the logsum by a max in the recursive part of
best path likelihood does not, in most practical cases, modify the classification results. Finally, it alleviates
The values found for both likelihoods differ within an acceptable error margin. Furthermore, using the
                                                         12
                                                    13
  1. It can be assumed that the observation sequences associated with each distinct phoneme obey specific
     densities of probability. As in the previous lab, this means that the phonetic classes are assumed
     to be separable by Gaussian classifiers. Hence, the word /aiy/ can be assimilated to the result of
     drawing samples from the pdf N/a/ , then transiting to N/i/ and drawing samples again, and finally
     transiting to N/y/ and drawing samples. It sounds therefore reasonable to model the word /aiy/ by
     a left-right HMM with three emitting states.
  2. If we know the phonetic boundaries for each instance, we know to which state belongs each training
     observation, and we can give a label (/a/, /i/ or /y/) to each feature vector. Hence, we can use the
     mean and variance estimators studied in the previous lab to compute the parameters of the Gaussian
     density associated with each state (or each label).
        By knowing the labels, we can also count the transitions from one state to the following (itself or
     another state). By dividing the transitions that start from a state by the total number of transitions
     from this state, we can determine the transition matrix.
  3. The Viterbi procedure allows to distribute some labels on a sequence of features. Hence, it is possible
     to perform unsupervised training in the following way :
      (a) Start with some arbitrary state sequences, which constitute an initial labeling. (The initial
          sequences are usually made of even distributions of phonetic labels along the length of each
          utterance.)
      (b) Update the model, relying on the current labeling.
      (c) Use the Viterbi algorithm to re-distribute some labels on the training examples.
      (d) If the new distribution of labels differs from the previous one, re-iterate (go to (b) ). One can
          also stop when the evolution of the likelihood of the training data becomes asymptotic to a
          higher bound.
     The principle of this algorithm is similar to the Viterbi-EM, used to train the Gaussians during
     the previous lab. Similarly, there exists a “soft” version, called the Baum-Welch algorithm, where
     each state participates to the labeling of the feature frames (this version uses the forward recursion
     instead of the Viterbi). The Baum-Welch algorithm is an EM algorithm specifically adapted to the
     training of HMMs (see [RJ93] for details), and is one of the most widely used training algorithms
     in “real world” speech recognition.
                                                                                                Answers :
   dure to train the model, making use of one of the algorithms studied during the present session.
3. Suppose you didn’t have the phonetic labeling (unsupervised training). Propose a recursive proce-
                                 2. How would you compute the parameters of the proposed HMM ?
   Justify your choice.
1. Which model architecture (ergodic or left-right) would you choose ? With how many states ?
instance of the word.
labeling of the data, i.e. some data structures that tell you where are the phoneme boundaries for each
/aiy/, and that you want to train a HMM for this word. Suppose also that this database comes with a
of training data. Suppose that you have a database containing several utterances of the imaginary word
In the previous lab session, we have learned how to estimate the parameters of Gaussian pdfs given a set
                                                                                               Questions :
database containing observation sequences, in a supervised or an unsupervised way.
the words that we want to model ? The solution is to estimate the parameters of the HMMs from a
compare the observations. But how to determine templates that represent efficiently the phonemes or
HMMs. As explained in section 3.3, these models have the role of stochastic templates to which we
Decoding or aligning acoustic feature sequences requires the prior specification of the parameters of some
                                                                      Training of HMMs                    5
                                                                               5 TRAINING OF HMMS
REFERENCES                                                                             REFERENCES
References
[RJ93] Lawrence Rabiner and Bin-Huang Juang. Fundamentals of Speech Recognition. Prentice Hall,
       1993.
14