HTK : The Hidden Markov Model Toolkit

このページは隠れマルコフモデルツールキット(HTK)の紹介,及び備忘録として作成したものです. 基本的には公式サイトにおいてあるドキュメント「The HTK Book」を日本語訳,適宜,追記していく形でまとめていこうと思います.

The Fundamentals of HTK

HTKは隠れマルコフモデル(HMMs)を組み立てるためのツールキットです. HMMsは時系列モデルを使用することができ,HTKのコアは出来る限りその一般的な目的を達成することができます. しかし,HTK自身は特に音声のプロセッシングツールをベースにしたHMMを構築するためにデザインされました. 従って,HTKがサポートしている下部基板の多くはこの目的を達成するためのものです. 例えば,描画について言えば,二つの主要なプロセッシングステージが含まれています. 一つ目は,HTK training tools は訓練用の発話とそれに付随する書き起こし情報を使用したHMMsのセットのパラメタの調整に使われます. 二つ目は,HTK recognition tools を使用することで,既知の発話を転写することができます.

この文章の基本部分はその大半これら二つのプロセスの手順の解説になります. しかし,詳細な説明に入る前に,HMMsの基本的な考え方について理解しておく必要があります. このことは,このツールキットの外観を得ることにも役立ちますし,HTKは体系化されたものをどのように訓練し,認識するのかの評価にも役立ちます.

この文章の第一章では

This first part of the book attempts to provide this information. In this chapter, the basic ideas of HMMs and their use in speech recognition are introduced. The following chapter then presents a brief overview of HTK and, for users of older versions, it highlights the main differences in version 2.0 and later. Finally in this tutorial part of the book, chapter 3 describes how a HMM-based speech recogniser can be built using HTK. It does this by describing the construction of a simple small vocabulary continuous speech recogniser.

he second part of the book then revisits the topics skimmed over here and discusses each in detail. This can be read in conjunction with the third and final part of the book which provides a reference manual for HTK. This includes a description of each tool, summaries of the various parameters used to configure HTK and a list of the error messages that it generates when things go wrong.

Finally, note that this book is concerned only with HTK as a tool-kit. It does not provide information for using the HTK libraries as a programming environment.

1.1 General Principles of HMMs

sed speech recognition, it is assumed that the sequence of observed speech vectors corresponding to each word is generated by a Markov model as shown in Fig. 1.3. A Markov model is a finite state machine which changes state once every time unit and each time t that a state j is entered, a speech vector ot is generated from the probability density bj (ot ). Furthermore, the transition from state i to state j is also probabilistic and is governed by the discrete probability aij . Fig. 1.3 shows an example of this process where the six state model moves through the state sequence X = 1, 2, 2, 3, 4, 4, 5, 6 in order to generate the sequence o1 to o6 . Notice that in HTK, the entry and exit states of a HMM are non-emitting. This is to facilitate the construction of composite models as explained in more detail later.

Speech recognition systems generally assume that the speech signal is a realisation of some message encoded as a sequence of one or more symbols (see Fig. 1.1). To effect the reverse operation of recognising the underlying symbol sequence given a spoken utterance, the continuous speech waveform is first converted to a sequence of equally spaced discrete parameter vectors. This sequence of parameter vectors is assumed to form an exact representation of the speech waveform on the basis that for the duration covered by a single vector (typically 10ms or so), the speech waveform can be regarded as being stationary. Although this is not strictly true, it is a reasonable approximation. Typical parametric representations in common use are smoothed spectra or linear prediction coefficients plus various other representations derived from these.

The role of the recogniser is to effect a mapping between sequences of speech vectors and the wanted underlying symbol sequences. Two problems make this very difficult. Firstly, the mapping from symbols to speech is not one-to-one since different underlying symbols can give rise to similar speech sounds. Furthermore, there are large variations in the realised speech waveform due to speaker variability, mood, environment, etc. Secondly, the boundaries between symbols cannot be identified explicitly from the speech waveform. Hence, it is not possible to treat the speech waveform as a sequence of concatenated static patterns.

The second problem of not knowing the word boundary locations can be avoided by restricting the task to isolated word recognition. As shown in Fig. 1.2, this implies that the speech waveform corresponds to a single underlying symbol (e.g. word) chosen from a fixed vocabulary. Despite the fact that this simpler problem is somewhat artificial, it nevertheless has a wide range of practical applications. Furthermore, it serves as a good basis for introducing the basic ideas of HMM-based recognition before dealing with the more complex continuous speech case. Hence, isolated word recognition using HMMs will be dealt with first.

1.2 Isolated Word Recognition

Let each spoken word be represented by a sequence of speech vectors or observations O, defined as

\[O = o_1,o_2,...,o_T\]

where ot is the speech vector observed at time t. The isolated word recognition problem can then be regarded as that of computing

\[\newcommand{\argmax}{\mathop{\rm arg~max}\limits} \newcommand{\argmin}{\mathop{\rm arg~min}\limits} \argmax_{i} {P(w_i|O)}\]

where wi is the i’th vocabulary word. This probability is not computable directly but using Bayes’ Rule gives

\[\begin{eqnarray} P ( W_i | O ) = \frac { P(O|W_i)P(W_i) } {P(O)} \end{eqnarray}\]

Thus, for a given set of prior probabilities \(P (w_i )\) , the most probable spoken word depends only on the likelihood \(P (O|wi )\) . Given the dimensionality of the observation sequence \(O\) , the direct estimation of the joint conditional probability \(P (o1 , o2 , . . . |w_i )\) from examples of spoken words is not practicable. However, if a parametric model of word production such as a Markov model is assumed, then estimation from data is possible since the problem of estimating the class conditional observation densities \(P (O|wi )\) is replaced by the much simpler problem of estimating the Markov model parameters.

In HMM based speech recognition, it is assumed that the sequence of observed speech vectors corresponding to each word is generated by a Markov model as shown in Fig. 1.3. A Markov model is a finite state machine which changes state once every time unit and each time t that a state j is entered, a speech vector \(b_j(o_t)\) is generated from the probability density \(b_j(o_t)\) . Furthermore, the transition from state i to state j is also probabilistic and is governed by the discrete probability aij . Fig. 1.3 shows an example of this process where the six state model moves through the state sequence \(X = 1, 2, 2, 3, 4, 4, 5, 6\) in order to generate the sequence \(o_1\) to \(o_6\) . Notice that in HTK, the entry and exit states of a HMM are non-emitting. This is to facilitate the construction of composite models as explained in more detail later.

The joint probability that O is generated by the model M moving through the state sequence X is calculated simply as the product of the transition probabilities and the output probabilities. So for the state sequence X in Fig. 1.3

\[P (O, X|M ) = a_12 b_2 (o_1 )a_22 b_2 (o_2 )a_23 b_3 (o_3 )...\]

However, in practice, only the observation sequence O is known and the underlying state sequence X is hidden. This is why it is called a Hidden Markov Model. Given that X is unknown, the required likelihood is computed by summing over all possible state sequences X = x(1), x(2), x(3), . . . , x(T ), that is

\[P (O|M) = \sum_x a_{x(0)x(1)} \prod_{t=1} b_{x(t)} (o_t)a_{x(t) x(t+1)}\]

test