Relative Density Nets: A New Way to Combine Bakpropagation with HMM's Andrew D. Brown Department of Computer Siene University of Toronto Toronto, Canada M5S 3G4 Georey E. Hinton Gatsby Unit, UCL London, UK WC1N 3AR hintongatsby.ul.a.uk andys.utoronto.a Abstrat Logisti units in the rst hidden layer of a feedforward neural network ompute the relative probability of a data point under two Gaussians. This leads us to onsider substituting other density models. We present an arhiteture for performing disriminative learning of Hidden Markov Models using a network of many small HMM's. Experiments on speeh data show it to be superior to the standard method of disriminatively training HMM's. 1 Introdution A standard way of performing lassiation using a generative model is to divide the training ases into their respetive lasses and then train a set of lass onditional models. This unsupervised approah to lassiation is appealing for two reasons. It is possible to redue overtting, beause the model learns the lass-onditional input densities P (xj) rather than the input-onditional lass probabilities P (jx). Also, provided that the model density is a good math to the underlying data density then the deision provided by a probabilisti model is Bayes optimal. The problem with this unsupervised approah to using probabilisti models for lassiation is that, for reasons of omputational eÆieny and analytial onveniene, very simple generative models are typially used and the optimality of the proedure no longer holds. For this reason it is usually advantageous to train a lassier disriminatively. In this paper we will look speially at the problem of learning HMM's for lassifying speeh sequenes. It is an appliation area where the assumption that the HMM is the orret generative model for the data is inaurate and disriminative methods of training have been suessful. The rst setion will give an overview of urrent methods of disriminatively training HMM lassiers. We will then introdue a new type of multi-layer bakpropagation network whih takes better advantage of the HMM's for disrimination. Finally, we present some simulations omparing the two methods. Classes 1 1 1 HMM’s Sequence Figure 1: An Alphanet with one HMM per lass. Eah omputes a sore for the sequene and this feeds into a softmax output layer. 2 Alphanets and Disriminative Learning The unsupervised way of using an HMM for lassifying a olletion of sequenes is to use the Baum-Welh algorithm [1℄ to t one HMM per lass. Then new sequenes are lassied by omputing the probability of a sequene under eah model and assigning it to the one with the highest probability. Speeh reognition is one of the ommonest appliations of HMM's, but unfortunately an HMM is a poor model of the speeh prodution proess. For this reason speeh researhers have looked at the possibility of improving the performane of an HMM lassier by using information from negative examples | examples drawn from lasses other than the one whih the HMM was meant to model. One way of doing this is to ompute the mutual information between the lass label and the data under the HMM density, and maximize that objetive funtion [2℄. It was later shown that this proedure ould be viewed as a type of neural network (see Figure 1) in whih the inputs to the network are the log-probability sores L(x1:T jH) of the sequene under hidden Markov model H [3℄. In suh a model there is one HMM per lass, and the output is a softmax non-linearity: exp(L(x1:T jHk )) P (k jx1:T ; H1 ; : : : ; HK ) = yk = PK (1) j =1 exp(L(x1:T jHj )) Training this model by maximizing the log probability of orret lassiation leads to a lassier whih will perform better than an equivalent HMM model trained solely in a unsupervised manner. Suh an arhiteture has been termed an \Alphanet" beause it may be implemented as a reurrent neural network whih mimis the forward pass of the forward-bakward algorithm.1 3 Bakpropagation Networks as Density Comparators A multi-layer feedforward network is usually thought of as a exible non-linear regression model, but if it uses the logisti funtion non-linearity in the hidden layer, there is an interesting interpretation of the operation performed by eah hidden unit. Given a mixture of two Gaussians where we know the omponent priors P (G ) and the omponent densities P (xjG ) then the posterior probability that Gaussian, G0 , generated an observation x, is a logisti funtion whose argument is the negative log-odds of the two lasses [4℄. This an learly be seen by rearranging 1 The results of the forward pass are the probabilities of the hidden states onditioned on the past observations, or \alphas" in standard HMM terminology. the expression for the posterior: (xjG0 )P (G0 ) P (xjG0 )P (G0 ) + P (xjG1 )P (G1 ) 1 o n = (2) P (xjG ) 1 + exp log P (xjG ) log PP ((GG )) If the lass onditional densities in question are multivariate Gaussians T 1 1 exp 2 (x k ) (x k ) P (xjGk ) = j2 j (3) with equal ovariane matries, , then the posterior lass probability may be written in this familiar form: 1 (4) P (G0 jx) = 1 + expf (xT w + b) P P (G0 jx) = 0 0 1 1 1 2 where, (5) ( G ) 0 0 ) + log (6) P (G1 ) Thus, the multi-layer pereptron an be viewed as omputing pairwise posteriors between Gaussians in the input spae, and then ombining these in the output layer to ompute a deision. 4 w = 1 (0 ) b = (1 + 0 )T (1 1 P A New Kind of Disriminative Net This view of a feedforward network suggests variations in whih other kinds of density models are used in plae of Gaussians in the input spae. In partiular, instead of performing pairwise omparisons between Gaussians, the units in the rst hidden layer an perform pairwise omparisons between the densities of an input sequene under M dierent HMM's. For a given sequene the log-probability of a sequene under eah HMM is omputed and the dierene in log-probability is used as input to the logisti hidden unit.2 This is equivalent to omputing the posterior responsibilities of a mixture of two HMM's with equal prior probabilities. In order to maximally leverage the information aptured by the HMM's we use M2 hidden units so that all possible pairs are inluded. The output of a hidden unit h is given by h(mn) = (L(x1:T jHm ) L(x1:T jHn )) (7) M where we have used (mn) as an index over the set, 2 , of all unordered pairs of the HMM's. The results of this hidden layer omputation are then ombined using a fully onneted layer of free weights, W , and nally passed through a softmax funtion to make the nal deision. X ak = w(m;n)k h(mn) (8) M (mn)2( ) exp(ak ) P (k jx1:T ; H1 ; : : : ; HM ) = pk = PK (9) k0 =1 exp(ak0 ) 2 2 We take the time averaged log-probability so that the sale of the inputs is independent of the length of the sequene. Classes −1 +1 −1 +1 Density Comparator Units +1 −1 HMM’s Sequence Figure 2: A multi-layer density net with HMM's in the input layer. The hidden layer units perform all pairwise omparisons between the HMM's. where we have used () as shorthand for the logisti funtion, and pk is the value of the kth output unit. The resulting arhiteture is shown in gure 2. Beause eah unit in the hidden layer takes as input the dierene in log-probability of two HMM's, this an be thought of as a xed layer of weights onneting eah hidden unit to a pair of HMM's with weights of 1. In ontrast to the Alphanet, whih alloates one HMM to model eah lass, this network does not require a one-to-one alignment between models and lasses and it gets maximum disriminative benet from the HMM's by omparing all pairs. Another benet of this arhiteture is that it allows us to use more HMM's than there are lasses. The unsupervised approah to training HMM lassiers is problemati beause it depends on the assumption that a single HMM is a good model of the data and, in the ase of speeh, this is a poor assumption. Training the lassier disriminatively alleviated this drawbak and the multi-layer lassier goes even further in this diretion by allowing many HMM's to be used to learn the deision boundaries between the lasses. The intuition here is that many small HMM's an be a far more eÆient way to haraterize sequenes than one big HMM. When many small HMM's ooperate to generate sequenes, the mutual information between dierent parts of generated sequenes sales linearly with the number of HMM's and only logarithmially with the number of hidden nodes in eah HMM [5℄. 5 Derivative Updates for a Relative Density Network The learning algorithm for an RDN is just the bakpropagation algorithm applied to the network arhiteture as dened in equations 7,8 and 9. The output layer is a distribution over lass memberships of data point x1:T , and this is parameterized as a softmax funtion. We minimize the ross-entropy loss funtion: ` = K X k=1 tk log pk (10) where pk is the value of the kth output unit and tk is an indiator variable whih is equal to 1 if k is the true lass. Taking derivatives of this expression with respet to the inputs of the output units yields ` = tk pk (11) a k ` ak ` = = (tk pk )h(mn) (12) w(mn);k ak w(mn);k The derivative of the output of the (mn)th hidden unit with respet to the output of ith HMM, Li , is h(mn) = (Lm Ln )(1 (Lm Ln ))(Æim Æin ) (13) Li where (Æim Æin ) is an indiator whih equals +1 if i = m, 1 if i = n and zero otherwise. This derivative an be hained with the the derivatives bakpropagated from the output to the hidden layer. For the nal step of the bakpropagation proedure we need the derivative of the log-likelihood of eah HMM with respet to its parameters. In the experiments we use HMM's with a single, axis-aligned, Gaussian output density per state. We use the following notation for the parameters: A: aij is the transition probability from state i to state j : i is the initial state prior i : mean vetor for state i vi : vetor of varianes for state i H: set of HMM parameters fA; ; ; vg We also use the variable st to represent the state of the HMM at time t. We make use of the property of all latent variable density models that the derivative of the log-likelihood is equal to the expeted derivative of the joint log-likelihood under the posterior distribution. For an HMM this means that: X L(x1:T jH) P (s1:T jx1:T ; H) = log P (x1:T ; s1:T jH) (14) H H i i s1:T The joint likelihood of an HMM is: hlog P (x1:T ; s1:T jH)i = = X hÆs ;i i log i + T X X " hÆst ;j Æst ;i i log aij + 1 1 i t=2 i;j # D D X X 1 1 hÆst ;i i 2 log vi;d 2 (xt;d i;d )2 =vi;d + onst (15) t=1 i d=1 d=1 where hi denotes expetations under the posterior distribution and hÆst ;i i and hÆst ;j Æst ;i i are the expeted state oupanies and transitions under this distribution. All the neessary expetations are omputed by the forward bakward algorithm. We ould take derivatives with respet to this funtional diretly, but that would require doing onstrained gradient desent on the probabilities and the varianes. Instead, we reparameterize the model using a softmax basis for probability vetors and an exponential basis for the variane parameters. This hoie of basis allows us to do unonstrained optimization in the new basis. The new parameters are dened as follows: a exp(ij ) exp(i ) (v) aij = P a ) ; i = P exp( ) ; vi;d = exp(i;d ) exp( i ij0 j0 i0 This results in the following derivatives: T X L(x1:T jH) (16) hÆ Æ i hÆ ia = T X X 1 ( ) ( ) ( ) (a) ij ( ) t=2 st ;j st 1 ;i st 1 ;i ij L(x T jH) = hÆ i s ;i 1: ( ) i L(x T jH) = 1: i;d 1 T X h Æst ;i t=1 T X (17) i i(xt;d ) i;d =vi;d (18) L(x T jH) = 1 hÆ i (x ) =v 1 (19) v 2 t st ;i t;d i;d i;d i;d When hained with the error signal bakpropagated from the output, these derivatives give us the diretion in whih to move the parameters of eah HMM in order to inrease the log probability of the orret lassiation of the sequene. 1: ( ) 6 2 =1 Experiments To evaluate the relative merits of the RDN, we ompared it against an Alphanet on a speaker identiation task. The data was taken from the CSLU 'Speaker Reognition' orpus. It onsisted of 12 speakers uttering phrases onsisting of 6 dierent sequenes of onneted digits reorded multiple times (48) over the ourse of 12 reording sessions. The data was pre-emphasized and Fourier transformed in 32ms frames at a frame rate of 10ms. It was then ltered using 24 bandpass, mel-frequeny saled lters. The log magnitude lter response was then used as the feature vetor for the HMM's. This pre-proessing redued the data dimensionality while retaining its spetral struture. While mel-epstral oeÆients are typially reommended for use with axis-aligned Gaussians, they destroy the spetral struture of the data, and we would like to allow for the possibility that of the many HMM's some of them will speialize on partiular sub-bands of the frequeny domain. They an do this by treating the variane as a measure of the importane of a partiular frequeny band | using large varianes for unimportant bands, and small ones for bands to whih they pay partiular attention. We ompared the RDN with an Alphanet and three other models whih were implemented as ontrols. The rst of these was a network with a similar arhiteture to the RDN (as shown in gure 2), exept that instead of xed onnetions of 1, the hidden units have a set of adaptable weights to all M of the HMM's. We refer to this network as a omparative density net (CDN). A seond ontrol experiment used an arhiteture similar to a CDN without the hidden layer, i.e. there is a single layer of adaptable weights diretly onneting the HMM's with the softmax output units. We label this arhiteture a CDN-1. The CDN-1 diers from the Alphanet in that eah softmax output unit has adaptable onnetions to the HMM's and we an vary the number of HMM's, whereas the Alphanet has just one HMM per lass diretly onneted to eah softmax output unit. Finally, we implemented a version of a network similar to an Alphanet, but using a mixture of Gaussians as the input density model. The point of this omparison was to see if the HMM atually ahieves a benet from modelling the temporal aspets of the speaker reognition task. In eah experiment an RDN onstruted out of a set of, M , 4-state HMM's was ompared to the four other networks all mathed to have the same number of free parameters, exept for the MoGnet. In the ase of the MoGnet, we used the same number of Gaussian mixture models as HMM's in the Alphanet, eah with the same number of hidden states. Thus, it has fewer parameters, beause it is laking the transition probabilities of the HMM. We ran the experiment four times with

© Copyright 2021 DropDoc