This blog post discusses the paper titled “Recurrent Neural Network based Language Model” by Mikolov et al that was presented in INTERSPEECH 2010.

Introduction

This paper introduces recurrent neural networks (RNNs) to language modeling. Previous language modeling techniques were solely based on statistical computations on a large text corpus. Such language models (LMs) are collectively termed n-gram language models and focus on the task of predicting the next word given n consecutive context words. In general, such LMs work by first obtaining n-gram probabilities (unigram, bigram, trigram etc.) for all possible sequences from the training data followed by application of a smoothing technique (eg. Kneser-Ney, Katz, etc) to assign probabilites to very rarely occuring sequence of words. A popular n-gram LM is the backoff LM, which uses lower-gram probabilities when the higher ones are below a certain threshold. The major disadvantage of such LMs is they lack explicit representation of long range dependencies.

The idea of Deep Learning based LMs was introduced by Yoshua Bengio when he first proposed a feedforward neural network to model a n-gram LM. Input to this network was fixed-length context vector, and hence this model too, although proving much better than statistical models in several tasks, failed to represent long range dependencies.

Mikolov et al.’s approach is a step towards solving this dependency problem. Although this approach required longer training times, it proved to be much better than existing LMs with much lower perplexities and WER.

Network description

Firstly, the vocabulary is procured from the training data by counting occurences of each word, followed by grouping all words with counts less than a threshold value () under the common token ‘rare’. Every word in the vocabulary is assigned a one-hot vector . The network has a hidden state whose dimension is a hyperparameter. Input vector is fed to the network in a sequential manner, treating each input word as a time step (denoted by ). Thus, the input vector at time is given by (here ‘+’ denotes concatenation).

The consequent steps are as follows:

The parameters and were initialized from a Gaussian distribution with mean 0 and standard deviation 0.1. Cross-entropy was used as the loss function:

The next-word probabilities were interpreted in this manner ( is the total count for rare words) :

if was ‘rare’ and otherwise.

Stochastic gradient descent was used for optimization with initial learning rate . Learning rate is decayed by if validation loss doesn’t decrease in an epoch. Training is stopped once the validation loss saturates i.e. doesn’t decrease significantly. Backpropagation was truncated to a single time-step (), which was a drastic simplification step.

They also proposed a dynamic RNN-LM which trained even on test data with constant learning rate , but only once. Other proposed architectures include ensemble of 3 RNN-LMs and linear interpolation between a RNN-LM and a Kneser-Ney smoothed 5-gram (KN5) i.e.

Result highlights

  • 18% reduction in WER with respect to KN5 baseline using ensemble of 3 dynamic RNN-LMs
  • Perplexity of 112 by mixing static and dynamic RNN-LMs with when processing testing data