Recurrent Neural Network based Language Model

This blog post discusses the paper titled “Recurrent Neural Network based Language Model” by Mikolov et al that was presented in INTERSPEECH 2010.

Introduction

This paper introduces recurrent neural networks (RNNs) to language modeling. Previous language modeling techniques were solely based on statistical computations on a large text corpus. Such language models (LMs) are collectively termed n-gram language models and focus on the task of predicting the next word given n consecutive context words. In general, such LMs work by first obtaining n-gram probabilities (unigram, bigram, trigram etc.) for all possible sequences from the training data followed by application of a smoothing technique (eg. Kneser-Ney, Katz, etc) to assign probabilites to very rarely occuring sequence of words. A popular n-gram LM is the backoff LM, which uses lower-gram probabilities when the higher ones are below a certain threshold. The major disadvantage of such LMs is they lack explicit representation of long range dependencies.

The idea of Deep Learning based LMs was introduced by Yoshua Bengio when he first proposed a feedforward neural network to model a n-gram LM. Input to this network was fixed-length context vector, and hence this model too, although proving much better than statistical models in several tasks, failed to represent long range dependencies.

Mikolov et al.’s approach is a step towards solving this dependency problem. Although this approach required longer training times, it proved to be much better than existing LMs with much lower perplexities and WER.

Network description

Firstly, the vocabulary is procured from the training data by counting occurences of each word, followed by grouping all words with counts less than a threshold value ( $T$ ) under the common token ‘rare’. Every word in the vocabulary is assigned a one-hot vector $w\in\mathbb{R}^{1\times|V|}$ . The network has a hidden state $s\in\mathbb{R}^{1\times D}$ whose dimension $D$ is a hyperparameter. Input vector $x$ is fed to the network in a sequential manner, treating each input word as a time step (denoted by $t$ ). Thus, the input vector at time $t$ is given by $x(t)=w(t)+s(t-1), x(t)\in\mathbb{R}^{1\times(|V|+D)}$ (here ‘+’ denotes concatenation).

The consequent steps are as follows:

$s(t) = \text{sigmoid}\left(x(t)\cdot U\right), \text{ where } U\in\mathbb{R}^{(|V|+D)\times D}\\ y(t) = \text{softmax}\left(s(t)\cdot W\right), \text{ where } W\in\mathbb{R}^{D\times |V|}\\$

The parameters $U$ and $W$ were initialized from a Gaussian distribution with mean 0 and standard deviation 0.1. Cross-entropy was used as the loss function:

$J = -\log y_j(t), \text{ where $w_j(t+1)=1$}\\$

The next-word probabilities were interpreted in this manner ( $C_{rare}$ is the total count for rare words) :

$P\left(w_{i}(t+1)\mid w(t), s(t-1)\right) = \frac{y_{i}(t)}{C_{rare}}$

if $w_{i}(t+1)$ was ‘rare’ and $y_{i}(t)$ otherwise.

Stochastic gradient descent was used for optimization with initial learning rate $\alpha=0.1$ . Learning rate is decayed by $0.5$ if validation loss doesn’t decrease in an epoch. Training is stopped once the validation loss saturates i.e. doesn’t decrease significantly. Backpropagation was truncated to a single time-step ( $\tau=1$ ), which was a drastic simplification step.

They also proposed a dynamic RNN-LM which trained even on test data with constant learning rate $\alpha=0.1$ , but only once. Other proposed architectures include ensemble of 3 RNN-LMs and linear interpolation between a RNN-LM and a Kneser-Ney smoothed 5-gram (KN5) i.e.

$P\left(w(t+1)\mid\text{ context }\right)=0.75\cdot P_{RNN-LM}\left(w(t+1)\mid w(t), s(t-1)\right)+0.25\cdot P_{KN5}\left(w(t+1)\mid w(t)\cdots w(t-4)\right)$

Result highlights

18% reduction in WER with respect to KN5 baseline using ensemble of 3 dynamic RNN-LMs
Perplexity of 112 by mixing static and dynamic RNN-LMs with $\alpha=0.3$ when processing testing data