Next Word Prediction using Katz Backoff Model - Part 2: N-gram model, Katz Backoff, and Good-Turing Discounting

Executive Summary

The Capstone Project of the Johns Hopkins Data Science Specialization is to build an NLP application, which should predict the next word of a user text input. In Part 1, we have analysed the data and found that there are a lot of uncommon words and word combinations (2- and 3-grams) can be removed from the corpora, in order to reduce memory usage and speed up the model building time.

Part 2 will discuss the Katz backoff model and Good-Turing discounting algorithm powering the application.

The Method

The basic idea of word prediction is likely to be something as:
\( \text{He likes to eat ice ...} \) You probably think of the word is cream. Such kind of next word prediction models from the previous \(N-1\) words is called \(N\)-gram models. A 2-gram/bigram is just a 2-word or 2-token sequence \(w_{i-1}^i\), e.g. "ice cream", a 3-gram/trigram is a 3-word sequence \(w_{i-2}^i\), e.g. "ice cream cone". A single word, e.g. "ice", not surprisingly, is called unigram.

N-gram model

An \(N\)-gram model is to predict the last word \(w_{i}\) of an \(N\)-gram from previous \(N-1\) word sequence \(w_{i-(N-1)}^{i-1}:w_{i-(N-1)}, \dots, w_{i-1}\), by calculating the probability of \(P(w_{i} | w_{i-(N-1)}, \dots, w_{i-1})\). One simple version of \(N\)-gram model is to count the relative frequencies of words regarding the previous \(N-1\) words, calculate the probabilities and find out the most probable next word. For example, suppose we want to predict the next word of eat ice, we can calculate the probability of: \( P(\text{cream}|\text{eat ice}) = \frac{C(\text{eat ice cream})}{C(\text{eat ice})}\) and then to compare with probabilities of other word sequence, e.g. \(P(\text{skating}|\text{eat ice})\), \(P(\text{sheet}|\text{eat ice})\), etc. Note that \(C(\text{eat ice})\) is the count of the word combination "eat ice" appearing in the corpus. This probabilities estimation is called maxmimum likelihood estimation or MLE. There is more detail discussion on \(N\)-gram model and related topics in chapter 4 of [5].

The major problem with the above MLE for training \(N\)-gram model is it cannot handle \(N\)-grams that are not seen in training corpus, which will have zero probability, while some of them are actually probable accepted word sequences in the language considered. This kind of zero probability \(N\)-grams do always appear in the model for any given corpus. There are many different methods to smooth the probability distributions by assigning some non-zero probabilities to these unseen \(N\)-grams. Furthermore, some of the methods smooth the probability distributions by also adjusting those \(N\)-grams with small count. The method implemented in this project is the back-off model, in particular the Katz backoff model, which will be introduced in next section. This backoff model will be used in combination with another smoothing technique, the Good-Turing discounting, and will be discussed in the following sections.

Katz Backoff Model

The backoff smoothing is to approximate the probability of an unseen \(N\)-gram by resorting to more frequent lower order \(N\)-grams. In Katz backoff model [6], the probability of an \(N\)-gram with zero count is approximated by backing off to \((N-1)\)-gram. The backoff will further continue until a history (word sequence precedes the candidate word) with non-zero count is encountered.

More formally, Katz Backoff can be expressed as:

$$ P_\text{Katz}(w_i|w_{i-N+1}^{i-1}) = \begin{cases} P^*(w_i|w_{i-N+1}^{i-1}),& \text{if } C(w_{i-N+1}^i) \gt 0 \\ \alpha(w_{i-N+1}^{i-1})P_\text{Katz}(w_i|w_{i-N+2}^{i-1}), & \text{otherwise.}\end{cases}$$

Where \(P^*\) is the Good-Turing discounted probability, and \(\alpha(w_{i-N+1}^{i-1})\) is the normalizing constant governing how the remaining probability mass, i.e. the backoff weight taken off from the seen higher N-grams, should be distributed to the unseen \(N\)-grams, according to the distribution of the \((N-1)\)-grams \(w_{i-N+2}^{i}\). If \(C(w_{i-N+1}^{i-1}=0)\), then \(\alpha=1\).

We describe the bigram and trigram models here as a concrete example. Note that the notations \(w_{i}, w_{i-1}\) etc. are replaced by \(x, y, z\) for clear representation.

Bigram version

$$P_\text{katz}(z|y) = \begin{cases} P^*(z|y), & C(y,z) \gt 0 \\ \alpha(y)P_\text{ML}(z), & \text{else if }C(y)\gt0 \\ P_\text{ML}(z), & \text{otherwise} \end{cases}$$

where \(P_\text{ML}(z)\) is the maximum likelihood probability of \(z\), i.e. \(C(z)\) divided by the total count of unigram in the corpus. The value of \(\alpha(y)\) is:

$$ \begin{split} \alpha(y) & = \frac{1-\sum\text{discounted probability of all observed bigrams start with }y}{\sum \text{probabiliy of }w\text{ of all unobserved bigrams }(y,w)} \\ & =\frac{1-\sum_{w:C(y,w)\gt0}P^*(w|y)}{\sum_{w:C(y,w)=0}P_\text{ML}(w)} \\ & = \frac{1-\sum_{w:C(y,w)\gt0}P^*(w|y)}{1-\sum_{w:C(y,w) \gt 0}P_\text{ML}(w)} \end{split} $$

Trigram version

$$ P_\text{katz}(z|x,y)= \begin{cases} P^*(z|x,y), &\text{if } C(x,y,z) \gt 0 \\ \alpha(x,y)P_\text{katz}(z|y), & \text{else if } C(x,y) \gt 0 \\ P_\text{katz}(z|y), & \text{otherwise.} \end{cases} $$

where the value of \(\alpha(x,y)\) is:

$$ \begin{split} \alpha(x,y) & = \frac{1-\sum\text{discounted probability of all observed trigrams start with }(x,y)}{1- \sum \text{probaility of (y,w) of all observed trigrams (x,y,w)}}\\ & =\frac{1-\sum_{w:C(x,y,w)\gt0}P^*(w|x,y)}{1-\sum_{w:C(x,y,w)\gt0}P^*(w|y)} \end{split} $$

Discounting

One of the major problems of MLE process of an \(N\)-gram model, as mentioned above, is the zero-probability word sequence in a sparse train dataset, due to any corpus is limited and finite in size. Therefore some acceptable word sequences in the language considered will be missing. This problem can be tackled by smoothing, which is to reduce some probability mass from non-zero counts and assign to the zero count terms. Discounting is a kind of smoothing algorithm, by introducing a discount \(d_c\), which is the ratio of the discounted counts to the original counts:

$$ d_r=\frac{r^*}{r} $$

There are different types of smoothing algorithms, such as Laplace smoothing and Good-Turing smoothing. The one implemented in this project is Good-Turing discounting and is introduced below.

Good-Turing Discounting Algorithm

There is a concept in Good-Turing smoothing, referred as frequency of frequency: \(N_r\), which is to count how many \(N\)-grams appearing \(c\) times. It is defined as:

$$ N_r=\sum_{x:C(x)=r}1 $$

For example, when we are considering bigrams in a corpus, \(N_8=2\) means there are 2 bigrams occurring 8 times in the corpus, \(N_5=0\) means no bigram occurring 5 times, and \(N_0=120\) means there are 120 possible bigrams do not appear in the corpus. We also define \(N\) to be the total number of objects observed: \(N=\sum rN_r\).

The intuition of Good-Turing estimation is to adjust the frequencies of low-frequency (small \(r\)) \(N\)-grams using frequencies of high-frequency (large \(r\)) \(N\)-grams:

$$ r^*=(r+1)\frac{N_{r+1}}{N_r} $$

Specifically, the strategy is to estimate the frequency of unobserved items/words (\(r=0\)) based on frequency of items/words observed once (\(r=1\)).

$$ P^*_\text{GT}(\text{unobserved items})=\frac{N_1}{N} $$

where \(N\) here is the total number of observed items in the training set. The Good-Turing discounted probability of other items is:

$$ P^*_\text{GT}=\frac{r^*}{N} $$

There are some complexities in the use of the original Good-Turing estimation [2], and a method called Simple Good-Turing [3] was proposed to make the implementation easier. The main steps of Simple Good-Turing include the followings:

  1. To handle the case that most of the \(N_r\) are zero for large \(r\), which is a necessary value when calculating \(r^*\). Instead of depending on the raw count \(N_r\) of each non-zero frequency, these non-zero values are averaged and replaced by \(Z_r=\frac{N_r}{0.5(t-q)}\), where \(q\), \(r\), \(t\) are successive indices of non-zero \(N_r\). When \(r\) is the first index, \(q\) is taken as \(0\) and when \(r\) is the last index, let \(t=2r-q\). From now on, most of the estimated frequencies will be based on \(Z_c\) rather than \(N_r\), even though we may still referring to \(N_r\).
  2. Because both the ranges of \(r\) and \(Z_r\) are large, both the values of \(r\) and \(Z_r\) are mapped to log space.
  3. Fit a straight line by linear regression: \(\log(Z_r)=a+b \log(r)\) to smooth \(Z_r\).

We can now calculate the discounted estimate \(r^*\) with the above steps linear regression model even thoug \(Z_{r+1}\) is not presented. In addition, the raw count \(c\) is considered to be reliable when it is large enough. Katz introduced a constant \(k\) and suggested that \(k\) to be 5:

$$ r^*=r \quad \text{for }r \gt k $$

Thus the final equation to compute \(r^*\) is

$$ r^*=\frac{(r+1)\frac{N_{r+1}}{N_r}-r\frac{(k+1)N_{k+1}}{N_1}}{1-\frac{(k+1)N_{k+1}}{N_1}}, \quad \text{for }1\leq r \leq k $$

We can then compute \(P^*_{GT}\) for all the observed and unobserved items in the training set, and combining with Katz backoff model to build the text prediction model. The implementation will be detailed in Part 3.

Other Methods

There are many different types of smoothing methods other than Good-Turing smoothing and Katz backoff, such as Jelinek and Mercer interpolation [4], Witten-Bell smoothing [9], Absolute discounting [8], Kneser-Ney Smoothing [7], and modified Kneser-Ney [1]. It is worth to explore different methods and test the performance in the future.

Reference

::: ::: [1] Chen, S.F. and Goodman, J. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language. 13, 4 (1999), 359--394. :::

::: [2] Church, K.W. and Gale, W.A. 1991. A comparison of the enhanced good-turing and deleted estimation methods for estimating probabilities of english bigrams. Computer Speech & Language. 5, 1 (1991), 19--54. :::

::: [3] Gale, W.A. and Sampson, G. 1995. Good-turing frequency estimation without tears. Journal of Quantitative Linguistics. 2, 3 (1995), 217--237. :::

::: [4] Jelinek, F. and Mercer, R. 1980. Interpolated estimation of markov source parameters from sparse data. Proceedings of workshop on pattern recognition in practice (Jan. 1980), 381--397. :::

::: [5] Jurafsky, D. and Martin, J.H. 2009. Speech and language processing (2nd edition). Prentice-Hall, Inc. :::

::: [6] Katz, S.M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE transactions on acoustics, speech and signal processing (1987), 400--401. :::

::: [7] Kneser, R. and Ney, H. 1995. Improved backing-off for n-gram language modeling. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (Jun. 1995), 181--184 vol.1. :::

::: [8] Ney, H. et al. 1994. On structuring probabilistic dependencies in stochastic language modelling. Computer Speech and Language. 8, (Jan. 1994), 1--38. :::

::: [9] Witten, I. and Bell, T. 1991. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory. 37, (Jul. 1991), 1085--1094. ::: :::

Avatar
Leo Mak
Enthusiast of Data Mining

Related