# Next Word Prediction using Katz Backoff Model - Part 2: N-gram model, Katz Backoff, and Good-Turing Discounting

## Executive Summary

The Capstone Project of the Johns Hopkins Data Science Specialization is to build an NLP application, which should predict the next word of a user text input. In Part 1, we have analysed the data and found that there are a lot of uncommon words and word combinations (2- and 3-grams) can be removed from the corpora, in order to reduce memory usage and speed up the model building time.

Part 2 will discuss the Katz backoff model and Good-Turing discounting algorithm powering the application.

## The Method

The basic idea of word prediction is likely to be something as:

$$
\text{He likes to eat ice …}
$$ You probably think of the word is *cream*. Such kind of next word
prediction models from the previous \(N-1\) words is called
**\(N\)-gram models**. A 2-gram/*bigram* is just a 2-word or 2-token
sequence \(w_{i-1}^i\), e.g. “ice cream”, a 3-gram/*trigram* is a
3-word sequence \(w_{i-2}^i\), e.g. “ice cream cone”. A single word,
e.g. “ice”, not surprisingly, is called *unigram*.

*N*-gram model

An \(N\)-gram model is to predict the last word \(w_{i}\) of an
\(N\)-gram from previous \(N-1\) word sequence
\(w_{i-(N-1)}^{i-1}:w_{i-(N-1)}, \dots, w_{i-1}\), by calculating
the probability of \(P(w_{i} | w_{i-(N-1)}, \dots, w_{i-1})\). One
simple version of \(N\)-gram model is to count the *relative
frequencies* of words regarding the previous \(N-1\) words, calculate
the probabilities and find out the most probable next word. For example,
suppose we want to predict the next word of *eat ice*, we can calculate
the probability of: $$
P(\text{cream}|\text{eat ice}) = \frac{C(\text{eat ice cream})}{C(\text{eat ice})}$$
and then to compare with probabilities of other word sequence, e.g.
\(P(\text{skating}|\text{eat ice})\),
\(P(\text{sheet}|\text{eat ice})\), etc. Note that
\(C(\text{eat ice})\) is the count of the word combination “*eat ice*”
appearing in the corpus. This probabilities estimation is called
**maxmimum likelihood estimation** or **MLE**. There is more detail
discussion on \(N\)-gram model and related topics in chapter 4 of
^{1}.

The major problem with the above MLE for training \(N\)-gram model is
it cannot handle \(N\)-grams that are not seen in training corpus,
which will have zero probability, while some of them are actually
probable accepted word sequences in the language considered. This kind
of zero probability \(N\)-grams do always appear in the model for any
given corpus. There are many different methods to *smooth* the
probability distributions by assigning some non-zero probabilities to
these unseen \(N\)-grams. Furthermore, some of the methods smooth the
probability distributions by also adjusting those \(N\)-grams with
small count. The method implemented in this project is the **back-off**
model, in particular the **Katz backoff** model, which will be
introduced in next section. This backoff model will be used in
combination with another smoothing technique, the **Good-Turing
discounting**, and will be discussed in the following sections.

### Katz Backoff Model

The backoff smoothing is to approximate the probability of an unseen
\(N\)-gram by resorting to more frequent lower order \(N\)-grams. In
Katz backoff model ^{2}, the probability of an \(N\)-gram with zero
count is approximated by backing off to \((N-1)\)-gram. The backoff
will further continue until a history (word sequence precedes the
candidate word) with non-zero count is encountered.

More formally, Katz Backoff can be expressed as:

Where \(P^*\) is the Good-Turing discounted probability, and \(\alpha(w_{i-N+1}^{i-1})\) is the normalizing constant governing how the remaining probability mass, i.e. the backoff weight taken off from the seen higher N-grams, should be distributed to the unseen \(N\)-grams, according to the distribution of the \((N-1)\)-grams \(w_{i-N+2}^{i}\). If \(C(w_{i-N+1}^{i-1}=0)\), then \(\alpha=1\).

We describe the bigram and trigram models here as a concrete example. Note that the notations \(w_{i}, w_{i-1}\) etc. are replaced by \(x, y, z\) for clear representation.

#### Bigram version

where \(P_\text{ML}(z)\) is the maximum likelihood probability of \(z\), i.e. \(C(z)\) divided by the total count of unigram in the corpus. The value of \(\alpha(y)\) is:

#### Trigram version

where the value of \(\alpha(x,y)\) is:

### Discounting

One of the major problems of MLE process of an \(N\)-gram model, as
mentioned above, is the zero-probability word sequence in a sparse train
dataset, due to any corpus is limited and finite in size. Therefore some
acceptable word sequences in the language considered will be missing.
This problem can be tackled by *smoothing*, which is to reduce some
probability mass from non-zero counts and assign to the zero count
terms. *Discounting* is a kind of smoothing algorithm, by introducing a
discount \(d_c\), which is the ratio of the discounted counts to the
original counts:

$$ d_r=\frac{r^*}{r} $$

There are different types of smoothing algorithms, such as Laplace smoothing and Good-Turing smoothing. The one implemented in this project is Good-Turing discounting and is introduced below.

### Good-Turing Discounting Algorithm

There is a concept in Good-Turing smoothing, referred as *frequency of
frequency*: \(N_r\), which is to count how many \(N\)-grams
appearing \(c\) times. It is defined as:

$$ N_r=\sum_{x:C(x)=r}1 $$

For example, when we are considering bigrams in a corpus, \(N_8=2\) means there are 2 bigrams occurring 8 times in the corpus, \(N_5=0\) means no bigram occurring 5 times, and \(N_0=120\) means there are 120 possible bigrams do not appear in the corpus. We also define \(N\) to be the total number of objects observed: \(N=\sum rN_r\).

The intuition of Good-Turing estimation is to adjust the frequencies of low-frequency (small \(r\)) \(N\)-grams using frequencies of high-frequency (large \(r\)) \(N\)-grams:

$$ r^*=(r+1)\frac{N_{r+1}}{N_r} $$

Specifically, the strategy is to estimate the frequency of unobserved items/words (\(r=0\)) based on frequency of items/words observed once (\(r=1\)).

$$ P^*_\text{GT}(\text{unobserved items})=\frac{N_1}{N} $$

where \(N\) here is the total number of observed items in the training set. The Good-Turing discounted probability of other items is:

$$ P^*_\text{GT}=\frac{r^*}{N} $$

There are some complexities in the use of the original Good-Turing
estimation ^{3}, and a method called **Simple Good-Turing** ^{4} was
proposed to make the implementation easier. The main steps of Simple
Good-Turing include the followings:

- To handle the case that most of the \(N_r\) are zero for large \(r\), which is a necessary value when calculating \(r^*\). Instead of depending on the raw count \(N_r\) of each non-zero frequency, these non-zero values are averaged and replaced by \(Z_r=\frac{N_r}{0.5(t-q)}\), where \(q\), \(r\), \(t\) are successive indices of non-zero \(N_r\). When \(r\) is the first index, \(q\) is taken as \(0\) and when \(r\) is the last index, let \(t=2r-q\). From now on, most of the estimated frequencies will be based on \(Z_c\) rather than \(N_r\), even though we may still referring to \(N_r\).
- Because both the ranges of \(r\) and \(Z_r\) are large, both the values of \(r\) and \(Z_r\) are mapped to log space.
- Fit a straight line by linear regression: \(\log(Z_r)=a+b \log(r)\) to smooth \(Z_r\).

We can now calculate the discounted estimate \(r^*\) with the above steps linear regression model even thoug \(Z_{r+1}\) is not presented. In addition, the raw count \(c\) is considered to be reliable when it is large enough. Katz introduced a constant \(k\) and suggested that \(k\) to be 5:

$$ r^*=r \quad \text{for }r \gt k $$

Thus the final equation to compute \(r^*\) is

$$ r^*=\frac{(r+1)\frac{N_{r+1}}{N_r}-r\frac{(k+1)N_{k+1}}{N_1}}{1-\frac{(k+1)N_{k+1}}{N_1}}, \quad \text{for }1\leq r \leq k $$

We can then compute \(P^*_{GT}\) for all the observed and unobserved items in the training set, and combining with Katz backoff model to build the text prediction model. The implementation will be detailed in Part 3.

## Other Methods

There are many different types of smoothing methods other than
Good-Turing smoothing and Katz backoff, such as Jelinek and Mercer
interpolation ^{5}, Witten-Bell smoothing ^{6}, Absolute discounting
^{7}, Kneser-Ney Smoothing ^{8}, and modified Kneser-Ney ^{9}. It is
worth to explore different methods and test the performance in the
future.

## Reference

Jurafsky, D. and Martin, J.H. 2009.

*Speech and language processing (2nd edition)*. Prentice-Hall, Inc. ↩︎Katz, S.M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer.

*IEEE transactions on acoustics, speech and signal processing*(1987), 400–401. ↩︎Church, K.W. and Gale, W.A. 1991. A comparison of the enhanced good-turing and deleted estimation methods for estimating probabilities of english bigrams.

*Computer Speech & Language*. 5, 1 (1991), 19–54. ↩︎Gale, W.A. and Sampson, G. 1995. Good-turing frequency estimation without tears.

*Journal of Quantitative Linguistics*. 2, 3 (1995), 217–237. ↩︎Jelinek, F. and Mercer, R. 1980. Interpolated estimation of markov source parameters from sparse data.

*Proceedings of workshop on pattern recognition in practice*(Jan. 1980), 381–397. ↩︎Witten, I. and Bell, T. 1991. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression.

*IEEE Transactions on Information Theory*. 37, (Jul. 1991), 1085–1094. ↩︎Ney, H. et al. 1994. On structuring probabilistic dependencies in stochastic language modelling.

*Computer Speech and Language*. 8, (Jan. 1994), 1–38. ↩︎Kneser, R. and Ney, H. 1995. Improved backing-off for n-gram language modeling.

*ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings*(Jun. 1995), 181–184 vol.1. ↩︎Chen, S.F. and Goodman, J. 1999. An empirical study of smoothing techniques for language modeling.

*Computer Speech & Language*. 13, 4 (1999), 359–394. ↩︎