Next Word Prediction using Katz Backoff Model - Part 1: The Data Analysis

Executive Summary

The Capstone Project of the Johns Hopkins Data Science Specialization is to build an NLP application, which should predict the next word of a user text input. This report will discuss the nature of the project and data, the model and algorithm powering the application, and the implementation of the application. Part 1 will focus on the analysis of the datasets provided, which will guide the direction on the implementation of the actual text prediction program.

The Project

The main focus of the project is to build a text prediction model, based on a large and unstructured database of English language, to predict the next word user intends to type. The required application should be a Shiny Web App, implemented in R language.

The Data

The corpora provided for the project include three categories of content, which are grabbed from blogs, news, and Twitters. More details about the corpora can be found here.

Each of the categories comes with four different languages respectively, which are US English, Germany German, Finnish, and Russian. All these datasets are stored in the corresponding text files, e.g. “de_DE.blogs.txt”, “en_US.news.txt”, “fi_FI.twitter.txt”, etc. We will only focus on the English dataset in this project as a proof of concept. And we are going to use the quanteda package to process and analyse the text data for its supports on quantitative analysis and efficiency on corpus processing.

The first step is to load the 3 text files, which corresponding to content of blog, news, and twitter in en_US locale. Next we create 3 quanteda corpora corresponding to the 3 text datasets. A corpus serves as the repository of original text with some metadata containing additional information regarding the corpus. The basic statistics of these files can be extract from the corresponding corpus.

Loading and Attributes of Data

To load the text file for quanteda to process, we also need readtext package, which was originally developed in quanteda package, to read the files and return the data.frame recognized by quanteda. Due to each piece of blog posts, news articles, and tweets are separated by a line break, we also need to split the whole text file into different document for quanteda to process.

require(quanteda)
require(readtext)
require(stringi)
require(ggplot2)
require(plotly)

path = "E:/tmp/JHU_DS/SwiftData/final/en_US/en_US"
filetype = c(".blogs.txt", ".news.txt", ".twitter.txt")
blog = paste0(path, filetype[1])
news = paste0(path, filetype[2])
twitter = paste0(path, filetype[3])

# Create corpus, each line is a document
corp_b <- corpus(stri_split_lines1(readtext(blog)))
corp_n <- corpus(stri_split_lines1(readtext(news)))
corp_t <- corpus(stri_split_lines1(readtext(twitter)))

File size & Corpus size

The sizes of the raw text files and corpora can affect the way analyzing them significantly, e.g. whether sampling is necessary, or could all the data be combined together, as well as how the prediction model should be created, which greatly affecting the performance of the application. Here we take a look at the file sizes and the sizes of corresponding corpora:

sizes_corp <- c(format(object.size(corp_b), units="auto"), format(object.size(corp_n), units="auto"), format(object.size(corp_t), units="auto"))
sizes_file <- c(utils:::format.object_size(file.size(blog), "auto"), utils:::format.object_size(file.size(news), "auto"), utils:::format.object_size(file.size(twitter), "auto"))
sizes_all <- matrix(c(sizes_file, sizes_corp), nrow = 2, byrow = T, dimnames = list(c("File", "Corpus"), c("Blog", "News", "Twitter")))
sizes_all
##        Blog       News       Twitter   
## File   "200.4 Mb" "196.3 Mb" "159.4 Mb"
## Corpus "318.8 Mb" "25.1 Mb"  "481.3 Mb"

Exploratory Data Analysis

The EDA will examine some properties of these corpora, including: number of texts, longest text (number of words), shortest text (number of words). The second part will show the distribution of contents in terms of their length. The final part presents an N-gram analysis on the corpora.

Corpus Properties

Before continue the analysis, we first need to tokenize the corpora, which is to segment the texts in a corpus into words in this report. Then we can count the length of texts in the corpora in terms of tokens.

token_b <- tokens(corp_b)
token_n <- tokens(corp_n)
token_t <- tokens(corp_t)

The number of news, blog pieces, and tweets, as well as the longest and shortest texts in these datasets are shown below:

df_b_len <- data.frame(len=sapply(token_b, length))
df_n_len <- data.frame(len=sapply(token_n, length))
df_t_len <- data.frame(len=sapply(token_t, length))

mtx_bstat <- matrix(nrow = 3, ncol = 3)
colnames(mtx_bstat) <- c("Number of Texts", "Longest Text", "Shortest Text")
rownames(mtx_bstat) <- c("Blog", "News", "Twitter")
mtx_bstat[1,] <- c(dim(df_b_len)[1], max(df_b_len), min(df_b_len))
mtx_bstat[2,] <- c(dim(df_n_len)[1], max(df_n_len), min(df_n_len))
mtx_bstat[3,] <- c(dim(df_t_len)[1], max(df_t_len), min(df_t_len))
mtx_bstat
##         Number of Texts Longest Text Shortest Text
## Blog             899288         7440             0
## News              77259         1210             1
## Twitter         2360148           68             1

Let’s take a look at those blog pieces with 0 word:

Blog:

head(corp_b[which(df_b_len==0)], 3)
##                                                                                                     text27249 
##                   "_________________________________________________________________________________________" 
##                                                                                                     text43246 
##              "______________________________________________________________________________________________" 
##                                                                                                     text54643 
## "___________________________________________________________________________________________________________"

The blog pieces with 0 word just contain underlines and we should remove such type of non-word elements in the later stage, to enhance the analysis performance.

Text Length Distribution

We will first have a rough idea on the distribution of different length of texts:

matrix(c(quantile(df_b_len$len, prob=c(0.5, 0.9, 0.99, 1)),
         quantile(df_n_len$len, prob=c(0.5, 0.9, 0.99, 1)),
         quantile(df_t_len$len, prob=c(0.5, 0.9, 0.99, 1))),
       byrow = T, nrow = 3, dimnames = list(c("Blog", "News", "Twitter"),
                                 c("50%", "90%", "99%", "100%")))
##         50% 90% 99% 100%
## Blog     33 111 230 7440
## News     36  71 125 1210
## Twitter  15  27  34   68

For blog and news, most of the texts are not as long as the longest. For example, the length of 99% of the texts in blog corpus are at most 3% of the longest. The situation for Twitter is a bit different since there is a 140-character limit for any tweet. We can further examine the distribution through the following plots.

Please note the lengths of texts in blog and news corpus are base 10 logarithms to reveal the distribution more clearly.

p_dist_b <- ggplot(df_b_len, aes(log10(len))) +
  geom_histogram(binwidth=density(log10(df_b_len$len))$bw, col="black", fill="red", alpha=0.8) +
  labs(x="Text Length (log10)", y="Text Count", title="Blog")
ggplotly(p_dist_b)
p_dist_n <- ggplot(df_n_len, aes(log10(len))) +
  geom_histogram(binwidth=density(log10(df_n_len$len))$bw, col="black", fill="green", alpha=0.8) +
  labs(x="Text Length (log10)", y="Text Count", title="News")
ggplotly(p_dist_n)
p_dist_t <- ggplot(df_t_len, aes(len)) +
  geom_histogram(binwidth=density(df_t_len$len)$bw, col="black", fill="blue", alpha=0.8) +
  labs(x="Text Length", y="Text Count", title="Tweet")
ggplotly(p_dist_t)

N-gram Analysis

Before analyzing the unigrams, bigrams, and trigrams of the corpora, we will clean the datasets by removing numbers, symbols, punctuation, URL, profanities, and function words (grammatical words) that have little or no substantive meaning.

profanity <- readLines("profanity.txt")
token_b <- tokens(token_b, remove_numbers=T, remove_punct=T, remove_symbols=T, remove_url=T) %>% tokens_remove(c(stopwords("en"), profanity))
token_n <- tokens(token_n, remove_numbers=T, remove_punct=T, remove_symbols=T, remove_url=T) %>% tokens_remove(c(stopwords("en"), profanity))
token_t <- tokens(token_t, remove_numbers=T, remove_punct=T, remove_symbols=T, remove_url=T) %>% tokens_remove(c(stopwords("en"), profanity))

Top 20 Unigrams

A document-feature matrix is created for each of the token objects so that we can extract the counts of the words we are interested in.

dfm_b <- token_b %>% dfm()
dfm_t <- token_n %>% dfm()
dfm_n <- token_t %>% dfm()
Blog Corpus
dotchart(sort(topfeatures(dfm_b,20)), pch=19, xlab = "Count", main = "Top 20 Unigrams from Blog Corpus")

News Corpus
dotchart(sort(topfeatures(dfm_n,20)), pch=19, xlab = "Count", main = "Top 20 Unigrams from News Corpus")

Twitter Corpus
dotchart(sort(topfeatures(dfm_t,20)), pch=19, xlab = "Count", main = "Top 20 Unigrams from Twitter Corpus")

Uncommon Unigrams

We can see that the most common word in blog corpus is “one” with count 125518, “just” with count 150977 in news corpus, and “said” with count 19170 in Twitter corpus. However, we are more interested in those insignificant words, i.e. words with count less than a specific number. We assume that the set of insignificant words can be removed from training set and it will not affect the performance of the prediction model a lot, since it is uncommon for user to type such words or word combinations.

low_limit = 1
freq_b_uni <- colSums(dfm_b)
sprintf("Tokens appear only once in Blog corpus: %.2f%%", (sum(freq_b_uni <= low_limit)/length(freq_b_uni))*100)
## [1] "Tokens appear only once in Blog corpus: 55.17%"

Let’s take a look the proportion of uncommon words (which are tokens here) in different corpora and with different lower limit:

freq_n_uni <- colSums(dfm_n)
freq_t_uni <- colSums(dfm_t)
mtx_rare_terms_uni <- rbind(sapply(1:4, function(x) sprintf("%.2f%%", (sum(freq_b_uni<=x)/length(freq_b_uni))*100)),
      sapply(1:4, function(x) sprintf("%.2f%%", (sum(freq_n_uni<=x)/length(freq_n_uni))*100)),
      sapply(1:4, function(x) sprintf("%.2f%%", (sum(freq_t_uni<=x)/length(freq_t_uni))*100)))
rownames(mtx_rare_terms_uni) <- c("Blog", "News", "Twitter")
colnames(mtx_rare_terms_uni) <- 1:4
mtx_rare_terms_uni
##         1        2        3        4       
## Blog    "55.17%" "67.36%" "73.30%" "76.85%"
## News    "61.82%" "73.42%" "78.66%" "81.83%"
## Twitter "50.72%" "64.09%" "70.86%" "75.13%"

We can see that if setting the lower limit to 4, there are more than 76% to 83% words are uncommon in the three corpora. Even restricting the definition of uncommon words to be appearing only once in the whole corpus, there are still more than 52% to 63% of all the words are uncommon. We can then safely trim down the size of dataset by cutting these uncommon words when building the prediction model in later stage. On the other hand, trimming down the corpus size can greatly reduce memory usage and speed up the model building time.

Top 20 Bigrams

We also plot the most common 20 bigrams in these corpora as followings:

dfm_b_bi <- token_b %>% tokens_ngrams(2) %>% dfm()
dfm_n_bi <- token_n %>% tokens_ngrams(2) %>% dfm()
dfm_t_bi <- token_t %>% tokens_ngrams(2) %>% dfm()
Blog Corpus
dotchart(sort(topfeatures(dfm_b_bi,20)), pch=19, xlab = "Count", main = "Top 20 Bigrams from Blog Corpus")

News Corpus
dotchart(sort(topfeatures(dfm_n_bi,20)), pch=19, xlab = "Count", main = "Top 20 Bigrams from News Corpus")

Twitter Corpus
dotchart(sort(topfeatures(dfm_t_bi,20)), pch=19, xlab = "Count", main = "Top 20 Bigrams from Twitter Corpus")

Uncommon Bigrams

Similarly, we now take a look on the proportion of uncommon bigrams in these corpora:

freq_b_bi <- colSums(dfm_b_bi)
freq_n_bi <- colSums(dfm_n_bi)
freq_t_bi <- colSums(dfm_t_bi)
mtx_rare_terms_bi <- rbind(sapply(1:4, function(x) sprintf("%.2f%%", (sum(freq_b_bi<=x)/length(freq_b_bi))*100)),
      sapply(1:4, function(x) sprintf("%.2f%%", (sum(freq_n_bi<=x)/length(freq_n_bi))*100)),
      sapply(1:4, function(x) sprintf("%.2f%%", (sum(freq_t_bi<=x)/length(freq_t_bi))*100)))
rownames(mtx_rare_terms_bi) <- c("Blog", "News", "Twitter")
colnames(mtx_rare_terms_bi) <- 1:4
mtx_rare_terms_bi
##         1        2        3        4       
## Blog    "78.86%" "89.32%" "93.08%" "94.98%"
## News    "88.24%" "95.25%" "97.30%" "98.21%"
## Twitter "78.51%" "88.64%" "92.39%" "94.33%"

Top 20 Trigrams

Finally, let’s have an idea on the most common trigrams of these corpora:

dfm_b_tri <- token_b %>% tokens_ngrams(3) %>% dfm()
dfm_n_tri <- token_n %>% tokens_ngrams(3) %>% dfm()
dfm_t_tri <- token_t %>% tokens_ngrams(3) %>% dfm()
Blog Corpus
dotchart(sort(topfeatures(dfm_b_tri,20)), pch=19, xlab = "Count", main = "Top 20 Trigrams from Blog Corpus")

News Corpus
dotchart(sort(topfeatures(dfm_n_tri,20)), pch=19, xlab = "Count", main = "Top 20 Trigrams from News Corpus")

Twitter Corpus
dotchart(sort(topfeatures(dfm_t_tri,20)), pch=19, xlab = "Count", main = "Top 20 Trigrams from Twitter Corpus")

Uncommon Trigrams

And here are some statistics of uncommon trigrams in the corpora:

freq_b_tri <- colSums(dfm_b_tri)
freq_n_tri <- colSums(dfm_n_tri)
freq_t_tri <- colSums(dfm_t_tri)
mtx_rare_terms_tri <- rbind(sapply(1:4, function(x) sprintf("%.2f%%", (sum(freq_b_tri<=x)/length(freq_b_tri))*100)),
      sapply(1:4, function(x) sprintf("%.2f%%", (sum(freq_n_tri<=x)/length(freq_n_tri))*100)),
      sapply(1:4, function(x) sprintf("%.2f%%", (sum(freq_t_tri<=x)/length(freq_t_tri))*100)))
rownames(mtx_rare_terms_tri) <- c("Blog", "News", "Twitter")
colnames(mtx_rare_terms_tri) <- 1:4
mtx_rare_terms_tri
##         1        2        3        4       
## Blog    "96.54%" "98.96%" "99.48%" "99.68%"
## News    "98.27%" "99.57%" "99.80%" "99.88%"
## Twitter "94.23%" "97.90%" "98.85%" "99.25%"

Summary

In this study, we find that the text datasets are:

  • Large in size, which may pose difficulty for building model and prediction performance in terms of memory usage and speed.
  • Containing some meaningless texts which are worthless for prediction.
  • Filled with uncommon words, which can be discarded later when building model and does not affect the performance of prediction model a lot.