Executive Summary The Capstone Project of the Johns Hopkins Data Science Specialization is to build an NLP application, which should predict the next word of a user text input. In Part 1, we have analysed and found some characteristics of the training dataset that can be made use of in the implementation. We have also discussed the Good-Turing smoothing estimate and Katz backoff model that powering our text prediction application in Part 2.
Executive Summary The Capstone Project of the Johns Hopkins Data Science Specialization is to build an NLP application, which should predict the next word of a user text input. In Part 1, we have analysed the data and found that there are a lot of uncommon words and word combinations (2- and 3-grams) can be removed from the corpora, in order to reduce memory usage and speed up the model building time.
Executive Summary The Capstone Project of the Data Science Specialization in Coursera offered by Johns Hopkins University is to build an NLP application, which should predict the next word of a user text input. This report will discuss the nature of the project and data, the model and algorithm powering the application, and the implementation of the application. Part 1 will focus on the analysis of the datasets provided, which will guide the direction on the implementation of the actual text prediction program.
We have discussed the basic ideas of logistic regression in previous post. The purpose of logistic regression is to find the optimal decision boundary which can classify the data with different categorical target feature into different classes. We also introduced the logistic function or sigmoid function as the regression model to find the optimal decision boundary. Now let’s take a look how to achieve it.
Cost Function and Gradient Descent for Logistic Regression We can still use gradient descent to train the logistic regression model.
In previous series of posts we discussed simple and multivariate linear regression that can be used to predict target features with continuous values. Besides that, there are other prediction problems with categorical target features and we want to train a model so that we can use it to predict the class of unknown data. Logistic regression is one of these models.
Classification Problem Imagine we have a tumor dataset which contains the Size of tumor and whether the tumor is Malignant or not.
In the previous articles we have discussed the basic concept of simple linear regression; how to measure the error of the regression model so that we can use the gradient descent method to find the global optimum of the regression problem; develop the multivariate linear regression model for real world problems; and how to choose learning rate and initial values of the weight to start the algorithm. We can try to solve real world problem using linear regression at this point.
Choosing Learning Rate We introduced an important parameter, the learning rate \(\alpha\), in Linear Regression 2 – Gradient Descent without discussing how to choose its value. In fact, the choice of the learning rate affects the performance of the algorithm significantly. It determines the convergence speed of the gradient descent algorithm, which is the number of iteration to reach the minimum. The below figures, we call it learning graph, show how different learning rates impact the speed of the algorithm.
The Simple Linear Regression can only handle the relationship between the target feature and one descriptive feature, which is not often the case in real life. For example, the number of features in the dataset of our toy example is now expanded to 4, including target feature Rental Price:
Size Rental Price Floor Number of bedroom 350 1840 15 1 410 1682 3 2 430 1555 7 1 550 2609 4 2 … … … … To generalize simple linear regression to multivariate linear regression is straightforward.
Why We Need Gradient Descent In the previous article, Linear Regression 1 – Simple Linear Regression and Cost Function, we introduced the concept of simple linear regression, which is basically to find a regression line model
$$M_w(x) = w_0 + w_1x_1$$ so that the prediction \(M_w(x)\) is as close to the \(y\) of our training data \((x,y)\) as possible. To find the best fit regression line, we are actually finding the optimal combination of the weight parameters \(w_0\) and \(w_1\) and trying to minimize the errors between the predictions and the actual values of target feature \(y\).