Linear Regression 1 - Simple Linear Regression and Cost Function

Error-based Learning

Imagine you are just starting to learn skipping and may occasionally trip over the rope or even fall down. You then try to slightly adjust the jumping speed, height, strength of your arms, your foot balance skill, etc. You may trip again and then continue adjusting your posture and movement. And one day, you find skipping is just a piece of cake! You have just learnt that from making countless errors but failing better each time, by tuning your motion little by little. This is the error-based learning. Linear regression is one of the error-based learnings, by iteratively adjusting a set of coefficients and other parameters, based on the errors measured by a cost function, to find a model which can fulfil the regression analysis task.

Simple Linear Regression

Linear regression is a type of regression analysis, which is applied to predict continuous outcomes. Linear regression tries to develop a model that can fit to the given set of descriptive features and the corresponding outcomes or responses. This process is called training the model and the given set of data is the training set. Linear regression belongs to the family of supervised learning as the values of target feature are given along with the training set.

Before diving into the details of linear regression, let’s assume the following dataset as our toy example, which captures some of the Rental Price and the Size of the flats:

SizeRental Price
3501840
4101682
4301555
5502609
4801815
6501900
9004615
8303512
7302153
12006308

And the scatter plot of the above data shows there should be a linear relationship between these two features.

scatter plot and regression line of our toy example

What we want to do is trying to find a model, as shown in below figure (yes, the model is just a straight line), that can capture this relationship between the size and the rental price of the flats. Therefore, we can 1, understand how the size affects the rental price; and 2, use this model to predict the rental price of a new flat of particular size. For example, a 90m2 size flat should rent at around 4140 dollars. These are the two important things that linear regression can offer us. scatter plot and regression line of our toy example

Model Representation

We can use the line equation to represent the model which captures the relationship between these two continuous features. A line equation is written as: $$y=mx+b$$ where \(m\) is the slope of the line and \(b\) is the y-intercept of the line. In our problem dataset, since the rental price depends only on the size of the flat, we call this kind of linear regression problem with only one input variable the Simple Linear Regression or Univariate Linear Regression. The general form to represent this model is: $$M_w(x) = \hat{y} = w_0x_0+w_1x_1$$ Here, \(w_0\) and \(w_1\) are called the weights of the model. In the case of simple linear regression, \(x_0\) is set to 1, so \(w_0x_0\) equals to \(w_0\) and represent the y-intercept. \(w_1\) is the parameter or coefficient of the input variable \(x_1\). \(M_w(x)\) or \(\hat{y}\) is the predicted value. The objective of simple linear regression is to find the optimal weights of the line equation, or the regression line, to capture the relationship between the descriptive feature and target feature. We can then use this to predict the value of the target feature of a new instance of data, with the given value of the descriptive feature. The model of our toy example is:

Rental Price = 54.43 x Size – 755.79

Cost Function

In order to find the optimal values of the weights, we need to define a function to measure how well does the model fit into the training dataset. The function measuring the errors between the predictions \(M_w(x)\) made by the model and their actual values \(y\) of the target feature in the training dataset is called cost function or error function. Without too much discussion on the intuition behind the deriving process of the function, the cost function \(J\) of weights \(w_0\) and \(w_1\) is defined as the sum of squared errors between the predictions and actual values:

$$J(w_0,w_1)=\frac{1}{2m}\sum_{i=1}^m\Big(M_w(x^i)-y^i\Big)^2$$

where \(M_w(x^i)\) is the prediction of the ith training data in an m-instances-training set by the model and \(y^i\) is the actual value corresponding to that training data. If we plug the original general form of simple linear regression into the cost function, the function expands to

$$J(w_0,w_1 )=\frac{1}{2m}\sum_{i=1}^m\Big((w_0+w_1x_1^i)-y^i \Big)^2$$

And this is the cost function of the simple linear regression. Next, we shall discuss how to find the optimal regression line by minimising the cost function, so as the error made by the model to a particular training data set.

Error Surface

To discuss how to minimise the cost function, let’s assume we have another training set as below:

xy
11
22
33

And we are going to try different combinations of \(w_0\) and \(w_1\) and calculate the squared errors. For simplicity, we set \(w_0\) to 0 and only change \(w_1\) to see the behaviour of the cost function \(J\). If we put the values in training data to the cost function, then we shall have:

$$J(0,w_1)=\frac{1}{2×3}\Big((w_1-1)^2+(2w_1-2)^2+(3w_1-3)^2\Big)$$

We list out some of the \(w_1\) and the corresponding values of \(J\):

\(w_1\)\(J(0,w1)\)
0.11.89
0.50.583
10
1.50.583
1.91.89

and the corresponding regression lines: figure of regression lines of different w1 You can see when \(w_1\) equals 1, the line coincides with all the training data and the squared error is 0, which should minimise the cost function \(J\). Actually we are confidence to say that by plotting the values of \(J\) against different \(w_1\), which is known as the error surface, as below figure: figure of the J vs w1 Although for this simple example, we can try out different combinations of the weights to minimise \(J\), however this kind of brute-force search is infeasible to most real-world problems. Fortunately, the properties of linear regression model ensure that there exists a global minimum and we can find this minimum through some systematic ways. We shall cover one of the methods and discuss how to generalize the model to problems with more than one descriptive features in the next article.

Avatar
Leo Mak
Enthusiast of Data Mining