Linear Regression 1 – Simple Linear Regression and Cost Function

By | 2017-01-17

Error-based Learning

Imagine you are just starting to learn skipping and may occasionally trip over the rope or even fall down.  You then try to slightly adjust the jumping speed, height, strength of your arms, your foot balance skill, etc.  You may trip again and then continue adjusting your posture and movement.  And one day, you find skipping is just a piece of cake!  You have just learnt that from making countless errors but failing better each time, by tuning your motion little by little.  This is the error-based learning.  Linear regression is one of the error-based learnings, by iteratively adjusting a set of coefficients and other parameters, based on the errors measured by a cost function, to find a model which can fulfil the regression analysis task.

Simple Linear Regression

Linear regression is a type of regression analysis, which is applied to predict continuous outcomes.  Linear regression tries to develop a model that can fit to the given set of descriptive features and the corresponding outcomes or responses.  This process is called training the model and the given set of data is the training set.  Linear regression belongs to the family of supervised learning as the values of target feature are given along with the training set.

Before diving into the details of linear regression, let’s assume the following dataset as our toy example, which captures some of the Rental Price and the Size of the flats:

Size Rental Price
35 1840
41 1682
43 1555
55 2609
48 1815
65 1900
90 4615
83 3512
73 2153
120 6308

And the scatter plot of the above data shows there should be a linear relationship between these two features.

scatter plot of size versus rental price

What we want to do is trying to find a model, as shown in below figure (yes, the model is just a straight line), that can capture this relationship between the size and the rental price of the flats.  Therefore, we can 1, understand how the size affects the rental price; and 2, use this model to predict the rental price of a new flat of particular size.  For example, a 90m2 size flat should rent at around 4140 dollars.  These are the two important things that linear regression can offer us.

scatter plot and regression line of rental price versus size

Model Representation

We can use the line equation to represent the model which captures the relationship between these two continuous features.  A line equation is written as:

    \[y=mx+b\]

where m is the slope of the line and b is the y-intercept of the line.  In our problem dataset, since the rental price depends only on the size of the flat, we call this kind of linear regression problem with only one input variable the Simple Linear Regression or Univariate Linear Regression.  The general form to represent this model is:

    \[\mathbb{M}_w(x)=\hat{y}=w_0x_0+w_1x_1\]

Here, w0 and w1 are called the weights of the model. In the case of linear regression, x0 is set to 1, so w0x0 equals to w0 and represent the y-intercept. w1 is the parameter or coefficient of the input variable x1\mathbb{M}_w(x) or \hat{y} is the predicted value.  The objective of simple linear regression is to find the optimal weights of the line equation, or the regression line, to capture the relationship between the descriptive feature and target feature. We can then use this to predict the value of the target feature of a new instance of data, with the given value of the descriptive feature.  The model of our toy example is

Rental Price = 54.43 x Size – 755.79

Cost function

In order to find the optimal values of the weights, we need to define a function to measure how well does the model fit into the training dataset.  The function measuring the errors between the predictions \mathbb{M}_w(x) made by the model and their actual values y of the target feature in the training dataset is called cost function or error function.  Without too much discussion on the intuition behind the deriving process of the function, the cost function J of weights w0 and w1 is defined as the sum of squared errors between the predictions and actual values:

    \[J(w_0,w_1)=\frac{1}{2m}\sum^m_{i=1}\left(\mathbb{M}_w(x^{(i)})-y^{(i)}\right)^2\]

where \mathbb{M}_w(x^{(i)}) is the prediction of the ith training data in an m-instances-training set by the model and y(i) is the actual value corresponding to that training data.  If we plug the original general form of simple linear regression into the cost function, the function expands to

    \[J(w_0,w_1)=\frac{1}{2m}\sum^m_{i=1}\left((w_0+w_1x^{(i)}_1)-y^{(i)}\right)^2\]

And this is the cost function of the simple linear regression.  Next, we shall discuss how to find the optimal regression line by minimising the cost function, so as the error made by the model to a particular training data set.

Error surface

To discuss how to minimise the cost function, let’s assume we have another training set as below:

x y
1 1
2 2
3 3

And we are going to try different combinations of w0 and w1 and calculate the squared errors.  For simplicity, we set w0 to 0 and only change w1 to see the behaviour of the cost function J.  If we put the values in training data to the cost function, then we shall have:

    \[J(0,w_1)=\frac{1}{2 \times 3}\left[(w_1-1)^2+(2w_1-2)^2+(3w_1-3)^2\right]\]

We list out some of the w0 and the corresponding values of J:

w1 J(0,w1)
0.1 1.89
0.5 0.583
1 0
1.5 0.583
1.9 1.89

and the corresponding regression lines:

figure of regression lines of different w1

You can see when w1 equals 1, the line coincides with all the training data and the squared error is 0, which should minimise the cost function J.  Actually we are confidence to say that by plotting the values of J against different w1, which is known as the error surface, as below figure:

figure of the J vs w1

Although for this simple example, we can try out different combinations of the weights to minimise J, however this kind of brute-force search is infeasible to most real-world problems.  Fortunately, the properties of linear regression model ensure that there exists a global minimum and we can find this minimum through some systematic ways.  We shall cover one of the methods and discuss how to generalize the model to problems with more than one descriptive features in the next article.

Leave a Reply

Your email address will not be published. Required fields are marked *