# Linear Regression 2 - Gradient Descent

## Why We Need Gradient Descent

In the previous article, Linear Regression 1 – Simple Linear Regression and Cost Function, we introduced the concept of simple linear regression, which is basically to find a regression line model

$$M_w(x) = w_0 + w_1x_1$$
so that the prediction $$M_w(x)$$ is as close to the $$y$$ of our training data $$(x,y)$$ as possible. To find the best fit regression line, we are actually finding the optimal combination of the weight parameters $$w_0$$ and $$w_1$$ and trying to minimize the errors between the predictions and the actual values of target feature $$y$$. We then defined the error function or cost function that measuring the errors as:

$$J(w_0,w_1)=\frac{1}{2m}\sum_{i=1}^m\Big(M_w(x^i-y^i)\Big)^2=\frac{1}{2m}\sum_{i=1}^m\Big((w_0+w_1x_1^i)-y^i\Big)^2$$

To find the minimum of the cost function we first calculate the gradient of cost function. That is, we calculate the partial derivatives of $$J$$ with respect to $$w_0$$ and $$w_1$$

$$\begin{cases} \frac{\partial J}{\partial w_0}=\frac{1}{m}\sum_{i=1}^m\Big((w_0+w_1x_1^i)-y^i\Big) \\ \frac{\partial J}{\partial w_1}=\frac{1}{m}\sum_{i=1}^m\Big(\big((w_0+w_1x_1^i)-y^i\big)x^i\Big) \end{cases}$$

Linear regression is a kind of convex problem which means it has the global optimum, and the gradient at that point equals to zero. Therefore, we can set the previous two partial derivatives to zero and solve the equations analytically to find the optimal $$w_0$$ and $$w_1$$. However, due to the problem size, in both the number of instances and number of features, solving the above equations directly is not feasible for most real world problems. Instead of solving the equations and finding the optimal values in one shot, there are other approaches to learn the weights in the iterative manner and one of them is Gradient Descent.

Gradient Descent starts by first picking a random point, i.e. a set of random values of the weights ($$w_0$$ and $$w_1$$ in our example case), in the error surface. Then it tries to look for a direction which leads the current point in the error surface moving downward, which means to determine the slope of the error surface at that point so that we can slightly move the point along the slope and give a smaller error. The algorithm then repeats the slope finding and moving steps and tries to reach the global minimum. Below figure shows an example of starting from a random point and gradually moving to the minimum. Blue line segments show the jumps between each $$(w_1$$, $$J)$$ pairs and finally reach $$(1.0, 0)$$.

Gradient is the slope of a function $$f$$ at a certain point $$x$$, represented by $$\nabla_x f(x)$$. In order to determine the gradient at $$x$$, we need to first calculate the partial derivatives of $$f$$ with respect to $$x_i$$. The partial derivative $$\frac{\partial}{\partial x_i}f(x)$$ shows how $$f$$ changes when $$x_i$$ changes at $$x$$. Gradient is then defined as the set of all partial derivatives $$\frac{\partial}{\partial x_i}f(x)$$. After knowing the slope, we can determine how we should adjust point $$x$$ in the next step to move toward the minimum, just like walking downhill from a random location of a valley, and that is why the algorithm called Gradient Descent.

Using the model at the beginning as an example, in the simplest form, what gradient descent algorithm does is to:

1. Start from a random pair of $$w_0$$ and $$w_1$$
2. Iteratively update $$w_0$$ and $$w_1$$ to reduce the cost function $$J(w_0, w_1)$$ by gradient descent, that is to reduce the error generated by the model until a minimum is found, or other stopping criteria are met.

There is an important assumption in the gradient descent algorithm. We consider each weight is independent to each other and we can and should update them simultaneously. This should be noted in the below calculations.

To update the weights, we just need to adjust them along the gradient as below:

$$\begin{cases} w_0=w_0-\alpha\frac{\partial}{\partial w_0}J(w_0, w_1) \\ w_1=w_1-\alpha\frac{\partial}{\partial w_1}J(w_0, w_1) \end{cases}$$

which has been calculated at the beginning. Hence the updating process in step 2 becomes:

$$\begin{cases} w_0=w_0-\alpha\frac{1}{m}\sum_{i=1}^m\Big((w_0+w_1x_1^i)-y^i\Big) \\ w_1=w_1-\alpha\frac{1}{m}\sum_{i=1}^m\Big(\big((w_0+w_1x_1^i)-y^i\big)x^i\Big) \end{cases}$$

The constant $$\alpha$$ is called the learning rate. It determines how far should each adjustment go, and we shall discuss it in the next article.