Logistic Regression 2 – Cost Function, Gradient Descent and Other Optimization Algorithms

We have discussed the basic ideas of logistic regression in previous post. The purpose of logistic regression is to find the optimal decision boundary which can classify the data with different categorical target feature into different classes. We also introduced the logistic function or sigmoid function as the regression model to find the optimal decision boundary. Now let’s take a look how to achieve it.

Cost Function and Gradient Descent for Logistic Regression

We can still use gradient descent to train the logistic regression model. The only difference is the cost function since we are now using the sigmoid function instead of the line equation as the model. Recall that the general cost function \(J\) represents the sum of squared error between predictions and actual values is:


Now the cost function for logistic regression is:

$$J(w)=-\frac{1}{m}\sum^m_{i=1}\Big[y^i\log M_w(x^i)+(1-y^i)\log \big(1-M_w(x^i)\big)\Big]$$

and the gradient is:

$$\frac{\partial J(w)}{\partial w_j}=\frac{1}{m}\sum^m_{i=1}\Big(M_w(x^i)-y^i\Big)x^i_j$$

where \(M_w(x^i)\) is the sigmoid function.

Finally, to perform gradient descent, which is similar to linear regression, we need to update the weights in an iterative manner:

$$w_j=w_j-\alpha \frac{1}{m}\sum^m_{i=1}\Big(M_w(x^i)-y^i\Big)x^i$$

We can now put the above new cost function and gradient descent equation to the gradient descent method introduced in previous post to train the multivariate logistic regression models.

Other Optimization Algorithms

Gradient descent can be used to train model in optimization problems such as logistic regression. However, it does not perform very well in multidimensional cases, both in terms of the result and efficiency. While gradient descent is making use the first derivatives of the cost function at a point, there are more advanced optimization methods that make use of other properties of the function, such as the second derivatives, Jacobian matrix, Hessian matrix, etc. Some of the methods include Conjugate gradient, Newton method, BFGS and L-BFGS. We only leave their names here for reference and shall not go through them.

Leo Mak
Make the world a better place, piece by piece.
comments powered by Disqus