In the previous articles we have discussed the basic concept of simple linear regression; how to measure the error of the regression model so that we can use the gradient descent method to find the global optimum of the regression problem; develop the multivariate linear regression model for real world problems; and how to choose learning rate and initial values of the weight to start the algorithm. We can try to solve real world problem using linear regression at this point. However, there are many additional techniques we can employ to make linear regression works more effectively and efficiently.

## Feature Scaling

In our flat rental price example, the values of feature *Size* may fall between 300 to 3000 feet, while *Number of Bedroom* is in the range of 1 to 4. It is reasonable to normalize the feature values before running gradient descent algorithm as the weight selection will be easier, especially for \(w_0\), which is the y-intercept of the regression model.

There are different ways to normalize the feature values. One is to subtract the mean of the feature from each value and then divide it by the difference between the maximum and minimum values of that feature:

Another method is to replace the difference of maximum and minimum with the standard deviation of that feature:

where the superscript *i* denotes the i^{th} feature.

## Categorical Descriptive Features

Sometimes we may encounter categorical, rather than continuous, features in our dataset, as shown in the below expanded flat rental dataset with a new *Region* feature, which takes value A, B, or C.

Size | Rental Price | Floor | Number of bedroom | Region |
---|---|---|---|---|

350 | 1840 | 15 | 1 | A |

410 | 1682 | 3 | 2 | B |

430 | 1555 | 7 | 1 | C |

550 | 2609 | 4 | 2 | A |

480 | 1815 | 18 | 2 | B |

… | … | … | … | … |

Linear regression cannot handle such kind of features by default. One of the common and obvious practices is to convert a categorical feature into several new binary features. For example, as there are 3 different areas for the *Region* feature: A, B, or C, we could replace it with 3 new descriptive features: *Region A*, *Region B*, and *Region C*, as shown in below table. If an instance has value A of the original *Region* feature, then the value of its new *Region A* feature is 1, and 0 for the other two.

Size | Rental Price | Floor | Number of bedroom | Region A | Region B | Region C |
---|---|---|---|---|---|---|

350 | 1840 | 15 | 1 | 1 | 0 | 0 |

410 | 1682 | 3 | 2 | 0 | 1 | 0 |

430 | 1555 | 7 | 1 | 0 | 0 | 1 |

550 | 2609 | 4 | 2 | 1 | 0 | 0 |

480 | 1815 | 18 | 2 | 0 | 1 | 0 |

The drawback of this method is as the number of features increases, more optimal values of weights has to be found, especially when there are many categorical features, or the feature has many different distinct values. To alleviate the problem a bit, for each original categorical feature, we could make one of the new binary features be the default value. For example, we can set Region A as the default value for the original Region feature and omit it in the table. Now if a data entry has both Region B and Region C are 0, we know the flat is in Region A. By this mean we can add one less feature to the new dataset.

## Polynomial Regression

Sometimes the relationships between descriptive features and target feature are non-linear and our simple linear regression model cannot handle it very well. In such case we need to generalize our model to not just include linear functions but also non-linear functions, which called *basis functions*. One common class of these functions is polynomials \(p(x)\). The linear regression model now becomes:

For example, to better capture feature \(x_1\), as shown in below figure, the linear regression model may now include the quadratic term: \(M_w(x) = w_0x_0+w_1x_1+w_2x^2_1\).

Note that although the model now includes non-linear terms of \(x\), the weights are still in linear form. Since our goal is to optimize those weights, the model can again be solved by using linear regression as usual.

In fact, we can include more terms in higher power than just the quadratic function, and even other type of basis functions. However, too many and too complex functions may introduce the problem of **overfitting**. That means the model adapts to the train data too well, but lose the generalization ability so that it cannot predict value of target feature given new unknown data. We can use **Regularization** to solve this problem and it will later be discussed in detail, as well as other issues of error based learning, such as overfitting and underfitting.