Course v3 Lesson 5 Notes

Back propagation

To calculate the loss between output layer/final activations and actual target values. Use the resulting losses to:

  1. Calculate the gradients with respect to the parameters and
  2. Update the parameters: $\text{parameters} -= \text{learning rate} \cdot \text{gradient of parameters}$.

Fine tuning

Example: ResNet-34

  • The final layer, i.e. that final weight matrix, of ResNet-34 has 1000 columns because the images can be in one of 1000 different categories, i.e. the length of target vector is 1000.
  • When we employ the ResNet-34 to do transfer learning, the final layer is thrown away because our specific problem does not have the same 1000 categories of the ImageNet.
  • API replaces that layer with 2 new weight matrices and put a ReLU in between. The size of the first matrix has default values and the second one is the same as the specific problem (data.c).
  • At the first stage of fine-tuning, we only need to train the newly added layers but not those previous layers as they are tasked to recognize different objects. Therefore we want to freeze those layers, i.e. not to back propagate the gradient descents to the layers, which can also save some time and memory. But the most import thing is to retain the knowledge stored in the model.
  • We like to make the whole model better at the second stage of find-tuning, and therefore we unfreeze other layers, so that all the layers can be updated. However, we assume that the first few layers that are responsible for more general concepts, e.g. to identify edges or colours, do not need much training as later layers. So the whole model is divided into several sections and have different learning rate: smaller for the beginning of the model and larger for layers closing to the output activations. This technique is called using discriminative learning rates, which can be utilized by slice syntax.

Collaborative filtering in detail

  • Embedding: matrix multiplication of a weight matrix with a one-hot encoding matrix is same as (mathematical identical to) an array lookup. This kind of lookup is called embedding. This operation is a time and memory efficient way of one-hot encoded matrix multiplication.
  • Latent features/factors: in collaborative filtering, the weight matrix, as well as the embeddings, encodes some features of e.g. personal taste, of users, and the corresponding features in the items so that the model can correlate some items with some users. This kind of features is called latent features or latent factors.
  • Bias: to encode or represent the general property of a user or an item, an extra number of is added for each entry (user or item), which is going to be added to the linear function. This number is called bias.

Training for collaborative filtering - revisit

  • When defining a range for the output of sigmoid function, it is better to expand or relax the range a bit since sigmoid function is asymptotic, where the extrema cannot be reached. For example, for a rating ranging between 0 and 5, it would be nice to set the range of the sigmoid function to be, e.g. [-0.5, 5.5]. The actual extended range is application dependent.

Bias & weights

  • User and item bias: some users may rate most of the items a very high or low score, and some items may get very high or low ratings no matter what features they have. To cope with this situation, we can introduce bias terms for each user and item. Adding these bias terms to the activation function in optimization process can help to build an unbiased rating model. The item bias may represent the intrinsic value of that item regardless what features it contains or what are the values of those features. The user bias represents the intrinsic preference of that user for the items, no matter what the favourite features of that user are.
  • Weights: often in collaborative filtering, as mentioned above, weights represent some latent or hidden features. It is often a good idea to reduce the dimension of layers (the number of weights) in neural network, if necessary, by procedure such as principal component analysis (PCA), in order to lower the number of activations to a level we would like to handle, e.g. to interpret the output result.

SGD tricks

Weight decay

Instead of limiting the number of parameters to avoid overfitting, we can still have more parameters, but make it small. More parameters mean more nonlinearities, more interactions, etc. This is a way to have more parameters and penalize complexity.

  • Lost function + sum of squares of parameters multiplied by a number. The number is called weight decay, which is the parameter wd in various learner functions in

  • The wd is set to 0.01 as default, but we can often to set it to a number between the default and 0.1.

  • To update the weight: without weight decay, the update looks like this: $w_t=w_{t-1}-lr \cdot \frac{dL}{dw_{t-1}}$, where $L$ is the lost function. For example, $L(x,w)=\text{mean square error}(\text{model}(x,w)-y)$, and $w$ and $x$ are weights and independent variables respectively. This actually corresponding to below lines of codes:

    with torch.no_grad():
        a.sub_(lr * a.grad)

    and a.grad is $\frac{dL}{dw_{t-1}}$

  • When weight decay $\text{wd}$ is introduced, the update process becomes: $w_t=w_{t-1}-lr \cdot \frac{dL}{dw_{t-1}}+\text{wd}\sum w^2$, which corresponding to below codes:

    wd = 1e-5 # example
    w2 = 0.0
    for p in parameters: w2 += (p**2).sum()
    loss = loss_function(y_hat, y) + w2*wd
  • Weight decay can affect the training process in training neural network because it is now part of the loss function $L(x,w)$ as stated above. The updating of weight by finding the gradient of the loss function w.r.t. the weights $w$, involve the calculation of the gradient of the additional part as: $\frac{d(\text{wd} \cdot \sum w^2)}{dw} = 2\text{wd} \cdot \sum w$. This process of adding a squared term into the loss function to penalize the complexity is called L2 regularization, which is mathematically identical to weight decay.


  • SGD with momentum: the new value is not only based on the gradients, but also following the ‘direction’ of the last few updates: $S_t=\alpha\cdot grad+(1-\alpha)S_{t-1}$, where $S_t$ is step at time $t$ and $\alpha$ is a constant between 0 and 1. This is called *exponentially weighted moving average (EWMA)* of the last few steps, since the ‘memory’ of the values of previous steps are forgotten exponentially (because both $\alpha$ and $1-\alpha$ are less than 1).
  • rmsprop: the gradient is divided by the square root of the EWMA with gradient squared: $\alpha\cdot grad^2+(1-\alpha)S_{t-1}$, which means if the gradients are consistently small, make the jumps bigger.
  • Adam: This is a kind of optimization technique with dynamic learning rates that utilize both momentum and rmsprop. Instead of dividing the gradient as in rmsprop, EWMA is divided in Adam.

One cycle policy fitting

Start with low learning rate with large momentum since when the learning rate is small, the direction of gradient is same as the updates, therefore the update can go faster with higher learning rate, with progressively smaller momentum, until learning rate gets too high, then the changes of learning rate and momentum go in reverse direction. This combination is called one cycle.

Cross entropy loss

To cope with categorical data in classification problem, (R)MSE loss is not very useful since it deals with continuous data in regression problem. We need a loss function that can assign little loss when prediction is correct with high confident, and big loss when prediction is wrong with high confident. For the prediction of a particular item, the cross entropy $\text{CE}$ loss is defined as:

$$\text{CE} = -\sum y_i\cdot \log(p(\hat{y}_i))$$

where $i \in [1 \cdots n]$ and $n$ is the number of possible classes, and

$$ y_i = \begin{cases} 1, & y \in \text{class }i,\\0, & \text{otherwise}\end{cases} $$

Since $y_i$ is the one hot encoding of the class of the item, the cross entropy is the sum of the one hot encoding multiplies the log of probabilities, or activations in more general term. And it is equivalent to the index lookup of the log of the activation.

Caveat: to ensure the correctness cross entropy when doing single label multi-class classification, the sum of all the activations of an item should be equal to 1. In order to hold this property, softmax is the activation function to use for the final layer, which is defined as: $\sigma(y_i)=\frac{e^{y_i}}{\sum e^{y_i}}$, and $y_i$ is the loss of the item to a specific class, which is the cross entropy in this case.

Leo Mak
Enthusiast of Data Mining
comments powered by Disqus