Course v3 Lesson 6 Notes

To label unlabelled data for you.

Tabular learner

Rossmann Store Sales data set

  • When dealing with time series data, most of the time in practice is not using recurrent neural network, which is powerful when the sequence of time point is the ONLY information we have. In real-world cases, when we are given time data, we can generate or look for more information regarding this field, such as hour, date, day of week, week, month, etc. as categorical variable, especially when the cardinality of the variable is not too high. With this kind of feature engineering, we are able to treat many kinds of time-series problems as tabular problems.
  • Preprocessor: run once on the training set before training to pre-process the data. The states or metadata created are shared with the validation and test set. There are different types of preprocessing techniques, such as Categorify, which is to transform the categorical data into numerical representations, and Filling missing value for continuous data.
  • Loss function in the problem: the Root Mean Square Percentage Error (RMSPE) is measuring the ratio, rather than absolute error of the prediction of sales. In order to convert to Root Mean Square Error, which is the most common loss function for regression problem, we should first take the logarithm of the ratio to transform the ratio to a difference. This is a common technique when dealing with variables having long tail distributions such as dollar amount of sales or population, which percentage differences are more relevant than absolute differences.


  • Many techniques available: data augmentation, weight decay, dropout, batch normalization, etc.

Data augmentation

It is a kind of ‘cheap’ regularization to get better generalization because it doesn’t take longer to train and to an extend doesn’t cause underfitting.

  • Two things to consider when doing image transformation:
    1. Which levels of transformation can still give a clear image that is of the same image. For example, a transformed cat image still shows that cat as the original image.
    2. Have a check on the original data set, especially the validation set to get an rough idea on the values or parameters of the transformation should be applied.


  • In short, throw away some percentage of the activations at random, to avoid the model remember a particular training data that will lead to overfitting. Different subsets of activations are thrown away for each different mini batch.
  • The activations are dropped out with probability p in training phase, but will always present in testing phase. Therefore, the weight of the activations need to be divided by p in testing phase.

Batch normalization

The output range of an activation layer is often not the same as the required range of the problem itself. For example, the range of an activation is between -1 to 1 but the range of proper prediction should be between 1 to 5. This makes training process slow to adapt. Batch normalization is a technique to add a layer which takes activations as inputs, normalize them, then find 2 learnable, with e.g. gradient descent, parameters $\gamma$ and $\beta$ for each input, which multiplies with and adds to the normalized inputs. These $\gamma$ and $\beta$ are effectively the multiplicative and additive bias layers in the batch normalization process, to expand or shrink the output range, and to shift the output up and down.

  • When batch norm is used, a higher learning rate can also be utilised.

Convolutional neural network

  • Kernel: a small matrix multiplied with the image pixels to apply effect or detect feature on the image.
  • Convolution: the linear operation, usually matrix multiplication, with a kernel on the batches of image pixels.
  • Channel: the result of the linear operation of a kernel applied to the input. We can create many different channels by applying different kernels to the same input.
  • Stride n convolution: the kernel will jump over n positions (pixels or element in a matrix) each time. A normal convolution with the kernel goes over all the pixels on the input, is a stride 1 convolution. A stride 2 convolution has the kernel skipping every other element in the input matrix. The stride n convolution is used to constraint the memory usage.
  • Average pooling: take the average of all the values for each channel. Each of these values represent a kind of feature.
  • To create a heatmap showing which area with high activation to identify the features in an image, instead of averaging values in a channel as average pooling does, we can average the values across all the channels in the same position. For example, the width and height of the channel of current layer is 11x11 and the depth, i.e. number of features, is 512, we can create a heatmap of 11x11 by averaging all the values of those 512 layers of each 11x11 “pixels”.

Data ethics

Theory and concept can be neutral, but content and data, as well as people who creating and collecting the algorithms and data can have large biases. This is especially a main concern when considering generative models in text, image, sound, video, etc. This kind of biases doesn’t necessary reflect what the world is, and can be systematically erroneous.

Biased data

  • There can be many different parties involved: people creating the algorithm, people implementing the software, people who use the software, etc.
  • It may be the person who creating the data set, and the person who implementing the algorithm who are in the best place to get out the loop and pointing out the possible problems.

Things to help

  • Think about how you would handle a situation BEFORE you are in it. Ethical issues are complex and don’t have clear or easy answers.
  • Do not make things that will cause massive amounts of bias and harm when creating data or product that many people use, research and build on top of it.
  • Think right from the start about the possible unintended consequences.
  • Feedback loops
    • Get involve with people who might be impacted by the usage of the algorithm and software.
    • Getting human back in the loop of data product, to avoid the result of algorithm going to the direction which is out of control.
    • Set up proper regulations.
Leo Mak
Enthusiast of Data Mining
comments powered by Disqus