Course v3 Lesson 4 Notes

Sentiment analysis (IMDB)

Transfer learning in NLP

  • Difficulties in training NLP model:
    1. How to speak a language.
    2. World knowledge
  • Start with a pre-trained model: a language model - a model that learns to predict the next word of a sentence.
  • Get benefit from a pre-trained model from a much bigger dataset, e.g. Wikipedia.
  • No preset label is needed: self-supervised learning - Yann LeCun, labels still exist in this kind of classification problem, but not created by human, instead, are built into the data set.

Training the model


  1. Retrieve a pre-trained language model for NLP: Wikitext 103
  2. Create a new language model good at predicting next word of movie reviews by fine-tuning the pre-trained language model (fine-tune the last layers) with the target corpus (IMDB reviews in this case).
  3. Fine-tune the language model to classify movie review to be positive or negative.

Training language model

  • A little trick for preparing training dataset of a language model is to make use of the test/validation set also. That is, we can keep the independent variable (the text) to train the language model, and therefore increasing the size of training dataset.
  • When our final goal is to train a classifier, we are not interested in predicting next word, but only the part of understanding a sentence, which is the encoder. So, we can just save the encoder for the next phase of training.

Training classifier

  • When training classifier, make sure the vocabulary, i.e. the mapping between tokens and their ids, of the training data is the same as the language model.
  • Training a language model takes longer time, but training a classifier based on that language model should be fast.

Tabular data

  • Using neural network to analyse tabular data can reduce the work in feature engineering and make it simpler.

Data pre-processing

  • Categorical data, unlike continuous data, needs to be transformed to embeddings.
  • Processor: similar to transformation in image processing, such as data augmentation, but is dedicated for tabular data. One key difference between processor and image transforms is that processor happen ahead of time, i.e. pre-process, while transformation happen in-time for each instance.
  • Remember to apply the same set of processors to all the dataset (training, validation, test).

Collaborative filtering

  • To predict how much an item a given user likes, or what kind of users like a given item.
  • Two ways to store user-preference values: list or sparse matrix.
  • “Cold start problem”: the collaborative system doesn’t have any data regarding new users and new items, therefore it is difficult to make suggestions to or of them. One of the methods to solve this problem is to have another model, e.g. metadata driven model, to guess and fill in the gaps.

Training with embedding

  • Embeddings is a matrix of weights, which can be looked up/indexed into as an array and retrieve a vector out of it. That vector is an embedding.
  • At the beginning of the training process, initialize the embeddings to some random numbers for both users and items, then calculate the dot product of the embeddings.
  • We can make the training of the network model easier and works better if we can restrict the network to search in a pre-defined range, e.g. 0 to 5 for movie rating, by using some methods, such as sigmoid function.

Terminologies in neural network

  • parameters: numbers in matrices to be multiplied with inputs, including weights matrices and bias matrices. The store, to make calculations.
  • activations: the numbers, which are results of matrices multiplication or nonlinearities/activation functions. The result of calculations.
  • layer: the calculation process, results in a set of activations. Two special layers: input layer and output layer.
  • loss function: function to compare output and true value, e.g. mean squared error, root mean squared error.
Leo Mak
Enthusiast of Data Mining
comments powered by Disqus