Course v3 Lesson 7 Notes

Basic CNN with batchnorm

Question: How to calculate numbers in the output column Param? It’s shown in the doc but seems doesn’t match.

ResNet from scratch

A deeper network doesn’t always have better performance, in terms of training error.

Deeper network has higher training error. source: Deep Residual Learning for Image Recognition, Kaiming He et al


  • “Deep Residual Learning for Image Recognition” by Kaiming He et al.
  • “Visualizing the Loss Landscape of Neural Nets” by Hao Li et al.

Question: At 19:30 in the video, shouldn’t it be only 1 conv operation followed by a ReLU before the plus, instead of 2 conv operations?

Skip connection

To “skip” the specific convolution layer by directly using the input as the activation. The direct link between the input and the activation is called identity connection or skip connection. The whole block of layers with skip connection is called res block.


Instead of “plus” operator, use “concatenate” operator. The block is called dense block and the model is called DenseNet.

DenseNet is memory intensive because the previous layers of features are all concatenated with later layers. However, the number of parameters is small, so it is worth to experiment DenseNet with small data sets. Moreover, it is good for segmentation tasks since the original input pixels are kept because segmentation requires kind of reconstruction of the original image.


  • Downsampling: normal convolution, with stride of 1 and stride of 2 or other tricks.
  • Upsampling: with stride of 1/2, also known as deconvolution or transposed convolution. There are many other ways to do the upsampling, e.g. nearest neighbour interpolation or bilinear interpolation, that doesn’t require lots of not-so-worthy computation time and power.

What U-Net does is to concatenate skip connection, which is called cross connection in U-Net, from the part of downsampling path (the encoder) to the same part (image size) of the upsampling path (the decoder). U-Net is very good at image segmentation because in the upsampling path, it utilizes the information from the original pixels and other information kept in the downsampling path, interacting with the layers with augmented-size.

Image restoration

Besides of segmentation, U-Net can also be used in image restoration. Image restoration is a kind of image generation to make a better image. “Better” here can have many different versions, for example:

  • Make a higher resolution image from a lower one
  • Make a colour image from a black-and-white one
  • Fill a missing piece of an image
  • Create various different styles of an image

To generate training data set for image restoration, we can consider to “crappify” the original image set, which is to make a lower quality version of the original images. Such “crappification” should include all the restorations we want the model to be able to manipulate, as some of those shown in above list. This is important because the model cannot do what it didn’t learn before.

When training the U-Net in this kind of image restoration task, with MSE between pixels as loss function, it can be pretty good at replacing missing patch or removing additional patch of an image because the patch is totally different with the pixels in the same place of the original image. But MSE cannot very well suit the purpose of creating a higher resolution image that is close to the original one because the losses between pixels are actually very small, in terms of colour and other pixel information. Therefore, we need some other loss function that can better describe other characteristics contained in an image, such as textures and some other details of objects in the images.

Generative adversarial loss

In additional to the pixel MSE utilized in U-Net as previously described, we can also train another model called Discriminator or Critic, which is a binary classification model making use of cross entropy loss function, trying to classify an image whether it is the original or is generated by another model. Now the performance of the image generator is measured by how good it can generate images that can successfully fool the Critic, which means the Critic classifies the generated images as real.

Schematic diagram of GAN. courtesy:

Feature loss / perceptual loss

In order to improve accuracy of the features such as texture of an image generated from a generative model, we can put those pairs of images into a pre-trained network model, and extract the features of both the generated image and real image from the same hidden layer of the network, and measure the difference between those by loss function, e.g. MSE. In the cited paper below, features are extracted from non-linear / activation layer (ReLU) just before the grid size changing layer (Max Pooling 2d).

System of training the lost network to measure feature losses in content and style between images. source: Perceptual Losses for Real-Time Style Transfer and Super-Resolution, Johnson et al

Side notes

Reclaim GPU memory

To reclaim GPU memory in Jupyter notebook, instead of restrart it, we can use below code to do that:

learner=None  # something that use a lot of GPU memory

Study strategy

  • Watch videos several times
  • Write code and share them, write related prose that could be read by a six-month-ago-you.
  • Read papers
  • Help on forum and share stories (success or not!)
  • Get social: book club, meetups, study groups
  • Build things: small apps, work projects, libraries, etc. Just try to finish something PROPERLY.
Leo Mak
Enthusiast of Data Mining
comments powered by Disqus