# Bib Racer 02 - Training with RBNR Dataset

Transform RBNR dataset to YOLO format, train and test on the datasets

Running events images with faces and bibs detected.

In previous post, we talked about how to scrape and download photos using Selenium and BeautifulSoup, from an online photo album of a trail running event. The next step is to identify the bib numbers in the photos automatically. This can be divided into two subtasks, the first one is to identify the location of the bib in an image, and the second one is to recognize bib numbers in the identified bib. Certainly, we are going to handle the identification first.

There are several approaches in trying to tackle the bib identification problem. One of them is using some specific text detection algorithm in image processing. For example, Stroke Width Transform (SWT)1 is used in the racing bib number recognition system here, together with face and torso detection. Another approach is to detect the bib(s) and bib number(s) in the image through deep neural network with different schemes, such as Regional Convolutional Neural Network (RCNN)2, U-net3 , and YOLO (You only look once).

In this project, we are going to train a neural network for this task. Specifically, we shall use a port of YOLOv3 by AlexeyAB to train a bib number detector. Jonathan Hui has a wonderful post on explaning YOLO: Real-time Object Detection with YOLO, YOLOv2 and now YOLOv3.

## Data preparation - Training with existing data

To let YOLO successfully detect the bib numbers in images we retrieved previously, we need to annotate those images manually to train a detection model, since unlike common objects, such as cats, dogs, and people, there is no existing model weights of bib and bib numbers available. Before going through the manual image annotation process, we would like to test the training process on some existing data. Ben-Ami et al. created the RBNR dataset which comprises 217 images and ground truth race bib numbers and bib positions information. We are going to train a YOLO model on this dataset and see if something meaningful can be created.

### Transform data from RBNR to YOLO format

There are two parts of data we need to take care when creating training set for YOLO: the images and object information.

• Image The default size of training images required by YOLO is 416x416. Since the images in RBNR dataset are in different sizes, we are going to resize those images with OpenCV before feeding them to the YOLO model. In fact, with many of implementations of YOLO, the resizing step is not necessary as the resolution issue is handled by the implementation of the model through various methods, e.g. cropping, padding, resizing. However, we are going to resize the images separately before the training process.

import os
import cv2
from google.colab.patches import cv2_imshow  # Colab only, to replace cv2.imshow
img_dims={}
os.chdir(data_path)
orig_img_path = data_path + 'train/'
yolo_img_path = data_path + 'yolo_train/'
yolo_w = 416
yolo_h = 416
for file in os.scandir(orig_img_path):
if (file.path.endswith(".JPG")):
img_dims[file.name.split(".")[0]] = [img.shape[1], img.shape[0]]  #[width, height]
# Resize the image
cv2.imwrite(yolo_img_path+file.name, cv2.resize(img, (yolo_w, yolo_h), interpolation=cv2.INTER_AREA))

• Object label position To annotate the training images, we need to create bounding box and label it for each of the target objects in the image. The bounding box information is provided in the RBNR dataset. However, there are several differences between the annotation file of RBNR dataset and the one needed by YOLO:

1. The RBNR annotation file is stored in Matlab format, while YOLO requires a plain text format file.
2. The bounding box position in RBNR is defined as minRow maxRow minCol maxCol in absolute position. That is pixel position of the image. On the other hand, the format specifying the bounding box position in YOLO is x_center y_center width height, which are floating values relative to the width and height of the image. The following figure may give you some idea on it, where:
• Grey rectangle is the bounding box.
• x1, x2, y1, y2 represent minCol, maxCol, minRow, maxRow correspondingly.
• (cx, cy) is the (x,y) position of the centre (x_center, y_center) of the bounding box relative to the image.
• width, height are the relative width and height of the bounding box.
3. The object class name in RBNR file is stored as the name of Matlab matrix (facep and tagp). While in YOLO, class is indexed in a separated object name file and the annotation file store the index number (0 and 1) of the class.
import numpy as np
lb_path = data_path + 'valid_labels/'
yolo_lb_path = data_path + 'yolo_train_labels/'
for file in os.scandir(lb_path):
if (file.path.endswith(".mat")):
key = file.name.split(".")[0]
print(key+':', img_dims[key])
with open(yolo_lb_path+'/'+key+'.txt', 'w') as yolo_f:
for f in m['facep']:
f=f.astype(np.float, copy=False)
xcenter = (f[3]+f[2])/(2*img_dims[key][0])  # (x1+x2)/(2*width)
ycenter = (f[0]+f[1])/(2*img_dims[key][1])  # (y1+y2)/(2*height)
width = (f[3]-f[2])/img_dims[key][0]  # (x2-x1)/(2*width)
height = (f[1]-f[0])/img_dims[key][1]  # (y2-y1)/(2*height)
print(0, xcenter, ycenter, width, height, file=yolo_f)
for t in m['tagp']:
t=t.astype(np.float, copy=False)
xcenter = (t[3]+t[2])/(2*img_dims[key][0])  # (x1+x2)/(2*width)
ycenter = (t[0]+t[1])/(2*img_dims[key][1])  # (y1+y2)/(2*height)
width = (t[3]-t[2])/img_dims[key][0]  # (x2-x1)/(2*width)
height = (t[1]-t[0])/img_dims[key][1]  # (y2-y1)/(2*height)
print(1, xcenter, ycenter, width, height, file=yolo_f)

As written in the comment of the code, the x coordinate of the bounding box can be calculated from: $$x_{\text{center}} = \frac{x_1+x_2}{2\times \text{image width}}$$ and the width of the bounding box can be calculated from: $$\text{width} = \frac{x_2-x_1}{2 \times \text{image width}}$$ The y coordinate and the height of the bounding box can also be calculated in similar way.

Besides the images and annotation files, there are several other files necessary for YOLO to train a model. They are the configuration file, files listing training and validating images, object class name file, and a data file indicating number of classes in the dataset and paths to other files. Details on how to tune the configuration file and how to prepare other metadata files can be found from AlexeyAB’s repository.

## Train a YOLO model with the dataset prepared

We now have the training images, annotation files, and all the necessary metadata and configuration files to start the training process. Instead of training a network from scratch, we start from the weights of the darknet53 model pre-trained for ImageNet, which can be download here.

To start training, use the darknet detector train command with object data information file, configuration file and weights file as below:

./darknet detector train ./build/darknet/x64/data/obj.data ./build/darknet/x64/cfg/yolo-obj.cfg "\$proj_path"darknet53.conv.74


And you will see the program is spitting out some statistics regarding the current training progress as below:

v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 94 Avg (IOU: 0.280618, GIOU: 0.231438), Class: 0.471559, Obj: 0.491790, No Obj: 0.443883, .5R: 0.333333, .75R: 0.000000, count: 3, class_loss = 913.653015, iou_loss = 4.227417, total_loss = 917.880432
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 106 Avg (IOU: 0.450346, GIOU: 0.334497), Class: 0.444642, Obj: 0.313804, No Obj: 0.495205, .5R: 0.500000, .75R: 0.000000, count: 2, class_loss = 4492.003906, iou_loss = 1.055664, total_loss = 4493.059570
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 82 Avg (IOU: 0.287228, GIOU: 0.175353), Class: 0.566153, Obj: 0.376341, No Obj: 0.479385, .5R: 0.000000, .75R: 0.000000, count: 3, class_loss = 269.133057, iou_loss = 1.595978, total_loss = 270.729034
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 94 Avg (IOU: 0.293746, GIOU: 0.240471), Class: 0.360567, Obj: 0.353466, No Obj: 0.443728, .5R: 0.000000, .75R: 0.000000, count: 4, class_loss = 914.361938, iou_loss = 3.726074, total_loss = 918.088013
v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 106 Avg (IOU: 0.218296, GIOU: 0.103024), Class: 0.553546, Obj: 0.403100, No Obj: 0.494409, .5R: 0.000000, .75R: 0.000000, count: 5, class_loss = 4473.710449, iou_loss = 6.518066, total_loss = 4480.228516

3: 1893.809692, 1894.787720 avg loss, 0.000000 rate, 13.986173 seconds, 192 images


You can find the meaning of the parameters in here and here. I just want to add that the first number (which is ‘3’ in the last row) represents the number of iterations the training has been gone through.

The AlexeyAB fork will save the weights every 100 iterations as yolo-obj_last.weight, and every 1000 iterations as yolo-obj_xxxx.weights in the backup directory specified in the object data information file. Here is a detail explanation on when to stop training. And in general, it is suggested that 2000 iterations for each class is good enough.

## Testing the trained model

Once we have the weights file, we can try to make some detections! Below are some information about the testing:

• Image resolution: original
• Iterations of weights files used: 1000 (model-1000), 2000 (model-2000), 3000 (model-3000)

If we look on the detections of different images from various iterations, we can find that the model trained with 2000 iterations give a better result among others.

The model-1000 cannot detect anything in the first image. On the other hand, model-2000 successfully detects many bibs as well as several faces. It even mis-interpret one of the windows as a bib. However, model-3000 iterations can barely detect some bibs and no faces.

Similar to the sample 1, the model-1000 detects only 2 bibs and 1 face. Similar result for the model-3000 but the model-2000 has detected more both bibs and faces.

We can presume that model-1000 is underfitting, model-3000 is possibly overfitting because it can only achieve a barely better result than model-1000. Meanwhile, model-2000 seems to hit the balance and return some better results. To better understand when we should stop the training to avoid overfitting, we may take a look on the loss chart. Since the current phase just serves the warm-up and testing purpose, we won’t bother so much on it and shall do that when training on previously downloaded image, our target dataset.

Now let’s take a look at some statistics of the performance of these models. The mAP in the table stands for mean Average Precision. Tarang Shah and again Jonathan Hui have given good explanations on mAP. Please note that the confidence threshold in tests is 0.25.

Iterations100020003000
Precision0.780.690.8
Recall0.420.850.67
F10.550.770.73
True Positive96193151
False Positive278538
False Negative1303375
mAP ( IoU:0.50)66.42%73.46%72.65%

The above table shows that model-2000 is able to detect much more objects than model-1000 and model-3000, though with a higher number of false positive, which means it also mark something else to be faces or bibs. However, the recall, hence the F1-score, are higher than the other two models.

Finally, let’s examine the performance of different models on different classes. It seems that on detecting faces, model-2000 works better while model-3000 works better on detecting bib. In general, precisions of bib detection are higher than face detection.

Iterations100020003000
FaceTrue Positive609065
False Positive223416
Average Precision62.10%71.43%62.60%
BibTrue Positive3610386
False Positive55122
Average Precision70.75%75.48%82.70%

In the next stage, we’ll work on preparing the training dataset from the previously downloaded images, to train a YOLO model that can detect the faces and bibs in trail running events. Feel free to share your views on anything regarding this post and topic in the comment section.

## Update - Comparison on training with original and resized images

You may wonder the differences between training a YOLO model with original and resized images. I have experimented with the two sets of training images and here are some of the comparisons:

• 1000 iteration

• 2000 iteration

• 3000 iteration

• 4000 iteration

The above examples show us that model trained on resized images does not have significant advantage over model trained on original images. Therefore we should train the model with original images in next phase, unless extra data augmentation is needed.

## References

1. Epshtein, B. and Ofek, E. and Wexler, Y. (2010). Detecting Text in Natural Scenes with Stroke Width Transform. CVPR, IEEE. ↩︎

2. Boonsim, Noppakun & Kanjaruek, Saranya. (2018). Racing Bib Number Localization Based on Region Convolutional Neural Networks. Proceedings of 2018 the 8th International Workshop on Computer Science and Engineering, pp. 293-297, Bangkok, 28-30 June, 2018. ↩︎

3. Apap, Adrian & Seychell, Dylan. (2019). Marathon Bib Number Recognition using Deep Learning. 21-26. 10.1109/ISPA.2019.8868768. ↩︎