Bib Racer 01 - Scrape Images

To identity bib numbers in images, first we need to have images.

Racers with bib numbers. courtesy: Fitz

I had participated in trail running races for 2 years. All of those races are exciting and unforgettable, with stunning scenic views and different kinds of challenges. During each of the event, there are many enthusiastic photographers, amateur or professional, taking numerous pictures of racers and putting them online for download either freely or with fees. In order to find the photos for a particular racer, one needs to either look for those photos from countless albums each containing more that hundreds of photos one by one and by naked eyes, or some websites can let you input the bib number to and get the photos with the number for you.

Various styles of bib and bib number

A bib number is usually a combination of zero or 1-2 alphabets and several digits which used to uniquely identify a competitor in a race. The bib number is prominently printed on the bib, which is a piece of paper, for the ease to spot and identify a racer from a distance.

With the bib number, we should be able to find all the photos of a particular racers. However, this is based on the assumption that every photo with racers has been tagged with corresponding bib numbers. The truth is, most of the uploaded photos have not been tag if photos are freely available, or it takes lots of human effort since the tagging is a manual process.

I am curious to know whether the tagging process can be automated, at least to a certain degree, by image processing technique currently available. In order to accomplish this goal, the project is going to be in two parts. The first one is to tag all the photos with bib numbers detected. The second part, is to identify different racers in all the available photos. Therefore we can tag the photos of the racers without bib number shown in the photos.

Retrieve images from web photo albums

First things first, we need to have the images. I participated a trail running event last year called Rebel Walker and there are several websites hosting photo albums of this event. I will use photos available from Running Biji HK of this race as an example.

Parsing album page using BeautifulSoup

We are using Beautiful Soup 4 here to parse the webpage with Python. There are lots of nicely written BeautifulSoup tutorials online and its documentation is a good place to start with. In order to parse the content of the page, first we need to download the webpage using urllib library:

from bs4 import BeautifulSoup
import urllib

def soup_page(page): try: return BeautifulSoup(page, 'lxml') except: print("Cannot fetch the requested page")

# Open album page url_bj = "https://hk.running.biji.co/index.php?q=album&act=gallery_album&competition_id=1734" albums_page = urllib.request.urlopen(url_bj) soup = soup_page(albums_page)

Taking a look at the source code of the album list page and we know that the album list section is under a ul with gal-grid class. The link to each individual album is inside the <a> tag. Therefore we can save all the links to a list for further parsing in the next step.

Source code of album list section
# Locate albums section and retrieve all album links
albums_list = soup.find('ul', attrs={'class': 'gal-grid'})
lnks = albums_list.find_all('a')
album_pages = []
for l in lnks:
  album_pages.append('https://hk.running.biji.co'+l['href'])

Scrape web page using Selenium

An album page displays all the photos in an album. However, it is a dynamic page utilizing some scrolling effect that not all photos will be loaded when page is opened at the first time, but will continue to load more if the bottom of the page is reached. So we cannot barely download the content of an album page with the given link. In order to fully load the web page, we need the help from a tool called Selenium WebDriver. It is a browser automation framework that can automate the browser actions to simulate the page scrolling movement to load all the photos.

To make use of Selenium in Python, we first need to install it with pip: pip install selenium. Selenium requires a driver to interface with the chosen browser so we need to install a suitable driver also. Details of the installation can be found in the documentation of Selenium with Python.

Setting up Selenium in Colab

As I am developing the project in Google Colab, below setup is needed before calling Selenium to work for us. The steps are based on the instructions in this post.

# install chromium, its driver, and selenium:
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

Then we can import the webdriver module in selenium and set the appropriate options in order to make it work.

from selenium import webdriver
sln_options = webdriver.ChromeOptions()
sln_options.add_argument('--headless')
sln_options.add_argument('--no-sandbox')
sln_options.add_argument('--disable-dev-shm-usage')

Load the album page with Selenium

We can now define a function to fully load an album page with Selenium, which will be processed by BeautifulSoup later. The first stpe is to create a webdriver connecting to a browser, which is Chrome here, with options defined above, then downloads the page of the given url, and find the height of that page:

driver = webdriver.Chrome('chromedriver', options=sln_options)
driver.get(url)

The height of the initial loaded page is determined by executing javascript return document.body.scrollHeight, which is then stored and compared later to determine whether a page has been all loaded.

last_height = driver.execute_script("return document.body.scrollHeight")

We then keep trying to scroll to the bottom of the page:

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Give it some time, which is 1 second in this example, to load the page:

time.sleep(1.0)

Find the new height of the page and check if there is any new content loaded. This process will continue unless the height of the page does not change anymore, which means all the content of page has been loaded.

new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
  break
last_height = new_height

The complete load_page function is shown below:

def load_page(url):
    driver = webdriver.Chrome('chromedriver', options=sln_options)
    driver.get(url)
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
      driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
      time.sleep(1.0)
      new_height = driver.execute_script("return document.body.scrollHeight")
      if new_height == last_height:
        break
      last_height = new_height
    return driver

Scrape the photos

With the page loading function ready, we can finally parse the album page and download the photos. For each photo album page, we first use Selenium to fully load the page, then feed it to BeautifulSoup for parsing the content.

photo_page = load_page(album)
soup = soup_page(photo_page.page_source)

Again, take a look at the source code of the album page we know that a link to every image is tagged in <img> with class photo_img.

Link to the image

Therefore we can save all the links of photos to a list as below:

photo_list = soup.find_all('img', attrs={'class': 'photo_img'})

Now let’s inspect the page containing the actual photo to see if there is any other information or resource that can benefit our project. From the source code, we find that there are 2 sizes of the photos available: a smaller size one is a file with filename started with “600_” and its height is 600 pixels, and a larger one is file with filename started with “1024_", as shown below:

2 versions of the same photo different in sizes

We shall download the larger images in hope of building a more accurate model. Since the URLs in photo_list are all link to “600_” files, so we need to replace the file names to those “1024_” before the download. It is a bit hardcode in style but I think it is sufficient for the purpose of this task, and no need to waste time to open another page to find out the exact link to the larger photo. The image is downloaded by requests.get().content and saved to the file object with the same file name in web server extracted by using os.path.basename().

import os
import requests
img_path = 'images/'
for i in range(len(photo_list)):
  lnk = photo_list[i]['src'].replace("600", "1024")
  with open(img_path + os.path.basename(lnk), "wb") as f:
    f.write(requests.get(lnk).content)

The complete photo download part is shown below:

a_num=0
for album in album_pages:
  photo_page = load_page(album)
  a_num+=1
  # Retrieve content of the album
  soup = soup_page(photo_page.page_source)
  photo_list = soup.find_all('img', attrs={'class': 'photo_img'})
  print('Album {} contains {} photos.'.format(a_num, len(photo_list)))
  # Download photos from an album
  for i in range(len(photo_list)):
    lnk = photo_list[i]['src'].replace("600", "1024")
    with open(img_path + os.path.basename(lnk), "wb") as f:
      f.write(requests.get(lnk).content)
  print('Finished processing album {}'.format(a_num))
  ```
  

Summary

Downloading racers photos from photo gallery site is the first step of our bib/racers recognition and grouping project. We’ll try to extract the bib numbers from the available photos and organize them in the next step. The source code of photo downloader in Jupyter notebook format is available in Github.

Avatar
Leo Mak
Enthusiast of Data Mining
comments powered by Disqus

Related