Finding correlation coefficients between ETFs with Python

Several months ago I finished reading the book The Intelligent Asset Allocator by William Bernstein. It is a really nice book if you want to have a solid idea and examples on portfolio theory, as well as guidance for building your own investment portfolio by allocating your asset into different classes. One of the main points of building effective portfolio is building with uncorrelated, or less correlated in reality, assets. Whether two assets are correlated or not, or more precisely, the level of correlation, is measured by correlation coefficient, which is ranging from -1 to +1. A positive correlation coefficient basically means the two assets change in same manner. Negatively correlated assets have a negative correlated coefficient, which change in reverse direction. Completely uncorrelated assets have coefficient 0, which we just do not know the movement of one asset from the movement of another asset. More about the usage of correlation in financial analysis can be found here.

The main investment vehicle I am assembling is through ETFs available in Hong Kong market. It is then interesting to find out the correlation coefficients of different combinations of ETFs. The list of all available ETFs in Hong Kong market is available in Once we get the list of targeted tickers of the securities, we can retrieve the corresponding historical data in Yahoo! Finance.

We are going to build a Python program to calculate the correlation coefficients of different ETFs for further analysis, which includes below four steps:

  1. Retrieve a list of ETFs
  2. Retrieve historical data of ETFs
  3. Scrutinize the data
  4. Create correlation matrix

Following the discussion and examples in The Intelligent Asset Allocator, the calculation of correlation coefficient is based on the percentage of change of monthly closing price of each ETF.

Retrieve list of ETFs

First thing first, we are going to retrieve all the ETFs available for trading in Stock Exchange of Hong Kong, which is available in the ETF search page of financial website Since the only information we need to retrieve the historical data from Yahoo! Finance is the ticker, the get_ETF_list function only return the list of names and tickers (ETF_Name), and the underlying indices of the ETFs (ETF_Index) for reference.

def get_ETF_list(lang='en'):
    return (ETF_Name, ETF_Index)

ETF_Name, ETF_Index = get_ETF_list('en')

The get_ETF_list function utilize a simple method read_html from pandas to download all the tables in a given url and save them to a list of dataframes.

    lang = lang  # 'tc' or 'en'
    url = '' + lang + '/stocks/etf/search.aspx?t=1&s=0&o=1&sl=1'
    dfs = pd.read_html(url, encoding='utf8')

Taking a look of the structure of the target webpage, we find that the table containing ETF information is the largest one amongst all the downloaded tables. Therefore we shall work on the dataframe with longest length.

    df = max(dfs, key=len)  # table includes all ETF information

Next we are going to clean up the downloaded ETF list. As there are some rows in the dataframe contain descriptive information, which is not needed, we are going to exclude those rows before returning the tickers and names of ETFs. Screenshot of ETF list

Since all the ETFs in the list element contain the string ‘.HK', we can find out the rows that do not contain ETF information and drop those rows. Moreover, as the format of variable ETF_Name is ‘Name of ETF #####.HK', we should split the string into the actual name of the security and the ticker. Finally we should also reset the dataframe index.

    ETF_Name = df.iloc[1:len(df), 0]
    non_etf_idx = ETF_Name[~ETF_Name.str.contains('.HK')].index
    ETF_Name = ETF_Name.drop(non_etf_idx)
    ETF_Name = ETF_Name.apply(str.rsplit, args=(' ', 1))
    ETF_Name.reset_index(inplace=True, drop=True)

Perform similar clean-up procedure on ETF_Index list.

    ETF_Index = df.iloc[1:len(df), 1]
    ETF_Index = ETF_Index.drop(non_etf_idx)
    ETF_Index.reset_index(inplace=True, drop=True)
    return (ETF_Name, ETF_Index)

Retrieve historical data of ETFs

We now have the list of names and tickers of all ETFs, and are ready to retrieve the corresponding historical data. As discussed previsouly, we choose to download the monthly closing prices of the securities to be the data for calculating correlation coefficients. To retrieve the set of historical data, we need to provide the start and end of the period, as well as the specific ticker.

Names = ETF_Name.tolist()
etf_tickers = list(zip(*Names))[1]
d1 = date(2017,1,1)
d2 =
etf_all = []
etf_all = [get_historical(d1, d2, etf.strip('0')) for etf in etf_tickers]

Let’s take a look on how the get_historical function retrieve the data we want:

def get_historical(date_start, date_end, ticker):
    start_tstmp = int(time.mktime(date_start.timetuple()))
    end_tstmp = int(time.mktime(date_end.timetuple()))
    his_url = '' + ticker + '/history?period1=' + str(start_tstmp) + '&period2=' + str(end_tstmp) + '&interval=1mo&filter=history&frequency=1mo'
    print("Getting data of", ticker, "...")
The `timetuple()` function converts the given `` parameter into a standard `time.struct_time` object that can be further coverted into a floating point number representing the time in seconds since the [*epoch*](, which are part of the parameters embedded in the requesting URL.

The table of historical data downloaded is stored in a pandas dataframe. Because there are some tickers which are not available in Yahoo! Finance page due to various reasons, we need to first make sure the table downloaded is at least in correct format by checking the first element of the first column is ‘Date’, which is the name of the column, as shown in below figure.

desired ETF historical data table format

  his_tbl = pd.read_html(his_url, header=0)
  df = his_tbl[0]
  if (df.columns[0] == 'Date'):
    # Get historical data
      return (None)

There are 7 columns in the the downloaded table including Date, Open, High, Low, Close, Adjusted Close and Volume. What we only need are the ‘Date’ and ‘Adjusted Close’ columns. As seen in above figure, there are rows in table indicating the dividend payout date and is not relevant to our analysis, so we need to drop those rows. Then we should parse the date and the adjusted close value to correct data formats (datetime64 and float64 respectively). Since the price data of each ticker will be merged together in later step for ease of calculation, we update the column name of the Adj Close column with the corresponding ticker. Finally, as the correlation coefficients of different tickers are calculated based on the percentage of their monthly closing price change, we add a new column to the dataframe to reflect that monthly percentage change using the pct_change() function.

        # Get historical data
        df = df.iloc[0:len(df) - 1, [0, 5]]  # Only retrieve 'Date' and 'Adj Close'
        non_price_idx = df[df.iloc[:, 1].str.contains('Dividend')].index
        df = df.drop(non_price_idx)  # Remove 'dividend' information row
        df.Date = pd.to_datetime(df.Date)  # Convert to datetime
        df.iloc[:, 1:df.shape[1]] = df.iloc[:, 1:df.shape[1]].apply(pd.to_numeric, errors='coerce')  # Convert to numeric
        col_name = ticker + '_Adj_Close'
        df.columns = ['Date', col_name]
        df[ticker + '_pct_diff'] = df[col_name].pct_change(periods=-1) * 100
        return (df.set_index('Date'))

Scrutinize the data

Now we have the data of all the available ETFs downloaded. Before computing the correlation coefficients from these data directly, it is worth to check that the data is clean and meaningful for comparison. For example, since some ETFs have very small transaction volume and numbers, a criteria is set to ensure that for each ETF, data should be available for at least 2/3 of the enquired period. All the ETF will be check by the can_add() function before adding to the final list of available ETFs.

month_diff = (d2.year-d1.year)*12 + (d2.month-d1.month)
min_len = round(month_diff / 1.5)  # at least the most recent 2/3 of the whole period contains data
etfs = [x for x in etf_all if can_add(x, min_len)]

The can_add() function checks on below 3 criteria:

  1. If there is any data downloaded
  2. If data is available for at least 2/3 of the whole enquired period
  3. If the most recent 2/3 of the equired period contains data
def can_add(etf, min_len):
    if etf is not None:
        if len(etf) > min_len:
            if etf.iloc[-min_len:,0].isna().sum() == 0:
                return True
    return False

Create correlation matrix

The final step of creating the correlation matrix is to combine the percentage change column of each ETF into one dataframe and compute the pairwise Pearson correlation coefficients of these columns.

etfs_diff = [df.filter(regex='pct_diff') for df in etfs]
etfs_diff_aggr = pd.concat(etfs_diff[:], axis=1)
etfs_diff_aggr = etfs_diff_aggr.rename(columns=lambda x: re.sub('_pct_diff', '', x))  # remove unnecessary string in column labels
etfs_corr = etfs_diff_aggr.corr()

Once the correlation matrix is ready, we can run different analyses on ETFs to find out their relationships. The details will be discussed in next post, Correlation analysis of ETF using Python. Stay tuned!

Leo Mak
Enthusiast of Data Mining
comments powered by Disqus