Correlation analysis of ETF using Python

Previously, we have covered why and how to create a correlation matrix of ETFs available in Hong Kong market using Python. Now we should do some actual correlation analyses on these securities, with the matrix just created. There are two kinds of analyses I am going to demonstrate, which are actually quite similar: one is to find out the n most uncorrelated ETFs in the whole market; the other one is to find out n most uncorrelated ETFs corresponding to a given specific ticker.

Find the n most uncorrelated securities

The idea to find out the n pairs of ETFs with smallest correlation coefficients is first to find out the smallest n correlation coefficients in the matrix, then look for their corresponding position pairs (row, column) in the matrix. Finally, the function returns the corresponding ticker names as well as the correlation coefficients which are the most uncorrelated ETFs in the whole market.

def find_nsmallest_pairs(corr_mtx, num_pairs):
    smallest_corrs = find_nsmallest(corr_mtx, num_pairs)
    smallest_pos = [np.argwhere(np.triu(corr_mtx, 1) == v)[0] for v in smallest_corrs]
    corrs_info = [[corr_mtx.index[smallest_pos[i][0]], corr_mtx.columns[smallest_pos[i][1]], smallest_corrs[i]] for i in range(len(smallest_corrs))]
    return corrs_info
The find_nsmallest() returns the n smallest correlation coefficients in the matrix corr_mtx, which will be discussed later. We can then use the numpy.argwhere() function to find out the corresponding indices of those n smallest correlation coefficients. Since a correlation matrix is symmetric, we only need to look for the indices from the upper or lower triangular matrix with numpy.triu() or numpy.tril(). The smallest_pos is a list of x-y indices corresponding to the smallest n values in the correlation matrix, and we can then retrieve the tickers of the corresponding indices as well as the correlation coefficients.

In order to look for the n smallest values in a pandas dataframe, we need to first cast the matrix into the numpy array. Again, due to the symmetric property of the correlation matrix, we only need to look into one of the triangular matrices. Since the range of a correlation coefficient is between -1 and +1, where 0 is the minimum by definition, the coefficients should be sorted according to their absolute values before finding out the minimum ones. Finally the corresponding real minimum values are looked up by their positions in the sorted list of absolute values.

def find_nsmallest(corr_mat, n):
    corr_arr = corr_mat.to_numpy()
    corrs = corr_arr[np.triu_indices(len(corr_arr), 1)]  # extract the upper triangle
    abs = np.abs(corrs)
    abs_sorted = np.sort(abs)
    pos = [np.argwhere(abs == abs_sorted[i]).item() for i in range(n)]
    return corrs[pos]
For example, the 8 pairs of tickers with lowest correlation coefficients are as below:
In[8]: find_nsmallest_pairs(etfs_corr, 8)
[['3019.HK', '3060.HK', 0.0002316007102952408],
 ['3007.HK', '3043.HK', -0.0008349599845091073],
 ['2819.HK', '3016.HK', 0.0008893844530078178],
 ['3055.HK', '3145.HK', 0.0008924172486858009],
 ['7302.HK', '82822.HK', 0.0009348115710049738],
 ['2832.HK', '3049.HK', 0.0009997182485107106],
 ['3085.HK', '83168.HK', -0.0011782217208005292],
 ['2833.HK', '3084.HK', 0.0015325610954576]]

The n most uncorrelated securities to a specific ticker

It is much easier to find the n most uncorrelated securities if a specific ticker is given, since we only need to look for the candidate values within a specific column in the dataframe.

def find_specific_nsmallest_pairs(ticker, corr_mtx, num_pairs):
    corrs = corr_mtx[corr_mtx.columns == ticker]
    return corrs[np.abs(corrs.T).nsmallest(num_pairs, ticker).index].T
For example, the followings show the 5 securities which are most uncorrelated with ticker ‘2800.HK’.
In[11]: find_specific_nsmallest_pairs('2800.HK', etfs_corr, 5)
3043.HK   0.003409
3027.HK   0.003778
83168.HK -0.005163
83188.HK  0.011657
82822.HK -0.014702

We can now plot the monthly price changes of some of the ETF pairs returned from previous operations. Please note the transaction period for below figures is from 2017 Janauary to 2019 September.

xlab = '2800.HK'
ticker_nsmallest_pairs = find_specific_nsmallest_pairs(xlab, etfs_corr, 5)
for i in range(pair_num):
    ylab = ticker_nsmallest_pairs.index[i]
    corrplot = etfs_diff_aggr.loc[:, [xlab, ylab]].plot.scatter(x=xlab, y=ylab)
    corrplot.set(xlabel=xlab+", monthly, %", ylabel=ylab+", monthly, %")

2800.HK/3072.HK, correlation = 0.0038

2800.HK/3072.HK, correlation = 0.0038

2800.HK/83168.HK, correlation = -0.0052

2800.HK/83168.HK, correlation = -0.0052

2800.HK/82822.HK, correlation = -0.015

2800.HK/82822.HK, correlation = -0.015

Excluding outliers

The above figures show that there are some extreme values, which are very much deviated from the normal ragne of percentage change. There are many reasons behind the existing of this kind of outliers, such as data incorrectness from the data source, the unhandled stock split or stock merge information, etc. Whatever the reasons are, the ETF correlation analysis can be greatly affected since it is sensitive to outliers, and we should exclude them before conducting any analysis, as shown in the below highlighted line of code.

etfs_diff = [df.filter(regex='pct_diff') for df in etfs]
etfs_diff_aggr = pd.concat(etfs_diff[:], axis=1)
etfs_diff_aggr = remove_outlier(etfs_diff_aggr)
etfs_diff_aggr = etfs_diff_aggr.rename(columns=lambda x: re.sub('_pct_diff', '', x))  # remove unnecessary string in column labels
etfs_corr = etfs_diff_aggr.corr()

The measure employed here to detect and exclude outliers is interquartile range (IQR). IQR is defined as substracting the first quartile, or the 25th percentile, from the third quartile, or the 75th percentile: $$ IQR = Q_3 - Q_1 $$ Outliers are then defined as observations outside of the range $[Q_1-1.5 \times IQR, Q_3+1.5 \times IQR]$

def remove_outlier(corr_df):
    q1s, q3s = np.nanpercentile(corr_df, [25, 75])
    iqrs = q3s - q1s
    lbs = q1s - 1.5 * iqrs
    ubs = q3s + 1.5 * iqrs
    corr_df[np.logical_or(corr_df < lbs, corr_df > ubs)] = np.nan

Let’s now run the correlation analysis on the excluded outliers ETFs data again.

The 8 most uncorrelated ETFs:

In[25]: find_nsmallest_pairs(etfs_corr, 8)
[['3141.HK', '3027.HK', 0.0001788345902729533],
 ['3027.HK', '2840.HK', 0.0005005550357437558],
 ['7230.HK', '3097.HK', -0.0008885981653538868],
 ['2819.HK', '3016.HK', 0.0008893844530078178],
 ['2825.HK', '3141.HK', 0.0013415966235783203],
 ['3012.HK', '3081.HK', 0.0013823011000694114],
 ['3086.HK', '2840.HK', -0.0023508470579590274],
 ['3070.HK', '3141.HK', -0.0024499785004589414]]

The 5 ETFs which are most uncorrelated with ticker ‘2800.HK’, which are very much different from the previous results:

In[26]: find_specific_nsmallest_pairs('2800.HK', etfs_corr, 5)
2819.HK   0.059241
2821.HK   0.064642
83168.HK  0.086489
3122.HK   0.095006
3141.HK  -0.100023

The differences is more clearly shown when the pairs of ETFs are plotted in scatter plots:

6 most uncorrelated ETFs with 2800.HK

6 most uncorrelated ETFs with 2800.HK

What’s next

We can now begin the ETF searching and asset building process according to the correlation between different securities. Certainly, there are a lot more besides just looking for low correlation coefficients in security analysis. For example, we may set a lower limit on daily or monthly transaction volume to ensure the liquidity of the securities to lower the bid-ask spread. Moreover, we can also look for ETFs with maximum or high enough correlation coefficients, indicating that they have similar behaviour, and choose the one with lowest cost, which is also one of the important dimensions of long-term investment.

The Python source code of this ETF correlation analysis report is available here.

Leo Mak
Enthusiast of Data Mining