Correlation analysis of ETF using Python
Previously, we have covered why and how to create a correlation matrix of ETFs available in Hong Kong market using Python. Now we should do some actual correlation analyses on these securities, with the matrix just created. There are two kinds of analyses I am going to demonstrate, which are actually quite similar: one is to find out the n most uncorrelated ETFs in the whole market; the other one is to find out n most uncorrelated ETFs corresponding to a given specific ticker.
Find the n most uncorrelated securities
The idea to find out the n pairs of ETFs with smallest correlation coefficients is first to find out the smallest n correlation coefficients in the matrix, then look for their corresponding position pairs (row, column) in the matrix. Finally, the function returns the corresponding ticker names as well as the correlation coefficients which are the most uncorrelated ETFs in the whole market.
def find_nsmallest_pairs(corr_mtx, num_pairs): smallest_corrs = find_nsmallest(corr_mtx, num_pairs) smallest_pos = [np.argwhere(np.triu(corr_mtx, 1) == v) for v in smallest_corrs] corrs_info = [[corr_mtx.index[smallest_pos[i]], corr_mtx.columns[smallest_pos[i]], smallest_corrs[i]] for i in range(len(smallest_corrs))] return corrs_info
find_nsmallest() returns the n smallest correlation coefficients in the matrix
corr_mtx, which will be discussed later. We can then use the
numpy.argwhere() function to find out the corresponding indices of those n smallest correlation coefficients. Since a correlation matrix is
symmetric, we only need to look for the indices from the upper or lower
triangular matrix with
smallest_pos is a list of x-y indices corresponding to the smallest n values in the correlation matrix, and we can then retrieve the tickers of the corresponding indices as well as the correlation coefficients.
In order to look for the n smallest values in a pandas dataframe, we need to first cast the matrix into the numpy array. Again, due to the symmetric property of the correlation matrix, we only need to look into one of the triangular matrices. Since the range of a correlation coefficient is between -1 and +1, where 0 is the minimum by definition, the coefficients should be sorted according to their absolute values before finding out the minimum ones. Finally the corresponding real minimum values are looked up by their positions in the sorted list of absolute values.
def find_nsmallest(corr_mat, n): corr_arr = corr_mat.to_numpy() corrs = corr_arr[np.triu_indices(len(corr_arr), 1)] # extract the upper triangle abs = np.abs(corrs) abs_sorted = np.sort(abs) pos = [np.argwhere(abs == abs_sorted[i]).item() for i in range(n)] return corrs[pos]
In: find_nsmallest_pairs(etfs_corr, 8) Out: [['3019.HK', '3060.HK', 0.0002316007102952408], ['3007.HK', '3043.HK', -0.0008349599845091073], ['2819.HK', '3016.HK', 0.0008893844530078178], ['3055.HK', '3145.HK', 0.0008924172486858009], ['7302.HK', '82822.HK', 0.0009348115710049738], ['2832.HK', '3049.HK', 0.0009997182485107106], ['3085.HK', '83168.HK', -0.0011782217208005292], ['2833.HK', '3084.HK', 0.0015325610954576]]
The n most uncorrelated securities to a specific ticker
It is much easier to find the n most uncorrelated securities if a specific ticker is given, since we only need to look for the candidate values within a specific column in the dataframe.
def find_specific_nsmallest_pairs(ticker, corr_mtx, num_pairs): corrs = corr_mtx[corr_mtx.columns == ticker] return corrs[np.abs(corrs.T).nsmallest(num_pairs, ticker).index].T
In: find_specific_nsmallest_pairs('2800.HK', etfs_corr, 5) Out: 2800.HK 3043.HK 0.003409 3027.HK 0.003778 83168.HK -0.005163 83188.HK 0.011657 82822.HK -0.014702
We can now plot the monthly price changes of some of the ETF pairs returned from previous operations. Please note the transaction period for below figures is from 2017 Janauary to 2019 September.
xlab = '2800.HK' ticker_nsmallest_pairs = find_specific_nsmallest_pairs(xlab, etfs_corr, 5) for i in range(pair_num): ylab = ticker_nsmallest_pairs.index[i] corrplot = etfs_diff_aggr.loc[:, [xlab, ylab]].plot.scatter(x=xlab, y=ylab) corrplot.set(xlabel=xlab+", monthly, %", ylabel=ylab+", monthly, %")
The above figures show that there are some extreme values, which are very much deviated from the normal ragne of percentage change. There are many reasons behind the existing of this kind of outliers, such as data incorrectness from the data source, the unhandled stock split or stock merge information, etc. Whatever the reasons are, the ETF correlation analysis can be greatly affected since it is sensitive to outliers, and we should exclude them before conducting any analysis, as shown in the below highlighted line of code.
etfs_diff = [df.filter(regex='pct_diff') for df in etfs] etfs_diff_aggr = pd.concat(etfs_diff[:], axis=1) etfs_diff_aggr = remove_outlier(etfs_diff_aggr) etfs_diff_aggr = etfs_diff_aggr.rename(columns=lambda x: re.sub('_pct_diff', '', x)) # remove unnecessary string in column labels etfs_corr = etfs_diff_aggr.corr()
The measure employed here to detect and exclude outliers is interquartile range (IQR). IQR is defined as substracting the first quartile, or the 25th percentile, from the third quartile, or the 75th percentile: $$ IQR = Q_3 - Q_1 $$ Outliers are then defined as observations outside of the range $[Q_1-1.5 \times IQR, Q_3+1.5 \times IQR]$
def remove_outlier(corr_df): q1s, q3s = np.nanpercentile(corr_df, [25, 75]) iqrs = q3s - q1s lbs = q1s - 1.5 * iqrs ubs = q3s + 1.5 * iqrs corr_df[np.logical_or(corr_df < lbs, corr_df > ubs)] = np.nan return(corr_df)
Let’s now run the correlation analysis on the excluded outliers ETFs data again.
The 8 most uncorrelated ETFs:
In: find_nsmallest_pairs(etfs_corr, 8) Out: [['3141.HK', '3027.HK', 0.0001788345902729533], ['3027.HK', '2840.HK', 0.0005005550357437558], ['7230.HK', '3097.HK', -0.0008885981653538868], ['2819.HK', '3016.HK', 0.0008893844530078178], ['2825.HK', '3141.HK', 0.0013415966235783203], ['3012.HK', '3081.HK', 0.0013823011000694114], ['3086.HK', '2840.HK', -0.0023508470579590274], ['3070.HK', '3141.HK', -0.0024499785004589414]]
The 5 ETFs which are most uncorrelated with ticker ‘2800.HK’, which are very much different from the previous results:
In: find_specific_nsmallest_pairs('2800.HK', etfs_corr, 5) Out: 2800.HK 2819.HK 0.059241 2821.HK 0.064642 83168.HK 0.086489 3122.HK 0.095006 3141.HK -0.100023
The differences is more clearly shown when the pairs of ETFs are plotted in scatter plots:
We can now begin the ETF searching and asset building process according to the correlation between different securities. Certainly, there are a lot more besides just looking for low correlation coefficients in security analysis. For example, we may set a lower limit on daily or monthly transaction volume to ensure the liquidity of the securities to lower the bid-ask spread. Moreover, we can also look for ETFs with maximum or high enough correlation coefficients, indicating that they have similar behaviour, and choose the one with lowest cost, which is also one of the important dimensions of long-term investment.
The Python source code of this ETF correlation analysis report is available here.