# Correlation analysis of ETF using Python

Previously, we have covered why and how to
create a correlation matrix of ETFs available in Hong Kong market using Python. Now we should do some actual correlation analyses on these securities, with the matrix just created. There are two kinds of analyses I am going to demonstrate, which are actually quite similar: one is to find out the *n* most uncorrelated ETFs in the whole market; the other one is to find out *n* most uncorrelated ETFs corresponding to a given specific ticker.

## Find the *n* most uncorrelated securities

The idea to find out the *n* pairs of ETFs with smallest correlation coefficients is first to find out the smallest *n* correlation coefficients in the matrix, then look for their corresponding position pairs (row, column) in the matrix. Finally, the function returns the corresponding ticker names as well as the correlation coefficients which are the most uncorrelated ETFs in the whole market.

```
def find_nsmallest_pairs(corr_mtx, num_pairs):
smallest_corrs = find_nsmallest(corr_mtx, num_pairs)
smallest_pos = [np.argwhere(np.triu(corr_mtx, 1) == v)[0] for v in smallest_corrs]
corrs_info = [[corr_mtx.index[smallest_pos[i][0]], corr_mtx.columns[smallest_pos[i][1]], smallest_corrs[i]] for i in range(len(smallest_corrs))]
return corrs_info
```

The `find_nsmallest()`

returns the *n* smallest correlation coefficients in the matrix `corr_mtx`

, which will be discussed later. We can then use the
`numpy.argwhere()`

function to find out the corresponding indices of those *n* smallest correlation coefficients. Since a correlation matrix is
symmetric, we only need to look for the indices from the upper or lower
triangular matrix with
`numpy.triu()`

or
`numpy.tril()`

. The `smallest_pos`

is a list of x-y indices corresponding to the smallest *n* values in the correlation matrix, and we can then retrieve the tickers of the corresponding indices as well as the correlation coefficients.

In order to look for the *n* smallest values in a pandas dataframe, we need to first cast the matrix into the numpy array. Again, due to the symmetric property of the correlation matrix, we only need to look into one of the triangular matrices. Since the range of a correlation coefficient is between -1 and +1, where 0 is the minimum by definition, the coefficients should be sorted according to their absolute values before finding out the minimum ones. Finally the corresponding real minimum values are looked up by their positions in the sorted list of absolute values.

```
def find_nsmallest(corr_mat, n):
corr_arr = corr_mat.to_numpy()
corrs = corr_arr[np.triu_indices(len(corr_arr), 1)] # extract the upper triangle
abs = np.abs(corrs)
abs_sorted = np.sort(abs)
pos = [np.argwhere(abs == abs_sorted[i]).item() for i in range(n)]
return corrs[pos]
```

```
In[8]: find_nsmallest_pairs(etfs_corr, 8)
Out[8]:
[['3019.HK', '3060.HK', 0.0002316007102952408],
['3007.HK', '3043.HK', -0.0008349599845091073],
['2819.HK', '3016.HK', 0.0008893844530078178],
['3055.HK', '3145.HK', 0.0008924172486858009],
['7302.HK', '82822.HK', 0.0009348115710049738],
['2832.HK', '3049.HK', 0.0009997182485107106],
['3085.HK', '83168.HK', -0.0011782217208005292],
['2833.HK', '3084.HK', 0.0015325610954576]]
```

## The *n* most uncorrelated securities to a specific ticker

It is much easier to find the *n* most uncorrelated securities if a specific ticker is given, since we only need to look for the candidate values within a specific column in the dataframe.

```
def find_specific_nsmallest_pairs(ticker, corr_mtx, num_pairs):
corrs = corr_mtx[corr_mtx.columns == ticker]
return corrs[np.abs(corrs.T).nsmallest(num_pairs, ticker).index].T
```

```
In[11]: find_specific_nsmallest_pairs('2800.HK', etfs_corr, 5)
Out[11]:
2800.HK
3043.HK 0.003409
3027.HK 0.003778
83168.HK -0.005163
83188.HK 0.011657
82822.HK -0.014702
```

We can now plot the monthly price changes of some of the ETF pairs returned from previous operations. Please note the transaction period for below figures is from 2017 Janauary to 2019 September.

```
xlab = '2800.HK'
ticker_nsmallest_pairs = find_specific_nsmallest_pairs(xlab, etfs_corr, 5)
for i in range(pair_num):
ylab = ticker_nsmallest_pairs.index[i]
corrplot = etfs_diff_aggr.loc[:, [xlab, ylab]].plot.scatter(x=xlab, y=ylab)
corrplot.set(xlabel=xlab+", monthly, %", ylabel=ylab+", monthly, %")
```

## Excluding outliers

The above figures show that there are some extreme values, which are very much deviated from the normal ragne of percentage change. There are many reasons behind the existing of this kind of outliers, such as data incorrectness from the data source, the unhandled stock split or stock merge information, etc. Whatever the reasons are, the ETF correlation analysis can be greatly affected since it is sensitive to outliers, and we should exclude them before conducting any analysis, as shown in the below highlighted line of code.

```
etfs_diff = [df.filter(regex='pct_diff') for df in etfs]
etfs_diff_aggr = pd.concat(etfs_diff[:], axis=1)
etfs_diff_aggr = remove_outlier(etfs_diff_aggr)
etfs_diff_aggr = etfs_diff_aggr.rename(columns=lambda x: re.sub('_pct_diff', '', x)) # remove unnecessary string in column labels
etfs_corr = etfs_diff_aggr.corr()
```

The measure employed here to detect and exclude outliers is
interquartile range (IQR). IQR is defined as substracting the first quartile, or the 25^{th}
percentile, from the third quartile, or the 75^{th} percentile:
$$
IQR = Q_3 - Q_1
$$
Outliers are then defined as observations outside of the range $[Q_1-1.5 \times IQR, Q_3+1.5 \times IQR]$

```
def remove_outlier(corr_df):
q1s, q3s = np.nanpercentile(corr_df, [25, 75])
iqrs = q3s - q1s
lbs = q1s - 1.5 * iqrs
ubs = q3s + 1.5 * iqrs
corr_df[np.logical_or(corr_df < lbs, corr_df > ubs)] = np.nan
return(corr_df)
```

Let’s now run the correlation analysis on the excluded outliers ETFs data again.

The 8 most uncorrelated ETFs:

```
In[25]: find_nsmallest_pairs(etfs_corr, 8)
Out[25]:
[['3141.HK', '3027.HK', 0.0001788345902729533],
['3027.HK', '2840.HK', 0.0005005550357437558],
['7230.HK', '3097.HK', -0.0008885981653538868],
['2819.HK', '3016.HK', 0.0008893844530078178],
['2825.HK', '3141.HK', 0.0013415966235783203],
['3012.HK', '3081.HK', 0.0013823011000694114],
['3086.HK', '2840.HK', -0.0023508470579590274],
['3070.HK', '3141.HK', -0.0024499785004589414]]
```

The 5 ETFs which are most uncorrelated with ticker ‘2800.HK’, which are very much different from the previous results:

```
In[26]: find_specific_nsmallest_pairs('2800.HK', etfs_corr, 5)
Out[26]:
2800.HK
2819.HK 0.059241
2821.HK 0.064642
83168.HK 0.086489
3122.HK 0.095006
3141.HK -0.100023
```

The differences is more clearly shown when the pairs of ETFs are plotted in scatter plots:

## What’s next

We can now begin the ETF searching and asset building process according to the correlation between different securities. Certainly, there are a lot more besides just looking for low correlation coefficients in security analysis. For example, we may set a lower limit on daily or monthly transaction volume to ensure the liquidity of the securities to lower the bid-ask spread. Moreover, we can also look for ETFs with maximum or high enough correlation coefficients, indicating that they have similar behaviour, and choose the one with lowest cost, which is also one of the important dimensions of long-term investment.

The Python source code of this ETF correlation analysis report is available here.