Friday, 6 September 2013

Correlations

In this post, two types of Correlations will be discussed, the non-parametric correlation called Spearman's Rank Correlation, and the parametric correlation called the Pearson Correlation. 

Spearman's Rank Correlation

Spearman's Rank Correlation $\rho$ is a non-parametric measure of the relationship between two variables, and tests whether these variables have a monotonic relationship.

Suppose two populations, with the same number of samples, have a monotonic relationship. If these two populations were paired together, and the pairs sorted by the magnitude of the first element in ascending order, the second elements would be ordered by magnitude in ascending order too. Conversely if the first element of the pair was used to order the pairs in descending order, the second element of the pair would be ordered in descending order too. This relationship would hold true if we  ordered the second element of the pair, and examined the ordering of the first element.  Thus, each of the pair of samples would be ordered in the same direction.

If two populations are anti-monotonic with respect to each other, then if these two populations were paired together, and the pairs sorted by the magnitude of the first element in ascending order, the second elements would be ordered by magnitude in descending order. Conversely if the first element of the pair was used to order the pairs in descending order, the second element of the pair would be ordered in  ascending order. This relationship would hold true if we ordered the second element of the pair, and examined the ordering of the first element.  Thus, each of the pair of samples would be ordered in the opposite direction.

Spearman's Rank Correlation is bounded in value, between $-1$ and $+1$, for when the variables are related in either an anti-monotonic or monotonic manner with each other respectively, and is given by the following equation:-

$\Large \rho=\frac{\sum_{i=1}^{n}{(x_i-\bar{x})(y_i-\bar{y})}}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}$ (Eq. 1)

where $x_i$ and $y_i$ are the result of converting $n$ (raw) samples $x[i]$ and $y[i]$ to their ranks respectively, and $\bar{x}$ and $\bar{y}$ are the mean of the ranks $x_i$ and $y_i$ respectively.

If there are no duplicate values of the raw samples ($x[i]$ and $y[i]$), i.e. no tied ranks, the above formula (Eq. 1) simplifies to

$\Large \rho=1-\frac{6\sum_{i=1}^{n}d_i^2}{n(n^2-1)}$ (Eq. 2)

where $d_i=x_i-y_i$. A derivation of this follows.

When there are no ties

$\sum_{i=1}^{n}(x_i-\bar{x})^2=\sum_{i=1}^{n}(y_i-\bar{y})^2$ (Eq. 3)

Also, $\bar{x}=\bar{y}$ regardless of whether there are ties or not. 


Expanding the left hand side of (Eq. 3), we have

$\large \sum_{i=1}^{n}(x_i-\bar{x})^2=\sum_{i=1}^{n}(x_i^2-2x_i\bar{x}+\bar{x}^2)$

$\large =\frac{n(n+1)(n+2)}{6}-\frac{n(n+1)^2}{2}+\frac{n(n+1)^2}{4}$

$\large =\frac{n(n^2-1)}{12}$ (Eq. 4)

where we have used the identities 

$\large \sum_{i=1}^{n}x_i^2=1^2+2^2+...+n^2=\frac{n(n+1)(n+2)}{6}$

and 

$\large \sum_{i=1}^{n}x_i=1+2+...+n=\frac{n(n+1)}{2}$

Thus, the denominator of (Eq. 1) is $\large \frac{n(n^2-1)}{12}$.

Examining the numerator of (Eq. 1), we can expand this to 

$\sum_{i=1}^{n}(x_i-\bar{x})^2$ - $\sum_{i=1}^{n}(x_i-\bar{x})^2$ + $\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})$


Using (Eq. 3) , we can further expand the numerator  to 

$\sum_{i=1}^{n}(x_i-\bar{x})^2$ - $0.5\sum_{i=1}^{n}(x_i-\bar{x})^2$ $0.5\sum_{i=1}^{n}(y_i-\bar{y})^2$  + $\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})$

Dividing the expanded numerator by the denominator, (Eq. 1) becomes 

$\Large \rho=1-\frac{\sum_{i=1}^{n}(x_i-\bar{x}-y_i+\bar{y})^2}{2\sum_{i=1}^{n}(x_i-\bar{x})^2}$

Substituting (Eq. 4), we have

$\Large \rho=1-\frac{6\sum_{i=1}^{n}(x_i-y_i)^2}{n(n^2-1)}$

If we denote $d_i=x_i-y_i$, we end up with

$\Large \rho=1-\frac{6\sum_{i=1}^{n}d_i^2}{n(n^2-1)}$

which is the simplified expression for $\rho$ when no ties are present.

As Spearman's Rank Correlation is a non-parametric test, it will be robust to the presence of outliers in the datasets.

Spearman's Rank Correlation - A worked example

Consider the raw sample pairs ($x[i]$,$y[i]$) as below

  1 , 1
  5 , 4
  3 , 2
  3 , 3

Their ranks ($x_i$,$y_i$) are on the right:-
                           
($x[i]$,$y[i]$)    ($x_i$,$y_i$)      ($x_i$-$\bar{x}$)     ($y_i$-$\bar{y}$)           
  1 , 1         1, 1          -1.5         -1.5
  5 , 4         4, 4           1.5           1.5
  3 , 2      2.5, 2              0          -0.5
  3 , 3      2.5, 3              0           0.5

The numerator of $\rho$ is ($-1.5 \times -1.5 + 1.5 \times 1.5 + 0 \times -0.5 + 0 \times 0.5$), which is $4.5$. The denominator of $\rho$ is $\sqrt{(2.25 + 2.25 + 0 + 0) \times (2.25 + 2.25 + 0.25 + 0.25)}$ which is $4.743416$.  Thus $\rho$ is $4.5/4.743416$, which is $0.948683$.

There is a Spearman Rank Correlation Calculator in this blog, which can be found here.

Pearson Correlation

The Pearson Correlation is the parametric equivalent of the Spearman Rank Correlation, and is given by the equation

$\Large r=\frac{\sum_{i=1}^{n}{(x[i]-\bar{x})(y[i]-\bar{y})}}{\sqrt{\sum_{i=1}^{n}(x[i]-\bar{x})^2\sum_{i=1}^{n}(y[i]-\bar{y})^2}}$

where $x[i]$ and $y[i]$ are the raw samples from the two populations that are to be correlated against, and $\bar{x}$ and $\bar{y}$ their respective means.

Like the Spearman Rank Correlation, the Pearson Correlation has a minimum of $-1$ and a maximum of +1. Unlike the Spearman Rank Correlation, it is sensitive to the value of the sample, and will be $+1$ only if the two populations are exactly linearly related with positive gradient, and $-1$ only if the two populations are exactly linearly related with negative gradient.  

It is worth noting that the square of the Pearson Correlation ($r^2$) is the Coefficient of Determination, that was encountered in the Linear Regression post, in this blog. There is a Pearson Correlation Calculator in this blog, which can be found here