Sunday, 5 January 2014

Chi-squared Test for Independence

In this blog post I will discuss Pearson's Chi-squared test for independence, using an example.

Pearson's Chi-squared test for independence is applied to outcomes that are arranged in a tabular form and tests for independence between two factors. The rows correspond to the levels for one factor and the columns to the levels for the other factor. The Null Hypothesis is that the two factors are independent.

For example, we could have two rows corresponding to gender (Female and Male), and three columns corresponding to what products were purchased from a supermarket (Product A, Product B and Product C). The six entries would correspond to which gender bought which product. The Null Hypothesis is that the proportion of people who bought each of the three products is independent of their gender. For the Null Hypothesis to be valid, the proportion of females who bought Product A is equal to the proportion of males and females who bought Product A (out of the total population of males and females), the proportion of males who bought Product A is equal to the proportion of males and females who bought Product A (out of the total population of males and females), and so on. To elaborate on this further, we can represent the outcomes in the table below.


Product A Product B Product C
Female $a$ $b$ $c$
Male $d$ $e$ $f$

For the Null Hypothesis to hold, the proportion of females who purchased Product A would be equal to the proportion of both males and females who purchased Product A out of the total population:-

$\Large \frac{a}{a+b+c}=\frac{a+d}{a+b+c+d+e+f}$

Rearranging the above equation, we obtain the expected value of $a$ as

$\Large \hat{a}=\frac{(a+d)(a+b+c)}{a+b+c+d+e+f}$

and the proportion of males who purchased Product A would be equal to the proportion of both males and females who purchased Product A out of the total population:-

$\Large \frac{d}{d+e+f}=\frac{a+d}{a+b+c+d+e+f}$

Rearranging the above equation in a similar fashion, we obtain the expected value of $d$ as

$\Large \hat{d}=\frac{(a+d)(d+e+f)}{a+b+c+d+e+f}$

Going through all the entries, the expected value of each entry turns out to be the product of the entry's column sum and the entry's row sum divided by the total population.

Once all the expected values of the entries are computed, the sum of squares of the difference between the actual entry outcome and the expected value of that entry divided by the expected value of that entry is used to calculate the chi-squared test statistic. Mathematically, this can be represented as


$\Large \chi^2=\sum_{m=1}^{r}\sum_{n=1}^{c}\frac{(O(m,n)-E(m,n))^2}{E(m,n)}$

where $r$ is the number of rows, $c$ is the number of columns, $O(m,n)$ is the outcome in row $m$, column $n$ and $E(m,n)$ the corresponding expected value. The degree of freedom for this statistic is $(r-1)(c-1)$. Based on the chi-squared statistic and the degrees of freedom we can calculate the p-value, and if this value is less than a significance level of our choice (a commonly used value is 0.05) we reject the Null Hypothesis that the proportion of people who bought each of the three products is independent of their gender.