The Mann-Whitney U (MWU) test is a non-parametric statistical test that is widely used in many fields (e.g in medical statistics and psychology), and tests whether two groups (a.k.a datasets or populations) are the same (the null hypothesis) against the alternative hypothesis that they are not so.

The following assumptions need to be made:

The MWU test is more robust than the Unpaired Student's t-test for data that is not normally distributed. We can think of the Unpaired Student's t-test as the parametric counterpart to the MWU test.

The following assumptions need to be made:

- The independent variable is the two groups.
- The dependent variable are the observations which are either continuous or ordinal (i.e. they can be sorted by some criterion, such as a level).
- All observations from both groups are independent of one another. For example, if patients' blood pressure is being measured, each observation should come from a distinct patient.
- The data need not be normally distributed.

The MWU test is more robust than the Unpaired Student's t-test for data that is not normally distributed. We can think of the Unpaired Student's t-test as the parametric counterpart to the MWU test.

The basic idea is that that the data for each group is ranked (from 1,2,..) in ascending order of magnitude, regardless of which group the data belongs to. If the populations/groups are very different, say Group 2 has samples much larger than that of population 1, then the sum of the ranks for Group 2 will be much larger than that for Group 1. A U statistic is calculated, which can be intuitively thought of as the difference between the total rank and the larger of the group rank sum. The

*smaller*this difference, the greater the disparity in rank sums between the two groups, and the less likely it is that the population rank differences are due to chance.
Consider Group 1 with $N1$ samples, and Group 2 with $N2$ samples.

Let $T1$ be the sum of ranks for Group 1, and $T2$ be the sum of ranks for Group 2.

We denote $TX$ as the larger of $T1$ and $T2$, and $NX$ as the number of samples of the group with the larger sum of ranks.

The U-statistic is given by:-

Let $T1$ be the sum of ranks for Group 1, and $T2$ be the sum of ranks for Group 2.

We denote $TX$ as the larger of $T1$ and $T2$, and $NX$ as the number of samples of the group with the larger sum of ranks.

The U-statistic is given by:-

$\large U=(N1 \times N2) + \frac{NX(NX+1)}{2} - TX$

The value of $U$ is compared against a critical value $Ucrit$, and if $U$ is

*less*than the critical value, the result is considered significant (note that many tests have a statistic that has to exceed a critical value for the result to be considered significant - the MWU test is almost the odd one out!).

The value of $Ucrit$ is found out using a U-test table, and is determined by the values of $N1$ and $N2$ and the confidence level (say 5%), and whether we want a two-tailed test or a one-tailed one.

If the values of $N1$ and $N2$ are high, we can use the z-test, as the U statistic can be reasonably well approximated by a Normal/Gaussian distribution with mean $\frac{N1N2}{2}$ and variance $\frac{N1N2(N1+N2+1)}{12}$.

If there are tied ranks, the variance needs to be slightly modified to

$\Large \frac{N1N2[(N1+N2+1)-\frac{\sum_{i=1}^{L}(t_i^3-t_i)}{(N1+N2)(N1+N2-1)}]}{12}$,

where $L$ is the number of groups of ties, and $t_i$ the number of ties for the $ith$ group.

__Worked Example__Suppose we have two groups, with the following samples :

Group 1: 1,4,6,7,8,3,2,1

Group 2: 3,3,3,8,10,16,18,70,30

So $N1=8$, $N2=9$.

We first rank all the data as follows (ignore ties for the moment- they are highlighted in blue):-

Rank Data Group

1 1 1

2 1 1

3 2 1

4 3 2

5 3 2

6 3 2

7 3 1

8 4 1

9 6 1

10 7 1

11 8 1

12 8 2

13 10 2

14 16 2

15 18 2

16 30 2

17 70 2

We deal with the ties by calculating the average of the ranks for the ties:-

Rank Data Group

1.5 1 1

1.5 1 1

3 2 1

5.5 3 2

5.5 3 2

5.5 3 2

5.5 3 1

8 4 1

9 6 1

10 7 1

11.5 8 1

11.5 8 2

13 10 2

14 16 2

15 18 2

16 30 2

17 70 2

Lets colour code by Group membership:-

Rank Data Group

1.5 1 1

1.5 1 1

3 2 1

5.5 3 2

5.5 3 2

5.5 3 2

5.5 3 1

8 4 1

9 6 1

10 7 1

11.5 8 1

11.5 8 2

13 10 2

14 16 2

15 18 2

16 30 2

17 70 2

The sum of ranks for Group 1 is:-

1.5 + 1.5 + 3 + 5.5 + 8 + 9 + 10 + 11.5 = 50

We can easily calculate the sum of ranks for Group 2. As we have 17 samples ($N1+N2$), the total of the ranks for the entire data is $\frac{(N1 + N2)(N1+N2+1)}{2}$, which is $17\times 18 \times 0.5$, i.e. 153.

So the Group 2 sum of ranks is $153-50=103$, which is greater than that of Group 1.

Thus $NX=9$, and $TX=103$, i.e. the number of samples and sum of ranks for the larger sum of ranks Group (i.e. Group 2).

We are now in a position to calculate U as follows:-

$U=(9 \times 8) + \frac{9(9+1)}{2} - 103 = 14$

Suppose we are interested in an alpha of 5%, for a two-tail test - going through a table (say one at http://www.lesn.appstate.edu/olson/stat_directory/Statistical%20procedures/Mann_Whitney%20U%20Test/Mann-Whitney%20Table.pdf), for $N1=8$ and $N2=9$, we find that $Ucrit=15$.

As $U<Ucrit$ we reject the null hypothesis that the groups are the same - the result is significant, but only just!

Next we come to the z-score calculation - which will be inaccurate as we only have 8 and 9 samples for Group 1 and 2 respectively.

We need to transform $U$ to a z-score, which will be a standard normal distribution with zero mean and unity variance.

For SciStatCalc, the following technique is used (which does not account for the ties for a more conservative result - also the results from GNU Octave were used to baseline against). We take the sum of ranks for Group 1 ($50$), and subtract $N1\times (N1+N2+1)\times(0.5) =8(18)(0.5)=72$, resulting in $-22$. We need to normalise this by the standard deviation, which is $\sqrt{\frac{N1\times N2\times (N1+N2+1)}{12}}=\sqrt{\frac{9\times 8\times (18)}{{12}}}=10.3923$.

So, the z-score is $-22/10.3923$, which is $-2.11695$. The p-value for this z-score is $0.034264$, which is less than our critical value of $0.05$, vindicating that our result is indeed marginally significant.

If the correction to the variance based on ties were to be applied, the standard deviation would have been $\sqrt{\frac{9\times 8\times (18 - (72/(16 \times 17)))}{{12}}}=10.3156$, because there are three groups of ties with 2, 4 and 2 ties, leading to $(4^3) - 4 + (2^3) -2 + (2^3) - 2 = 72$, so that the correction term is $72/((9 + 8)\times(9+8-1))$. The z-score would then be $-22/10.31526=-2.1327$, corresponding to a p-value of $0.03295$, which is in agreement with the results obtained from using R. Nevertheless, the p-value is still less than our 5% level, so the result is still significant.

Below is a screenshot of the Mann-Whitney U-test on SciStatCalc:-

There is a Mann Whitney U-test Calculator in this blog, which can be found here.

To calculate the p-value for a z-score, you need to take the absolute value of the z-score, and calculate the area under a standard Gaussian/Normal distribution (mean 0, variance 1) from $abs(zscore)$ to infinity - i.e. calculate the area under the tail. Given the symmetric nature of the standard Normal distribution, you would end up with the same result from -infinity to $-abs(zscore)$. For a two-tailed result you need to multiply this result by 2.

A final note - the MWU test is also known as the Wilcoxon rank sum test.

So the Group 2 sum of ranks is $153-50=103$, which is greater than that of Group 1.

Thus $NX=9$, and $TX=103$, i.e. the number of samples and sum of ranks for the larger sum of ranks Group (i.e. Group 2).

We are now in a position to calculate U as follows:-

$U=(9 \times 8) + \frac{9(9+1)}{2} - 103 = 14$

Suppose we are interested in an alpha of 5%, for a two-tail test - going through a table (say one at http://www.lesn.appstate.edu/olson/stat_directory/Statistical%20procedures/Mann_Whitney%20U%20Test/Mann-Whitney%20Table.pdf), for $N1=8$ and $N2=9$, we find that $Ucrit=15$.

As $U<Ucrit$ we reject the null hypothesis that the groups are the same - the result is significant, but only just!

Next we come to the z-score calculation - which will be inaccurate as we only have 8 and 9 samples for Group 1 and 2 respectively.

We need to transform $U$ to a z-score, which will be a standard normal distribution with zero mean and unity variance.

For SciStatCalc, the following technique is used (which does not account for the ties for a more conservative result - also the results from GNU Octave were used to baseline against). We take the sum of ranks for Group 1 ($50$), and subtract $N1\times (N1+N2+1)\times(0.5) =8(18)(0.5)=72$, resulting in $-22$. We need to normalise this by the standard deviation, which is $\sqrt{\frac{N1\times N2\times (N1+N2+1)}{12}}=\sqrt{\frac{9\times 8\times (18)}{{12}}}=10.3923$.

So, the z-score is $-22/10.3923$, which is $-2.11695$. The p-value for this z-score is $0.034264$, which is less than our critical value of $0.05$, vindicating that our result is indeed marginally significant.

If the correction to the variance based on ties were to be applied, the standard deviation would have been $\sqrt{\frac{9\times 8\times (18 - (72/(16 \times 17)))}{{12}}}=10.3156$, because there are three groups of ties with 2, 4 and 2 ties, leading to $(4^3) - 4 + (2^3) -2 + (2^3) - 2 = 72$, so that the correction term is $72/((9 + 8)\times(9+8-1))$. The z-score would then be $-22/10.31526=-2.1327$, corresponding to a p-value of $0.03295$, which is in agreement with the results obtained from using R. Nevertheless, the p-value is still less than our 5% level, so the result is still significant.

Below is a screenshot of the Mann-Whitney U-test on SciStatCalc:-

There is a Mann Whitney U-test Calculator in this blog, which can be found here.

__Calculating the p-value for a z-score__To calculate the p-value for a z-score, you need to take the absolute value of the z-score, and calculate the area under a standard Gaussian/Normal distribution (mean 0, variance 1) from $abs(zscore)$ to infinity - i.e. calculate the area under the tail. Given the symmetric nature of the standard Normal distribution, you would end up with the same result from -infinity to $-abs(zscore)$. For a two-tailed result you need to multiply this result by 2.

A final note - the MWU test is also known as the Wilcoxon rank sum test.