Tuesday, 8 October 2013

Shapiro-Wilk Test: Testing for Normality

In this post I will describe an implementation of the Shapiro-Wilk test, which is a powerful test for whether a dataset has a Normal distribution. For many statistical tests, especially the parametric tests, it is necessary to assume that the datasets are distributed normally. Given a particular dataset, the Shapiro-Wilk test involves the calculation of a test statistic, denoted by $W$. In order not to reject the Null Hypothesis that the samples are indeed distributed Normally, the value of $W$ must exceed a critical value. 

Conceptually, the Shapiro-Wilk test examines the closeness between the data samples after they have been ordered and standardised (i.e. transformed to a zero mean, unity variance dataset) and what the samples would have been were they drawn from a standard Normal distribution and ordered. The ordered samples drawn from the standard Normal distribution would be monotonically increasing quantiles (inverse Cumulative Density Function (CDF)) of the standard Normal distribution. This metric of closeness (which is the Shapiro-Wilk test statistic $W$) is basically a square of the correlation between the ordered and standardised dataset and the inverse CDF values. The test statistic is thus a positive value with an upper bound of unity - the higher this value, the more Normal/Gaussian the data.

The objective of this post is to detail an algorithm where everything is self-contained.

I would like to  gratefully acknowledge the work of A. Trujillo-Ortiz et al, who had kindly converted the Fortran Algorithm AS R94 (Royston, 1995) to a Matlab m-file, and then put the m-file in Matlab File Exchange, thus opening up visibility of the algorithm to a wider audience. Their file is reproduced below in blue (without any modification), and can be run on Octave (a free to download Matlab-clone), should you lack access to Matlab, but are nevertheless familiar with the  tool.

function [W] = ShaWilstat(x)
%SHAWILTEST Shapiro-Wilk' W statistic for assessing a sample normality.
% This m-file is generating from the Fortran Algorithm AS R94 (Royston,
% 1995) [URL address http://lib.stat.cmu.edu/apstat/181]. Here we take only
% the procedure to generate the Shapiro-Wilk's W statistic, needed to feed
% the Royston's test for multivariate normality. Here, we present both the
% options for the sample kurtosis type: 1) Shapiro-Francia for leptokurtic,
% and 2) Shapiro-Wilk for the platykurtic ones. Do not assume that the
% result of the Shapiro-Wilk test is clear evidence of normality or non-
% normality, it is just one piece of evidence that can be helpful.
% Input:
%      x - data vector (3 < n < 5000)
% Output:
%      W - Shapiro-Wilk's W statistic
% Example: From the example given by Scholtz and Stephens (1987, p.922). We
% only take the data set from laboratory 1 of eight measurements of the
% smoothness of a certain type of paper:38.7,41.5,43.8,44.5,45.5,46.0,47.7,
% 58.0
% Data vector is:
%  x=[38.7,41.5,43.8,44.5,45.5,46.0,47.7,58.0];
% Calling on Matlab the function: 
%          W = ShaWilstat(x)
% Answer is:
% W = 0.8476
% Created by A. Trujillo-Ortiz, R. Hernandez-Walls, K. Barba-Rojo, 
%             and L. Cupul-Magana
%             Facultad de Ciencias Marinas
%             Universidad Autonoma de Baja California
%             Apdo. Postal 453
%             Ensenada, Baja California
%             Mexico.
%             atrujo@uabc.mx
% Copyright. November 18, 2007.
% Reference:
% Scholz, F.W. and Stephens, M.A. (1987), K-Sample Anderson-Darling Tests.
%     Journal of the American Statistical Association, 82:918-924.

x  =  x(:); %put data in a column vector
n = length(x); %sample size

if n < 3,
   error('Sample vector must have at least 3 valid observations.');

if n > 5000,
    warning('Shapiro-Wilk statistic might be inaccurate due to large sample size ( > 5000).');

x = sort(x); %sorting of data vector in ascending order
m = norminv(((1:n)' - 3/8) / (n + 0.25));
w = zeros(n,1); %preallocating weights

if kurtosis(x) > 3, %Shapiro-Francia test is better for leptokurtic samples
    w = 1/sqrt(m'*m) * m;
    W = (w' * x) ^2 / ((x - mean(x))' * (x - mean(x)));
else %Shapiro-Wilk test is better for platykurtic samples
    c = 1/sqrt(m' * m) * m;
    u = 1/sqrt(n);
    p1 = [-2.706056,4.434685,-2.071190,-0.147981,0.221157,c(n)];
    p2 = [-3.582633,5.682633,-1.752461,-0.293762,0.042981,c(n-1)];

    w(n) = polyval(p1,u);
    w(1) = -w(n);

    if n == 3,
        w(1) = 0.707106781;
        w(n) = -w(1);

    if n >= 6,
        w(n-1) = polyval(p2,u);
        w(2) = -w(n-1);
        ct =  3;
        phi = (m'*m - 2 * m(n)^2 - 2 * m(n-1)^2) / ...
                (1 - 2 * w(n)^2 - 2 * w(n-1)^2);
        ct = 2;
        phi = (m'*m - 2 * m(n)^2) / (1 - 2 * w(n)^2);

    w(ct:n-ct+1) = m(ct:n-ct+1) / sqrt(phi);

    W = (w' * x) ^2 / ((x - mean(x))' * (x - mean(x)));


Please note that (in the above m-file) the comment that W=0.8476 for dataset in vector x=[38.7,41.5,43.8,44.5,45.5,46.0,47.7,58.0] is incorrect - I obtain W=0.87293 (I have verified this on Octave running the above file, as well as running this test on my app SciStatCalc). Running the data through an online Shapiro-Wilk test calculator in 
results in 0.87311, which is closer to my result than 0.8476.  Running the dataset in
results in 0.872973, which is much closer to my result. Also, for the test to work for 3 samples phi should be set to 1. The bottom part of the Matlab function could be amended as follows (additional code in red)
    if n == 3,
        phi = 1;

    w(ct:n-ct+1) = m(ct:n-ct+1) / sqrt(phi);

    W = (w' * x) ^2 / ((x - mean(x))' * (x - mean(x)));


Let us describe the steps implemented in the above m-file in more general algorithmic terms, for not only the benefit of those unfamiliar with Matlab or Octave, but also for anyone who wishes to implement the algorithm in a programming language of their choice.

Assume we have a set of $N$ samples, $x(n)$, for $n=1,2,..,N$. First we sort these samples in ascending order of magnitude, denoting the sorted result by $x_s(n)$, for $n=1,2,..,N$.

We then calculate the mean $\mu$ of the dataset:
$\large \mu=\frac{1}{N}\sum_{n=1}^Nx_s(n)$

Next the standard deviation $\sigma$ is calculated:
$\large \sigma=\sqrt{\frac{1}{N-1}\sum_{n=1}^N(x_s(n)-\mu)^2}$

Having calculated the mean and standard deviation, the kurtosis $\gamma_2$ is calculated:
$\large \gamma_2=\frac{1}{N}\frac{\sum_{n=1}^N(x_s(n)-\mu)^4}{\sigma^4}-3$

Next we create $N$ samples of parameter $m(n)$, for $n=1,2,..,N$,

$\large m(n)=inverse\_gaussian\_cdf(\frac{n-(3/8)}{N+0.25},0,1)$

i.e. $m(n)$ is the output of the quantile function using the zero mean, unity variance (standard) Gaussian distribution, for probability $\frac{n-(3/8)}{N+0.25}$.

Denote $sum\_msq$ as the sum of squares of $m(n)$ as follows:
$\large sum\_msq=\sum_{n=1}^Nm(n)^2$

We normalise $m(n)$ by the square root of $sum\_msq$, resulting in the normalised samples
$\Large w(n)=\frac{m(n)}{\sqrt{sum\_msq}}$

Now if the calculated kurtosis $\gamma_2$ is greater than 3 for a leptokurtic set of samples (should this be if $\gamma_2$ is greater than 0?), we implement the Shapiro-Francia test as below to obtain our final result, the statistic $W$:
$\Large W=\frac{(\sum_{n=1}^Nw(n)x_s(n))^2}{\sigma^2\times (N-1)}$

When the kurtosis $\gamma_2$ is less than or equal to 3, a more detailed processing follows to obtain our desired statistic $W$, which is described next.

If there are only 3 elements ($N=3$), $w(1)=\sqrt{0.5}$, and $w(3)=-\sqrt{0.5}$.

When there are between 4 and 5 samples inclusive (i.e. $N=4$ or $N=5$), the following applies. Let $\large u=\frac{1}{\sqrt{N}}$. We evaluate the fifth order polynomial:
$\large y=\sum_{k=0}^{5}p_1[k]u^{(5-k)}$

where the above polynomial coefficients are:

The value $y$ is then substituted into element $N$ of $w$, i.e. it replaces $w(N)$, while $-y$ is substituted into the first element of $w$, i.e. $-y$ replaces $w(1)$.

Next we calculate $\phi$
$\Large \phi=\frac{sum\_msq-(2\times m(N)^2)}{1-(2\times w(N)^2)}$

and replace elements/indices $n=2$ to $n=N-1$ inclusive of $w(n)$ by $\large \frac{m(n)}{\sqrt{\phi}}$. Then we finally calculate our desired $W$ statistic as
$\Large W=\frac{(\sum_{n=1}^Nw(n)x_s(n))^2}{\sigma^2\times (N-1)}$

When there are 6 or more samples $N\geq 6$, we need to implement the following procedure. First we evaluate the following fifth order polynomial with a different set of coefficients
$\large z=\sum_{k=0}^{5}p_2[k]u^{(5-k)}$


we then set $w(N-1)$ to polynomial result $z$ and set $w(2)$ to $-z$. The value of $\phi$ is calculated as follows
$\Large \phi=\frac{sum\_msq-(2\times m(N)^2)-(2\times m(N-1)^2)}{1-(2w\times (N)^2)-(2\times w(N-1)^2)}$

Indices $n=3$ to $n=N-2$ inclusive of samples $w(n)$ are replaced by $\large \frac{m(n)}{\sqrt{\phi}}$.  Finally the $W$ statistic is calculated as
$\Large W=\frac{(\sum_{n=1}^Nw(n)x_s(n))^2}{\sigma^2\times (N-1)}$

The higher the value of $W$, the more likely the dataset is Normal distributed - $W$ has an upper bound of 1. In practice, we use a table of critical values of $W$ (see this posting), which depends on the number of samples in the dataset under analysis, and if our calculated value of $W$ is less than the critical value, we reject the Null Hypothesis that the samples are indeed drawn from a Normal/Gaussian distribution.

It is possible to calculate the p-value from the calculated W statistic. There is a FORTRAN routine in http://lib.stat.cmu.edu/apstat/R94. Not being that familiar with FORTRAN, I have translated the subsection that converts the W statistic to the p-value to a GNU Octave/Matlab function, listed below in blue (for a simpler implementation, I have assumed that the number of points to censor is 0).

function[res] = shapiro_pval(W,N)

% W : Shapiro-Wilk statistic
% N : Number of samples

% Function result - will store p value
res = 0;

pw = 0;

pi6 = 6/pi;

G  = [-0.2273E1, 0.459E0];
c3 = [0.5440E0, -0.39978E0, 0.25054E-1, -0.6714E-3];
c4 = [0.13822E1, -0.77857E0, 0.62767E-1, -0.20322E-2];

c5 = [-0.15861E1, -0.31082E0, -0.83751E-1, 0.38915E-2];
c6 = [-0.4803E0, -0.82676E-1, 0.30302E-2];

% stqr = 0.1047198E1;

stqr = sqrt(0.75);

y = log(1 - W);
xx = log(N);
m = 0;
s = 1;

if (N == 3)

  pw = pi6 * (asin(sqrt(W)) - stqr);

  if (pw < 0)
    pw = 0;

elseif (N <= 11)
  gma = (G(2)*N) + (G(1)*1);
  if (y > gma)
    pw = 1e-19;
    y = -log(gma - y);
    m = c3(4)*N^3 + c3(3)*N^2 + c3(2)*N + c3(1);
    s = exp(c4(4)*N^3 + c4(3)*N^2 + c4(2)*N + c4(1));
    pw = 1 - normcdf((y - m)/s);

  m  = c5(4)*xx^3 + c5(3)*xx^2 + c5(2)*xx + c5(1);
  s  = exp(c6(3)*xx^2 + c6(2)*xx + c6(1));
  pw = 1 - normcdf((y - m)/s);

res = pw;

Please note that there are a variety of methods to determine whether the samples of a particular dataset are indeed Normal distributed. One can plot a histogram of the set and examine its shape, as well as carry out a Q-Q plot.

An online Shapiro Wilk Test Calculator can be found in this blog here.