Wednesday, 4 September 2013

Linear Regression

The linear regression model is a fairly simple model, nevertheless it is used in many applications. 

What follows is a fairly mathematical derivation of the linear regression parameters.

Let us denote the observations as $y[n]$ and $x[n]$ for $n=0,1,..,N-1$ for the model 

$y[n] = ax[n]+b$

We assume that we can measure both $x[n]$ and $y[n]$ - the unknowns are $a$ and $b$, the gradient and intercept respectively. We can use the least squares error criterion to find estimates for $a$ and $b$, from data $x[n]$ and $y[n]$.

Let  $e[n] = y[n] - ax[n]-b$ be the error. We can represent the samples $e[n]$, $x[n]$ and $y[n]$ as length $N$ column vectors $\underline{e}$,$\underline{x}$ and $\underline{y}$ respectively.

Thus the error sum of squares (which can be regarded as a cost function $C$ to minimise over) will be


where $\underline{1}$ is a column vector with all elements set to 1.

Expanding this cost function results in:-


Differentiating $C$ w.r.t. to $a$ and $b$:-

$\frac{\partial C}{\partial a}=2a(\underline{x}^T\underline{x})+2b(\underline{x}^T\underline{1})-2\underline{x}^T\underline{y}$

$\frac{\partial C}{\partial b}=2a(\underline{x}^T\underline{1})+2b(\underline{1}^T\underline{1})-2\underline{y}^T\underline{1}$

In order to estimate $\hat{a}$ and $\hat{b}$ (the least squares estimates of $a$ and $a$ respectively) we need to set the above two partial differential equations to 0, and solve for $a$ and $b$. We need to solve the following matrix equation

$\left[\begin{array}{cc}\underline{x}^T\underline{x}&\underline{x}^T\underline{1}\\ \underline{x}^T\underline{1}&\underline{1}^T\underline{1}\end{array}\right]$$\left[\begin{array}{c}\hat{a}\\ \hat{b}\end{array}\right]$ = $\left[\begin{array}{c}\underline{x}^T\underline{y} \\ \underline{y}^T\underline{1} \end{array}\right]$

Inverting the matrix and pre multiplying both sides of the above matrix equation by the inverse results in:

$\left[\begin{array}{c}\hat{a} \\ \hat{b} \end{array}\right]$ = $\frac{1}{N(\underline{x}^T\underline{x})-(\underline{x}^T\underline{1})^2}$$\left[\begin{array}{cc}\underline{1}^T\underline{1}&-\underline{x}^T\underline{1}\\ -\underline{x}^T\underline{1}&\underline{x}^T\underline{x}\end{array}\right]$$\left[\begin{array}{c}\underline{x}^T\underline{y} \\ \underline{y}^T\underline{1} \end{array}\right]$

If we revert to a more explicit notation in terms of the samples, we have



where the denominator of the fraction in the above two equations is the matrix determinant.

Coefficient of Determination

Once we have calculated $\hat{a}$ and $\hat{b}$, we can see how well the linear model accounts for the data. This is achieved by finding the Coefficient of Determination ($r^2$). This is the ratio of the residual sum of squares ($rss$) to the total sum of squares ($tss$),

$\Large r^2=\frac{rss}{tss}$

and has a maximum value of 1, which is attained when the model is exactly linear. The larger $r^2$ is, the better the linear model fits the data. We can think of $r^2$ as the proportion of variation in data explained by the linear model, or the "goodness of fit" of the linear model.

The total sum of squares is given by



$\bar{y}=\frac{1}{N}\sum_{n=0}^{N-1}y[n]$ is the mean of $y$.

The residual sum of squares ($rss$) is the total sum of squares ($tss$) minus the sum of squared errors ($sse$)



If the data perfectly fits the linear model, $sse$ will equal 0, so that $tss$ will equal $rss$ - the coefficient of determination $r^2$ will be $1$.

Summarising, we have

$\Large r^2=\frac{\sum_{n=0}^{N-1}(y[n]-\bar{y})^2-\sum_{n=0}^{N-1}(y[n]-\hat{a}x[n]-\hat{b})^2}{\sum_{n=0}^{N-1}(y[n]-\bar{y})^2}$

A worked example

Now, let us consider the following example:-

$x=[1,2,3,4]$, $y=[1.5,6,7,8]$

The matrix determinant will be $(4\times (1 + 2^2+3^2+4^2))-(1+2+3+4)^2=20$.

The least squares estimate of $a$ is

$\hat{a}=[4\times((1\times 1.5) + (2\times 6) + (3\times 7) + (4\times 8)) \\ -( (1.5+6+7+8)\times(1+2+3+4))]/20 \\ =2.05$

The least squares estimate of $b$ is

$\hat{b}=[(1+2^2+3^2+4^2)\times(1.5+6+7+8)-((1+2+3+4)\times((1\times 1.5) \\+ (2\times 6) + (3\times 7) +(4\times 8)))]/20 \\ = (675 - 665)/20 \\ =0.5$

The $tss$ is


where the average of $y$ is $5.625$.

The $sse$ is

$(1.5 - (2.05\times 1) - 0.5)^2+(6 - (2.05\times 2) - 0.5)^2\\+(7 - (2.05\times 3) - 0.5)^2+(8 - (2.05\times 4) - 0.5)^2=3.675$

So that the $rss$ is $24.688-3.675=21.013$.

The coefficient of determination is


This result is quite high, and is fairly indicative of a linear model.

There is a Linear Regression Calculator in this blog, which can be found here