Home » maths » linear algebra » Linear Regression: The Math(s)

Linear Regression: The Math(s)

Let’s suppose we have vectors $x^1, \dots , x^k \in \mathbb{R}^n$ where each coordinate represents a variable, perhaps $x_1$ is height, $x_2$ is weight and $x_3$ is age, and so on. Associated to each $x^i$ we have $y^i \in \mathbb{R}$, this we think of as the response, or what we are trying to predict given $x^i$. The task is to find a model or algorithm that will spit out a good approximation to $y^i$ given $x^i$ as an input, in particular one that will work for new unseen pieces of data for which we do not know the response.

The linear regression approach to this is as follows. Let $X \in \mathbb{R}^{k \times n}$ be the matrix with rows $x^i$, and $Y \in \mathbb{R}^k$ be the matrix with rows $y^i$. What we are seeking is a $\beta \in \mathbb{R}^k$ so that $X\beta = Y + \epsilon$ where $\epsilon$ is an error term, and $\beta$ is chosen to minimize this error. So we minimize the error on the training data, and we hope that this generalizes to new cases.

The natural measure of error to use is the Euclidean metric given by the inner product, that is $\langle x,y \rangle = \sum \limits _{i=1}^nx_iy_i$ for $x,y \in \mathbb{R}^n$. Recall that an inner product is a function taking two vectors to their underlying scalar field that represents (at least for unit vectors), the magnitude of the projection of one onto the other, see wikipedia for further details.. This naturally gives rise to a norm defined $\Vert x\Vert_2^2=\langle x,x \rangle.$

Now we think geometrically: $\beta$ is telling us how to combine the columns of $X$, and so $\beta$ tells us where in the column space of $X$, which you can imagine as being like a plane, is the best approximation to $Y$. With our geometric cap on, the natural point in the column space to pick is the orthogonal projection of $Y$ onto the column space $R(X)$.

What does this mean? Well a vector $p$ is an orthogonal projection of $Y$ onto $R(X)$ when the vector between $p$ and $Y$, $Y-p$, is orthogonal to $R(X)$, meaning that $\langle x,Y-p \rangle=0$ for all $x \in R(X)$. Geometrically this means that the line you draw from $Y$ to $p$ in $R(X)$ hits it at right-angles – the line is normal to the space.

Here is a picture of projection of a point onto a plane, courtesy of Wikibooks:

There $N$ is the normal to the plane, see how the line from the point to its projection is parallel to this line, meaning that any vector in the plane is perpendicular to the path of of the point to its projection.

How do we know that this is the right point to pick? Pythagoras’ theorem. Given orthogonal $x,y$ Pythagoras’ theorem states that

$\Vert x+y \Vert_2^2 = \Vert x \Vert_2^2 + \Vert y \Vert_2^2$.

So let $q \in R(X)$ be another candidate for $\beta$. Then $\langle Y-p, p-q\rangle=0$ since $p$ is the orthogonal projection, and so plugging this into the above gives

$\Vert (Y-p)+(p-q)\Vert_2^2=\Vert Y-q\Vert_2^2=\Vert Y-p\Vert_2^2+\Vert p-q \Vert_2^2$

and so

$\Vert Y-q\Vert_2 > \Vert Y-p\Vert_2$,

meaning that $q$ gives a larger error for any $q \ne p.$ Note that this is nothing more than the observation that the hypotenuse of a right-angled triangle is the longest side.

Great, so now we need to find $\beta$. Fortunately all we need to do is arrange everything we know in an equation:

$0= X^T (Y-X\beta) = X^TY-X^TX \beta$

giving that

$X^TY = X^TX \beta \ \ \ \ (*)$

and so

$(X^TX)^{-1}X^TY = \beta \ \ \ \ (**).$

Noting that $X^TX$ will be invertible when the columns of $X$ are linearly independent, if this fails, then we can discard some redundant columns and start again. The matrix $(X^TX)^{-1}X^T$ is called the projection matrix for the column space.  Simples.

I should point out that it is actually better not to solve for $\beta$ on one side by inverting $X^TX$ as in $(**)$, you should leave it as $(*)$ and solve the linear system, finding the inverse is computationally expensive!

Next time we will implement this in Python and test it on some toy data.