Let’s suppose we have vectors where each coordinate represents a variable, perhaps is height, is weight and is age, and so on. Associated to each we have , this we think of as the response, or what we are trying to predict given . The task is to find a model or algorithm that will spit out a good approximation to given as an input, in particular one that will work for new unseen pieces of data for which we do not know the response.
The linear regression approach to this is as follows. Let be the matrix with rows , and be the matrix with rows . What we are seeking is a so that where is an error term, and is chosen to minimize this error. So we minimize the error on the training data, and we hope that this generalizes to new cases.
The natural measure of error to use is the Euclidean metric given by the inner product, that is for . Recall that an inner product is a function taking two vectors to their underlying scalar field that represents (at least for unit vectors), the magnitude of the projection of one onto the other, see wikipedia for further details.. This naturally gives rise to a norm defined
Now we think geometrically: is telling us how to combine the columns of , and so tells us where in the column space of , which you can imagine as being like a plane, is the best approximation to . With our geometric cap on, the natural point in the column space to pick is the orthogonal projection of onto the column space .
What does this mean? Well a vector is an orthogonal projection of onto when the vector between and , , is orthogonal to , meaning that for all . Geometrically this means that the line you draw from to in hits it at right-angles – the line is normal to the space.
Here is a picture of projection of a point onto a plane, courtesy of Wikibooks:
There is the normal to the plane, see how the line from the point to its projection is parallel to this line, meaning that any vector in the plane is perpendicular to the path of of the point to its projection.
How do we know that this is the right point to pick? Pythagoras’ theorem. Given orthogonal Pythagoras’ theorem states that
So let be another candidate for . Then since is the orthogonal projection, and so plugging this into the above gives
meaning that gives a larger error for any Note that this is nothing more than the observation that the hypotenuse of a right-angled triangle is the longest side.
Great, so now we need to find . Fortunately all we need to do is arrange everything we know in an equation:
Noting that will be invertible when the columns of are linearly independent, if this fails, then we can discard some redundant columns and start again. The matrix is called the projection matrix for the column space. Simples.
I should point out that it is actually better not to solve for on one side by inverting as in , you should leave it as and solve the linear system, finding the inverse is computationally expensive!
Next time we will implement this in Python and test it on some toy data.