Home » maths » linear algebra » Linear Regression: The Math(s)

Linear Regression: The Math(s)

Let’s suppose we have vectors x^1, \dots , x^k \in \mathbb{R}^n where each coordinate represents a variable, perhaps x_1 is height, x_2 is weight and x_3 is age, and so on. Associated to each x^i we have y^i \in \mathbb{R}, this we think of as the response, or what we are trying to predict given x^i. The task is to find a model or algorithm that will spit out a good approximation to y^i given x^i as an input, in particular one that will work for new unseen pieces of data for which we do not know the response.

The linear regression approach to this is as follows. Let X \in \mathbb{R}^{k \times n} be the matrix with rows x^i, and Y \in \mathbb{R}^k be the matrix with rows y^i. What we are seeking is a \beta \in \mathbb{R}^k so that X\beta = Y + \epsilon where \epsilon is an error term, and \beta is chosen to minimize this error. So we minimize the error on the training data, and we hope that this generalizes to new cases.

The natural measure of error to use is the Euclidean metric given by the inner product, that is \langle x,y \rangle = \sum \limits _{i=1}^nx_iy_i for x,y \in \mathbb{R}^n. Recall that an inner product is a function taking two vectors to their underlying scalar field that represents (at least for unit vectors), the magnitude of the projection of one onto the other, see wikipedia for further details.. This naturally gives rise to a norm defined \Vert x\Vert_2^2=\langle x,x \rangle.

Now we think geometrically: \beta is telling us how to combine the columns of X, and so \beta tells us where in the column space of X, which you can imagine as being like a plane, is the best approximation to Y. With our geometric cap on, the natural point in the column space to pick is the orthogonal projection of Y onto the column space R(X).

What does this mean? Well a vector p is an orthogonal projection of Y onto R(X) when the vector between p and Y, Y-p, is orthogonal to R(X), meaning that \langle x,Y-p \rangle=0 for all x \in R(X). Geometrically this means that the line you draw from Y to p in R(X) hits it at right-angles – the line is normal to the space.

Here is a picture of projection of a point onto a plane, courtesy of Wikibooks:

Linalg_projection_onto_plane_2

There N is the normal to the plane, see how the line from the point to its projection is parallel to this line, meaning that any vector in the plane is perpendicular to the path of of the point to its projection.

How do we know that this is the right point to pick? Pythagoras’ theorem. Given orthogonal x,y Pythagoras’ theorem states that

\Vert x+y \Vert_2^2 = \Vert x \Vert_2^2 + \Vert y \Vert_2^2 .

So let q \in R(X) be another candidate for \beta. Then \langle Y-p, p-q\rangle=0 since p is the orthogonal projection, and so plugging this into the above gives

\Vert (Y-p)+(p-q)\Vert_2^2=\Vert Y-q\Vert_2^2=\Vert Y-p\Vert_2^2+\Vert p-q \Vert_2^2

and so

\Vert Y-q\Vert_2 > \Vert Y-p\Vert_2,

meaning that q gives a larger error for any q \ne p. Note that this is nothing more than the observation that the hypotenuse of a right-angled triangle is the longest side.

Great, so now we need to find \beta. Fortunately all we need to do is arrange everything we know in an equation:

0= X^T (Y-X\beta) = X^TY-X^TX \beta

giving that

X^TY = X^TX \beta \ \ \ \ (*)

and so

(X^TX)^{-1}X^TY = \beta \ \ \ \ (**).

Noting that X^TX will be invertible when the columns of X are linearly independent, if this fails, then we can discard some redundant columns and start again. The matrix (X^TX)^{-1}X^T is called the projection matrix for the column space.  Simples.

I should point out that it is actually better not to solve for \beta on one side by inverting X^TX as in (**), you should leave it as (*) and solve the linear system, finding the inverse is computationally expensive!

Next time we will implement this in Python and test it on some toy data.

Advertisements

2 Comments

  1. […] so last time we saw what linear regression was all about: you have data in the form of a matrix where rows […]

  2. […] the identity function as seen in linear regression, […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: