Home » maths » statistics » linear regression » Linear Regression: The Code

Linear Regression: The Code

Ok so last time we saw what linear regression was all about: you have data in the form of a matrix X where rows correspond to datapoints, and columns correspond to features, and a column vector Y of corresponding response values. We seek a \beta so that X \beta is as close to Y as possible.

We saw that we can find the solution by solving the linear system X^T\beta = X^TXY for \beta.

To do this in Python will be very simple using  scipy.linalg! I have created an IPython notebook taking you through everything, which you can run interactively.

Let’s begin by creating some data to model to get a feel for what we are doing. We’ll start with a linear function f, and add some Gaussian noise to make it a bit more realistic, then use linear regression to fit that data, and see how well we do compared with the true function f.

#First get all the inclusions out of the way.
import numpy as np
from scipy import linalg
import random as rd
import matplotlib.pyplot as plt
def test_set(a,b, no_samples, intercept, slope, variance):
    #First we grab some random x-values in the range [a,b].
    X = [rd.uniform(a,b) for i in range(no_samples)]
    #Sorting them seems to make the plots work better later on.
    X.sort()
    #Now we define our linear function that will underly our synthetic dataset: the true model.
    def f(x):
        return intercept + slope*x
    #We create some y values to go with each x, given by f(x) plus a small perturbation by Gaussian noise.
    Y = [f(x)+rd.gauss(0,variance) for x in X]
    #Now we return these as numpy column vectors, as well as the points given by f without noise.
    return [np.array([X]).T,np.array([Y]).T, map(f, X)]

Great. Now let’s see what this looks like by plotting the data along with the graph of f.

X,Y,f = test_set(a=0, b=5, no_samples=70, intercept=1, slope=3, variance=6)
fig, ax = plt.subplots()
ax.plot(X, Y, '.')
ax.plot(X, f)
plt.show()

This will look something like:
graph1

but of course yours will not look exactly the same because the data is generated randomly.
The code for doing the linear regression is simplicity itself.

def linear_regression(X, Y):
    #solve linear system X^TXY = XB
    beta=np.linalg.solve(X.T.dot(X),X.T.dot(Y))
    return beta

Now we want to test it on our data, note that we want to add an extra column to X filled with ones, since if we just take a linear combination of X we will always go through the origin – we want an intercept term.

X = np.hstack((X, np.ones((X.shape[0], 1), dtype=X.dtype)))
beta=linear_regression(X, Y)
fig, ax = plt.subplots()
ax.plot(X[:,0], Y, '.', label="True model + noise")
ax.plot(X[:,0], f, '-', color='red', label="True model")
ax.plot(X[:,0], X.dot(beta), color='green',label="Fitted model")
legend = ax.legend(loc='lower right', shadow=True)
plt.show()
print "Fitted Slope = ", beta[0][0], "True Slope = 3"
print "Intercept =", beta[1][0], "True Intercept = 1"

Producing something like:

graph2

Fitted Slope =  3.30648992096 True Slope = 3
Intercept = -0.627196179429 True Intercept = 1

Well, the slope is not far off, but the intercept is wrong! This mostly comes from the noise we added,larger variance will mean poorer approximations to the true model, and more data will mean better approximations.If you run the code a couple of times, you will see quite a lot of variation. So maybe what we need to is try and quantify our uncertainty about the fit. We can do this by using bootstrapping as we discussed previously.

To get 95 \% condidence bounds for our fit, we will take a 1000 resamples of our data and fit each, then extract the quantiles from their predictions.

First we get the beta for each resample.

betas=[]
N= X.shape[0]
for resample in range(1000):
    index = [rd.choice(range(N)) for i in range(N)]
    re_X=X[index,:]
    betas.append(linear_regression(re_X, Y[index,:]))

Then for each data point, we collect all of the predictions and take quantiles.

def confidence_bounds(x):
    predictions=[x.dot(b) for b in betas]
    predictions.sort()
    return [predictions[24], predictions[-25]]
upper=[]
lower=[]
for x in X:
    conf_bounds = confidence_bounds(x)
    upper.append(conf_bounds[1])
    lower.append(conf_bounds[0])

Now we put everything together in a plot.

fig, ax = plt.subplots()
ax.plot(X[:,0], Y, '.', label="True function + noise")
ax.plot(X[:,0], X.dot(beta), 'red', label="Linear regression fit")
ax.plot(X[:,0], upper, color='blue', label="Bootstrapped 95% confidence intervals for fit")
ax.plot(X[:,0], lower, 'blue')
ax.plot(X[:,0], f, 'green', label="True function")
legend = ax.legend(loc='upper left', shadow=True,prop={'size':8})
plt.show()

Which looks like this:

graph3Next time we will cover logistic regression, used when we want to predict a binary outcome, which may prove useful on the Titanic data. The code for everything is here, but I recommend using this IPython notebook if you have it.

Advertisements

2 Comments

  1. […] back. Last time we showed how to use linear regression to predict a numeric response. Today we are going to talk […]

  2. […] Linear Regression: The Code Data analysis with Python […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: