Home » maths » statistics » linear regression » Linear Regression: The Code

# Linear Regression: The Code

Ok so last time we saw what linear regression was all about: you have data in the form of a matrix $X$ where rows correspond to datapoints, and columns correspond to features, and a column vector $Y$ of corresponding response values. We seek a $\beta$ so that $X \beta$ is as close to $Y$ as possible.

We saw that we can find the solution by solving the linear system $X^T\beta = X^TXY$ for $\beta$.

To do this in Python will be very simple using  scipy.linalg! I have created an IPython notebook taking you through everything, which you can run interactively.

Let’s begin by creating some data to model to get a feel for what we are doing. We’ll start with a linear function f, and add some Gaussian noise to make it a bit more realistic, then use linear regression to fit that data, and see how well we do compared with the true function f.

#First get all the inclusions out of the way.
import numpy as np
from scipy import linalg
import random as rd
import matplotlib.pyplot as plt
def test_set(a,b, no_samples, intercept, slope, variance):
#First we grab some random x-values in the range [a,b].
X = [rd.uniform(a,b) for i in range(no_samples)]
#Sorting them seems to make the plots work better later on.
X.sort()
#Now we define our linear function that will underly our synthetic dataset: the true model.
def f(x):
return intercept + slope*x
#We create some y values to go with each x, given by f(x) plus a small perturbation by Gaussian noise.
Y = [f(x)+rd.gauss(0,variance) for x in X]
#Now we return these as numpy column vectors, as well as the points given by f without noise.
return [np.array([X]).T,np.array([Y]).T, map(f, X)]


Great. Now let’s see what this looks like by plotting the data along with the graph of f.

X,Y,f = test_set(a=0, b=5, no_samples=70, intercept=1, slope=3, variance=6)
fig, ax = plt.subplots()
ax.plot(X, Y, '.')
ax.plot(X, f)
plt.show()


This will look something like:

but of course yours will not look exactly the same because the data is generated randomly.
The code for doing the linear regression is simplicity itself.

def linear_regression(X, Y):
#solve linear system X^TXY = XB
beta=np.linalg.solve(X.T.dot(X),X.T.dot(Y))
return beta


Now we want to test it on our data, note that we want to add an extra column to X filled with ones, since if we just take a linear combination of X we will always go through the origin – we want an intercept term.

X = np.hstack((X, np.ones((X.shape[0], 1), dtype=X.dtype)))
beta=linear_regression(X, Y)
fig, ax = plt.subplots()
ax.plot(X[:,0], Y, '.', label="True model + noise")
ax.plot(X[:,0], f, '-', color='red', label="True model")
ax.plot(X[:,0], X.dot(beta), color='green',label="Fitted model")
plt.show()
print "Fitted Slope = ", beta[0][0], "True Slope = 3"
print "Intercept =", beta[1][0], "True Intercept = 1"



Producing something like:

Fitted Slope =  3.30648992096 True Slope = 3
Intercept = -0.627196179429 True Intercept = 1

Well, the slope is not far off, but the intercept is wrong! This mostly comes from the noise we added,larger variance will mean poorer approximations to the true model, and more data will mean better approximations.If you run the code a couple of times, you will see quite a lot of variation. So maybe what we need to is try and quantify our uncertainty about the fit. We can do this by using bootstrapping as we discussed previously.

To get $95 \%$ condidence bounds for our fit, we will take a 1000 resamples of our data and fit each, then extract the quantiles from their predictions.

First we get the beta for each resample.

betas=[]
N= X.shape[0]
for resample in range(1000):
index = [rd.choice(range(N)) for i in range(N)]
re_X=X[index,:]
betas.append(linear_regression(re_X, Y[index,:]))


Then for each data point, we collect all of the predictions and take quantiles.

def confidence_bounds(x):
predictions=[x.dot(b) for b in betas]
predictions.sort()
return [predictions[24], predictions[-25]]
upper=[]
lower=[]
for x in X:
conf_bounds = confidence_bounds(x)
upper.append(conf_bounds[1])
lower.append(conf_bounds[0])


Now we put everything together in a plot.

fig, ax = plt.subplots()
ax.plot(X[:,0], Y, '.', label="True function + noise")
ax.plot(X[:,0], X.dot(beta), 'red', label="Linear regression fit")
ax.plot(X[:,0], upper, color='blue', label="Bootstrapped 95% confidence intervals for fit")
ax.plot(X[:,0], lower, 'blue')
ax.plot(X[:,0], f, 'green', label="True function")
plt.show()


Which looks like this:

Next time we will cover logistic regression, used when we want to predict a binary outcome, which may prove useful on the Titanic data. The code for everything is here, but I recommend using this IPython notebook if you have it.