Ok so last time we saw what linear regression was all about: you have data in the form of a matrix where rows correspond to datapoints, and columns correspond to features, and a column vector of corresponding response values. We seek a so that is as close to as possible.

We saw that we can find the solution by solving the linear system for .

To do this in Python will be very simple using scipy.linalg! I have created an IPython notebook taking you through everything, which you can run interactively.

Let’s begin by creating some data to model to get a feel for what we are doing. We’ll start with a linear function f, and add some Gaussian noise to make it a bit more realistic, then use linear regression to fit that data, and see how well we do compared with the true function f.

#First get all the inclusions out of the way. import numpy as np from scipy import linalg import random as rd import matplotlib.pyplot as plt def test_set(a,b, no_samples, intercept, slope, variance): #First we grab some random x-values in the range [a,b]. X = [rd.uniform(a,b) for i in range(no_samples)] #Sorting them seems to make the plots work better later on. X.sort() #Now we define our linear function that will underly our synthetic dataset: the true model. def f(x): return intercept + slope*x #We create some y values to go with each x, given by f(x) plus a small perturbation by Gaussian noise. Y = [f(x)+rd.gauss(0,variance) for x in X] #Now we return these as numpy column vectors, as well as the points given by f without noise. return [np.array([X]).T,np.array([Y]).T, map(f, X)]

Great. Now let’s see what this looks like by plotting the data along with the graph of f.

X,Y,f = test_set(a=0, b=5, no_samples=70, intercept=1, slope=3, variance=6) fig, ax = plt.subplots() ax.plot(X, Y, '.') ax.plot(X, f) plt.show()

This will look something like:

but of course yours will not look exactly the same because the data is generated randomly.

The code for doing the linear regression is simplicity itself.

def linear_regression(X, Y): #solve linear system X^TXY = XB beta=np.linalg.solve(X.T.dot(X),X.T.dot(Y)) return beta

Now we want to test it on our data, note that we want to add an extra column to X filled with ones, since if we just take a linear combination of X we will always go through the origin – we want an intercept term.

X = np.hstack((X, np.ones((X.shape[0], 1), dtype=X.dtype))) beta=linear_regression(X, Y) fig, ax = plt.subplots() ax.plot(X[:,0], Y, '.', label="True model + noise") ax.plot(X[:,0], f, '-', color='red', label="True model") ax.plot(X[:,0], X.dot(beta), color='green',label="Fitted model") legend = ax.legend(loc='lower right', shadow=True) plt.show() print "Fitted Slope = ", beta[0][0], "True Slope = 3" print "Intercept =", beta[1][0], "True Intercept = 1"

Producing something like:

[…] back. Last time we showed how to use linear regression to predict a numeric response. Today we are going to talk […]

[…] Linear Regression: The Code Data analysis with Python […]