Today we will learn about the random forest algorithm, and how to implement in Python, using the decision trees we have made before.

**History**

The random forest algorithm was first introduced by Leo Breiman, and he (and Adam Cutler) have a great guide to them here.

**The Algorithm**

The idea is to train many decision trees on bootstrapped samples of your data, and when we grow a node we only consider a small random selection of variables. We don’t bother pruning the trees, we just trust that the overfit of each tree will be averaged out. This has the advantage of being fast, since few variables need to be searched, and stable since average of all the trees over the resampled training sets reduces the variability you get from any particular sample. This also gives a rich set of information about the model, since each tree has unseen data that was missed in the resampling, and this can be used to test each individual tree. This can be used to estimate the importance of a variable, and estimate the generalisation error of the random forest.

The best way to explain it is to just do it.

**Random Forest in Python**

First we’ll create random tree classes, one for the regression case and one for the classification case.

class ClassRandomTree(tic.ClassificationTree, object): def iter_split(self, node): #We overwrite the classification trees' exhaustive search #with a random sample. random.shuffle(self.columns) #We shuffle the columns, then select the first few. for i in range(self.number_random_vars): #This attribute will be set by another object. attribute=self.columns[i] for pivot in self.get_pivots(node.local_data, attribute): self.fuse_vertex(node) self.split_vertex(vertex=node, pivot=pivot, split_attribute=attribute) yield def train(self,data, data_type_dict, min_node_size, max_depth, number_random_vars,response): self.number_random_vars=number_random_vars #The number of variables we select for each tree. self.vertices=[] self.edges=set([]) self.leaves=set([]) self.data=data self.data_type_dict=data_type_dict #Tells the algorithm what kinds of variables we have. self.columns=data_type_dict.keys() if self.response in self.columns: self.columns.remove(self.response) self.min_node_size=min_node_size self.max_node_depth=max_depth self.response=response self.grow_tree() #Grows the tree without pruning.

Now we do exactly the same, but inherit from the regression tree class.

class RegressRandomTree(tic.RegressionTree, object): def iter_split(self, node): random.shuffle(self.columns) for i in range(self.number_random_vars): attribute=self.columns[i] for pivot in self.get_pivots(node.local_data, attribute): self.fuse_vertex(node) self.split_vertex(vertex=node, pivot=pivot, split_attribute=attribute) yield def train(self,data, data_type_dict, min_node_size, max_depth, number_random_vars,response): self.number_random_vars=number_random_vars self.vertices=[] self.edges=set([]) self.leaves=set([]) self.data=data self.data_type_dict=data_type_dict self.columns=data_type_dict.keys() if self.response in self.columns: self.columns.remove(self.response) self.min_node_size=min_node_size self.max_node_depth=max_depth self.response=response self.grow_tree()

Now that we can make a tree, we grow a forest. The basic idea is simple, and we add a couple of features to keep track of variable importance and our OOB error.

class RandomForest(object): def __init__(self, m, train, test, data_type_dict, min_node_size, max_depth, response, no_trees, missing=False, classification=True): self.m=m #Number of variables to select for each tree. self.train=train self.test=test self.data_type_dict=data_type_dict self.min_node_size=min_node_size self.max_depth=max_depth self.response=response self.no_trees=no_trees self.missing=missing self.classification=classification def predict(self, predict=False, OOBerror=False, variable_importance=False): N=len(self.train) predictions=[] oob_error=[] #Keeps track of the OOB error of each tree. mean_oob_error=[] #Keeps track of the mean OOB error of the trees. importance_dict ={key:0 for key in self.data_type_dict.keys()} #Will give an importance score #for each variable. for tree in range(self.no_trees): data,leftovers = OOB_sample(self.train,N, True) #We resample the data. if self.classification: R=ClassRandomTree(missing=self.missing) else: R=RegressRandomTree(missing=self.missing) R.train(data=data, data_type_dict=self.data_type_dict, min_node_size=self.min_node_size, max_depth=self.max_depth, response=self.response, number_random_vars=self.m) if OOBerror or variable_importance: if self.classification: oob_error.append(R.test(leftovers)) #Tests tree on unsampled data. mean_oob_error.append(1-np.mean(oob_error)) else: oob_error.append(R.test(leftovers)/float(len(leftovers))) mean_oob_error.append(np.mean(oob_error)) if variable_importance: for attribute in self.data_type_dict.keys(): #To assess a variables importance, we see how the error of the tree is affected #by randomly permuting the labels for each variable. shuffled_leftovers=shuffle_attribute_labels(leftovers, attribute) local_class_rate= 1-oob_error[tree] #Unshuffled error. shuffled_rate = 1-R.test(shuffled_leftovers) #Shuffled error. importance_dict[attribute]+=local_class_rate-shuffled_rate #Difference betweeen #shuffled and unshuffled #error is added to dict. if predict: #Now we make our predictions from the tree, and record them so that #we can either take the mean or mode of the trees' predictions at the end. R.load_new_data(self.test) predictions.append(self.test.apply(R.predict, axis=1)) if predict: predictions=tic.combine_predictions(predictions, not self.classification) if OOBerror or variable_importance: if self.classification: print 'Classification rate: ',mean_oob_error[-1] else: print 'Mean error: ', mean_oob_error[-1] if variable_importance: #Creates a graph of variable importance, we just do a little fussing so that #they are ranked, which is a little awkward in a dictionary. importance_dict = {attribute: importance_dict[attribute]/float(self.no_trees) for attribute in importance_dict.keys()} D=sorted(importance_dict.items(), key=lambda x: x[1]) values= [x[1] for x in D] keys = [x[0] for x in D] import matplotlib.pyplot as plt plt.barh(range(len(importance_dict)), values, align='center') plt.yticks(range(len(importance_dict)), keys) plt.show() return [predictions, mean_oob_error, importance_dict, oob_error]

Now I just need to show you the code I used for shuffling the variable labels:

def shuffle_attribute_labels(df, attribute): df = df.copy() col = list(df[attribute]) np.random.shuffle(col) df[attribute]=col return df

To illustrate this let’s run it on the Titanic data set. I created a new variable ‘Spouse’ that records whether or not the spouse of the person is known to have survived or not, and I’d like to know if this is a useful attribute.

import cleantitanic2 as ct train,test, data_type_dict=ct.clean() R=RandomForest(4, train, test, data_type_dict, 1, 100,'Survived',100, missing=False) D=R.predict(predict=True,OOBerror=True, variable_importance=True) prediction_path = 'C:/Documents and Settings/DIGIT/My Documents/Google Drive/Blogs/triangleinequality/Titanic/prediction.csv' prediction_csv=pd.read_csv(prediction_path) prediction_csv['Survived']=D[0] prediction_csv['Survived']=prediction_csv['Survived'].map(lambda x: int(x)) prediction_csv.to_csv('C:/Documents and Settings/DIGIT/My Documents/Google Drive/Blogs/triangleinequality/Titanic/rf_test.csv', index =False)

Here I used 100 trees, minimum node size 1 and maximum depth 100. I also chose to look at only 4 variables at each step, a rule of thumb is use about the square root of the number of variables, although you will want to refine this using cross-validation or the OOB error given by the random forest.

This produced the following graph, and gave an estimate of 0.78942559327 for the OOB error.

As you can see the variable ‘Spouse’ has negative importance! That means that using it is harmful to the model, since randomly shuffling the labels actually improved things! So I should forget about that one. We also see what we already knew, that ‘Title’ and ‘Sex’ are very important, although of course ‘Title’ is really a refinement of the variable ‘Sex’, and so we see one of the problems with this measure: correlated variables!I have updated most of the modules I have written now, so if you want to run this code you should download all of these files, if you have any problems just let me know!