Home » machine learning » decision trees » Random Forest

Random Forest

Today we will learn about the random forest algorithm, and how to implement in Python, using the decision trees we have made before.

History

The random forest algorithm was first introduced by Leo Breiman, and he (and Adam Cutler) have a great guide to them here.

The Algorithm

The idea is to train many decision trees on bootstrapped samples of your data, and when we grow a node we only consider a small random selection of variables. We don’t bother pruning the trees, we just trust that the overfit of each tree will be averaged out. This has the advantage of being fast, since few variables need to be searched, and stable since average of all the trees over the resampled training sets reduces the variability you get from any particular sample. This also gives a rich set of information about the model, since each tree has unseen data that was missed in the resampling, and this can be used to test each individual tree. This can be used to estimate the importance of a variable, and estimate the generalisation error of the random forest.

The best way to explain it is to just do it.

Random Forest in Python
First we’ll create random tree classes, one for the regression case and one for the classification case.

class ClassRandomTree(tic.ClassificationTree, object):
    def iter_split(self, node): #We overwrite the classification trees' exhaustive search
                                #with a random sample.
        random.shuffle(self.columns) #We shuffle the columns, then select the first few.
        for i in range(self.number_random_vars): #This attribute will be set by another object.
            attribute=self.columns[i]
            for pivot in self.get_pivots(node.local_data, attribute):
                self.fuse_vertex(node)
                self.split_vertex(vertex=node, pivot=pivot,
                                  split_attribute=attribute)
                yield    def train(self,data, data_type_dict, min_node_size, max_depth,
              number_random_vars,response):
        self.number_random_vars=number_random_vars #The number of variables we select for each tree.
        self.vertices=[]
        self.edges=set([])
        self.leaves=set([])
        self.data=data
        self.data_type_dict=data_type_dict #Tells the algorithm what kinds of variables we have.
        self.columns=data_type_dict.keys()
        if self.response in self.columns:
            self.columns.remove(self.response)
        self.min_node_size=min_node_size
        self.max_node_depth=max_depth
        self.response=response
        self.grow_tree() #Grows the tree without pruning.

Now we do exactly the same, but inherit from the regression tree class.

class RegressRandomTree(tic.RegressionTree, object):
    def iter_split(self, node):
        random.shuffle(self.columns)
        for i in range(self.number_random_vars):
            attribute=self.columns[i]
            for pivot in self.get_pivots(node.local_data, attribute):
                self.fuse_vertex(node)
                self.split_vertex(vertex=node, pivot=pivot,
                                  split_attribute=attribute)
                yield

    def train(self,data, data_type_dict, min_node_size, max_depth,
              number_random_vars,response):
        self.number_random_vars=number_random_vars
        self.vertices=[]
        self.edges=set([])
        self.leaves=set([])
        self.data=data
        self.data_type_dict=data_type_dict
        self.columns=data_type_dict.keys()
        if self.response in self.columns:
            self.columns.remove(self.response)
        self.min_node_size=min_node_size
        self.max_node_depth=max_depth
        self.response=response
        self.grow_tree()

Now that we can make a tree, we grow a forest. The basic idea is simple, and we add a couple of features to keep track of variable importance and our OOB error.

class RandomForest(object):
    def __init__(self, m, train, test, data_type_dict, min_node_size,
                 max_depth, response, no_trees, missing=False, classification=True):
                     self.m=m #Number of variables to select for each tree.
                     self.train=train
                     self.test=test
                     self.data_type_dict=data_type_dict
                     self.min_node_size=min_node_size
                     self.max_depth=max_depth
                     self.response=response
                     self.no_trees=no_trees
                     self.missing=missing
                     self.classification=classification
    def predict(self, predict=False, OOBerror=False,
                variable_importance=False):
        N=len(self.train)
        predictions=[]
        oob_error=[] #Keeps track of the OOB error of each tree.
        mean_oob_error=[] #Keeps track of the mean OOB error of the trees.
        importance_dict ={key:0 for key in self.data_type_dict.keys()} #Will give an importance score
                                                                       #for each variable.
        for tree in range(self.no_trees):
            data,leftovers = OOB_sample(self.train,N, True) #We resample the data.
            if self.classification:
                R=ClassRandomTree(missing=self.missing)
            else:
                R=RegressRandomTree(missing=self.missing)
            R.train(data=data, data_type_dict=self.data_type_dict,
                    min_node_size=self.min_node_size,
                    max_depth=self.max_depth,
                    response=self.response,
                    number_random_vars=self.m)
            if OOBerror or variable_importance:
                if self.classification:
                    oob_error.append(R.test(leftovers)) #Tests tree on unsampled data.
                    mean_oob_error.append(1-np.mean(oob_error))
                else:
                    oob_error.append(R.test(leftovers)/float(len(leftovers)))
                    mean_oob_error.append(np.mean(oob_error))
            if variable_importance:
                for attribute in self.data_type_dict.keys():
                    #To assess a variables importance, we see how the error of the tree is affected
                    #by randomly permuting the labels for each variable.
                    shuffled_leftovers=shuffle_attribute_labels(leftovers, attribute)
                    local_class_rate= 1-oob_error[tree] #Unshuffled error.
                    shuffled_rate = 1-R.test(shuffled_leftovers) #Shuffled error.
                    importance_dict[attribute]+=local_class_rate-shuffled_rate #Difference betweeen
                                                                               #shuffled and unshuffled
                                                                               #error is added to dict.
            if predict:
                #Now we make our predictions from the tree, and record them so that
                #we can either take the mean or mode of the trees' predictions at the end.
                R.load_new_data(self.test)
                predictions.append(self.test.apply(R.predict, axis=1))
        if predict:
            predictions=tic.combine_predictions(predictions, not self.classification)

        if OOBerror or variable_importance:
            if self.classification:
                print 'Classification rate: ',mean_oob_error[-1]
            else:
                print 'Mean error: ', mean_oob_error[-1]

        if variable_importance:
            #Creates a graph of variable importance, we just do a little fussing so that
            #they are ranked, which is a little awkward in a dictionary.
            importance_dict = {attribute: importance_dict[attribute]/float(self.no_trees)
            for attribute in importance_dict.keys()}
            D=sorted(importance_dict.items(), key=lambda x: x[1])
            values= [x[1] for x in D]
            keys = [x[0] for x in D]
            import matplotlib.pyplot as plt
            plt.barh(range(len(importance_dict)), values, align='center')
            plt.yticks(range(len(importance_dict)), keys)
            plt.show()

        return [predictions, mean_oob_error, importance_dict, oob_error]

Now I just need to show you the code I used for shuffling the variable labels:

def shuffle_attribute_labels(df, attribute):
    df = df.copy()
    col = list(df[attribute])
    np.random.shuffle(col)
    df[attribute]=col
    return df

To illustrate this let’s run it on the Titanic data set. I created a new variable ‘Spouse’ that records whether or not the spouse of the person is known to have survived or not, and I’d like to know if this is a useful attribute.

import cleantitanic2 as ct
train,test, data_type_dict=ct.clean()
R=RandomForest(4, train, test, data_type_dict, 1, 100,'Survived',100, missing=False)
D=R.predict(predict=True,OOBerror=True, variable_importance=True)
prediction_path = 'C:/Documents and Settings/DIGIT/My Documents/Google Drive/Blogs/triangleinequality/Titanic/prediction.csv'
prediction_csv=pd.read_csv(prediction_path)
prediction_csv['Survived']=D[0]
prediction_csv['Survived']=prediction_csv['Survived'].map(lambda x: int(x))
prediction_csv.to_csv('C:/Documents and Settings/DIGIT/My Documents/Google Drive/Blogs/triangleinequality/Titanic/rf_test.csv',
                      index =False)

Here I used 100 trees, minimum node size 1 and maximum depth 100. I also chose to look at only 4 variables at each step, a rule of thumb is use about the square root of the number of variables, although you will want to refine this using cross-validation or the OOB error given by the random forest.
This produced the following graph, and gave an estimate of 0.78942559327 for the OOB error.
variable_importanceAs you can see the variable ‘Spouse’ has negative importance! That means that using it is harmful to the model, since randomly shuffling the labels actually improved things! So I should forget about that one. We also see what we already knew, that ‘Title’ and ‘Sex’ are very important, although of course ‘Title’ is really a refinement of the variable ‘Sex’, and so we see one of the problems with this measure: correlated variables!I have updated most of the modules I have written now, so if you want to run this code you should download all of these files, if you have any problems just let me know!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: