Ok so now that we made decision trees, an easy algorithm to try is the random forest. This I will talk about next time. It involves OOB-out of bag resampling, and so I thought we could cover that in its own post.
The idea is extremely simple, you have datapoints and you pick points from these randomly, with replacement. If you want to visualise it: you draw a number from a hat, write it down, put it back in the hat and repeat.
This is also sometimes called bootstrapping. This can be used for a variety of purposes, underscoring them all is that you can simulate the variation present in the data, and so get an idea of how your parameter estimates, or models, could have performed given different samples from the same population. Thus if you want to know about the population mean, given your sample, you could create a 100 different resampled data sets from the sample and calculate the sample mean of each. Now you don’t just have a number, your estimate for the mean, you can see from the spread of the means how confident you ought to be in your estimate.
Similarly models can be sensitive to their training data and give inconsistent results, one way to combat this is to train several models on several training sets taken OOB and average their predictions, this reduces the variability of your predictions.
It can also be used for validation, if you resample from datapoints times, you can train on the resample, and test on all the data points you didn’t pick. But how many are there likely to be? I’m glad you asked.
A little probability
Let be the indicator variable for the event , that the datapoint was not picked.
Then the number of datapoints we missed out and can test with is .
Now we’ll take expectations. .
And then .
Putting this together we get .
And so the expected proportion remaining is as .
How cool is that! Euler’s constant just pops up everywhere! It means that as a rule of thumb you’ll have a third of the data left to test with.
A Little Python
The idea is simple, we create of a list of row indexes from our data set, then sample from that list with replacement as many times as desired, we then use these to select the rows to make our resampled dataset.
def OOB_sample(data, no_resamples, return_leftovers=False): data_rows=list(data.index) random_rows=[random.choice(data_rows) for row in range(no_resamples)] if not return_leftovers: return data.ix[random_rows] leftover_rows=filter(lambda x: not x in random_rows, data_rows) return[data.ix[random_rows], data.ix[leftover_rows]]
- To prune or not to prune: cross-validation is the answer (triangleinequality.wordpress.com)