Carolina Bento
2 min readJul 19, 2021

--

Hi Caroline Mendonça Costa,

With Random Forests you don’t need to explicitly split your dataset into training and testing. If you’d like to do so, you can still run the method score on that holdout set and obtain the mean accuracy of the model after training.

It seems like Random Forests uses the entire dataset for training and testing but, in practice, it creates a holdout dataset for each tree.

Here’s an example:

Say that your dataset is [1,2,3,4,5,6,7,8,9,10] and you want to run a Random Forests model with 4 trees, each one with 4 data points.

In the bootstrapping part of the algorithm, 4 sampled datasets with 4 elements are created by randomly sampling with replacement from the original dataset.
These datasets are used to train the trees can be:

  • Train set for Tree 1: [5, 3, 1, 7],
  • Train set for Tree 2: [10, 8, 4, 3],
  • Train set for Tree 3: [4, 7, 9, 1],
  • Train set for Tree 4: [6, 8, 3, 6]

Each tree only is trained on a subset of the original dataset. Implicitly there are data points each of these trees “has never seen before”, because they were not “picked” to train the tree. Just like in a typical holdout set.

The data points not used to build the tree are called Out-of-Bag observations.

For instance, these are the Out-of-Bag observations for each of the trees mentioned above:

  • Tree 1: [2, 4, 6, 8, 9, 10],
  • Tree 2: [1, 2, 5, 6, 7, 9],
  • Tree 3: [2, 3, 5, 6, 8, 10],
  • Tree 4: [1, 2, 4, 5, 7, 9, 10]

When it comes to the model evaluation step, the Out-of-Bag Error is calculated for each tree using the Out-of-Bag observations, the data points the tree “has never seen before”, and averaged across all trees.

Thanks for asking 🙂

--

--

Carolina Bento
Carolina Bento

Written by Carolina Bento

Articles about Data Science and Machine Learning | @carolinabento

No responses yet