How to solve the limit 53 R randomForest categories?

6

In R, using the library randomForest , when I run randomForest() I get the following error message:

Error in randomForest.default(m, y, ...) : 
  Can not handle categorical predictors with more than 53 categories.

The factor in question has 57 categories. How can I change this limit or work around this problem?

    
asked by Márcio Mocellin 05.11.2018 в 17:22
source

1 answer

9

First, ask yourself if you really need to have a categorical variable with this amount of levels. When dividing a n-level factor, random forest performs 2 ^ n-2 possible divisions of this variable to choose the best dividing point. In this case, there are 9.00719925e15 possible results.

If your computer can take 0.001 seconds to do each division of this, it will take 285616 years to finalize the modeling. This is more time than we, humans, exist as a species on Earth.

First, I would wonder why this variable has so many levels like this.

  • Would it be a numeric variable that was read incorrectly? In this case, treat it as numbers rather than categories.

  • If the variable is categorical, can it be treated as an ordinal variable? If it is, random forest can be faster to sort ordinal than nominal variables.

  • If the variable is nominal categorical, can it be simplified into fewer categories? For example, if they are countries of the world, is it possible to create a new variable called continent that will only have 6 levels?

  • If the variable is nominal categorical, can it be simplified into fewer categories? Are all levels representative? Would it be possible to combine the lower frequency levels into a new level called Other?

  • These are some of the ones I know are standard on a problem like that. It will not be possible to adjust this template without turning this variable into something simpler.

        
    answered by 05.11.2018 / 17:54
    source