How to stratify / divide a data.frame according to categories of a variable in R?

1

I'm running a linear regression model in R and would like to perform stratified analysis by categories of a variável X with 4 categories ( X1 , X2 , X3 and X4 ).

I thought of stratifying data.frame by the categories of X, so I would have 4 data.frames and would run the same model for each category.

I tried the function:

X1=data.frame[which(data.frame$X==1), ]

but resulted in a data.frame X1 with 0 remarks (lines), even though the name of the columns appears.

What do you suggest to fix this error? Thanks.

    
asked by anonymous 08.03.2017 / 20:42

1 answer

2

There are several ways to run multiple regressions by category in R. I'll show you how to do with the base functions of R and with dplyr . As an example, we will use the mtcars database.

Suppose you want to run the mpg ~ disp + hp regression for each level of the variable cyl of mtcars (there are 3 categories).

First of all, you can use the split() function to build a list with three data.frames different, one for each category:

data.frame_por_categoria <- split(mtcars, mtcars$cyl)

Now, just use lapply() to apply the regression on every data.frame :

modelos <- lapply(data.frame_por_categoria, function(x) lm(mpg ~ disp + hp, data = x))

The result, modelos is a list of all three regressions. To access the first template:

modelos[[1]]
Call:
lm(formula = mpg ~ disp + hp, data = x)

Coefficients:
(Intercept)         disp           hp  
   43.04006     -0.11954     -0.04609  

You can also do the same with the dplyr package.

You have to group by category and then use do() function to run the regression, putting a point . where data.frame would need to enter:

library(dplyr)
resultado <- mtcars %>% group_by(cyl) %>% do(modelo = lm(mpg ~ disp + hp, data = .))

The resultado of the operation is a data.frame with a column named model, and each element of that column is the regression. To access the first template:

resultado$modelo[[1]]
Call:
lm(formula = mpg ~ disp + hp, data = .)

Coefficients:
(Intercept)         disp           hp  
   43.04006     -0.11954     -0.04609  
    
09.03.2017 / 02:34