List-columns or list-columns are a data structure that can be useful at various times when working with tidyverse. They are mainly used as intermediary structures.
They can be used in the R-base but you will have to use the I
function to prevent the base from releasing an error. Example:
data.frame(idade = 1:5, nome = letters[1:5], lista = I(lapply(1:5, rnorm)))
idade nome lista
1 1 a 0.178046....
2 2 b 0.407768....
3 3 c -0.84749....
4 4 d -0.44864....
5 5 e 1.229863....
An example that illustrates well the use of column-list is when we are using vectorized functions that return more than one value inside a mutate. For example:
df <- tribble(
~x1,
"a,b,c",
"d,e,f,g"
)
df %>%
mutate(x2 = stringr::str_split(x1, ","))
#> # A tibble: 2 x 2
#> x1 x2
#> <chr> <list>
#> 1 a,b,c <chr [3]>
#> 2 d,e,f,g <chr [4]>
Next, it is common to simplify the data.frame using the unnest
function of tidyr
:
df %>%
mutate(x2 = stringr::str_split(x1, ",")) %>%
unnest()
#> # A tibble: 7 x 2
#> x1 x2
#> <chr> <chr>
#> 1 a,b,c a
#> 2 a,b,c b
#> 3 a,b,c c
#> 4 d,e,f,g d
#> 5 d,e,f,g e
#> 6 d,e,f,g f
#> # ... with 1 more row
There are many other interesting use cases. Another example that I really like is the one created by the package rsample
:
library(tidyverse)
library(rsample)
vfold_cv(mtcars, v = 5) %>%
mutate(
modelos = map(splits, ~lm(mpg ~ ., data = analysis(.x))),
mse = map2_dbl(modelos, splits, ~mean((assessment(.y)$mpg - predict(.x, assessment(.y)))^2))
)
# 5-fold cross-validation
# A tibble: 5 x 4
splits id modelos mse
* <list> <chr> <list> <dbl>
1 <split [25/7]> Fold1 <S3: lm> 40.4
2 <split [25/7]> Fold2 <S3: lm> 5.99
3 <split [26/6]> Fold3 <S3: lm> 9.11
4 <split [26/6]> Fold4 <S3: lm> 11.6
5 <split [26/6]> Fold5 <S3: lm> 21.3
In the example above we set a template for each fold of the cross-validation and then we calculate the mean square error for the observations that are left out in each fold.