What are columns-list of a data.frame?

5

encourages the use of list-columns in < a data / data / data.frames "class=" post-tag "title=" show questions tagged 'data.frames' "> data.frames . But, after all,

  • What are columns-lists?

  • On what occasions are they commonly used?

  • Can they be created with r-base or just as tibble s?

For example,

data.frame(idade = 1:5, nome = letters[1:5], lista = lapply(1:5, rnorm))
  

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,:

     

arguments imply differing number of rows: 1, 2, 3, 4, 5

tibble::tibble(idade = 1:5, nome = letters[1:5], lista = lapply(1:5, rnorm))
# A tibble: 5 x 3
  idade nome  lista    
  <int> <chr> <list>   

1     1 a     <dbl [1]>
2     2 b     <dbl [2]>
3     3 c     <dbl [3]>
4     4 d     <dbl [4]>
5     5 e     <dbl [5]>
    
asked by anonymous 26.12.2018 / 14:31

1 answer

5

List-columns or list-columns are a data structure that can be useful at various times when working with tidyverse. They are mainly used as intermediary structures.

They can be used in the R-base but you will have to use the I function to prevent the base from releasing an error. Example:

data.frame(idade = 1:5, nome = letters[1:5], lista = I(lapply(1:5, rnorm)))

  idade nome        lista
1     1    a 0.178046....
2     2    b 0.407768....
3     3    c -0.84749....
4     4    d -0.44864....
5     5    e 1.229863....

An example that illustrates well the use of column-list is when we are using vectorized functions that return more than one value inside a mutate. For example:

df <- tribble(
  ~x1,
  "a,b,c", 
  "d,e,f,g"
) 

df %>% 
  mutate(x2 = stringr::str_split(x1, ","))
#> # A tibble: 2 x 2
#>   x1      x2       
#>   <chr>   <list>   
#> 1 a,b,c   <chr [3]>
#> 2 d,e,f,g <chr [4]>

Next, it is common to simplify the data.frame using the unnest function of tidyr :

df %>% 
  mutate(x2 = stringr::str_split(x1, ",")) %>% 
  unnest()
#> # A tibble: 7 x 2
#>   x1      x2   
#>   <chr>   <chr>
#> 1 a,b,c   a    
#> 2 a,b,c   b    
#> 3 a,b,c   c    
#> 4 d,e,f,g d    
#> 5 d,e,f,g e    
#> 6 d,e,f,g f    
#> # ... with 1 more row

There are many other interesting use cases. Another example that I really like is the one created by the package rsample :

library(tidyverse)
library(rsample)

vfold_cv(mtcars, v = 5) %>% 
  mutate(
    modelos = map(splits, ~lm(mpg ~ ., data = analysis(.x))),
    mse = map2_dbl(modelos, splits, ~mean((assessment(.y)$mpg - predict(.x, assessment(.y)))^2))
    )

#  5-fold cross-validation 
# A tibble: 5 x 4
  splits         id    modelos    mse
* <list>         <chr> <list>   <dbl>
1 <split [25/7]> Fold1 <S3: lm> 40.4 
2 <split [25/7]> Fold2 <S3: lm>  5.99
3 <split [26/6]> Fold3 <S3: lm>  9.11
4 <split [26/6]> Fold4 <S3: lm> 11.6 
5 <split [26/6]> Fold5 <S3: lm> 21.3 

In the example above we set a template for each fold of the cross-validation and then we calculate the mean square error for the observations that are left out in each fold.

    
26.12.2018 / 15:02