In R, what is the best way to select inner list sets within a list of lists?

Question

In R, what is the best way to select inner list sets within a list of lists?

Navigation

#1 by (6 votes)
#2 by (2 votes)

5

I have a list of lists like the following:

lista <- list(num = list(1:10, 11:20, 21:30),
              chr = list(letters[1:13], letters[14:26], LETTERS[1:13]))

I would like to make it a data.frame , but for this the two internal lists would have to be the same size. To achieve this goal, I would like to select a set with only the first 10 elements of each internal list (missing some remarks from the list that will be cut will not be a problem).

I was able to accomplish this task through a not very elegant function (with loop, posted below) and wondering if there are no more efficient ways to do this.

Since we have very little documentation on R in Portuguese, I found it reasonable to ask: in R , what is the best way to select inner list sets within a list of lists?

r lista

asked by anonymous 17.11.2016 / 00:57

2 answers

2

pegar_elem <- function(x, vetor){
  xx <- x
  for (i in seq_along(xx)) {
    xx[[i]] <- xx[[i]][vetor]
  }
  return(xx)
}

lista2 <- lapply(lista, pegar_elem, 1:10)
as.data.frame(sapply(lista2, unlist))

EDITED

Although Daniel Falbel answers are more elegant, log:

17.11.2016 / 00:57

How to declare a password input? How to install the PDO_PGSQL driver in ubuntu?

score 6 · Accepted Answer

I find the following way more concise to do what you need:

library(purrr) # para a função map
library(tidyr) # para a função unnest
library(dplyr) # para a função as_data_frame
map(lista, ~map(.x, ~.x[1:10])) %>%
  as_data_frame() %>%
  unnest()

The result is this:

# A tibble: 30 × 2
     num   chr
   <int> <chr>
1      1     a
2      2     b
3      3     c
4      4     d
5      5     e
6      6     f
7      7     g
8      8     h
9      9     i
10    10     j
# ... with 20 more rows

Another way, which also gets legal is:

lista %>%
  as_data_frame() %>%
  mutate(chr = map(chr, ~.x[1:10])) %>%
  unnest()

list columns , that is, columns of date.frames that are lists are being widely used and popularized by Hadley Wickham. See here for the R for Data Science .

In the example with list columns I just modified the chr column, but you could modify all the columns using:

lista %>%
  as_data_frame() %>%
  mutate_all(funs(map(., ~.x[1:10]))) %>%
  unnest()

Complementing Tom's Benchmark

> lista <- list(
+   num = lapply(1:10, function(x) sample(1:100, 20)),
+   chr = lapply(1:10, function(x) sample(letters, 20))
+ )
> microbenchmark(
+   solucao_tomas = {as.data.frame(sapply(lapply(lista, pegar_elem, 1:10), unlist))},
+   solucao_daniel = {unnest(as_data_frame(map(lista, ~map(.x, ~.x[1:10]))))}
+ )
Unit: microseconds
           expr      min       lq      mean   median       uq      max neval
  solucao_tomas  419.026  439.375  466.7568  454.947  476.889  695.780   100
 solucao_daniel 2456.108 2559.625 2745.8009 2680.130 2836.733 4466.647   100
> lista <- list(
+   num = lapply(1:1000, function(x) sample(1:100, 20)),
+   chr = lapply(1:1000, function(x) sample(letters, 20))
+ )
> microbenchmark(
+   solucao_tomas = {as.data.frame(sapply(lapply(lista, pegar_elem, 1:10), unlist))},
+   solucao_daniel = {unnest(as_data_frame(map(lista, ~map(.x, ~.x[1:10]))))}
+ )
Unit: milliseconds
           expr       min       lq     mean   median       uq      max neval
  solucao_tomas 13.559905 14.15854 14.64829 14.56517 14.83060 16.89264   100
 solucao_daniel  9.871144 10.27053 11.07952 10.80652 11.29402 19.82793   100
> lista <- list(
+   num = lapply(1:10000, function(x) sample(1:100, 20)),
+   chr = lapply(1:10000, function(x) sample(letters, 20))
+ )
> microbenchmark(
+   solucao_tomas = {as.data.frame(sapply(lapply(lista, pegar_elem, 1:10), unlist))},
+   solucao_daniel = {unnest(as_data_frame(map(lista, ~map(.x, ~.x[1:10]))))}
+ )
Unit: milliseconds
           expr       min        lq     mean    median       uq      max neval
  solucao_tomas 156.63202 171.06855 195.3683 180.86325 227.1462 271.7314   100
 solucao_daniel  80.93934  91.22597 100.5079  96.73947 104.7544 154.6254   100

That is, when the list is small Tomás' solution using for is more efficient, however the difference there is in the house of microseconds. (efficiency is not very important when objects are small). When objects begin to grow, the solution using purrr , dplyr and tidyr becomes more efficient. With lists of size 10,000 it becomes 2x faster. This solution is efficient when it is necessary, that is, when the size of the objects grows.