Select first lines depending on group efficiently

Question

Select first lines depending on group efficiently

Navigation

#1 by (2 votes)

5

Suppose I have the following database

set.seed(100)
base <- expand.grid(grupo = c("a", "b", "c", "d"), score = runif(100))

And I want to select the lines with the lowest score depending on the group according to the table below:

qtds <- data.frame(grupo = levels(base$grupo), qtd = c(1, 2, 3, 4))
qtds

  grupo qtd
1     a   1
2     b   2
3     c   3
4     d   4

That is, I want to select the line with the lowest score of the group a , the two lines with the lowest score of the group b , and so on ...

At the moment, I'm doing this:

novaBase <- data.frame()
for(i in levels(base$grupo)){
  novaBase <- rbind(novaBase,
                    base %>% 
                      filter(grupo == i) %>% 
                      filter(row_number(score) <= qtds$qtd[qtds$grupo == i])
                    )
}

   grupo        score
1      a 0.0003950703
2      b 0.0003950703
3      b 0.0039051792
4      c 0.0003950703
5      c 0.0221628349
6      c 0.0039051792
7      d 0.0269371939
8      d 0.0003950703
9      d 0.0221628349
10     d 0.0039051792

This works, but it seems very inefficient, and the code is hard to understand. Does anyone know a better way?

r data.frame dplyr

asked by anonymous 28.01.2015 / 13:50

1 answer

Animation on Android, circle emitting waves Wait for thread to finish before closing program

score 2 · Accepted Answer

A form with dplyr would be:

base2 <- merge(base, qtds)

base2 %>% group_by(grupo) %>% arrange(score) %>% slice(1:unique(qtd))
Source: local data frame [10 x 3]
Groups: grupo

   grupo      score qtd
1      a 0.03014575   1
2      b 0.03014575   2
3      b 0.03780258   2
4      c 0.03014575   3
5      c 0.03780258   3
6      c 0.05638315   3
7      d 0.03014575   4
8      d 0.03780258   4
9      d 0.05638315   4
10     d 0.09151028   4