Select first lines depending on group efficiently

5

Suppose I have the following database

set.seed(100)
base <- expand.grid(grupo = c("a", "b", "c", "d"), score = runif(100))

And I want to select the lines with the lowest score depending on the group according to the table below:

qtds <- data.frame(grupo = levels(base$grupo), qtd = c(1, 2, 3, 4))
qtds

  grupo qtd
1     a   1
2     b   2
3     c   3
4     d   4
That is, I want to select the line with the lowest score of the group a , the two lines with the lowest score of the group b , and so on ...

At the moment, I'm doing this:

novaBase <- data.frame()
for(i in levels(base$grupo)){
  novaBase <- rbind(novaBase,
                    base %>% 
                      filter(grupo == i) %>% 
                      filter(row_number(score) <= qtds$qtd[qtds$grupo == i])
                    )
}

   grupo        score
1      a 0.0003950703
2      b 0.0003950703
3      b 0.0039051792
4      c 0.0003950703
5      c 0.0221628349
6      c 0.0039051792
7      d 0.0269371939
8      d 0.0003950703
9      d 0.0221628349
10     d 0.0039051792

This works, but it seems very inefficient, and the code is hard to understand. Does anyone know a better way?

    
asked by anonymous 28.01.2015 / 13:50

1 answer

2

A form with dplyr would be:

base2 <- merge(base, qtds)

base2 %>% group_by(grupo) %>% arrange(score) %>% slice(1:unique(qtd))
Source: local data frame [10 x 3]
Groups: grupo

   grupo      score qtd
1      a 0.03014575   1
2      b 0.03014575   2
3      b 0.03780258   2
4      c 0.03014575   3
5      c 0.03780258   3
6      c 0.05638315   3
7      d 0.03014575   4
8      d 0.03780258   4
9      d 0.05638315   4
10     d 0.09151028   4
    
28.01.2015 / 14:34