How to consolidate (aggregate or group) the values in a database?

15

Suppose I have the following database

vendas<-c(100,140,200,300,20,1000,200,3000)
vendedor<-c("A","B","A","B","C","C","D","A")
regiao<-c("Norte","Sul","Leste","Norte","Sul","Norte","Leste","Sul")
df<-data.frame(vendedor,regiao,vendas)

And want to look at total sales by seller and by seller / region.

How do I generate this new database with aggregate data for analysis?

    
asked by anonymous 27.02.2014 / 02:36

5 answers

14

Hadley recently created dplyr , a much faster version with more intuitive syntax than plyr . (links for CRAN and for the RStudio blog post )

No dplyr would look like this

library(dplyr)
group_by(df,vendedor)%>%summarise(Total=sum(vendas))
  vendedor Total
1        A  3300
2        B   440
3        C  1020
4        D   200

And grouping by seller and region

group_by(df,vendedor, regiao)%>%summarise(Total=sum(vendas))
   vendedor regiao Total
1        A  Leste   200
2        A  Norte   100
3        A    Sul  3000
4        B  Norte   300
5        B    Sul   140
6        C  Norte  1000
7        C    Sul    20
8        D  Leste   200

Editing: The most recent version of dplyr uses the %>% operator of magrittr .

    
28.02.2014 / 01:29
9

Another option is to use the data.table package:

df <- data.table(df)
df[,sum(vendas), by=vendedor]
df[,sum(vendas), by=data.table(vendedor, regiao)]

Packages like plyr and data.table have been created to make this (and other) types of analysis easier and faster, so if your bank is large or you have to do a lot of grouping, it might be worth having a study in them!

    
27.02.2014 / 20:05
5

You can do this with the function aggregate

aggregate(vendas ~ vendedor,
          data=df,
          FUN=sum)

  vendedor vendas
1        A   3300
2        B    440
3        C   1020
4        D    200

aggregate(vendas ~ vendedor + regiao,
          data=df,
          FUN=sum)

  vendedor regiao vendas
1        A  Leste    200
2        D  Leste    200
3        A  Norte    100
4        B  Norte    300
5        C  Norte   1000
6        A    Sul   3000
7        B    Sul    140
8        C    Sul     20

Another option is to use the package plyr

library(plyr)

ddply(df,
      c("vendedor"),
      summarise,
      total = sum(vendas))

  vendedor total
1        A  3300
2        B   440
3        C  1020
4        D   200

ddply(df,
      c("vendedor", "regiao"),
      summarise,
      total = sum(vendas))

  vendedor regiao total
1        A  Leste   200
2        A  Norte   100
3        A    Sul  3000
4        B  Norte   300
5        B    Sul   140
6        C  Norte  1000
7        C    Sul    20
8        D  Leste   200
    
27.02.2014 / 19:21
4

You can also use tapply :

By seller:

with(df, tapply(vendas, vendedor, sum))
 A    B    C    D 
3300  440 1020  200 

By seller and region:

with(df, tapply(vendas, list(vendedor, regiao), sum))
  Leste Norte  Sul
A   200   100 3000
B    NA   300  140
C    NA  1000   20
D   200    NA   NA
    
21.09.2014 / 19:12
3

Another option is to use sqldf

library(sqldf)
sqldf("SELECT vendedor, sum(vendas) from df group by vendedor")
sqldf("SELECT vendedor, regiao, sum(vendas) from df group by vendedor, regiao")
    
03.04.2014 / 21:15