Calculate average distance traveled with single tapply

2

I have my database from the hflights package of R, which shows a number of flights in the US. I need to calculate the average distance traveled ( Distance ) for each day of the week (variable DayofWeek ) between flights with a delay of more than 60 and between those with a delay of less than 60 (variable% with%). You must use a single DepDelay .

I've tried something like this, but it's wrong:

y=(c(which(sapply(dados,is.numeric))))y 
apply(as.matrix(y),1,function(x){tapply(dados[,x],list(dados$DayofWeek,dados$DepDelay>60),mean)})
    
asked by anonymous 28.11.2014 / 04:49

3 answers

2

In my opinion, the most elegant way to do this is to use the dplyr and tidyr packages. Code using these functions is much simpler to read. Worth learning!

library(dplyr)
library(tidyr)

hflights %>% 
  filter(!is.na(DepDelay)) %>% # filtra os voos sem atraso
  mutate(DepDelay2 = ifelse(DepDelay>60, ">60", "<=60")) %>% # atraso maior que 60
  group_by(DayOfWeek, DepDelay2) %>% # indica o calculo em grupo
  summarise(media = mean(Distance)) %>% # usa a media para agregar
  spread(DepDelay2, media) # coloca em colunas separadas

# Source: local data frame [7 x 3]
# 
#   DayOfWeek      <60      >60
# 1         1 783.1453 796.0078
# 2         2 778.1595 796.8847
# 3         3 779.2828 816.0626
# 4         4 782.5144 816.7166
# 5         5 785.8960 790.0505
# 6         6 823.0040 821.7922
# 7         7 797.2524 803.0468
    
28.11.2014 / 12:17
1

I do not know how to do this for both groups ( DepDelay > 60 and DepDelay < 60 ) using a single tapply , but I would do so for each of the groups:

tapply(X = hflights[which(hflights$DepDelay > 60),"Distance"], INDEX = hflights[which(hflights$DepDelay > 60),"DayOfWeek"], FUN = mean)
tapply(X = hflights[which(hflights$DepDelay < 60),"Distance"], INDEX = hflights[which(hflights$DepDelay < 60),"DayOfWeek"], FUN = mean)

Just note that there are also 232 cases with DepDelay == 60 , in case you want to consider all flights (not canceled) in the database.

EDITED  
Here is a (not very elegant) way of doing this with a single tapply :

tapply(X = hflights$Distance, INDEX = list(hflights$DayOfWeek, ifelse(test = hflights$DepDelay > 60, yes = "> 60", no = "<= 60")), FUN = mean)

The only problem is that this way you need to include the 232 cases with DepDelay == 60 in one of the two groups (in my code I put the second group, DepDelay < 60 )

    
28.11.2014 / 06:26
1

Two other ways to do this:

Using tapply , like Nishimura, but with cut . The cool of the cut is that it easily extends to more than two conditions, just increase the breaks and labels:

tapply(X = dados$Distance, 
       INDEX = list(dados$DayOfWeek,
                    cut(dados$DepDelay, 
                        breaks =c(-Inf, 60, Inf), 
                        labels = c("menor60", "maior60"))),
       FUN = mean)

Using dplyr without cut :

dados %>% 
  group_by(DayOfWeek) %>% 
  summarise(menor60 = mean(Distance[DepDelay <= 60], na.rm = TRUE),
            maior60 = mean(Distance[DepDelay > 60], na.rm = TRUE))

The dplyr solution with cut would be equivalent to Daniel's answer by changing ifelse .

    
28.11.2014 / 15:18