Speed of table crossing - R

3

Good evening!

I cross two bases in Rstudio, using the merge, but I would like to know if using another way (eg left_join), I get faster, because my tables can get 8 million lines.

Thank you.

    
asked by anonymous 08.06.2018 / 04:06

1 answer

4

Ronaldo, how are you?

See this experiment by comparing the merge () function with the inner_join () function of the dplyr package.

# Garantindo a reprodução dos resultados aleatórios
set.seed(101)

# Gerando dois datasets com 8.000.000 de observações para exemplo
df1 <- data.frame(x = sample(seq(1,16000000,1),8000000),
                  y = sample(seq(1,16000000,1),8000000),
                  z = sample(seq(1,16000000,1),8000000))

df2 <- data.frame(x = sample(seq(1,16000000,1),8000000),
                  y = sample(seq(1,16000000,1),8000000),
                  z = sample(seq(1,16000000,1),8000000))


# Testando a função merge()
system.time(dfa <- merge(df1, df2, by = c("x", "y")))

#    user  system elapsed 
# 115.911  2.563  122.016 


# Testando a função inner_join()
library(dplyr)
system.time(dfb <- inner_join(df1, df2, by = c("x", "y")))

 #   user  system elapsed 
 # 16.459   0.966  17.833

Note that on my machine the merge function took 122 seconds to complete the operation, while the inner_join function took only 17 seconds.

    
08.06.2018 / 09:43