Good evening!
I cross two bases in Rstudio, using the merge, but I would like to know if using another way (eg left_join), I get faster, because my tables can get 8 million lines.
Thank you.
Good evening!
I cross two bases in Rstudio, using the merge, but I would like to know if using another way (eg left_join), I get faster, because my tables can get 8 million lines.
Thank you.
Ronaldo, how are you?
See this experiment by comparing the merge () function with the inner_join () function of the dplyr package.
# Garantindo a reprodução dos resultados aleatórios
set.seed(101)
# Gerando dois datasets com 8.000.000 de observações para exemplo
df1 <- data.frame(x = sample(seq(1,16000000,1),8000000),
y = sample(seq(1,16000000,1),8000000),
z = sample(seq(1,16000000,1),8000000))
df2 <- data.frame(x = sample(seq(1,16000000,1),8000000),
y = sample(seq(1,16000000,1),8000000),
z = sample(seq(1,16000000,1),8000000))
# Testando a função merge()
system.time(dfa <- merge(df1, df2, by = c("x", "y")))
# user system elapsed
# 115.911 2.563 122.016
# Testando a função inner_join()
library(dplyr)
system.time(dfb <- inner_join(df1, df2, by = c("x", "y")))
# user system elapsed
# 16.459 0.966 17.833
Note that on my machine the merge function took 122 seconds to complete the operation, while the inner_join function took only 17 seconds.