How to join observations of tables that have a different set of variables in R?

5

In the case below id corresponds to observations contained in two different data frames, where in df1 the observations have the variable x and in df2 the observations have x and z

In these cases how to join the two data frames? rbind does not work

df1 <- data.frame(id=1:10, z = rnorm(10))
df2 <- data.frame(id=11:20, x = rnorm(10),z=rnorm(10))

dados <- rbind(df1,df2)

Ideally, the column not present in all data frames should be added by assigning the NA value to observations that do not have this variable.

    
asked by anonymous 14.01.2016 / 18:34

4 answers

4

In this case I always use the bind_rows function of dplyr :

library(dplyr)
dados <- bind_rows(df1,df2)

> dados
Source: local data frame [20 x 3]

      id          z           x
   (int)      (dbl)       (dbl)
1      1  0.8179472          NA
2      2  0.2624969          NA
3      3 -0.1684590          NA
4      4 -0.1239140          NA
5      5  0.4434778          NA
6      6 -0.8865578          NA
7      7  0.1160360          NA
8      8  0.5604733          NA
9      9 -2.2761215          NA
10    10 -0.7920775          NA
11    11  1.7650167 -1.38172797
12    12 -1.0004357  2.64345620
13    13 -1.6467084 -0.01361806
14    14  0.9055755  2.00354819
15    15 -0.1645952  0.57657614
16    16  0.2675339 -0.01727064
17    17  0.6383209 -0.43920834
18    18 -1.4729775 -0.35907320
19    19  0.9345417 -0.93673279
20    20 -0.7888048  0.36903134

I thought it was cool to put a run-time comparison of all alternatives here:

> microbenchmark(
+   base = merge(df1, df2, all = TRUE),
+   dplyr = dplyr::bind_rows(df1,df2),
+   data.table = data.table::rbindlist(list(df1,df2), fill = TRUE),
+   plyr = plyr::rbind.fill(df1,df2)  
+ )
Unit: microseconds
       expr      min        lq      mean    median        uq        max neval
       base 1370.788 1578.6680 2138.9646 1852.2805 2296.0775   8607.060   100
      dplyr   64.768  111.1450  205.0742  126.2580  161.3900   4055.948   100
 data.table  173.051  239.8905 2860.8464  280.5705  352.7535 253411.277   100
       plyr  362.365  440.6795  597.4301  506.5200  622.8745   4323.416   100

Note that the solution using dplyr is the fastest of all. More than 10x more than base and about 2x faster than solution per data.table . I'm comparing the medians!

    
14.01.2016 / 18:40
3

You can use the rbind.fill function of package plyr to do what you want:

library(plyr)
df1 <- data.frame(id=1:10, z = rnorm(10))
df2 <- data.frame(id=11:20, x = rnorm(10),z=rnorm(10))

dados <- rbind.fill(df1,df2)
    
14.01.2016 / 18:41
3

You can also use the rbindlist function of the data.table package, with the fill = TRUE option:

library(data.table)
rbindlist(list(df1,df2), fill = TRUE)
    
14.01.2016 / 19:25
1

One solution to the base package itself is to use merge . Although it is typically used to match between columns, it also works as a rbind with fill as long as you give the correct arguments. This is a good option, especially if the use of the other package ( dplyr , data.table , plyr ) is just this.

> df1 <- data.frame(id=1:10, z = rnorm(10))
> df2 <- data.frame(id=11:20, x = rnorm(10),z=rnorm(10))
> merge(df1, df2, all = TRUE)
   id           z            x
1   1  0.55248779           NA
2   2  0.18582807           NA
3   3  0.15226974           NA
4   4 -1.01512686           NA
5   5 -0.13544547           NA
6   6  0.07923429           NA
7   7  0.13807574           NA
8   8  0.08524223           NA
9   9  1.28002174           NA
10 10  0.46765903           NA
11 11 -0.46550011  0.083700363
12 12  0.02691267  0.623578122
13 13  1.60678489 -1.043026698
14 14 -1.77340124  0.001289941
15 15  0.89638862  1.388973550
16 16  0.47108478  0.047770795
17 17 -0.08128863 -0.099346609
18 18  0.24915884 -1.423919181
19 19  0.75685907  1.822885566
20 20  0.48232639 -1.172792347
    
15.01.2016 / 05:09