Complete observations in a data frame

4

Suppose I have the following dataset:

dados <- structure(list(Ano = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L, 
2003L, 2003L, 2003L, 2004L, 2004L, 2004L, 2005L, 2005L, 2005L, 
2005L), Categoria = structure(c(1L, 2L, 4L, 1L, 3L, 4L, 1L, 2L, 
3L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("a", "b", "c", "d"
), class = "factor"), Valor = c(15, 21, 15, 14, 18, 20, 17, 21, 
20, 23, 20, 22, 25, 18, 21, 16)), .Names = c("Ano", "Categoria", 
"Valor"), row.names = c(NA, -16L), class = "data.frame")

dados
    Ano Categoria Valor
1  2001         a    15
2  2001         b    21
3  2001         d    15
4  2002         a    14
5  2002         c    18
6  2002         d    20
7  2003         a    17
8  2003         b    21
9  2003         c    20
10 2004         b    23
11 2004         c    20
12 2004         d    22
13 2005         a    25
14 2005         b    18
15 2005         c    21
16 2005         d    16

This data set has 3 columns: Year, Category, and Value. By definition, the years range from 2001 to 2005 and the only possible categories are a, b, c, d. However, some categories were not recorded in a few years. For example, category c did not occur in 2001. In turn, category b did not occur in 2002, and so on.

I would like to create a new data frame with 5 * 4 = 20 lines because this is the product between the number of years and distinct categories. That is, the new data frame must have all possible permutations between each Year and Category value. The value 0 must be assigned for each permutation that does not exist in the original data frame. That is, I want the following result:

    Ano Categoria Valor
1  2001         a    15
2  2001         b    21
3  2001         c     0
4  2001         d    15
5  2002         a    14
6  2002         b     0
7  2002         c    18
8  2002         d    20
9  2003         a    17
10 2003         b    21
11 2003         c    20
12 2003         d     0
13 2004         a     0
14 2004         b    23
15 2004         c    20
16 2004         d    22
17 2005         a    25
18 2005         b    18
19 2005         c    21
20 2005         d    16

How to proceed in an automated way to do this?

    
asked by anonymous 05.09.2017 / 22:28

2 answers

5

With tidyr you can use the complete() function:

library(tidyr)
complete(dados, Ano, Categoria, fill = list(Valor = 0))
# A tibble: 20 x 3
     Ano Categoria Valor
   <int>    <fctr> <dbl>
 1  2001         a    15
 2  2001         b    21
 3  2001         c     0
 4  2001         d    15
 5  2002         a    14
 6  2002         b     0
 7  2002         c    18
 8  2002         d    20
 9  2003         a    17
10  2003         b    21
11  2003         c    20
12  2003         d     0
13  2004         a     0
14  2004         b    23
15  2004         c    20
16  2004         d    22
17  2005         a    25
18  2005         b    18
19  2005         c    21
20  2005         d    16
    
06.09.2017 / 01:52
3

Only with base R, you can use expand.grid followed by merge .

tmp <- expand.grid(Ano = unique(dados$Ano), Categoria = unique(dados$Categoria))

res <- merge(dados, tmp, all.y = TRUE)
res$Valor[which(is.na(res$Valor))] <- 0
res

rm(tmp)  # limpeza final
    
05.09.2017 / 23:21