Complete observations in a data frame

Question

Complete observations in a data frame

Navigation

#1 by (5 votes)
#2 by (3 votes)

4

Suppose I have the following dataset:

dados <- structure(list(Ano = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L, 
2003L, 2003L, 2003L, 2004L, 2004L, 2004L, 2005L, 2005L, 2005L, 
2005L), Categoria = structure(c(1L, 2L, 4L, 1L, 3L, 4L, 1L, 2L, 
3L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("a", "b", "c", "d"
), class = "factor"), Valor = c(15, 21, 15, 14, 18, 20, 17, 21, 
20, 23, 20, 22, 25, 18, 21, 16)), .Names = c("Ano", "Categoria", 
"Valor"), row.names = c(NA, -16L), class = "data.frame")

dados
    Ano Categoria Valor
1  2001         a    15
2  2001         b    21
3  2001         d    15
4  2002         a    14
5  2002         c    18
6  2002         d    20
7  2003         a    17
8  2003         b    21
9  2003         c    20
10 2004         b    23
11 2004         c    20
12 2004         d    22
13 2005         a    25
14 2005         b    18
15 2005         c    21
16 2005         d    16

This data set has 3 columns: Year, Category, and Value. By definition, the years range from 2001 to 2005 and the only possible categories are a, b, c, d. However, some categories were not recorded in a few years. For example, category c did not occur in 2001. In turn, category b did not occur in 2002, and so on.

I would like to create a new data frame with 5 * 4 = 20 lines because this is the product between the number of years and distinct categories. That is, the new data frame must have all possible permutations between each Year and Category value. The value 0 must be assigned for each permutation that does not exist in the original data frame. That is, I want the following result:

    Ano Categoria Valor
1  2001         a    15
2  2001         b    21
3  2001         c     0
4  2001         d    15
5  2002         a    14
6  2002         b     0
7  2002         c    18
8  2002         d    20
9  2003         a    17
10 2003         b    21
11 2003         c    20
12 2003         d     0
13 2004         a     0
14 2004         b    23
15 2004         c    20
16 2004         d    22
17 2005         a    25
18 2005         b    18
19 2005         c    21
20 2005         d    16

How to proceed in an automated way to do this?

r data.frame

asked by anonymous 05.09.2017 / 22:28

2 answers

3

Only with base R, you can use expand.grid followed by merge .

tmp <- expand.grid(Ano = unique(dados$Ano), Categoria = unique(dados$Categoria))

res <- merge(dados, tmp, all.y = TRUE)
res$Valor[which(is.na(res$Valor))] <- 0
res

rm(tmp)  # limpeza final

05.09.2017 / 23:21

Why type = 'date' does not work in Firefox? Is it correct in a DTO class to have attributes of two or more tables?

score 5 · Accepted Answer

With tidyr you can use the complete() function:

library(tidyr)
complete(dados, Ano, Categoria, fill = list(Valor = 0))
# A tibble: 20 x 3
     Ano Categoria Valor
   <int>    <fctr> <dbl>
 1  2001         a    15
 2  2001         b    21
 3  2001         c     0
 4  2001         d    15
 5  2002         a    14
 6  2002         b     0
 7  2002         c    18
 8  2002         d    20
 9  2003         a    17
10  2003         b    21
11  2003         c    20
12  2003         d     0
13  2004         a     0
14  2004         b    23
15  2004         c    20
16  2004         d    22
17  2005         a    25
18  2005         b    18
19  2005         c    21
20  2005         d    16