Take a sample without repetition taking into account 2 variables in the R

Question

Take a sample without repetition taking into account 2 variables in the R

Navigation

#1 by (2 votes)

0

I have two bases. One with the lines that would like to take the sample and the other with the sample size with the date. The first one that is the actual database that I need to sample, is exemplified below called "good":

CNPJ    data
333333  201601
333333  201612
111111  201612
111111  201610
111111  201607
111111  201611
22222   201605
22222   201606
22222   201610
22222   201509
99999   201605
99999   201612
99999   201611
99999   201601

The second base is below called "tamamostra", it has only the sample size I need for each date, and this sample should be done with CNPJs that do not repeat:

data    201509  201510  201512  201601  201602  201603  201604  201605  201606  201607  201610  201611  201612  Total
ruins   1          1       1       6       4       3       2       4       3       5       5       4       6       45
bons    3          3       3       14      10      7       5       10      7       12      12      10      14      105
Total   4          4       4       20      14      10      7       14     10    17         17      14      20      155

I need to make a "good" size sample for each date without repeating the same CNPJ. That is, for 201509 I need a sample of size 3 with 3 different CNPJs and these CNPJs can not be repeated for the other dates, for 201601 I need a sample of size 14 with CNPJs that do not repeat on the previous date and so on , having in the end a full size sample 105 with unique CNPJs. It is worth mentioning that there are some CNPJs that do not have certain dates.

I tried using for with the sample to make this sample, but since I did not specify that the CNPJ could not be repeated, some CNPJs were repeated:

for(i in 2:14){
bons1[i]<-subset(bons,data==tamamostra[1,i])[sample(nrow(subset(bons,data==tamamostra[1,i])), tamamostra[3,i]), ]
}

How to do this in R? I believe the dplyr package should have some workaround.

r dplyr

asked by anonymous 02.08.2018 / 19:44

1 answer

Problems while navigating json nested How to instantiate a PHP class

score 2 · Answer 1

Since your sample data does not have enough size for non-repetition sampling, I am generating more, simpler, demo-only samples:

dados <- data.frame(
  CNPJ = rep(1:20, each = 3),
  data = 2015:2017
)

tam <- data.frame(
  data = 2015:2017,
  bons = 1:3
)

The sample size table must be in "long" format. In the case of your data, you can convert them as follows:

tamamostra <- read.table(text = c('
  data    201509  201510  201512  201601  201602  201603  201604  201605  201606  201607  201610  201611  201612  Total
  ruins   1          1       1       6       4       3       2       4       3       5       5       4       6       45
  bons    3          3       3       14      10      7       5       10      7       12      12      10      14      105
  Total   4          4       4       20      14      10      7       14     10    17         17      14      20      155')
)
tam <- as.data.frame(t(tamamostra[,-c(1,ncol(tamamostra))]))
names(tam) <- tamamostra[[1]]

Using loop with subset

The idea here is to sequentially sample CNPJs by date and cut the raffles from the data table:

#data.frame para receber as amostras
amostra <- data.frame(
  CNPJ = NA,
  data = rep(tam$data, tam$bons)
)

# cópia dos dados, para preservar o original
dados -> dados.temp

for (data in tam$data) {
  samp.cnpj <- sample(dados.temp[dados.temp$data == data, 'CNPJ'], size = tam[tam$data == data, 'bons'])
  samp.cnpj -> amostra[amostra$data == data, 'CNPJ']
  dados.temp <- dados.temp[!dados.temp$CNPJ %in% samp.cnpj,]
}; rm(dados.temp, samp.cnpj)

> amostra
  CNPJ data
1    6 2015
2   18 2016
3    8 2016
4    7 2017
5   15 2017
6   19 2017

Sorting first a date for each CNPJ

Here the idea is to first draw a date for each CNPJ (so that there is no repetition) and then sample the CNPJs by date, using the data.table package. This solution is potentially faster for a very large dataset, but there may not be enough CNPJs left to do the sampling.

library(data.table)
setDT(dados)
amostra <- dados[, .(data = sample(data, 1)), by = CNPJ][tam, on = 'data'][, sample(CNPJ, bons), by = data]
names(amostra)[2] <- 'CNPJ'

> amostra
   data CNPJ
1: 2015    9
2: 2016    1
3: 2016   16
4: 2017    2
5: 2017    8
6: 2017    7

(Thanks to @ juan-antonio-roldán-diaz for suggesting this idea)