Create new database from random values with loop or other method

4

I have three dataframes with different number of rows and I would like to create a new dataframe with 100 random values from these dataframes and based on three criteria:

  • A - Column a and b will have 100 random values from the 1 dataframe

  • B - The first 50 rows of columns c1 and d1 in 50 matched random values, ie occurring in the same row as dataframe 2

  • C - Subsequent 50 rows of columns (51-100) c2 and d2 in another 50 paired random values, occurring on the same line from the dataframe 3

I've tried with loop, but it does not go well. How could I fix it or do it in a better way?

Here are the data and the script, and the expected result:

a <- c(4,6,7,3,2,5,6,9,6,5,8,6,7,8,9,7,6)
b <- c(40,60,70,30,20,NA,60,90,60,50,75,34,42,32,NA,45,29)

c1 <- c(1,2,3,4,5,6,7,8,9,10)
d1 <- c(10,9,8,7,6,5,4,3,2,1)

c2 <- c(11,12,13,14,15,16,17,18,19,20)
d2 <- c(20,19,18,17,16,15,14,13,12,11)

df1 <- data.frame(a,b)
df2 <- data.frame(c1,d1)
df3 <- data.frame(c2,d2)

#newdf (with 100 rows)

n <- 100
newdf <- data.frame(n=rep(1:n))
newdf$a <- NA 
newdf$b <- NA 
newdf$c <- NA
newdf$d<- NA

for (i in 1:50){
  newdf$a[i] <- sample(df1$a, 1, replace=T) # random value
  newdf$b[i] <- sample(df1$b, 1, replace=T) # random value 
  newdf$c[i] <- sample[df2$c1,1, replace=T] # one criterion
  newdf$d[i] <- sample[df2$d1,1, replace=T] # one criterion
}

for (i in 51:100){
  newdf$a[i] <- sample(df1$a, 1, replace=T) # random value
  newdf$b[i] <- sample(df1$b, 1, replace=T) # random value 
  newdf$c[i] <- sample[df3$c2,1, replace=T] # two criterion
  newdf$d[i] <- sample[df3$d2,1, replace=T] #two criterion
}

#Result 

a      b     c    d 
7     60     1    10 # linha 1
6     50     3    8
2     90     5    6  # linha 50
.
.
.
2     90     11    20  # linha 51
.
.
.
    
asked by anonymous 10.04.2017 / 17:41

1 answer

1

I think the best way to solve this problem is not through a loop. I solved it by randomly selecting the lines, all at once. I have saved these results into vectors called index_a , index_b , index_cd_50 and index_cd_100 . These vectors therefore store the 100 rows drawn from df1, with columns a and b, and the 50 rows drawn from df2 and 50 rows drawn from df3.

These lines will be considered before or after position 50 when setting newdf. Try running the code row by line to identify what I did.

a <- c(4,6,7,3,2,5,6,9,6,5,8,6,7,8,9,7,6)
b <- c(40,60,70,30,20,NA,60,90,60,50,75,34,42,32,NA,45,29)

c1 <- c(1,2,3,4,5,6,7,8,9,10)
d1 <- c(10,9,8,7,6,5,4,3,2,1)

c2 <- c(11,12,13,14,15,16,17,18,19,20)
d2 <- c(20,19,18,17,16,15,14,13,12,11)

df1 <- data.frame(a,b)
df2 <- data.frame(c1,d1)
df3 <- data.frame(c2,d2)

index_a <- sample(1:nrow(df1), 100, replace=TRUE)
index_b <- sample(1:nrow(df1), 100, replace=TRUE)

index_cd_50  <- sample(1:nrow(df2), 50, replace=TRUE)

index_cd_100 <- sample(1:nrow(df3), 50, replace=TRUE)

newdf <- data.frame(a=df1$a[index_a],
                    b=df1$b[index_b],
                    c=c(df2$c1[index_cd_50], df2$d1[index_cd_100]),
                    d=c(df3$c2[index_cd_50], df3$d2[index_cd_100]))    
    
10.04.2017 / 18:30