Create new array from a fairly large first array efficiently

6

Dear, in R, I have a very large database and I want to create new columns. I will try to explain my problem with a very small matrix. Next, "1" means private school and "2", publishes. I have for example a database

>Data
Casa Escola 
 1     1
 1     1
 1     2
 1     2
 2     1
 2     2
 2     1
 3     1
 3     1
 3     1
 3     1

In this case, we conclude that house 1 has 4 residents who are in school, 2 in particular and 2 in public. Similarly, house 2 has 3 residents in school, 2 in particular and 1 in public. Finally, house 3 has 4 people in school and all in particular.

I want a new nuance with the first column indicating the house; the second indicating the number of children in the household; the third indicating the number of those who are in private school and Finally, the fourth, indicating the number of children in public school. Something like this:

  >matrix1
  >   Casa    em_escola     part     publ
       1          4          2        2
       2          3          2        1
       3          4          4        0

I've made a code that I'll show below. The problem with this code is that my original array is too large and takes hours to run. Also, I need to do the same thing for other arrays. Next, my code

lista1<- unique(Data$Casa)
length(lista1)
n=length(lista1)

lista_aux<- c(1:n)


matrix1<- data.frame(lista_aux,lista1)
nrow(matrix1)


for(i in 1:n) 
{


matrix = subset(Data , control_uc == lista1[i] )
matrix1$em_escola[i] <- nrow(matrix)

mat1<- subset (matrix, Escola == "1" )
matrix1$part[i]<- nrow(mat1)

mat2<- subset(matrix, cod_freq_escola =="2" )
matrix1$publ[i]<- nrow(mat2)
}

    
asked by anonymous 16.08.2014 / 00:21

2 answers

5

You can use the dplyr library to make your code simpler, and same time, more efficient:

library(dplyr)

Data <- data.frame(Casa=c(1,1,1,1,2,2,2,3,3,3,3),
    Escola=c(1,1,2,2,1,2,1,1,1,1,1))

matrix1 <- Data %>%
    group_by(Casa) %>%
    summarise(em_escola = n(),
        part = sum(Escola == 1),
        publ = sum(Escola == 2))

matrix1
    
16.08.2014 / 01:45
5

To complement, I also leave a response with data.table . Both dplyr and data.table are extremely fast for large databases. Dplyr is, in my opinion, more intuitive and data.table is more flexible.

library(data.table)
Data <- data.table(Data)
matrix1 <- Data[,list(em_escola = length(Escola),
           part=sum(Escola==1),
           publ = sum(Escola==2)), by=Casa]
    
16.08.2014 / 06:09