Computational efficiency in R - lists or vectors

7

I'm studying computational efficiency in R, generating matrices through different methods.

First, I generate an array of vector form and calculate the variances for its columns:

matriz <- matrix(rep(NA, 1000*200), nrow = 1000, ncol=200)

a = system.time(

for(i in 1:200){
  matriz[,i] <- rnorm(1000, i)
  print(var(matriz[,i]))
}
)

I redo exercise using the apply

for(i in 1:200){
  matriz[,i] <- rnorm(1000, i)
}

apply(matriz, 2, var)
}
)

I redo it again using the mapply function

matriz <- matrix(rep(NA, 1000*200), nrow = 1000, ncol=200)

matriz <- mapply(rnorm, n = 1000, mean = 1:200)
v <- apply(matriz, 2 , var)}
)

Then, in contrast to the vector method, I use average lists and calculations for your lines.

First of vector form:

lista <- list()

for(i in 1:10){
  matriz <- matrix(rep(NA, 20*100),nrow=20, ncol = 100)
  for(k in 1:20){
    matriz[k,] <- rnorm(20, mean = k) 
  }
  lista[[i]] <- matriz
  print(mean(lista[[i]]))
}

Finally I use lists with the function lapply:

for(i in 1:20){
  matriz <- matrix(rep(NA, 20*200), nrow = 20, ncol = 200)
  for(k in 1:20){
    matriz[k,] <- rnorm(20, mean = k) 
  }
  lista[[i]] <- matriz
}

  lapply(lista, mean)

})

The table below shows the calculation times for each method:

|   | user.self| sys.self| elapsed| user.child| sys.child|
|:--|---------:|--------:|-------:|----------:|---------:|
|a  |     0.028|    0.004|   0.030|          0|         0|
|b  |     0.024|    0.000|   0.023|          0|         0|
|c  |     0.020|    0.000|   0.021|          0|         0|
|d  |     0.006|    0.000|   0.006|          0|         0|
|e  |     0.007|    0.000|   0.006|          0|         0|

Of course the last two times will be smaller, since I calculated a much smaller array. However, can you explain the vantanges and drawbacks of each method and why this occurs?

    
asked by anonymous 10.03.2018 / 04:04

1 answer

9

To evaluate code speed, it is very important to completely isolate problems. In your case, you are measuring the time of two operations:

  • Create the array with random values with 1000 rows and 200 columns
  • Calculate the variance of each column
  • I would organize the problem as follows.

    Create arrays in R

    gerar_for <- function() {
    
      matriz <- matrix(rep(NA, 1000*200), nrow = 1000, ncol=200)
    
      for(i in 1:200){
        matriz[,i] <- rnorm(1000, i)
      }
    
      matriz
    }
    
    gerar_mapply <- function() {
      mapply(rnorm, n = 1000, mean = 1:200)
    }
    
    
    gerar_for_slow <- function() {
      matriz <- NULL
      for(i in 1:200){
        matriz <- cbind(matriz, rnorm(1000, i))
      }
      matriz
    }
    
    microbenchmark::microbenchmark(
      "for" = gerar_for(),
      "mapply" = gerar_mapply(),
      "for-slow" = gerar_for_slow()
    )
    
    Unit: milliseconds
         expr      min        lq     mean    median        uq      max neval cld
          for  15.6097  16.76431  20.0785  18.26528  20.10932 163.0261   100  a 
       mapply  15.5994  17.43291  22.1635  18.68548  21.00221 153.6971   100  a 
     for-slow 148.6910 169.03706 217.5798 178.62365 295.26370 373.7119   100   b
    

    The microbenchmark function is very good for comparing speed of functions, since , it runs each function more than once, ensuring that the time difference is not just because of some lock it may have given to your computer.

    From the table above, we see that there is not much difference between the first two ways of doing, since the one that grows allocating memory dynamically is very slow.

    Calculate the variance

    var_for <- function(matriz) {
      variancias <- numeric(200)
      for(i in 1:200) {
        variancias[i] <- var(matriz[,i])
      }
      variancias
    }
    
    var_apply <- function(matriz) {
      apply(matriz, 2, var)
    }
    
    var_for_slow <- function(matriz) {
      variancias <- NULL
      for(i in 1:200) {
        variancias <- c(variancias, var(matriz[,i]))
      }
      variancias
    }
    
    matriz <- gerar_for()
    
    microbenchmark::microbenchmark(
      "for" = var_for(matriz),
      "apply" = var_apply(matriz),
      "for-slow" = var_for_slow(matriz)
    )
    
    Unit: milliseconds
         expr      min       lq     mean   median       uq       max neval cld
          for 5.187810 5.506842 6.672243 5.834702 7.041265  24.80995   100   a
        apply 6.053562 6.822156 9.412554 7.345083 8.566811 152.58045   100   a
     for-slow 5.304672 5.587136 6.798713 6.063436 7.600376  13.52065   100   a
    

    In the table above we see that in this case it does not make much difference between any of the three approaches.

    Comparison:

    As far as I understand, you're basically comparing the use of apply and for .

    The advantages of using for is the ease of making codes in which an iteration depends on the result of the previous iteration. This is not so simple with apply . The disadvantages of for is that it is easy to make code that is slow, for example the gerar_for_slow function (above). Another disadvantage is that you usually have to write more lines of code.

    The apply is more or less the opposite of for , it is difficult to make code that depends on the previous iteration. But it's easier to make code that does not get slow.

    For me the biggest advantage of using apply is that you get used to thinking that R is a functional language, so it will be much easier to learn and get deeper into language.

    About vectorization

    apply should not be considered as vectorization in R. apply is simply an alternative way to write for .

    To be considered vectored, your loop has to be written in a lower-level programming language (C, Fortran, C ++, etc.) and this is what happens with many R functions. For example:

    soma_for <- function(vetor) {
      soma <- 0
      for(i in 1:length(vetor)){
        soma <- soma + vetor[i]
      }
      soma
    }
    
    soma_vetorizada <- function(vetor) {
      sum(vetor)
    }
    
    vetor <- rnorm(1000)
    microbenchmark::microbenchmark(
      "for" = soma_for(vetor),
      "vetorizada" = soma_vetorizada(vetor)
    )
    
    Unit: microseconds
           expr    min     lq     mean  median     uq      max neval cld
            for 45.723 45.909 75.11931 46.0165 46.294 2773.788   100   b
     vetorizada  1.575  1.607 10.93954  1.6575  1.727  913.892   100  a 
    

    So we see the difference in speed between the two implementations.

        
    10.03.2018 / 21:15