To evaluate code speed, it is very important to completely isolate problems. In your case, you are measuring the time of two operations:
Create the array with random values with 1000 rows and 200 columns
Calculate the variance of each column
I would organize the problem as follows.
Create arrays in R
gerar_for <- function() {
matriz <- matrix(rep(NA, 1000*200), nrow = 1000, ncol=200)
for(i in 1:200){
matriz[,i] <- rnorm(1000, i)
}
matriz
}
gerar_mapply <- function() {
mapply(rnorm, n = 1000, mean = 1:200)
}
gerar_for_slow <- function() {
matriz <- NULL
for(i in 1:200){
matriz <- cbind(matriz, rnorm(1000, i))
}
matriz
}
microbenchmark::microbenchmark(
"for" = gerar_for(),
"mapply" = gerar_mapply(),
"for-slow" = gerar_for_slow()
)
Unit: milliseconds
expr min lq mean median uq max neval cld
for 15.6097 16.76431 20.0785 18.26528 20.10932 163.0261 100 a
mapply 15.5994 17.43291 22.1635 18.68548 21.00221 153.6971 100 a
for-slow 148.6910 169.03706 217.5798 178.62365 295.26370 373.7119 100 b
The microbenchmark
function is very good for comparing speed of functions, since , it runs each function more than once, ensuring that the time difference is not just because of some lock it may have given to your computer.
From the table above, we see that there is not much difference between the first two ways of doing, since the one that grows allocating memory dynamically is very slow.
Calculate the variance
var_for <- function(matriz) {
variancias <- numeric(200)
for(i in 1:200) {
variancias[i] <- var(matriz[,i])
}
variancias
}
var_apply <- function(matriz) {
apply(matriz, 2, var)
}
var_for_slow <- function(matriz) {
variancias <- NULL
for(i in 1:200) {
variancias <- c(variancias, var(matriz[,i]))
}
variancias
}
matriz <- gerar_for()
microbenchmark::microbenchmark(
"for" = var_for(matriz),
"apply" = var_apply(matriz),
"for-slow" = var_for_slow(matriz)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
for 5.187810 5.506842 6.672243 5.834702 7.041265 24.80995 100 a
apply 6.053562 6.822156 9.412554 7.345083 8.566811 152.58045 100 a
for-slow 5.304672 5.587136 6.798713 6.063436 7.600376 13.52065 100 a
In the table above we see that in this case it does not make much difference between any of the three approaches.
Comparison:
As far as I understand, you're basically comparing the use of apply
and for
.
The advantages of using for
is the ease of making codes in which an iteration depends on the result of the previous iteration. This is not so simple with apply
. The disadvantages of for
is that it is easy to make code that is slow, for example the gerar_for_slow
function (above). Another disadvantage is that you usually have to write more lines of code.
The apply
is more or less the opposite of for
, it is difficult to make code that depends on the previous iteration. But it's easier to make code that does not get slow.
For me the biggest advantage of using apply
is that you get used to thinking that R is a functional language, so it will be much easier to learn and get deeper into language.
About vectorization
apply
should not be considered as vectorization in R. apply
is simply an alternative way to write for
.
To be considered vectored, your loop has to be written in a lower-level programming language (C, Fortran, C ++, etc.) and this is what happens with many R functions. For example:
soma_for <- function(vetor) {
soma <- 0
for(i in 1:length(vetor)){
soma <- soma + vetor[i]
}
soma
}
soma_vetorizada <- function(vetor) {
sum(vetor)
}
vetor <- rnorm(1000)
microbenchmark::microbenchmark(
"for" = soma_for(vetor),
"vetorizada" = soma_vetorizada(vetor)
)
Unit: microseconds
expr min lq mean median uq max neval cld
for 45.723 45.909 75.11931 46.0165 46.294 2773.788 100 b
vetorizada 1.575 1.607 10.93954 1.6575 1.727 913.892 100 a
So we see the difference in speed between the two implementations.