Why are loops slow in R? How to avoid them?

5

It is very common to hear (or read) that loops are not efficient in R and should be avoided ( in this link or on another link or same in this ).

And proving this statement is simple:

numeros <- rnorm(10000)

com_loop <- function(vetor) {
  res <- 0
  for (i in seq_along(vetor)) {
    res <- res + vetor[i]
  }
  res
}

microbenchmark::microbenchmark(
  loop = com_loop(numeros),
  vetorizado = sum(numeros)
)

Unit: microseconds
       expr     min      lq      mean  median       uq      max neval
       loop 494.709 512.670 562.71062 514.723 551.9285 3074.480   100
 vetorizado   9.750  10.263  10.77702  10.264  10.2640   28.226   100

The questions I ask are:

  • Why loops are slow in ?
  • What alternatives are there? (packages, strategies, etc.)
  • asked by anonymous 26.10.2017 / 14:51

    2 answers

    7

    Excellent questions. Below I'll put my two cents about them.

    1. Why are loops slow in R?

    Loops are slow in R because this is an intrinsic feature of interpreted languages. All code written in the R language (which is a language interpreted as python or ruby) is read and interpreted for machine language, to be executed there.

    C, on the other hand, is a compiled language. All code written in the C language is compiled, made into an executable in the native language of the machine's operating system and processor, and only then will it run.

    If we loop a language interpreted as R, the step of translating the code written in R into the machine language will occur for each step of the loop. Thus, several extra steps are added in the execution of the program, these steps do not exist in the compiled language. And each intermediate step of these is added to the total execution time of the program.

    I understand that this answer may not answer your question directly. Let me redo it as follows:

    Why do loops in R are slower than vectorized code?

    Although it does not look like it, the answer to this question is in the description I made above. Many of the native R codes, such as% as% of your example, were written in C, C ++, or FORTRAN. Note the output that appears at the prompt when typing sum :

    sum
    function (..., na.rm = FALSE)  .Primitive("sum")
    

    This function was not written in R. It was certainly written in C, C ++ or FORTRAN, which makes it much more optimized. After all, these are compiled languages, much more optimized to perform any operations. So the run-time difference in the sum and com_loop codes of your question example.

    2. What alternatives are there? (packages, strategies, etc.)

    Basically, there are three strategies to try to optimize code in R. However, they will not always work, as each case is a case.

  • Use vectorized code
  • For example, vetorizado family functions have an advantage over loops. Often (though not always), using functions of this family will leave your code faster. After all, R is a language that works best with vectors. The functions of the apply family use this feature of R optimally, and therefore end up being many times faster than apply (or for etc).

    In addition, in my opinion , leave the code cleaner and easier to audite later.

  • Parallelize the code
  • Use the power of parallel processing of your computer. Instead of using a core to do the job, distribute it in more colors. The most famous packages for this are while , parallel and doMC .

    Unfortunately, I've tried it in the past and have never been able to make it work in Windows. I even suspect that it is impossible. However, they are easy to use on macOS and Linux.

  • Read the book R Inferno . It brings many strategies beyond these two that I quoted above. The book opened my eyes in the past, showing what I did wrong at the time of writing my codes. There are 9 more detailed strategies than these that I put here in this summary and I'm sure many of your questions will be clarified by him.
  • 26.10.2017 / 16:20
    3

    Complementing the answer from @Marcos Nunes, which is excellent, the text that made me understand the difference between loop and vectorization was this one: Vectorization in R: Why?

    R is a high-level language, that is, it takes care of the interpretation of the code by you. For example, when you create a code like this:

    i<-5.0
    

    You should not tell the computer:

  • 5.0 is a floating point number;
  • that "i" should store a numeric type data;
  • to find a place in memory for number 5;
  • register the "i" as an indicator for that place in memory.
  • You need to convert i <- 5.0 to binary as this is done when you click enter;
  • If you change the value of "i" to, for example, i <- "b" , communicate that "i" does not store an integer but a character.
  • When you put this inside a for, R will repeat this process of interpretation to each loop. And that's what makes the loop slow.

    On the other hand, if you put all values in a single vector, this process of interpretation occurs at once, thus reducing processing time. This is why vectors only accept one type of data, that is, you can not have integers, factors and characters in the same vector, as this would break with the vectorization logic, which is to perform those six steps only once.

        
    28.10.2017 / 10:07