Why invert the index of the array in a for, to popular the array, makes the process faster?


I was doing some tests in Java and noticed a very strange behavior. I made a two-dimensional array and populated the array in two different ways, filling in random numbers up to 2 ^ 31-1 and just changing its index. I got a considerable performance gain by doing this, but I do not understand why.

Here is the code:

public class Array {

    static int tamanhoX = 12000;
    static int tamanhoY = 12000;
    static int array[][] = new int[tamanhoX][tamanhoY];
    static long tempoInicio = 0, tempoFim = 0;
    static double mediaTempo = 0.0;

    public static void main(String[] args) {
        //Armazena o tempo das operações
        long tempos[] = new long[10];

        //Calcula o tempo da operação diversas vezes:
        for (int i = 0; i < tempos.length; i++) {
            //Armazena o valor retornado dentro do array: tempos[] na posição: i
            tempos[i] = calculaTempo();

        //Soma todos os valores armazenados no array: tempos[]
        for (int i = 0; i < tempos.length; i++) {
            mediaTempo += tempos[i];

        System.out.println("Tempo total: " + (mediaTempo / 1000) + "s");
        mediaTempo = mediaTempo / tempos.length;//Calcula média
        System.out.println("Média de tempo é de: " + mediaTempo + "ms");

    static long calculaTempo() {
        //Armazena o momento de início da operação em milisegundos
        tempoInicio = System.currentTimeMillis();
        for (int i = 0; i < tamanhoX; i++) {
            for (int j = 0; j < tamanhoY; j++) {
                //Preenche com um valor aleatório:
                //Mais lento
                //array[j][i] = (int) (Math.random() * Integer.MAX_VALUE);
                //Mais rápido
                array[i][j] = (int) (Math.random() * Integer.MAX_VALUE);
        //Armazena o tempo final da operação
        tempoFim = System.currentTimeMillis();
        System.out.println("Tempo: " + (tempoFim - tempoInicio) + "ms");
        return (tempoFim - tempoInicio);

Notice that I've only changed a single line, that of assigning the array. In my tests I had an average of 90 seconds for:

array[j][i] = (int) (Math.random() * Integer.MAX_VALUE);

and 48 seconds for:

array[i][j] = (int) (Math.random() * Integer.MAX_VALUE);

I would love to understand the reason for this. Why does this time difference in just reverse the array index?

Caution! Do not run this code on computers with low memory. There is a risk of crashing your operating system.

asked by anonymous 13.03.2014 / 04:09

2 answers


In modern processors the bottleneck is not in the CPU, but in the cache . This means that the fastest program is not [necessarily] the one that performs the least instructions, but the one that accesses the memory more efficiently (i.e., causes less cache misses ).

In Java, a multidimensional array is represented as an array of arrays. This means that the first array is an array of references, and the rest are integer arrays. Both try to occupy contiguous positions of memory:

int[][] array = new int[4][4];
// É equivalente a:
int[] a = { 0, 0, 0, 0 };
int[] b = { 0, 0, 0, 0 };
int[] c = { 0, 0, 0, 0 };
int[] d = { 0, 0, 0, 0 };
int[][] array = { a, b, c, d };

Let's assume, for simplicity, that your L1 cache has 4 size, and that there are only two "pages" available. At first, array is in the cache. When your loop does:

array[0][0] = ...;

It tries to load array (great, it's already in the cache!) and then a (it's not in the cache, search the next level and / or RAM). It then assigns it normally.

If after that it does:

array[0][1] = ...;

It will find array in cache and a also in cache! The operation is quite fast. The same holds for array[0][2] and array[0][3] . Only when it tries to assign array[1][0] is that it will not find b in the cache - and will have to search.

But if instead he does:

array[1][0] = ...;

Then b will not be in the cache, and it will have to fetch immediately. Then comes the array[2][0] , and see: c is also not in the cache, there it will fetch again. By the time it returns to array[1][1] it will have loaded and unloaded the entire memory region of your program in the cache, and now you will have to do it all over again ...

In practice, the difference is not only more striking because the cache is not so small, and there are several levels (L1, L2 ...) - so that even if your data structure is not in L1 maybe it is on L2 (ie does not have to fetch there from the beginning). But the bigger the data, the bigger the difference becomes - especially if the physical memory is gone and it starts paging (i.e. using virtual memory / swap).

13.03.2014 / 04:39

As already mentioned, the processor has a small amount of cache memory, the TLB , which stores recently accessed memory locations, and this memory contained in the processor has faster read / write speed than RAM since RAM depends on buses and so on. and TLB gets right into the processor along with the MMU .

If we traverse the for loop column by column (vertically) whereas the TLB cache stores linearly (horizontally) the cache is "wasted" because it has a relatively small size and its data are always overwritten with newer memory addresses, but if we traverse the loop linearly, the TLB cache will be reused by reusing the last accessed memory addresses. This advantage is that it generates the performance gain, since the processor does not need to query in the RAM memory of the address table, saving several machine cycles.

I think this cache exists on all current processors, it generates a very large performance gain, it is an indispensable item today.

You have a very interesting article on a related subject here , It's worth reading.


  • link
  • link
  • link
  • link
  • 14.03.2014 / 20:00