String Builder (or How to improve the performance of massive replacements)

8

TL; DR: I know there is a class called StringBuilder, either at . NET and Java , which allows you to perform operations on a text without generating a new string for each method call. It would be very convenient to have something similar in Javascript, however I can not find anything similar.

My goal: Let's assume that I have some considerable lengths of text on which I wish to perform certain substitution operations. By size, let's imagine strings with millions of characters.

I know that Javascript may not seem like an appropriate technology for this. But I want to do the replacements in real time, which excludes techniques like leaving a job to operate on a server. I also want to replace multiple snippets, based on the input of the user.

The problem: Javascript overrides end up being expensive in memory. Someone correct me if I am wrong - but if I have a string that occupies a megabyte , using the replace method of the String object, I will have an occupation of two megabytes : one of the original string, which will not exist until the garbage collector reclaims it, and another one of the new string. When you perform a new replacement, there will be three megabytes , and so on. In the thousandth change, we are already occupying in the vicinity of a gigabyte .

I'm thinking of approximate numbers, and considering that all substitutes are global (I use regular expressions with the g modifier, ie: /foo/g ).

Doubt: is there anything that plays the role of StringBuilder in Javascript? If it does not exist, is there any way it could be implemented?

    
asked by anonymous 16.05.2014 / 16:14

2 answers

6

I've been researching the subject and found this project StringBuilder . Which seems to be optimized for what you want.

Normally a code snippet would look like this:

var myString = "";
for (var i = 0; i < 1000; i++) {
    myString += i.toString();
}

The problem with this is that a lot of memory is used. After the third iteration the following values will be in memory:

""
"1"
"12"
"123"

Example using StringBuilder :

var myStringBuilder = StringBuilder();
for (var i = 0; i < 1000; i++) {
    myStringBuilder.append(i.toString());
}

After the third iteration, only these values will be in memory:

"1"
"2"
"3"
    
16.05.2014 / 16:45
4

In JavaScript, strings are immutable. This means that - whenever you try to make a substitution in a string - a new string will be created with the result. This makes it particularly problematic to efficiently implement something like String.replace , especially if regular expressions are involved.

In theory, you can improve this by using arrays instead of strings. For example, to concatenate multiple strings, accumulate them in an array and at the end join them using the join method:

var arr = [];
for (var i = 0; i < 1000; i++) {
    arr.push(i.toString());
}
var myString = arr.join('');

This solution (which by the way is the same strategy used by StringBuilder as suggested in the DBX8 response ) Unfortunately this is not ideal since JavaScript does not have a "character" data type. Instead, when referring to an index in the string what we have back is just another string:

var str = "abc";
console.log(typeof str[1]); // string
For this reason, representing a string as a list of "characters" (or even of code points, if we use whole numbers as elements of the array) will hardly be more efficient in memory [or in time]. Only when pieces are reasonably large do you expect to gain performance using this method (which may or may not be true in a regex replace case, depending on the case).

What's left is to use a lower-level structure, such as Uint16Array . When creating such an array ( ArrayBuffer exists in multiple formats ) only a copy is kept in memory. It can be modified at will that no additional copy is made, although one has to be careful not to exceed the space limit or things like this:

function stringParaBuffer(str) {
    var ret = new Uint16Array(str.length);
    for ( var i = 0 ; i < str.length ; i++ )
        ret[i] = str.charCodeAt(i);
    return ret;
}
function bufferParaString(arr) { return String.fromCharCode.apply(null, arr); }

var arr = stringParaBuffer("abc");  // Cria o buffer
arr[1] = "d".charCodeAt(0);         // Nenhuma cópia é feita
console.log(bufferParaString(arr)); // imprime: "adc"

arr[3] = "e".charCodeAt(0); // Falha silenciosamente, o array continua o mesmo
This "raw" solution can be used as a basis for building something more sophisticated (such as% retaining size, re-scaling as needed, etc.) that allows regex replace without excessive memory usage. But so far, as far as I know, there is no library ready for making use of this ...

    
16.05.2014 / 18:45