System.OutofMemoryException - Parser for large files

2

I created a simple grammar to interpret a file whose format is very similar to a JSON. However, when I try to parse the file I get the Sytem.OutOfMemoryException exception. This is because of the size of the file I'm trying to parse. The file has 108MB and 4.682.073 of rows.

How do I parse smaller files, everything works normally, however, for this file, I realize that when the memory occupied by the process reaches almost 2GB the exception is fired and the program stops. The exception comes from the code generated for the parser with the ANTLR extension for Visual Studio.

How do I run the parser for a really large files with ANTLR?

More information

The machine I'm running the parser has 8GB of memory, 2.8 GHz processor (Intel Core 2 Duo).

Problem example

Sample file for reading

(
    :field ("ObjectName"
        :field (
            :field ("{6BF621F9-A0E2-49BB-A86B-3DE4750954F4}")
            :field (Value)
            :field (Value)
            :field (
                :Time ("Sun Jan 26 10:08:33 2014")
                :last_modified_utc (1390730913)
                :By ("Mensagem qualquer")
                :From (localhost)
            )
            :field ("Applications/application_fw1")
            :field (false)
            :field (false)
        )
        :field ()
        :field ()
        :field ()
        :field (0)
        :field (true)
        :field (true)
    )
.
.
.
Milhares de outros fields.
.
.
.
)

The grammar

grammar Objects;

/*
 * Parser Rules
 */


compileUnit
    : obj
    ;


obj
    : OPEN ID? (field)* CLOSE
    ;

field
    : ':'(ID)? obj
    ;


/*
 * Lexer Rules
 */


OPEN 
    : '(' 
    ;

CLOSE 
    : ')' 
    ;

ID
    : (ALPHA | ALPHA_IN_STRING)
    ;


fragment
INT_ID
    : ('0'..'9')
    ;

fragment
ALPHA_EACH
    : 'A'..'Z' | 'a'..'z' | '_' | INT_ID | '-' | '.' | '@'
    ;

fragment
ALPHA
    : (ALPHA_EACH)+
    ;

fragment
ALPHA_IN_STRING
    : ('"' ( ~[\r\n] )+ '"')
    ;



WS
    // :    ' ' -> channel(HIDDEN)
    : [ \t\r\n]+ -> skip  // skip spaces, tabs, newlines
    ;

Running the parser

// text é o texto do arquivo de 108MB que será lido.
var input = new Antlr4.Runtime.AntlrInputStream(text);
var lexer = new ObjectsLexer(input);
var tokens = new Antlr4.Runtime.CommonTokenStream(lexer);
var parser = new ObjectsParser(tokens);

// Contexto para a regra compileUnit
// ERRO: Aqui ocorre o problema. Quando inicia a montagem da árvore para compileUnit
// Não chega no Visitor, a exceção ocorre em compileUnit()
var ctx = parser.compileUnit();


// Execução do visitor
new ObjectsVisitor().Visit(ctx);
    
asked by anonymous 06.11.2014 / 19:37

1 answer

2

You can set some things to avoid the problem:

When compiling the unit of work, framework attempts to load the file and the entire tree into memory. In theory, the address space of the application is 4Gb, but I believe the 2Gb limitation is by the maximum size of the data structure within the process.

By eliminating the need for the buffer, the file loads in a segmented way, just like the parse tree, and the memory problem is avoided.

    
06.11.2014 / 21:32