What better way to parse a very large json file in java?

3

I have a very large file in json, it has 5 gigs and has 652339 lines, I was thinking of using the Gson library in java.

I'd like to know, what is the best way to parse from file, since neither the json framework was able to extract it right. Example of a file line:

{"control": {"lang": {"lang": "pt", "d": 1395183935882, "v": 5}, "last": "UPDATE", "read": {"d": 1395183767992, "v": 3}, "update": {"d": 1395308552817, "v": 2}, "rule": {"entities": [80000, 84001, 80034, 84232, 84009, 84051, 84084, 80061], "d": 1395305209944, "v": 3}, "entities": {"entities": [80000, 84001, 80034, 84232, 84009, 84051, 84084, 80061]}, "terms": {"terms": [], "d": 1395249318552, "v": 3}, "coletas": [{"terms": [], "id": 97}]}, "picture": "https://fbexternal-a.akamaihd.net/safe_image.php?d=AQA10tlbPQBXIp4p&w=154&h=154&url=http%3A%2F%2Fimages.immedia.com.br%2F%2F9%2F9146_2_L.JPG", "story": "Georgevan Araujo compartilhou um link.", "updated_time": "2013-12-30T23:59:59", "from": {"name": "Georgevan Araujo", "id": "100000278536009"}, "description": "Segundo o ex-ministro da Fazenda, a prova de que o governo n\u00e3o tem nada de socialista \u00e9 que ele destruiu as suas duas principais empresas: a Petrobras e a Eletrobr\u00e1s", "caption": "www.infomoney.com.br", "privacy": {"value": ""}, "name": "\"O que o governo fez com a Petrobras foi uma trag\u00e9dia\", diz Delfim Netto", "application": {"namespace": "fbipad_", "name": "Facebook for iPad", "id": "173847642670370"}, "link": "http://www.infomoney.com.br/onde-investir/acoes/noticia/3086396/que-governo-fez-com-petrobras-foi-uma-tragedia-diz-delfim", "story_tags": {"0": [{"length": 16, "type": "user", "id": "100000278536009", "name": "Georgevan Araujo", "offset": 0}]}, "created_time": "2013-12-30T23:59:59", "_id": "100000278536009_719669731385638", "type": "link", "id": "100000278536009_719669731385638", "icon": "https://fbstatic-a.akamaihd.net/rsrc.php/v2/yD/r/aS8ecmYRys0.gif"}

I was thinking of:

  • Split this file into several others and parse it one by one
  • Create a database and put all the information in the database for use in the application
  • Try to get rid of the json structure with a java application and read the file as executed

I think the above alternatives are not the best.

    
asked by anonymous 26.09.2015 / 16:11

2 answers

2

As my real need in this Json was for some tags, what I used was an element-by-element reading at each iteration, according to the need to use it. For this I used the jackson Json API. My code below takes only the title, url, text, entidades tags from the above Json:

public class BrutoNewsJsonParser {

    JsonFactory factory;
    JsonParser jp;
    JsonToken current;

    public BrutoNewsJsonParser() {
        factory = new JsonFactory();
        jp = null;

        String path = "/home/nicolas/Documentos/X9dadosIC/Bruto/news_jul_dez_2013.json";

        try {
            jp = factory.createJsonParser(new File(path));

        } catch (IOException ex) {
            Logger.getLogger(BrutoNewsJsonParser.class.getName()).log(Level.SEVERE, null, ex);
            ex.printStackTrace();
        }

    }

    public News ler() {
        EntidadesReader er = new EntidadesReader();
        String title = null, url = null, text = null;
        LinkedList<String> entidades = new LinkedList<>();
        boolean controleEntidades = true;

        int contador = 0;

        try {
            current = jp.nextToken();
        } catch (IOException ex) {
            Logger.getLogger(BrutoNewsJsonParser.class.getName()).log(Level.SEVERE, null, ex);
        }

        if (current == JsonToken.START_OBJECT) {
            contador++;
        }

        while (contador != 0) {
            try {
                String namefield = jp.getCurrentName();
                if ("title".equals(namefield)) {
                    title = jp.getText();
                } else if ("url".equals(namefield)) {
                    url = jp.getText();
                } else if ("text".equals(namefield)) {
                    text = jp.getText();
                } else if ("entities".equals(namefield) && controleEntidades) {
                    if (current == JsonToken.START_ARRAY) {
                        controleEntidades = false;
                        current = jp.nextToken();
                        while (current != JsonToken.END_ARRAY) {
                            entidades.add(er.traduzir(Integer.parseInt(jp.getText())));
                            current = jp.nextToken();
                        }
                    }
                }

                current = jp.nextToken();
                if (current == JsonToken.END_OBJECT) {
                    contador--;
                } else if (current == JsonToken.START_OBJECT) {
                    contador++;
                }
            } catch (IOException e) {
                System.err.println(current.asString());
                e.printStackTrace();
            }
        }
        try {
            jp.nextToken();
        } catch (JsonParseException j) {

        } catch (IOException ex) {
            Logger.getLogger(BrutoNewsJsonParser.class.getName()).log(Level.SEVERE, null, ex);
        }
        return new News(title, url, text, entidades);
    }
}

With this, each Read method call provides an additional element.

    
24.10.2015 / 21:37
2

The database is probably the best solution because:

  • It is made to work with huge amounts of data;

  • 5GB is very much meant to be kept in memory, especially in Java;

  • If the data needs to be reused, it will be necessary to perform the entire process of interpretation, etc. of the data, which will undoubtedly take time.

I do not know of a specific tool that manages to handle so much data like that, or that it boasts of being able to manipulate so much data. But unless you have more than 4 ^ 31 records in a single level of your object tree, enough memory on your machine and set Java to have a really large (8GB +) heap limit, I see no problems. / p>

A detail to note that can make it much easier: if your file consists only of lines like the one described, and nothing else, if you want separated by commas, and each line is a complete Json, you can process row by line as being a Json file, and then sending the line to the database, eliminating the previously cited memory problem, with the advantage of being reasonably simple to do.

    
26.09.2015 / 16:32