How to iterate over a huge amount of records with scala sorm

3

I want to iterate over a lot of records in a table with sorm, but I want to do it in an efficient way in terms of memory.

Today I use this code:

Db.query[Items].whereEqual("title", someTitle).fetch.foreach { webCount =>
          //do something
}

The problem is that this code first loads all records before proceeding for each item in the loop. Is there any way to do a stream of records?

Link from the original question to English

    
asked by anonymous 12.08.2014 / 17:16

2 answers

3

The methods Querier.fetch and Querier.fetchIds return streams , which does not mean that you can not run out of memory if you have to work with all the objects returned at the same time.

The object Query (built by Querier ) has limit and offset properties that allow paging in the traditional sense of database:

val itemCount = Db.query[Items].whereEqual("title", someTitle).count()
// Consultas paginadas
Db.query[Items].whereEqual("title", someTitle).limit(10).offset(40).fetch // ...
    
12.08.2014 / 17:38
2

According to a review in documentation , the fetch of entities is done in two phases:

  • The query with all filters and sorts is executed in the database, but only retrieving the id (primary key) of each record.

  • When you actually get to the item in Stream returned by the query, then the second phase occurs where all other fields are read.

  • So, as the API does not seem to provide some other way to iterate over the results, this would already be the most efficient way and would hardly pop the memory.

    However, the comment is 2 years ago and I really do not know if it still applies to newer versions.

    Original comment:

      

    Fetching entities in SORM always goes in two phases: first, all your filters - no matter how intricate, - orderings and etc get applied to a single multitable query which fetches just the ids of the matching entities; in the second phase SORM emits multiple queries to actually populate the resulting entities depending on the complexity of their structure. Since all the selects of the second phase are by primary keys, they are very cheap. But this area will definitely become a battlefield for all kinds of optimizations in the future. Contibution is much appreciated.

         

    There was actually a single implementation that was doing everything in a single phase: both querying and object population were being done in a single query in the version 0.1.0, but then it turned out that due to specifics of how joining tables works, it could fetch a million rows for certain multicollection entities, literally. So, downshifting to a simpler strategy turned out to be inevitable.

         

    The "Stream" thing is there intentionally. It delays the second phase fetching queries and objects population until you actually reach them in the returned Stream. Although this might be subject to change in the future.

        
    12.08.2014 / 17:48