Strategies for analyzing very large databases in R (that do not fit in RAM)


Suppose I have a huge database that does not fit into RAM. What strategies to analyze this database in R, since I can not fully load it into memory?

PS: The question is not just about how to make R talk to a relational / non-relational database. If, for example, your data is in a relational database, you still can not load them all at once into R to run a random forest or a regression, for example.

asked by anonymous 28.08.2014 / 06:44

4 answers


R is a specialized language whose sweet spoot is memory data analysis problems (an extremely significant set of problems).

That said the R ecosystem is large and diverse solutions are emerging to address issues with huge volumes of data. Keep in mind that Big Data problems use rather specific techniques (and often Software / Hardware / File System solutions and specific protocols) such as MapReduce . Do not assume that you can do everything you do in a data.frame with gigantic data volumes, and even if a particular technique can be applied, do not assume that the algorithms are alike. Keep in mind that issues such as regression with MapReduce are still open search problems, new algorithms and new implementations are popping up inside and outside the R ecosystem (you can find more information in papers as < Robust Regression on MapReduce .

Robust Regression

To give you a taste of where to start, there are already packages that implement:

28.08.2014 / 17:56

This question depends on some factors such as the analysis task you want to perform and the size of the data set, that is, how big it is in relation to the RAM (and sometimes the HD) intends to carry out the analysis. There are a few cases to consider:

Regarding the size of the dataset:

  • Data sets larger than RAM but smaller than HD's common in personal computers, something like 20Gb for example.

  • Data sets larger than RAM and personal computer HD.

Regarding the type of analysis:

  • Descriptive analyzes, queries, and simple calculations.

  • More complex analysis, including modeling like RandomForest, Linear Regressions, and so on.

When the dataset is of a moderate size, larger than RAM, but not so large as to be impossible to treat it on a single PC, R packages like ff , Bigmemory or even the ScaleR package from Revolution Analytics are capable of performing simple and more complex analyzes. A caveat in these cases are situations where, even with these packages, the procedure is very slow in relation to the user's need. Another less well-known solution is to use the MADLib library, which extends Postgres and enables complex analyzes of large data sets such as Linear / Logistic Regressions, RandomForest and etc., directly from the R through the PivotalR package.

If the analysis involves only simple queries and descriptive statistics, an interesting solution might be to simply load the dataset into a Database Management System (DBMS) like Postgres , or MySQL , the SQLite3 and the MonetDB , and turn the calculations into SQL queries. Alternatively, you can use the dplyr package, with which you define the data source as one of these DBMSs, and the package automatically converts operations of dplyr in SQL code. In addition to these alternatives, dplyr allows you to use Big Data services in the cloud, such as BigQuery , where the user can perform operations query directly from the terminal with dplyr, just as you would if you were using a data.frame.

In situations where the data set is much larger than the RAM, sometimes intractable on a single computer, there is a need to use frameworks that allow distributed processing of large data sets as Apache Hadoop or Apache Spark . In these cases, depending on the type of analysis you want to perform, such as queries and simple calculations, Hadoop + R with the package Radop or Spark + R with the SparkR package may suffice.

Both Hadoop and Spark have associated projects that implement Machine Learning methods such as Apache Mahout and MLib , which are not available for use with R. However there is the engine H2O of the 0xadata that has an API for the R such that the user can implement modeling methods in large datasets. MADlib, mentioned above, can also be used in distributed database management systems such as Greenplum a>, such that together with the PivotalR package, it allows for complex analyzes. The Revolution ScaleR package can also be used in these cases, where it uses a Big Data backend infrastructure.

26.09.2014 / 15:51

I suggest as a solution to handle large volumes of data a database management system , "rice with beans "( relational model ) ... In this context, the best FOSS software (satisfies the demands of robustness, stability, scalability, etc.) is PostgreSQL .

Some users of R have been using PostgreSQL since ~ 2006, so it is already well-established and documented: o "R embedded" module ( R Procedural Language for PostgreSQL ), gives you the freedom to create procedures database with R - for example triggers of UPDATE / INSERT / DELETE written with R jargon instead of PL or other strange language - and perform virtually all the R operations (over basic data types) in the database scripts themselves.

In the SOen have some tips on how to install .

However, apparently PL / R would still have the same RAM restrictions as R, as seen in this SOen response .

23.05.2017 / 14:37

There is always the option of working with banks external to the R and loading only the variables needed for the analysis (since analyzes that will use all the variables at the same time are rare). There are several packages that allow you to work with other types of banks, of which I highlight RSQLite . It has functions that create SQLite databases of files with delimiters and without (fixed width).

29.08.2014 / 16:05