This question depends on some factors such as the analysis task you want to perform and the size of the data set, that is, how big it is in relation to the RAM (and sometimes the HD) intends to carry out the analysis. There are a few cases to consider:
Regarding the size of the dataset:
-
Data sets larger than RAM but smaller than HD's common in
personal computers, something like 20Gb for example.
-
Data sets larger than RAM and personal computer HD.
Regarding the type of analysis:
-
Descriptive analyzes, queries, and simple calculations.
-
More complex analysis, including modeling like RandomForest, Linear Regressions, and so on.
When the dataset is of a moderate size, larger than RAM, but not so large as to be impossible to treat it on a single PC, R packages like ff , Bigmemory or even the ScaleR package from Revolution Analytics are capable of performing simple and more complex analyzes. A caveat in these cases are situations where, even with these packages, the procedure is very slow in relation to the user's need. Another less well-known solution is to use the MADLib library, which extends Postgres and enables complex analyzes of large data sets such as Linear / Logistic Regressions, RandomForest and etc., directly from the R through the PivotalR package.
If the analysis involves only simple queries and descriptive statistics, an interesting solution might be to simply load the dataset into a Database Management System (DBMS) like Postgres , or MySQL , the SQLite3 and the MonetDB , and turn the calculations into SQL queries. Alternatively, you can use the dplyr package, with which you define the data source as one of these DBMSs, and the package automatically converts operations of dplyr in SQL code. In addition to these alternatives, dplyr allows you to use Big Data services in the cloud, such as BigQuery , where the user can perform operations query directly from the terminal with dplyr, just as you would if you were using a data.frame.
In situations where the data set is much larger than the RAM, sometimes intractable on a single computer, there is a need to use frameworks that allow distributed processing of large data sets as Apache Hadoop or Apache Spark . In these cases, depending on the type of analysis you want to perform, such as queries and simple calculations, Hadoop + R with the package Radop or Spark + R with the SparkR package may suffice.
Both Hadoop and Spark have associated projects that implement Machine Learning methods such as Apache Mahout and MLib , which are not available for use with R. However there is the engine H2O of the 0xadata that has an API for the R such that the user can implement modeling methods in large datasets. MADlib, mentioned above, can also be used in distributed database management systems such as Greenplum a>, such that together with the PivotalR package, it allows for complex analyzes. The Revolution ScaleR package can also be used in these cases, where it uses a Big Data backend infrastructure.