Best SQL DBMS for large databases (RAIS RAM mem)

3

I'm trying to write a routine to automate the import of public RAIS micro-data and to work with them on a regular computer (RAIS size & RAM)

What is the best SQL software (DBMS / protocol) to use? Preferably I would like to use RSQLlite because it can be created from inside the R, without depending on the installation of other software by the user.

But I wanted to know what the trade-offs are in relation to other DBMSs in terms of performance, ability to handle large bases, and so on.

In practice, in addition to the SQL software loop, I would still like to automate the data download, since the data is made available by the MTE via Google Drive (I asked in the English here ). In addition they come compressed in the format .7z and from what I saw you can not unpack with only the commands of R (but you can install 7zip and execute by command line in the shell, inside the R).

    
asked by anonymous 03.10.2014 / 03:18

1 answer

6

The choice of DBMS depends on what you will do with the data and the type of data that will be used. I will mention what are, in my opinion, the advantages and disadvantages of using the following DBMS's: Postgres, MonetDB and SQLite3.

SQLite3 is the simplest DBMS of the three and also the one that has fewer features and less capacity to store data. The great advantage of SQLite3 is simplicity, since you do not need to install it on your computer. Just use an R package and the data can be easily transferred by copying a file. If the database is relatively small and SQLite3 performance is satisfactory in terms of speed and features, this DBMS is an excellent choice.

Postgres is certainly the one with the most features. Virtually any feature available in other DBMS's is also available in Postgres. There is also the MADLib library that extends features, such that it is possible to perform complex analyzes, all directly from the R, either by using the dplyr or the PivotalR . Through these packages it is not necessary to write SQL queries directly and the user can work using only the knowledge that already has the R language.

MonetDB is a DBMS columnar , that is, it stores the data internally in the form of columns. This is a big advantage of MonteDB because the read speed is significantly faster than Postgres and SQLite3. However, the cost of writing is greater, such that in some applications the positive effect of reading can be offset by the negative effect of writing.

So my suggestion is this: if the data set is reasonably stable, ie no need for regular writing, MonetDB may be the best choice. If you need to write and read regularly, and the dataset is large but not much larger than RAM, SQLite3 might be a better option. Finally, if complex query and modeling is required on very large volumes of data, perhaps the best option is Postgres + dplyr + PivotalR.

Especially in the case of microdata, where after the first reading, it is practically not necessary to write in the database, the reading time can be critical. In these cases my preference is by MonetDB. Another advantage of MonetDB in the case of microdata is that if these data are the results of complex samples, it is necessary to include the sampling plan in the analysis; with the sqlsurvey package you can include the sample plan in the analysis easily using MonetDB.

    
03.10.2014 / 04:49