Workflow in R: strategies for organizing a data analysis project

Question

Workflow in R: strategies for organizing a data analysis project

Navigation

#1 by (4 votes)

5

Based on this SOEN question , I ask: / p>

What strategies do you recommend to organize a data analysis project in R ? The project usually involves the following steps (not necessarily in that order):

Load and "clean" a "raw" database;

Manipulate the database to leave it in the formats required for viewing and analysis;

Perform analyzes, build graphs and tables;

Produce final reports;

Also, in general, you will need to create your own functions to perform the previous steps.

By specifying a few more points that can be addressed (you do not need to address all points, here are suggestions for guiding responses):

What is the structure of folders and files .R use?

Is it recommended to put all functions in the same file or in separate files?

Would it be worth creating a package for these functions, instead of giving source ?

What functions, packages, etc. are recommended to manage this process? Is version control recommended in a project of this type?

Regarding the analysis execution scripts, would it be better to put all analysis in a single script or separate a script by activity (create database, clean database etc)?

versionamento r workflow

asked by anonymous 08.03.2014 / 02:52

1 answer

What is ReactPHP, what good is it worth to use? How does a historical debugger work?

score 4 · Accepted Answer

I have work on projects involving the integration of various data sources, data transformation and reporting. The scripts are mainly in R, but sometimes I go to other languages.

I have created the following directories: report/ , script/ , lib/ , data/ , raw-data/ and doc/ . The source code is script/ and lib/ , data in data/ and raw-data/ and reports in report/ . The general idea is to create small R scripts that transform data in turn until you get to formatted data to be used in reports.

In raw-data/ , I save data that was created or obtained manually, usually in .csv or similar files. Scripts read data from raw-data (or data/ ), possibly perform some transformations (filter, grouping, etc.) and create files in data/ , usually using the saveRDS function, so they can be read quickly using the readRDS function. Each script is small and usually only writes a rds file containing a data frame. If there are functions used in more than one script, they remain in files in the lib/ folder and are loaded using the source function (with the chdir=TRUE option). Scripts extensively use the dplyr package.

In the doc/ folder I try to keep two updated diagrams: one with the data frames and their columns and another that describes the data transformation pipeline (a diagram with scripts and data files, indicating for each script what files he reads and which one he writes). The advantage of documenting the pipeline is that when some file changes (for example, because of the arrival of more current data), it is easy to determine which scripts need to be executed and in what order to update the data used in the final reports. I use yEd to create the diagrams.

Someofthescriptsinscript/generatereports.Theyarewrittentobecompiledwith knitr :: spin and create HTML files in report/ , often containing graphics generated with rCharts .

The project is kept under version control using Git . I avoid keeping the files in data/ in control of version, since they can be large and many of them can be generated from the scripts and the data in raw-data/ . The exception is for data files that are derived from external databases. In this case I put the file in version control to ensure that people without database access can run the project.

An example project that uses this workflow can be found at link

The advantage of using a number of specialized scripts is that if any data source is updated or if a new column needs to be calculated in a data frame, you only need to run the scripts that handle that data frame again. This is especially important when the project involves slow data transformations or access to external data sources such as web and database management systems.