I have work on projects involving the integration of various data sources, data transformation and reporting. The scripts are mainly in R, but sometimes I go to other languages.
I have created the following directories: report/
, script/
, lib/
, data/
, raw-data/
and doc/
. The source code is script/
and lib/
, data in data/
and raw-data/
and reports in report/
. The general idea is to create small R scripts that transform data in turn until you get to formatted data to be used in reports.
In raw-data/
, I save data that was created or obtained manually, usually in .csv
or similar files. Scripts read data from raw-data
(or data/
), possibly perform some transformations (filter, grouping, etc.) and create files in data/
, usually using the saveRDS
function, so they can be read quickly using the readRDS
function. Each script is small and usually only writes a rds
file containing a data frame. If there are functions used in more than one script, they remain in files in the lib/
folder and are loaded using the source
function (with the chdir=TRUE
option). Scripts extensively use the dplyr package.
In the doc/
folder I try to keep two updated diagrams: one with the data frames and their columns and another that describes the data transformation pipeline (a diagram with scripts and data files, indicating for each script what files he reads and which one he writes). The advantage of documenting the pipeline is that when some file changes (for example, because of the arrival of more current data), it is easy to determine which scripts need to be executed and in what order to update the data used in the final reports. I use yEd to create the diagrams.
Someofthescriptsinscript/
generatereports.Theyarewrittentobecompiledwith knitr :: spin and create HTML files in report/
, often containing graphics generated with rCharts .
The project is kept under version control using Git . I avoid keeping the files in data/
in control of version, since they can be large and many of them can be generated from the scripts and the data in raw-data/
. The exception is for data files that are derived from external databases. In this case I put the file in version control to ensure that people without database access can run the project.
An example project that uses this workflow can be found at link
The advantage of using a number of specialized scripts is that if any data source is updated or if a new column needs to be calculated in a data frame, you only need to run the scripts that handle that data frame again. This is especially important when the project involves slow data transformations or access to external data sources such as web and database management systems.