There are several utilities in the R ecosystem for reproducible research. The package repana (for Reproducible Analysis) help in having a common directory structure where to save the files that should be consider part of the main stream of production, and files that are products of the main stream as well as modified files no longer part of the main stream but need to be kept such as re-formatted reports or presentations.
The aspiration of this package is that you can set up an analysis with the
make_structure() function, have access to the database using
get_con() function, have your tables in the database documented with the
update_table() function, and reproduce a complete analysis by running the
make_structure() reads the
config.yml files and ensure all entries on the
dirs section exists. The
config.yml created by default will produce the following directories for the data, functions, database, logs, reports and handmade entries of the
_data to keep all data sources need to reproduce the analysis.
_functions to keep all functions programmed for the analysis
database to keep all secondary datasets and objects
logs to keep the log of the scripts
reports to keep all secondary analysis reports and sheets
handmade to keep all modified files and reports that should be kept
The directories can be used in your programs by the
The information in
handmade as well as the scripts in the root directory should be preserved as they are the core of your analysis. The files in
database are created and recreated as result of your analysis’ scripts. Those are the results of your analysis.
clean_structure()clean those directories included in the list
clean_before_new_analysis so a new analysis could be re-run without worries that new and old results are mixed. If you use
git, those directories are candidates to be excluded from the control version by having them in the
.gitignore file. (
make_structure() take care of create a
.gitignore if it does not exist or include those directories if they have not been yet included).
make_structure() writes a
config.yml if it does not exists. This file is used by the
config::get() function. It contains a the following entries under the default: tree
dirs: to define the directories that make_structure will maintain. It have the entry values for
handmade directories. If you prefer other options than the defaults values you may change it. You may access those directories in your programs using
config::get("dirs") Note that the name for
function does not have a underscore but by default the values are
make_structure will create the paths with the value of the entries but you access them in your program with the name of the entry. This will provide the freedom to direct the real path of the directory to any place you need.
clean_before_new_analysis: to define the list of directories that should be cleaned every time you want to repeat the analysis from zero. This directories are included in the
defaultdb: is written with the parameters for a
RSQLite SQLite driver
You may add other entries that your analysis may requires. The
config.yml itself should also be included in your
.gitignore file as it is something that change from system to system (i.e. driver parameters) so you should include in the documentation of your analysis what entries should be defined so you can reproduce the analysis in other machine.
DBI and Pool connections are used as a way to keep data as well as results in a database system. You must provide, at least, values for the
dbconnection entries corresponding to the package that host the dbConnection and the name of the dbConnection function. Notice entry names are lowercase. The rest of the entries must correspond for parameters for your driver connection or pool connection.
Example to use
RSQLite with a
results.db file in the database directory
defaultdb: package: RSQLite dbconnect: SQLite dbname: database/results.db
Example to use
defaultdb: package: RPostgres dbconnection: Postgres dbname: testdb host: localhost port: 5432 user: username password: password
You can define several configuration to use different databases in the same analysis, but the
defaultdb will be used by default for the
update_table will save a data.frame into the database, and will keep a log in the
log_table table with the timestamps the file was updated in the database. The
log_table keep a record of when was the table included in the database and a comment that will help to trace the origin. You may include the date data was obtained or the source of the data.
update_table(p_con,"iris", "from system)
master(pattern, start, stop, logdir, rscript_path) function execute in a plain vanilla R process each one of the files identified by the pattern. By default use the pattern is
"^[0-9][0-9].*\\.R$", which include all files like
04_make_report.R but not
exploratory.R etc.. Files are run on the order starting from the first but if for any reason you need to omit the first files you may skip them with the
logdir is the directory for the logs, by default
rscript_path is the full path to the
Rscript program which is at the end the one that process the
R file. The current default is for a OS system. But implementation for Linux and Windows will soon be implemented.
The master function use functions from the
Here is an example of the
config.yml created by
default: dirs: data: _data functions: _functions handmade: handmade database: database reports: reports logs: logs clean_before_new_analysis: - database - reports - logs defaultdb: package: RSQLite dbconnect: SQLite dbname: ":memory:" driver: /usr/local/lib/libsqlite3odbc.dylib database: database/results.db
You may download the package from git-hub. Within R you may use
devtools::install_github("johnaponte/repana", build_manual = T, build_vignettes = T)