Using large data files with workflowr

workflowr version 1.7.1

John Blischak

2023-08-22

Introduction

Workflowr provides many features to track the progress of your data analysis project and make it easier to reproduce both the current version as well as previous versions of the project. However, this is only possible if the data files from previous versions can also be restored. In other words, even if you can obtain the code from six months ago, if you can’t obtain the data from six months ago, you won’t be able to reproduce your previous analysis.

Unfortunately, if you have large data files, you can’t simply commit them to the Git repository along with the code. The max file size able to be pushed to GitHub is 100 MB, and this is in general a good practice to follow no matter what Git hosting service you are using. Large files will make each push and pull take much longer and increase the risk of the download timing out. This vignette discusses various strategies for versioning your large data files.

Option 0: Reconsider versioning your large data files

Before considering any of the options below, you need to reconsider if this is even necessary for your project. And if it is, which data files need to be versioned. Specifically, large raw data files that are never modified do not need to be versioned. Instead, you could follow these steps:

  1. Upload the files to an online data repository, a private FTP server, etc.
  2. Add a script to your workflowr project that can download all the files
  3. Include the instructions in your README and your workflowr website that explain how to download the files

For example, an RNA sequencing project will produce FASTQ files that are large and won’t be modified. Instead of committing these files to the Git repository, they should instead be uploaded to GEO/SRA.

Option 1: Record metadata

If your large data files are modified throughout the project, one option would be to record metadata about the data files, save it in a plain text file, and then commit the plain text file to the Git repository. For example, you could record the modification date, file size, MD5 checksum, number of rows, number of columns, column means, etc.

For example, if your data file contains observational measurements from a remote sensor, you could record the date of the last observation and commit this information. Then if you need to reproduce an analysis from six months ago, you could recreate the previous version of the data file by filtering on the date column.

Option 2: Use Git LFS (Large File Storage)

If you are comfortable using Git in the terminal, a good option is Git LFS. It is an extension to Git that adds extra functionality to the standard Git commands. Thus it is completely compatible with workflowr.

Instead of committing the large file to the Git repository, it instead commits a plain text file containing a unique hash. It then uploads the large file to a remote server. If you checkout a previous version of the code, it will use the unique hash in the file to download the previous version of the large data file from the server.

Git LFS is integrated into GitHub. However, a free account is only allotted 1 GB of free storage and 1 GB a month of free bandwidth. Thus you may have to upgrade to a paid GitHub account if you need to version lots of large data files.

See the Git LFS website to download the software and set it up to track your large data files.

Note that for workflowr you can’t use Git LFS with any of the website files in docs/. GitHub Pages serves the website using the exact versions of the files in that directory on GitHub. In other words, it won’t pull the large data files from the LFS server. Therefore everything will look fine on your local machine, but break once pushed to GitHub.

As an example of a workflowr project that uses Git LFS, see the GitHub repository singlecell-qtl. Note that the large data files, e.g. data/eset/02192018.rds , contain the phrase “Stored with Git LFS”. If you download the repository with git clone, the large data files will only contain the unique hashes. See the contributing instructions for how to use Git LFS to download the latest version of the large data files.

Option 3: Use piggyback

An alternative option to Git LFS is the R package piggyback. Its main advantages are that it doesn’t require paying to upgrade your GitHub account or configuring Git. Instead, it uses R functions to upload large data files to releases on your GitHub repository. The main disadvantage, especially for workflowr, is that it isn’t integrated with Git. Therefore you will have to manually version the large data files by uploading them via piggyback, and recording the release version in a file in the workflowr project. This option is recommended if you anticipate substantial, but infrequent, changes to your large data files.

Option 4: Use a database

Importing large amounts of data into an R session can drastically degrade R’s performance or even cause it to crash. If you have a large amount of data stored in one or more tabular files, but only need to access a subset at a time, you should consider converting your large data files into a single database. Then you can query the database from R to obtain a given subset of the data needed for a particular analysis. Not only is this memory efficient, but you will benefit from the improved organization of your project’s data. See the CRAN Task View on Databases for resources for interacting with databases with R.