rrefine

VP Nagraj

2022-11-12

Introduction

OpenRefine (formerly Google Refine) is a popular, open source data cleaning software1. rrefine enables users to programmatically trigger data transfer between R and OpenRefine. Using the functions available in this package, you can import, export, apply data cleaning operations, or delete a project in OpenRefine directly from R. There are several client libraries for automating OpenRefine tasks via Python, nodeJS and Ruby2. rrefine extends this functionality to R users.

Installation

rrefine is available on CRAN:

install.packages("rrefine")

The latest version of the package is also available on GitHub and can be installed via devtools by using the following:

# install.packages("devtools")
devtools::install_github("vpnagraj/rrefine")
library(rrefine)

lateformeeting

rrefine includes a sample “dirty” data set to illustrate its features. This object (lateformeeting) is a simulated data frame that holds 63 observations of dates, days of the week, numbers of hours slept and indicators of whether or not the subject was on time for work. The data are recorded in inconsistent formats and will require cleaning in order to be parsed correctly by R. You can take a look at how messy things are below:

theDate what.day.whas.it sleephours was.i.on.time.for.work
1/8/99 Friday 7 Yes
Janury 15 1999 friday 5 No
1/25/1999 Monday 8 No
1/28/1999 thursday 8 Yes
2/9/99 Tuesday six Yes
2/11/1999 Thursday 7 No
February 15, 1999 Monday 8 Yes
2/19/1999 Friday 10 Yes
3/3/1999 Wednesday 5 n
3/4/1999 Thursday 9 Yes
3/5/1999 Fridday 8 Y
March 8, 1999 Monday 11 Yes
3/15/99 Monday 8 yep
3/16/1999 Tuesday 8 Yes
3/17/1999 Wedensday 6.5 N
4/1/99 Thursday 7 Yes
4/12/1999 Monday 8 Yes
4/20/1999 Tuesday 10 Yes
4/27/99 Tuesday 7 Yes
5/5/1999 Wednesday 6 N
5/12/1999 Wednseday 9 Yes
5/13/99 Thursday 7 Yes
5/17/1999 monday 7 and a half Y
May 25 1999 Tuesday 8 YES
5/27/1999 Thursday 8 No
6/1/1999 tuesday 7 Yes
6/7/1999 Monday 10 Yes
6/11/1999 Friday 6 NO
6/15/99 Tuesday 8 Yes
6/24/1999 thursday 7.5 Yes
7/1/1999 Thursday 7 No
July 6 1999 Wednesday 5 Yes
7/12/1999 Monday 8 Y
7/20/1999 tuesday 9 Yes
7/21/99 Wednesday 6 No
7/23/1999 fridya 8 Yes
8/3/1999 Tuesday 7 Yes
8/5/1999 Thursday 6 Y
8/16/1999 Modnay 8 Yes
Aug 27 1999 Friday 4 Yes
August 31, 1999 Teusday 7 yep
September 1, 1999 Wednesday 11 Yes
9/10/1999 Friday 7 Yes
9/20/1999 Monday 7 No
9/23/1999 thursdayy 7.5 Yes
9/28/1999 Monday 8 Yes
10/1/99 Friday 7 Y
10/4/1999 Monday 6 Yes
10/5/1999 tuesday 9 Yes
10/14/1999 Thursday 8 Yes
10/22/1999 Friday 7 No
10/25/99 Monday 7 Yes
10/27/1999 Wednesday 8 Y
11/2/1999 Tuesday 7 Yes
11/3/1999 Wednseday 8 Yes
11/8/1999 monday 5 Yes
11/18/99 Thursday 8 No
12/1/1999 Wednesday 7 Yes
12/3/1999 Friday 6 N
12/6/1999 Monday 6.5 Yes
12/13/1999 Monday 9 Yes
12/16/1999 Thursday 7 Yes
12/21/99 Tuesday 8 Y

refine_upload()

While the data cleaning could be performed using R, the operations here describe a typical scenario for OpenRefine users. The first step to creating a new project is to make sure OpenRefine is installed and running3. By default, the application will run locally at http://127.0.0.1:3333/. All of the functions in rrefine will assume the default local host name and port, however these can both be overridden4. Additionally, as of v1.1.0 the package will internally connect to the OpenRefine instance using a CSRF token in API requests5. The refine_upload() function allows you to pass the contents of a delimited text file (csv or tsv) along with a project name (optional) and an argument to automatically open the browser in which OpenRefine is running. The example below demonstrates this workflow using the lateformeeting sample data:

write.csv(lateformeeting, file = "lateformeeting.csv", row.names = FALSE)
refine_upload(file = "lateformeeting.csv", project.name = "lfm_cleanup", open.browser = TRUE)

With the project uploaded, you can perform any of the desired clean-up procedures in OpenRefine.

refine_operations()

Whether the data in OpenRefine has been uploaded via refine_upload() or another method, users can programmatically apply operations to projects using refine_operations(). This function will pass an arbitrary list of data cleaning operations to the specified project. Operations must be defined in valid JSON format6. In addition to the generic refine_operations() that can flexibly accept any valid JSON operation, the rrefine package includes a series of wrapper functions to perform common data cleaning procedures:

The example below demonstrates several operations using the lateformeeting sample data:

refine_add_column(new_column = "dotw_allcaps", 
                  base_column = "what.day.whas.it", 
                  value = "grel:value",
                  project.name = "lfm_cleanup")
refine_to_upper(column_name = "dotw_allcaps", project.name = "lfm_cleanup")
refine_export(project.name = "lfm_cleanup")$dotw_allcaps
##  [1] "FRIDAY"    "FRIDAY"    "MONDAY"    "THURSDAY"  "TUESDAY"   "THURSDAY" 
##  [7] "MONDAY"    "FRIDAY"    "WEDNESDAY" "THURSDAY"  "FRIDAY"    "MONDAY"   
## [13] "MONDAY"    "TUESDAY"   "WEDNESDAY" "THURSDAY"  "MONDAY"    "TUESDAY"  
## [19] "TUESDAY"   "WEDNESDAY" "WEDNESDAY" "THURSDAY"  "MONDAY"    "TUESDAY"  
## [25] "THURSDAY"  "TUESDAY"   "MONDAY"    "FRIDAY"    "TUESDAY"   "THURSDAY" 
## [31] "THURSDAY"  "WEDNESDAY" "MONDAY"    "TUESDAY"   "WEDNESDAY" "FRIDAY"   
## [37] "TUESDAY"   "THURSDAY"  "MONDAY"    "FRIDAY"    "TUESDAY"   "WEDNESDAY"
## [43] "FRIDAY"    "MONDAY"    "THURSDAY"  "MONDAY"    "FRIDAY"    "MONDAY"   
## [49] "TUESDAY"   "THURSDAY"  "FRIDAY"    "MONDAY"    "WEDNESDAY" "TUESDAY"  
## [55] "WEDNESDAY" "MONDAY"    "THURSDAY"  "WEDNESDAY" "FRIDAY"    "MONDAY"   
## [61] "MONDAY"    "THURSDAY"  "TUESDAY"
refine_remove_column(column = "dotw_allcaps", project.name = "lfm_cleanup")

refine_export()

Once you’ve cleaned up the data in OpenRefine you can pull it back into R for plotting, modeling, etc. by using refine_export(). This function will accept either the project name or the numerical unique identifier. It is only necessary to use both if there are multiple projects with the same name in your OpenRefine application. Note that the data is exported directly into R as a data frame and you can assign it to a new object.

lfm_clean <- refine_export(project.name = "lfm_cleanup")
lfm_clean
date dotw hours.slept on.time
1999-01-08T00:00:00Z Friday 7.0 Yes
1999-01-15T00:00:00Z Friday 5.0 No
1999-01-25T00:00:00Z Monday 8.0 No
1999-01-28T00:00:00Z Thursday 8.0 Yes
1999-02-09T00:00:00Z Tuesday 6.0 Yes
1999-02-11T00:00:00Z Thursday 7.0 No
1999-02-15T00:00:00Z Monday 8.0 Yes
1999-02-19T00:00:00Z Friday 10.0 Yes
1999-03-03T00:00:00Z Wednesday 5.0 No
1999-03-04T00:00:00Z Thursday 9.0 Yes
1999-03-05T00:00:00Z Friday 8.0 Yes
1999-03-08T00:00:00Z Monday 11.0 Yes
1999-03-15T00:00:00Z Monday 8.0 Yes
1999-03-16T00:00:00Z Tuesday 8.0 Yes
1999-03-17T00:00:00Z Wednesday 6.5 No
1999-04-01T00:00:00Z Thursday 7.0 Yes
1999-04-12T00:00:00Z Monday 8.0 Yes
1999-04-20T00:00:00Z Tuesday 10.0 Yes
1999-04-27T00:00:00Z Tuesday 7.0 Yes
1999-05-05T00:00:00Z Wednesday 6.0 No
1999-05-12T00:00:00Z Wednesday 9.0 Yes
1999-05-13T00:00:00Z Thursday 7.0 Yes
1999-05-17T00:00:00Z Monday 7.5 Yes
1999-05-25T00:00:00Z Tuesday 8.0 Yes
1999-05-27T00:00:00Z Thursday 8.0 No
1999-06-01T00:00:00Z Tuesday 7.0 Yes
1999-06-07T00:00:00Z Monday 10.0 Yes
1999-06-11T00:00:00Z Friday 6.0 No
1999-06-15T00:00:00Z Tuesday 8.0 Yes
1999-06-24T00:00:00Z Thursday 7.5 Yes
1999-07-01T00:00:00Z Thursday 7.0 No
1999-07-06T00:00:00Z Wednesday 5.0 Yes
1999-07-12T00:00:00Z Monday 8.0 Yes
1999-07-20T00:00:00Z Tuesday 9.0 Yes
1999-07-21T00:00:00Z Wednesday 6.0 No
1999-07-23T00:00:00Z Friday 8.0 Yes
1999-08-03T00:00:00Z Tuesday 7.0 Yes
1999-08-05T00:00:00Z Thursday 6.0 Yes
1999-08-16T00:00:00Z Monday 8.0 Yes
1999-08-27T00:00:00Z Friday 4.0 Yes
1999-08-31T00:00:00Z Tuesday 7.0 Yes
1999-09-01T00:00:00Z Wednesday 11.0 Yes
1999-09-10T00:00:00Z Friday 7.0 Yes
1999-09-20T00:00:00Z Monday 7.0 No
1999-09-23T00:00:00Z Thursday 7.5 Yes
1999-09-28T00:00:00Z Monday 8.0 Yes
1999-10-01T00:00:00Z Friday 7.0 Yes
1999-10-04T00:00:00Z Monday 6.0 Yes
1999-10-05T00:00:00Z Tuesday 9.0 Yes
1999-10-14T00:00:00Z Thursday 8.0 Yes
1999-10-22T00:00:00Z Friday 7.0 No
1999-10-25T00:00:00Z Monday 7.0 Yes
1999-10-27T00:00:00Z Wednesday 8.0 Yes
1999-11-02T00:00:00Z Tuesday 7.0 Yes
1999-11-03T00:00:00Z Wednesday 8.0 Yes
1999-11-08T00:00:00Z Monday 5.0 Yes
1999-11-18T00:00:00Z Thursday 8.0 No
1999-12-01T00:00:00Z Wednesday 7.0 Yes
1999-12-03T00:00:00Z Friday 6.0 No
1999-12-06T00:00:00Z Monday 6.5 Yes
1999-12-13T00:00:00Z Monday 9.0 Yes
1999-12-16T00:00:00Z Thursday 7.0 Yes
1999-12-21T00:00:00Z Tuesday 8.0 Yes

From there the clean data is available for analyses that couldn’t have been performed in its original format.

refine_delete()

To clean up your OpenRefine workspace you can delete projects using refine_delete(). Just like refine_export() it’s possible to pass either a project name or unique identifier to this function. And it is only necessary to use both if there are multiple projects with the same name.

refine_delete(project.name = "lfm_cleanup")

References


  1. https://openrefine.org/↩︎

  2. https://docs.openrefine.org/technical-reference/openrefine-api#third-party-software-libraries↩︎

  3. https://openrefine.org/download.html↩︎

  4. For documentation on how to specify a different host or port number see ?refine_path().↩︎

  5. https://github.com/OpenRefine/OpenRefine/wiki/Changes-for-3.3#csrf-protection-changes↩︎

  6. https://docs.openrefine.org/technical-reference/openrefine-api#apply-operations↩︎