SQLove - Simple Functions for Complex Queries

The Problem

The DBI and RJDBC packages provide exceptionally powerful tools for integrating SQL queries into the R environment. In a typical workflow using R, the user may want to bring down a simple table and then manipulate the data using powerful R tools such as tidyverse. However, some use cases exist where complex SQL legacy code can be run to pull down data so that R can be used for analytic purposes only. This is especially relevant where the user’s local machine has limited memory or low processing speed. In these cases, leveraging complex SQL queries can reduce the load on the local machine. However, the typical DBI::dbGetQuery() function only handles one SELECT statement at a time. Therefore, the user would need to break down a complex SQL script that, for example, creates numerous temporary tables and then joins them, into multiple character scripts passed to multiple functions before the final output matches what they would see if they simply ran the full script.

This is the problem that SQLove hopes to solve with the dbGetMultiQuery() function, which allows the user to pass a SQL file containing unlimited query actions that end with a single, final, SELECT statement. Additionally, SQLove includes the dbSendMultiUpdate() function which allows the user to run unlimited query actions to modify, update, or create database objects without pulling any information directly into the R environment. This function may be helpful for users who want to automate database actions outside of the database environment (e.g. small teams working on productionizing process, teams who contract out for database management services but who want to retain some individual processes, etc.).

`dbGetMultiQuery`

Let’s go through a simple example that demonstrates the functionality of the dbGetMultiQuery() function. Suppose that you have three tables that you want to modify and then join prior to bringing data into R. In the “leads” table, you have lead_id, date, and product. In the “sales” table, you have lead_id, date, sale_amount. Here is one way you might modify and then join these tables in SQL:

CREATE TEMP TABLE lead_count AS
  SELECT
    lead_id
    ,COUNT(DISTINCT lead_id) as num_leads
    FROM leads
    WHERE lead_id NOT LIKE 'SPAM'
    GROUP BY 1;
    
CREATE TEMP TABLE lead_sales AS
    SELECT
      lead_id
      ,SUM(sale_amount) as tot_sales
      FROM sales
      WHERE lead_id NOT LIKE 'SPAM'
      GROUP BY 1;

SELECT a.* 
      FROM lead_count a 
      LEFT JOIN lead_sales b
        ON a.lead_id = b.lead_id

In a typical workflow for DBI, you would need to separate these statements and evaluate each individually (the first two using dbSendUpdate() and the final one using dbGetQuery()). This is where the dbGetMultiQuery() function shines!

After you have set up your connection per the DBI guidance here. You can simply use the following convention for executing your full SQL script and loading it into R like this:

sales_volume <- dbGetMultiQuery(connection = conn,
                                sql_file_path = "~/path/to/file.sql")

Running this code will then automatically parse each query and will run the RJDBC::dbSendUpdate() function on all but the final statement. The final statement will be run using the DBI::dbGetQuery() function. This function works by creating a list of character vectors representing each query. It does this by separating on the ; character, which, in SQL convention, represents the end of a query action. It then loops through each in order until the final vector in the list. For the final vector, dbGetMultiQuery anticipates a SELECT statement that will bring down the data in its final form to the R environment!

To avoid any difficulties in processing the SQL file across different SQL types, comments (both in the -- format and the /* */ format) are automatically removed from the SQL script after it is read into the R environment. Additionally, dbGetMultiQuery() includes base::gsub() functionality that allows the user to substitute strings. This may be valuable if the user what to change a condition to compare data outputs interactively in the R environment. Below, is an example of this functionality using the same SQL script above:

spam_volume <- dbGetMultiQuery(connection = conn,
                               sql_file_path = "~/path/to/file.sql",
                               pattern = "NOT LIKE SPAM",
                               replacement = "LIKE SPAM")

In this example, instead of eliminating SPAM, we are retaining ONLY rows that are marked as SPAM so that we can evaluate the volume of SPAM in the “leads” and “sales” tables. It is important to note that the pattern and replacement arguments in this function replace all instances of the string across the FULL SQL SCRIPT. This means that if you want to replace a string in only one of the queries, you will need to be careful about how you call the arguments so you don’t accidentally alter a string in another query.

`dbSendMultiUpdate`

For some users, there is an advantage to running SQL scripts from the R environment rather than within the database itself. For example, you may be able to more easily automate running R scripts during off hours if you are part of a small team without a dedicated ETL workflow. Alternatively, you may contract out your ETL needs, but desire to retain some processes in-house so your contractor doesn’t have access to your code. These, and probably other totally legitimate reasons not outlined here, underpin the dbSendMultiUpdate function, which allows the user to run an unlimited number of SQL query statements that do not produce an output. Here is an example:

CREATE TEMP TABLE lead_count AS
  SELECT
    lead_id
    ,COUNT(DISTINCT lead_id) as num_leads
    FROM leads
    WHERE lead_id NOT LIKE 'SPAM'
    GROUP BY 1;
    
CREATE TEMP TABLE lead_sales AS
    SELECT
      lead_id
      ,SUM(sale_amount) as tot_sales
      FROM sales
      WHERE lead_id NOT LIKE 'SPAM'
      GROUP BY 1;

CREATE TABLE schema.lead_info as
      SELECT *
      FROM lead_count a 
      LEFT JOIN lead_sales b
        ON a.lead_id = b.lead_id

You’ll notice that this script is almost identical to the one above. However, our final statement no creates a table in the relational database rather than bringing data down into the R environment. In this way, the dbSendMultiUpdate function does not produce an output. Rather, it simply runs the SQL code in the relational database environment on your behalf.

This is important because, as a user of this function, your quality assurance (QA) procedures must happen at the level of the SQL script itself. Because of this, it is recommended that this function be used as a production tool rather than as an interactive tool. While constructing the SQL script itself, the user will likely be better served working in a relational database IDE.

SQLove - Simple Functions for Complex Queries

Introduction

The Problem

dbGetMultiQuery

dbSendMultiUpdate

`dbGetMultiQuery`

`dbSendMultiUpdate`