Use PRQL on R

PRQL (Pipelined Relational Query Language, pronounced “Prequel”) is a modern language for transforming data, can be compiled to SQL.

This package provides a simple function to convert a PRQL query string to a SQL string.

For example, this is a PRQL query.

from mtcars
filter cyl > 6
select {cyl, mpg}
derive {mpg_int = math.round 0 mpg}

And, this is the SQL query that is compiled from the PRQL query.

SELECT
  cyl,
  mpg,
  ROUND(mpg, 0) AS mpg_int
FROM
  mtcars
WHERE
  cyl > 6

To compile a PRQL string, just pass the query string to the prql_compile() function, like this.

library(prqlr)

"
from mtcars
filter cyl > 6
select {cyl, mpg}
derive {mpg_int = math.round 0 mpg}
" |>
  prql_compile() |>
  cat()
#> SELECT
#>   cyl,
#>   mpg,
#>   ROUND(mpg, 0) AS mpg_int
#> FROM
#>   mtcars
#> WHERE
#>   cyl > 6
#> 
#> -- Generated by PRQL compiler version:0.11.2 (https://prql-lang.org)

This output SQL query string can be used with already existing great packages that manipulate data with SQL.

Work with DB

Using it with the {DBI} package, we can execute PRQL queries against the database.

library(DBI)

# Create an ephemeral in-memory SQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")

# Create a table inclueds `mtcars` data
dbWriteTable(con, "mtcars", mtcars)

# Execute a PRQL query
"
from mtcars
filter cyl > 6
select {cyl, mpg}
derive {mpg_int = math.round 0 mpg}
take 3
" |>
  prql_compile("sql.sqlite") |>
  dbGetQuery(con, statement = _)
#>   cyl  mpg mpg_int
#> 1   8 18.7      19
#> 2   8 14.3      14
#> 3   8 16.4      16

We can also use the sqldf::sqldf() function to automatically register Data Frames to the database.

"
from mtcars
filter cyl > 6
select {cyl, mpg}
derive {mpg_int = math.round 0 mpg}
take 3
" |>
  prql_compile("sql.sqlite") |>
  sqldf::sqldf()
#>   cyl  mpg mpg_int
#> 1   8 18.7      19
#> 2   8 14.3      14
#> 3   8 16.4      16

Since SQLite is used here via {RSQLite}, the target option of prql_compile() is set to "sql.sqlite".

Available target names can be found with the prql_get_targets() function.

Work with R Data Frames

Using {prqlr} with the {tidyquery} package, we can execute PRQL queries against R Data Frames via {dplyr}.

{dplyr} is a very popular R package for manipulating Data Frames, and the PRQL syntax is very similar to the {dplyr} syntax.

Let’s run a query that aggregates a Data Frame flights, contained in the {nycflights13} package.

library(tidyquery)
library(nycflights13)

"
from flights
filter (distance | in 200..300)
filter air_time != null
group {origin, dest} (
  aggregate {
    num_flts = count this,
    avg_delay = (average arr_delay | math.round 0)
  }
)
sort {-origin, avg_delay}
take 2
" |>
  prql_compile() |>
  query()
#> # A tibble: 2 × 4
#>   origin dest  num_flts avg_delay
#>   <chr>  <chr>    <int>     <dbl>
#> 1 LGA    BUF        122        -2
#> 2 LGA    PWM        273         2

This query can be written with {dplyr}’s functions as follows.

library(dplyr, warn.conflicts = FALSE)
library(nycflights13)

flights |>
  filter(
    distance |> between(200, 300),
    !is.na(air_time)
  ) |>
  group_by(origin, dest) |>
  summarise(
    num_flts = n(),
    avg_delay = mean(arr_delay, na.rm = TRUE) |> round(0),
    .groups = "drop"
  ) |>
  arrange(desc(origin), avg_delay) |>
  head(2)
#> # A tibble: 2 × 4
#>   origin dest  num_flts avg_delay
#>   <chr>  <chr>    <int>     <dbl>
#> 1 LGA    BUF        122        -2
#> 2 LGA    PWM        273         2

Note that {dplyr} queries can be generated by the tidyquery::show_dplyr() function!