🐿️ Sqrrl

Project
Easily build bespoke SQL queries programmatically in R
Published

June 30, 2017

UPDATE 10/17/2017: Don’t use this! I made it for myself so it works for what I needed it for. But you probably shouldn’t use this package. There are better ways of building SQL queries that are safer and better (and probably even easier). For now, let me just point you in the direction of db.rstudio.com, dplyr/dbplyr, and the recently added glue_sql() function in the glue package.

Project Links: source, documentation

sqrrl is a small collection of utility functions that help build text-based SQL queries in an R-style native-feeling and functional manner.

Unlike other packages that build SQL queries using an object-oriented style, sqrrl provides small functions that produce SQL snippets and can be linked together to compose SQL queries. The result is that the code to produce the SQL statement reads much like the SQL statement iteself. On the other hand, sqrrl doesn’t know anything about your database and can’t help you out with completions, etc.

Where this package is most useful is with Shiny web apps that interact with a MySQL backend. The utilities are all built so that queries can be built using column names and values stored inside ordinary R data structures.

The following is a quick demonstration of how the package works using the nyclights13 dataset. For more information on sqrrl, check out the package documentation.

Setup flights database

To demonstrate the features in sqrrl, let’s set up an in-memory SQLite database using the nycflights13 dataset featured in dplyr and dbplyr.

First, load (or install) the pacakges and functions that we need.

# ---- Workspace Setup ----
library('nycflights13') # install.packages('nycflights13')
library('DBI')          # install.packages('DBI')
library('dplyr')        # install.packages('dplyr')
library('dbplyr')       # install.packages('dbplyr')

# Load the sqrrl package
# devtools::isntall_github('gadenbuie/sqrrl')
library('sqrrl')

# Alias to create nice tables
as_table <- function(...) knitr::kable(..., format = 'html')

Then load the flights data frame from nycflights13 into the in-memory SQLite database (this code comes direclty from the dbplyr documentation).

# ---- Example Setup ----
# Create an in-memory SQLite database
con <- dbConnect(RSQLite::SQLite(), path = ":memory:")

# Use dplyr/dbplyr to copy flights table to the temp db
copy_to(con, nycflights13::flights, "flights",
  temporary = FALSE,
  indexes = list(
    c("year", "month", "day"),
    "carrier",
    "tailnum",
    "dest"
  )
)

# Show first 5 rows
dbGetQuery(con, 'SELECT * FROM flights LIMIT 5') %>%
  as_table
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 1357034400
2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 1357034400
2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 1357034400
2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 1357034400
2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 1357038000

Querying flights

Often, when I’m working with a database, I’ll create an alias for dbGetQuery with the database or table name. Inside the alias function I usually add any data type modifications that might need to be applied, and I suppress the warning messages that DBI outputs about data type conversions.

flights <- function(query, ...) {
  suppressWarnings(dbGetQuery(con, query, ...))
}

Now we can repeat the above SELECT statement using sqrrl, this time limiting the columns selected.

flight_cols <- c('year', 'month', 'day',
                 'carrier', 'flight', 'tailnum')
SELECT(flight_cols) %+%
  FROM('flights') %+%
  LIMIT(5) %>%
  flights %>%
  as_table
year month day carrier flight tailnum
2013 1 1 UA 1545 N14228
2013 1 1 UA 1714 N24211
2013 1 1 AA 1141 N619AA
2013 1 1 B6 725 N804JB
2013 1 1 DL 461 N668DN

Note that sqrrl provides the %+% infix operator, which is essentially just an alias for paste(x, y).

'a' %+% 'b'
[1] "a b"
# or PHP style without a padded space: paste0
'a' %.% 'b'
[1] "ab"

We can also do more complicated queries, like finding the average arrival delay, grouped by tail number:

SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
  FROM('flights') %+%
  GROUP_BY('tailnum') %+%
  ORDER_BY(DESC('delay')) %+%
  LIMIT(10) %>%
  flights %>%
  as_table
tailnum delay n
N844MH 320.0000 1
N911DA 294.0000 1
N922EV 276.0000 1
N587NW 264.0000 1
N851NW 219.0000 1
N928DN 201.0000 1
N7715E 188.0000 1
N654UA 185.0000 1
N665MQ 174.6667 6
N427SW 157.0000 1

sqrrl also provides a wrapper around the python utility sqlformat that can be used to pretty-print SQL formats.

SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
  FROM('flights') %+%
  GROUP_BY('tailnum') %+%
  ORDER_BY(DESC('delay')) %+%
  LIMIT(10) %>%
  sqlformat %>% cat
SELECT tailnum,
       avg(arr_delay) AS delay,
       count(*) AS n
  FROM flights
 GROUP BY tailnum
 ORDER BY delay DESC
 LIMIT 10

Let’s use the above as an inner query and filter on n > 100:

query_all_arr_delay <- SELECT(
  'tailnum', delay = 'avg(arr_delay)', n = 'count(*)'
) %+%
  FROM('flights') %+%
  GROUP_BY('tailnum') %+%
  ORDER_BY(DESC('delay'))

SELECT() %+%
  FROM(delay = parens(query_all_arr_delay)) %+%
  WHERE(gt(n = 100)) %+%
  LIMIT(10) %>%
  flights %>%
  as_table
tailnum delay n
N11119 30.30657 148
N16919 29.88745 251
N14998 27.92202 230
N15910 27.61132 280
N13123 25.97345 121
N11192 25.85235 154
N14950 25.28780 219
N21130 24.96610 126
N24128 24.91803 129
N22971 24.74766 230

Queries are just strings

Notice that unlike other packages, sqrrl can’t build the nested queries for you. You still need to understand the structure of the database and the structure of the query.

But when compared with the final output of the query, the sqrrl version looks a lot like SQL transliterated into R functions.

SELECT() %+%
  FROM(delay = parens(
    SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
      FROM('flights') %+%
      GROUP_BY('tailnum') %+%
      ORDER_BY(DESC('delay'))
  )) %+%
  WHERE(gt(n = 100)) %+%
  LIMIT(10) %>%
  sqlformat() %>%
  cat()
SELECT *
  FROM (
        SELECT tailnum,
               avg(arr_delay) AS delay,
               count(*) AS n
          FROM flights
         GROUP BY tailnum
         ORDER BY delay DESC
       ) delay
 WHERE n>100
 LIMIT 10

For me, at least, where the goal is to write SQL queries as bare strings, sqrrl lets me write in R and think in SQL without having to add a huge number of paste and paste0 functions.

Everything in sqrrl takes input data from regular R data types and outputs an SQL snippet.

For an example of nearly everything each of the functions can do, see the Getting Started section in the documentation.

A more complicated SELECT query

As a final example, here is a fully-loaded select query.

SELECT('`year`', 'carrier', 'flight', 'dest',
       n = 'count(*)',
       avg_dist = 'avg(distance)',
       avg_air_time = 'avg(air_time)') %+%
  FROM(f = 'flights') %+%
  WHERE(
    BETWEEN('month', 6, 12),
    'carrier' %IN% c("UA", "AA", "US", "WN"),
    geq('dep_time' = 800),
    leq('air_time' = 120),
    'origin' %LIKE% 'JFK'
  ) %+%
  GROUP_BY('`year`', 'carrier', 'flight', 'dest') %+%
  ORDER_BY(DESC('n')) %+%
  LIMIT(10) %>%
  { sqlformat(.) %>% cat; . } %>%
  flights %>%
  as_table
SELECT `year`,
       carrier,
       flight,
       dest,
       count(*) AS n,
       avg(distance) AS avg_dist,
       avg(air_time) AS avg_air_time
  FROM flights f
 WHERE `month` BETWEEN 6 AND 12
   AND carrier IN ("UA", "AA", "US", "WN")
   AND dep_time>=800
   AND air_time<=120
   AND origin LIKE("JFK")
 GROUP BY `year`,
          carrier,
          flight,
          dest
 ORDER BY n DESC
 LIMIT 10
year carrier flight dest n avg_dist avg_air_time
2013 US 1831 CLT 178 541 86.95506
2013 US 425 CLT 126 541 84.92857
2013 AA 178 BOS 119 187 37.94118
2013 AA 256 BOS 117 187 39.13675
2013 AA 2314 BOS 115 187 37.85217
2013 US 1802 CLT 112 541 87.23214
2013 AA 84 BOS 101 187 37.95049
2013 AA 1850 BOS 94 187 38.46809
2013 AA 1838 BOS 93 187 37.83871
2013 AA 1762 BOS 86 187 38.47674

This query and table select the most popular flights from JFK between June and December of 2013 from the carriers UA, AA, US, and WN that depart JFK after 8:00 AM and have an air time of less than 2 hours.

Learn more

There’s more that the package can do, like JOINs, INSERTs, and UPDATEs that I haven’t gone into here.

There are also a number of wrappers, comparison operators and concatenators that can be used for wrapping strings in quotes — e.g. quotes() — comparing columns to values — e.g. geq(), eq(), lt(), neq() — and stringing together statements — e.g. AND(), OR(), %LIKE%, %IN%, BETWEEN().

There’s an example of nearly every single function and each of it’s possible configurations in the package documentation.

Hopefully this package is useful to someone other than myself (like you!). If you run into any problems, let me know or submit an issue on GitHub.