Getting Started

Dyfan Jones

The RAthena package aims to make it easier to work with data stored in AWS Athena. RAthena package attempts to provide three levels of interacting with AWS Athena:

Installing RAthena:

As RAthena utilising the python AWS SDK boto3, Python 3+ is required. Please install Python 3+ either by Python or Python Anaconda. To install RAthena:

# cran version
install.packages("RAthena")

# Dev version
remotes::install_github("dyfanjones/RAthena")

Next is to install Python boto3. This can be done either by RAthena’s installation method:

RAthena::install_boto()

Or pip method:

pip install boto3

Python Environments:

If RAthena doesn’t pick up boto3 after using install_boto(), please consider specifying the python environment.install_boto() creates RAthena environment. This is either a Python virtual environment or a conda environment depending on your system.

library(DBI)

# Specify python conda environment and force reticulate to use it
reticulate::use_condaenv("RAthena", required = TRUE)

# Or specify python virtual environment and force reticulate to use it
reticulate::use_virtualenv("RAthena", required = TRUE)

con <- dbConnect(RAthena::athena())

Note: Python environments are not required if boto3 is either in the root Python or if R and Python are in their own environment (for example conda environment).

Docker Example:

To help with users wishing to run RAthena in a docker, a simple docker file has been created here. To set up the docker please refer to link. For demo purposes we will use the example docker and run it locally:

# build docker image
docker build . -t rathena

# start container with aws credentials passed from local
docker run \
      -e AWS_ACCESS_KEY_ID="$(aws configure get aws_access_key_id)" \
      -e AWS_SECRET_ACCESS_KEY="$(aws configure get aws_secret_access_key)" \
      -e AWS_SESSION_TOKEN="$(aws configure get aws_session_token)" \
      -e AWS_DEFAULT_REGION="$(aws configure get region)" \
      -it rathena

When running RAthena in the docker environment you might be required to let reticulate know what python you are using.

reticulate::use_python("/usr/bin/python3")

library(DBI)

con <- dbConnect(RAthena::athena(), s3_staging_dir = "s3://mybucket/")

Usage:

Low - Level API:

library(DBI)
library(RAthena)

con <- dbConnect(athena())

# list all current work groups in AWS Athena
list_work_groups(con)

# Create a new work group
create_work_group(con, "demo_work_group", description = "This is a demo work group",
                  tags = tag_options(key= "demo_work_group", value = "demo_01"))

DBI:

library(DBI)

con <- dbConnect(RAthena::athena())

# Get metadata 
dbGetInfo(con)

# $profile_name
# [1] "default"
# 
# $s3_staging
# [1] ######## NOTE: Please don't share your S3 bucket to the public
# 
# $dbms.name
# [1] "default"
# 
# $work_group
# [1] "primary"
# 
# $poll_interval
# NULL
# 
# $encryption_option
# NULL
# 
# $kms_key
# NULL
# 
# $expiration
# NULL
# 
# $region_name
# [1] "eu-west-1"
# 
# $boto3
# [1] "1.11.5"
# 
# $RAthena
# [1] "1.7.1"

# create table to AWS Athena
dbWriteTable(con, "iris", iris)

dbGetQuery(con, "select * from iris limit 10")
# Info: (Data scanned: 860 Bytes)
#  sepal_length sepal_width petal_length petal_width species
# 1:           5.1         3.5          1.4         0.2  setosa
# 2:           4.9         3.0          1.4         0.2  setosa
# 3:           4.7         3.2          1.3         0.2  setosa
# 4:           4.6         3.1          1.5         0.2  setosa
# 5:           5.0         3.6          1.4         0.2  setosa
# 6:           5.4         3.9          1.7         0.4  setosa
# 7:           4.6         3.4          1.4         0.3  setosa
# 8:           5.0         3.4          1.5         0.2  setosa
# 9:           4.4         2.9          1.4         0.2  setosa
# 10:          4.9         3.1          1.5         0.1  setosa

dplyr:

library(dplyr)

athena_iris <- tbl(con, "iris")

athena_iris %>%
  select(species, sepal_length, sepal_width) %>% 
  head(10) %>%
  collect()
# Info: (Data scanned: 860 Bytes)
# # A tibble: 10 x 3
# species  sepal_length sepal_width
# <chr>           <dbl>       <dbl>
# 1 setosa            5.1         3.5
# 2 setosa            4.9         3  
# 3 setosa            4.7         3.2
# 4 setosa            4.6         3.1
# 5 setosa            5           3.6
# 6 setosa            5.4         3.9
# 7 setosa            4.6         3.4
# 8 setosa            5           3.4
# 9 setosa            4.4         2.9
# 10 setosa           4.9         3.1