Introduction to seeker

Jake Hughey

2024-01-22

RNA-seq data

The seeker package is designed to be a wrapper around various command-line and R-based tools. The main function is, well, seeker(), which is targeted at processing bulk RNA-seq data. seeker()’s main argument is a list of parameters specifying which steps of RNA-seq data processing to perform and how to perform them. The list of parameters can come from a yaml file, an example of which is shown below.

study: 'PRJNA600892' # [string]
metadata:
  run: TRUE # [logical]
  bioproject: 'PRJNA600892' # [string]
  include:
    # [named list or NULL]
    colname: 'run_accession' # [string]
    values: ['SRR10876945', 'SRR10876946'] # [vector]
  # exclude # [named list or NULL]
    # colname # [string]
    # values # [vector]
fetch:
  run: TRUE # [logical]
  # keep # [logical or NULL]
  # overwrite # [logical or NULL]
  # keepSra # [logical or NULL]
  # prefetchCmd # [string or NULL]
  # prefetchArgs # [character vector or NULL]
  # fasterqdumpCmd # [string or NULL]
  # fasterqdumpArgs # [character vector or NULL]
  # pigzCmd # [string or NULL]
  # pigzArgs # [character vector or NULL]
trimgalore:
  run: TRUE # [logical]
  # keep # [logical or NULL]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
  # pigzCmd # [string or NULL]
fastqc:
  run: TRUE # [logical]
  # keep # [logical or NULL]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
salmon:
  run: TRUE # [logical]
  indexDir: '~/refgenie_genomes/alias/mm10/salmon_partial_sa_index/default' # [string]
  # sampleColname # [string or NULL]
  # keep # [logical or NULL]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
multiqc:
  run: TRUE # [logical]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
tximport:
  run: TRUE # [logical]
  tx2gene:
    # [named list or NULL]
    organism: 'mmusculus' # [string]
    # version # [number or NULL]
    # filename # [string or NULL]
  countsFromAbundance: 'lengthScaledTPM' # [string]
  # ignoreTxVersion # [logical or NULL]

An empty template yaml file is available at system.file('extdata', 'params_template.yml', package = 'seeker'). You can copy these yaml files to your working directory like so:

for (filename in c('PRJNA600892.yml', 'params_template.yml')) {
  file.copy(system.file('extdata', filename, package = 'seeker'), '.')}

If you’ve already installed the system dependencies, such as with installSysDeps(), a basic way to run seeker() is then:

library('seeker')
doParallel::registerDoParallel()

yamlPath = 'PRJNA600892.yml'
params = yaml::read_yaml(yamlPath)
seeker(params)

Beware even this minimal example could take some time.

Microarray data

Here you can use the seekerArray() function, which can process data from NCBI GEO and ArrayExpress, and can process raw Affymetrix data stored locally. The main arguments are study and geneIdType. For example:

library('seeker')

study = 'GSE25585'
geneIdType = 'entrez'
seekerArray(study, geneIdType)