This is a tutorial for using the R package AnnotationBustR. AnnotationBustR reads in sequences from GenBank and allows you to quickly extract specific parts and write them to FASTA files given a set of search terms. This is useful as it allows users to quickly extract parts of concatenated or genomic sequences based on GenBank features and write them to FASTA files, even when feature annotations for homologous loci may vary (i.e. gene synonyms like COI, COX1, COXI all being used for cytochrome oxidase subunit 1).
In this tutorial we will cover the basics of how to use AnnotationBustR to extract parts of a GenBank sequences.This is considerably faster than extracting them manually and requires minimal effort by the user. While command line utilities like BLAST can also work, they require the buliding of databases to search against and can be computationally intensive and can have difficulties with highly complex sequences, like trans-spliced genes. They also require a far more complex query language to extract the subsequence and write it to a file. For example, it is possible to extract into FASTA files every subsequence from a mitochondrial genome (38 sequences, 13 CDS, 22 tRNA, 2rRNA, 1 D-loop) in 26-36 seconds, which is significantly faster than if you were to do it manually from the online GenBank features table. In this tutorial, we will discuss how to install AnnotationBustR, the basic AnnotationBustR pipeline, and how to use the functions that are included in AnnotationBustR.
In order to install the stable CRAN version of the AnnotationBustR package:
While we recommend use of the stable CRAN version of this package, we recommend using the package
devtools to temporarily install the development version of the package from GitHub if for any reason you wish to use it :
#1. Install 'devtools' if you do not already have it installed: install.packages("devtools") #2. Load the 'devtools' package and temporarily install the development version of #'AnnotationBustR' from GitHub: library(devtools) dev_mode(on=T) install_github("sborstein/AnnotationBustR") # install the package from GitHub library(AnnotationBustR)# load the package #3. Leave developers mode after using the development version of 'AnnotationBustR' so it will not remain on #your system permanently. dev_mode(on=F)
To load AnnotationBustR and all of its functions/data:
It is important to note that most of the functions within AnnotationBustR connect to sequence databases and require an internet connection.
##3.0: AnnotationBustR Work Flow Before we begin a tutorial on how to use AnnotationBustR to extract sequences, lets first discuss the basic workflow of the functions in the package (Fig. 1). The orange box represents the step that occur outside of using AnnotationBustR. The only step you must do outside of AnnotationBustR is obtain a target list of accession numbers. This can be done either by downloading the accession numbers themselves from GenBank (https://www.ncbi.nlm.nih.gov/nuccore) or using R packages like
rentrez to find accessions of interest in R. All boxes in blue in the graphic below represent steps that occur using AnnotationBustR. Boxes in green represent steps that are not mandatory, but may prove to be useful features of AnnotationBustR. In this tutorial, we will go through the steps in order, including the optional steps to show how to fully use the AnnotationBustR package.