Introduction

The protr package offers a unique and comprehensive toolkit for generating various numerical representation schemes of protein sequences. The descriptors included are extensively utilized in bioinformatics and chemogenomics research.

The commonly used descriptors listed in protr include amino acid composition, autocorrelation, CTD, conjoint traid, quasi-sequence order, pseudo amino acid composition, and profile-based descriptors derived by Position-Specific Scoring Matrix (PSSM).

The descriptors for proteochemometric (PCM) modeling include the scales-based descriptors derived by principal components analysis, factor analysis, multidimensional scaling, amino acid properties (AAindex), 20+ classes of 2D and 3D molecular descriptors (for example, Topological, WHIM, VHSE, among others), and BLOSUM/PAM matrix-derived descriptors.

The protr package also implemented parallelized similarity computation derived by pairwise protein sequence alignment and Gene Ontology (GO) semantic similarity measures. ProtrWeb, the web application built on protr, can be accessed from http://protr.org.

If you find protr useful in your research, please feel free to cite our paper:

Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. (2015). protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics 31 (11), 1857-1859.

BibTeX entry:

@article{Xiao2015,
  author  = {Xiao, Nan and Cao, Dong-Sheng and Zhu, Min-Feng and Xu, Qing-Song.},
  title   = {protr/{ProtrWeb}: {R} package and web server for generating various numerical representation schemes of protein sequences},
  journal = {Bioinformatics},
  year    = {2015},
  volume  = {31},
  number  = {11},
  pages   = {1857--1859},
  doi     = {10.1093/bioinformatics/btv042}
}

An example predictive modeling workflow

Here, we use the subcellular localization dataset of human proteins presented in Chou and Shen (2008) to demonstrate the basic usage of protr.

The complete dataset includes 3,134 protein sequences (2,750 different proteins) classified into 14 human subcellular locations. We selected two classes of proteins as our benchmark dataset. Class 1 contains 325 extracell proteins and class 2 includes 307 mitochondrion proteins. We aim to build a random forest classification model to classify these two types of proteins.

First, we load the protr package, then read the protein sequences stored in two separated FASTA files with readFASTA():

library("protr")

# Load FASTA files
# Note that `system.file()` is for accessing example files
# in the protr package. Replace it with your own file path.
extracell <- readFASTA(
  system.file(
    "protseq/extracell.fasta",
    package = "protr"
  )
)
mitonchon <- readFASTA(
  system.file(
    "protseq/mitochondrion.fasta",
    package = "protr"
  )
)

To read protein sequences stored in PDB format files, use readPDB() instead. The loaded sequences will be stored as two lists in R, and each component is a character string representing one protein sequence. In this case, there are 325 extracell protein sequences and 306 mitonchon protein sequences:

length(extracell)
## [1] 325
length(mitonchon)
## [1] 306

To ensure that the protein sequences only have the 20 standard amino acid types, which is usually required for the descriptor computation, we use the protcheck() function to do the amino acid type sanity check and remove the non-standard sequences:

extracell <- extracell[(sapply(extracell, protcheck))]
mitonchon <- mitonchon[(sapply(mitonchon, protcheck))]
length(extracell)
## [1] 323
length(mitonchon)
## [1] 304

Two protein sequences were removed from each class. For the remaining sequences, we calculate the Type II PseAAC descriptor, i.e., the amphiphilic pseudo amino acid composition (APseAAC) descriptor (Chou 2005) and make class labels for classification modeling.

# Calculate APseAAC descriptors
x1 <- t(sapply(extracell, extractAPAAC))
x2 <- t(sapply(mitonchon, extractAPAAC))
x <- rbind(x1, x2)

# Make class labels
labels <- as.factor(c(rep(0, length(extracell)), rep(1, length(mitonchon))))

In protr, the functions of commonly used descriptors for protein sequences and proteochemometric (PCM) modeling descriptors are named after extract...().

Next, we will split the data into a 75% training set and a 25% test set.

set.seed(1001)

# Split training and test set
tr.idx <- c(
  sample(1:nrow(x1), round(nrow(x1) * 0.75)),
  sample(nrow(x1) + 1:nrow(x2), round(nrow(x2) * 0.75))
)
te.idx <- setdiff(1:nrow(x), tr.idx)

x.tr <- x[tr.idx, ]
x.te <- x[te.idx, ]
y.tr <- labels[tr.idx]
y.te <- labels[te.idx]

We will train a random forest classification model on the training set with 5-fold cross-validation, using the randomForest package.

library("randomForest")
rf.fit <- randomForest(x.tr, y.tr, cv.fold = 5)
rf.fit

The training result is:

## Call:
##  randomForest(x = x.tr, y = y.tr, cv.fold = 5)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
##
##         OOB estimate of  error rate: 25.11%
## Confusion matrix:
##     0   1 class.error
## 0 196  46   0.1900826
## 1  72 156   0.3157895

With the model trained on the training set, we predict on the test set and plot the ROC curve with the pROC package, as is shown in Figure 1.

# Predict on test set
rf.pred <- predict(rf.fit, newdata = x.te, type = "prob")[, 1]

# Plot ROC curve
library("pROC")
plot.roc(y.te, rf.pred, grid = TRUE, print.auc = TRUE)

The area under the ROC curve (AUC) is:

## Call:
## plot.roc.default(x = y.te, predictor = rf.pred, col = "#0080ff",
##                  grid = TRUE, print.auc = TRUE)
##
## Data: rf.pred in 81 controls (y.te 0) > 76 cases (y.te 1).
## Area under the curve: 0.8697
Figure 1: ROC curve for the test set of protein subcellular localization data.

Figure 1: ROC curve for the test set of protein subcellular localization data.

Package overview

The protr package (Xiao et al. 2015) implemented most of the state-of-the-art protein sequence descriptors with R. Generally, each type of the descriptor (feature) can be calculated with a function named extractX() in the protr package, where X stands for the abbreviation of the descriptor name. The descriptors are:

  • Amino acid composition
    • extractAAC() - Amino acid composition
    • extractDC() - Dipeptide composition
    • extractTC() - Tripeptide composition
  • Autocorrelation
    • extractMoreauBroto() - Normalized Moreau-Broto autocorrelation
    • extractMoran() - Moran autocorrelation
    • extractGeary() - Geary autocorrelation
  • CTD descriptors
    • extractCTDC() - Composition
    • extractCTDT() - Transition
    • extractCTDD() - Distribution
  • Conjoint triad descriptors
    • extractCTriad() - Conjoint triad descriptors
  • Quasi-sequence-order descriptors
    • extractSOCN() - Sequence-order-coupling number
    • extractQSO() - Quasi-sequence-order descriptors
  • Pseudo-amino acid composition
    • extractPAAC() - Pseudo-amino acid composition (PseAAC)
    • extractAPAAC() - Amphiphilic pseudo-amino acid composition (APseAAC)
  • Profile-based descriptors
    • extractPSSM()
    • extractPSSMAcc()
    • extractPSSMFeature()

The descriptors commonly used in Proteochemometric Modeling (PCM) implemented in protr include:

  • extractScales(), extractScalesGap() - Scales-based descriptors derived by Principal Components Analysis
    • extractProtFP(), extractProtFPGap() - Scales-based descriptors derived by amino acid properties from AAindex (a.k.a. Protein Fingerprint)
    • extractDescScales() - Scales-based descriptors derived by 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.)
  • extractFAScales() - Scales-based descriptors derived by Factor Analysis
  • extractMDSScales() - Scales-based descriptors derived by Multidimensional Scaling
  • extractBLOSUM() - BLOSUM and PAM matrix-derived descriptors

The protr package integrates the function of parallelized similarity score computation derived by local or global protein sequence alignment between a list of protein sequences; Biostrings and pwalign provide the sequence alignment computation, and the corresponding functions listed in the protr package include:

  • twoSeqSim() - Similarity calculation derived by sequence alignment between two protein sequences
  • parSeqSim() - Parallelized pairwise similarity calculation with a list of protein sequences (in-memory version)
  • parSeqSimDisk() - Parallelized pairwise similarity calculation with a list of protein sequences (disk-based version)
  • crossSetSim() - Parallelized pairwise similarity calculation between two sets of protein sequences (in-memory version)

protr also integrates the function for parallelized similarity score computation derived by Gene Ontology (GO) semantic similarity measures between a list of GO terms / Entrez Gene IDs, the GO similarity computation is provided by GOSemSim, the corresponding functions listed in the protr package include:

  • twoGOSim() - Similarity calculation derived by GO-terms semantic similarity measures between two GO terms / Entrez Gene IDs;
  • parGOSim() - Pairwise similarity calculation with a list of GO terms / Entrez Gene IDs.

To use the parSeqSim() function, we suggest the users to install the packages foreach and doParallel first in order to make the parallelized pairwise similarity computation available.

In the following sections, we will introduce the descriptors and function usage in this order.

Commonly used descriptors

Note: Users need to evaluate the underlying details of the descriptors provided intelligently, instead of using protr with their data blindly, especially for the descriptor types that have more flexibility. It would be wise for the users to use some negative and positive control comparisons where relevant to help guide the interpretation of the results.

A protein or peptide sequence with \(N\) amino acid residues can be generally represented as \(\{\,R_1, R_2, \ldots, R_n\,\}\), where \(R_i\) represents the residue at the \(i\)-th position in the sequence. The labels \(i\) and \(j\) are used to index amino acid position in a sequence, and \(r\), \(s\), \(t\) are used to represent the amino acid type. The computed descriptors are roughly categorized into 4 groups according to their major applications.

A protein sequence can be partitioned equally into segments. The descriptors designed for the complete sequence can often be applied to each individual segment.

Amino acid composition descriptor

The amino acid composition describes the fraction of each amino acid type within a protein sequence. The fractions of all 20 natural amino acids are calculated as follows:

\[ f(r) = \frac{N_r}{N} \quad r = 1, 2, \ldots, 20. \]

where \(N_r\) is the number of the amino acid type \(r\) and \(N\) is the length of the sequence.

As was described above, we can use the function extractAAC() to extract the descriptors (features) from protein sequences:

library("protr")

x <- readFASTA(system.file(
  "protseq/P00750.fasta",
  package = "protr"
))[[1]]

extractAAC(x)
#>          A          R          N          D          C          E          Q 
#> 0.06405694 0.07117438 0.03914591 0.05160142 0.06761566 0.04804270 0.04804270 
#>          G          H          I          L          K          M          F 
#> 0.08185053 0.03024911 0.03558719 0.07651246 0.03914591 0.01245552 0.03202847 
#>          P          S          T          W          Y          V 
#> 0.05338078 0.08896797 0.04448399 0.02313167 0.04270463 0.04982206

Here, using the function readFASTA(), we loaded a single protein sequence (P00750, Tissue-type plasminogen activator) from a FASTA format file. Then, we extracted the AAC descriptors with extractAAC(). The result returned is a named vector whose elements are tagged with the name of each amino acid.

Dipeptide composition descriptor

Dipeptide composition gives a 400-dimensional descriptor, defined as:

\[ f(r, s) = \frac{N_{rs}}{N - 1} \quad r, s = 1, 2, \ldots, 20. \]

where \(N_{rs}\) is the number of dipeptides represented by amino acid type \(r\) and type \(s\). Similar to extractAAC(), we use extractDC() to compute the descriptors:

dc <- extractDC(x)
head(dc, n = 30L)
#>          AA          RA          NA          DA          CA          EA 
#> 0.003565062 0.003565062 0.000000000 0.007130125 0.003565062 0.003565062 
#>          QA          GA          HA          IA          LA          KA 
#> 0.007130125 0.007130125 0.001782531 0.003565062 0.001782531 0.001782531 
#>          MA          FA          PA          SA          TA          WA 
#> 0.000000000 0.005347594 0.003565062 0.007130125 0.003565062 0.000000000 
#>          YA          VA          AR          RR          NR          DR 
#> 0.000000000 0.000000000 0.003565062 0.007130125 0.005347594 0.001782531 
#>          CR          ER          QR          GR          HR          IR 
#> 0.005347594 0.005347594 0.000000000 0.007130125 0.001782531 0.003565062

Here, we only showed the first 30 elements of the result vector and omitted the rest of the result. The element names of the returned vector are self-explanatory, as before.

Tripeptide composition descriptor

Tripeptide composition gives an 8000-dimensional descriptor, defined as:

\[ f(r, s, t) = \frac{N_{rst}}{N - 2} \quad r, s, t = 1, 2, \ldots, 20 \]

where \(N_{rst}\) is the number of tripeptides represented by amino acid type \(r\), \(s\), and \(t\). With the function extractTC(), we can easily obtain the length-8000 descriptor, to save some space, here we also omitted the long outputs:

tc <- extractTC(x)
head(tc, n = 36L)
#>         AAA         RAA         NAA         DAA         CAA         EAA 
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 
#>         QAA         GAA         HAA         IAA         LAA         KAA 
#> 0.001785714 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 
#>         MAA         FAA         PAA         SAA         TAA         WAA 
#> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000 
#>         YAA         VAA         ARA         RRA         NRA         DRA 
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 
#>         CRA         ERA         QRA         GRA         HRA         IRA 
#> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000 
#>         LRA         KRA         MRA         FRA         PRA         SRA 
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000

Autocorrelation descriptors

Autocorrelation descriptors are defined based on the distribution of amino acid properties along the sequence. The amino acid properties used here are various types of amino acids index (Retrieved from the AAindex database, see Kawashima, Ogata, and Kanehisa (1999), Kawashima and Kanehisa (2000), and Kawashima et al. (2008); see Figure 2 for an illustrated example). Three types of autocorrelation descriptors are defined here and described below.

All the amino acid indices are centralized and standardized before the calculation, i.e.

\[ P_r = \frac{P_r - \bar{P}}{\sigma} \]

where \(\bar{P}\) is the average of the property of the 20 amino acids:

\[ \bar{P} = \frac{\sum_{r=1}^{20} P_r}{20} \quad \textrm{and} \quad \sigma = \sqrt{\frac{1}{2} \sum_{r=1}^{20} (P_r - \bar{P})^2} \]