The protr package offers a unique and comprehensive toolkit for generating various numerical representation schemes of protein sequences. The descriptors included are extensively utilized in bioinformatics and chemogenomics research.
The commonly used descriptors listed in protr include amino acid composition, autocorrelation, CTD, conjoint traid, quasi-sequence order, pseudo amino acid composition, and profile-based descriptors derived by Position-Specific Scoring Matrix (PSSM).
The descriptors for proteochemometric (PCM) modeling include the scales-based descriptors derived by principal components analysis, factor analysis, multidimensional scaling, amino acid properties (AAindex), 20+ classes of 2D and 3D molecular descriptors (for example, Topological, WHIM, VHSE, among others), and BLOSUM/PAM matrix-derived descriptors.
The protr package also implemented parallelized similarity computation derived by pairwise protein sequence alignment and Gene Ontology (GO) semantic similarity measures. ProtrWeb, the web application built on protr, can be accessed from http://protr.org.
If you find protr useful in your research, please feel free to cite our paper:
Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. (2015). protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics 31 (11), 1857-1859.
BibTeX entry:
@article{Xiao2015,
author = {Xiao, Nan and Cao, Dong-Sheng and Zhu, Min-Feng and Xu, Qing-Song.},
title = {protr/{ProtrWeb}: {R} package and web server for generating various numerical representation schemes of protein sequences},
journal = {Bioinformatics},
year = {2015},
volume = {31},
number = {11},
pages = {1857--1859},
doi = {10.1093/bioinformatics/btv042}
}
Here, we use the subcellular localization dataset of human proteins presented in Chou and Shen (2008) to demonstrate the basic usage of protr.
The complete dataset includes 3,134 protein sequences (2,750 different proteins) classified into 14 human subcellular locations. We selected two classes of proteins as our benchmark dataset. Class 1 contains 325 extracell proteins and class 2 includes 307 mitochondrion proteins. We aim to build a random forest classification model to classify these two types of proteins.
First, we load the protr package, then read the protein sequences
stored in two separated FASTA files with readFASTA()
:
library("protr")
# Load FASTA files
# Note that `system.file()` is for accessing example files
# in the protr package. Replace it with your own file path.
extracell <- readFASTA(
system.file(
"protseq/extracell.fasta",
package = "protr"
)
)
mitonchon <- readFASTA(
system.file(
"protseq/mitochondrion.fasta",
package = "protr"
)
)
To read protein sequences stored in PDB format files, use
readPDB()
instead. The loaded sequences will be stored as
two lists in R, and each component is a character string representing
one protein sequence. In this case, there are 325 extracell
protein sequences and 306 mitonchon protein sequences:
length(extracell)
## [1] 325
length(mitonchon)
## [1] 306
To ensure that the protein sequences only have the 20 standard amino
acid types, which is usually required for the descriptor computation, we
use the protcheck()
function to do the amino acid type
sanity check and remove the non-standard sequences:
extracell <- extracell[(sapply(extracell, protcheck))]
mitonchon <- mitonchon[(sapply(mitonchon, protcheck))]
length(extracell)
## [1] 323
length(mitonchon)
## [1] 304
Two protein sequences were removed from each class. For the remaining sequences, we calculate the Type II PseAAC descriptor, i.e., the amphiphilic pseudo amino acid composition (APseAAC) descriptor (Chou 2005) and make class labels for classification modeling.
# Calculate APseAAC descriptors
x1 <- t(sapply(extracell, extractAPAAC))
x2 <- t(sapply(mitonchon, extractAPAAC))
x <- rbind(x1, x2)
# Make class labels
labels <- as.factor(c(rep(0, length(extracell)), rep(1, length(mitonchon))))
In protr, the functions of commonly used descriptors for protein
sequences and proteochemometric (PCM) modeling descriptors are named
after extract...()
.
Next, we will split the data into a 75% training set and a 25% test set.
set.seed(1001)
# Split training and test set
tr.idx <- c(
sample(1:nrow(x1), round(nrow(x1) * 0.75)),
sample(nrow(x1) + 1:nrow(x2), round(nrow(x2) * 0.75))
)
te.idx <- setdiff(1:nrow(x), tr.idx)
x.tr <- x[tr.idx, ]
x.te <- x[te.idx, ]
y.tr <- labels[tr.idx]
y.te <- labels[te.idx]
We will train a random forest classification model on the training
set with 5-fold cross-validation, using the randomForest
package.
library("randomForest")
rf.fit <- randomForest(x.tr, y.tr, cv.fold = 5)
rf.fit
The training result is:
## Call:
## randomForest(x = x.tr, y = y.tr, cv.fold = 5)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 25.11%
## Confusion matrix:
## 0 1 class.error
## 0 196 46 0.1900826
## 1 72 156 0.3157895
With the model trained on the training set, we predict on the test
set and plot the ROC curve with the pROC
package, as is
shown in Figure 1.
# Predict on test set
rf.pred <- predict(rf.fit, newdata = x.te, type = "prob")[, 1]
# Plot ROC curve
library("pROC")
plot.roc(y.te, rf.pred, grid = TRUE, print.auc = TRUE)
The area under the ROC curve (AUC) is:
## Call:
## plot.roc.default(x = y.te, predictor = rf.pred, col = "#0080ff",
## grid = TRUE, print.auc = TRUE)
##
## Data: rf.pred in 81 controls (y.te 0) > 76 cases (y.te 1).
## Area under the curve: 0.8697
The protr package (Xiao et al. 2015)
implemented most of the state-of-the-art protein sequence descriptors
with R. Generally, each type of the descriptor (feature) can be
calculated with a function named extractX()
in the protr
package, where X
stands for the abbreviation of the
descriptor name. The descriptors are:
extractAAC()
- Amino acid compositionextractDC()
- Dipeptide compositionextractTC()
- Tripeptide compositionextractMoreauBroto()
- Normalized Moreau-Broto
autocorrelationextractMoran()
- Moran autocorrelationextractGeary()
- Geary autocorrelationextractCTDC()
- CompositionextractCTDT()
- TransitionextractCTDD()
- DistributionextractCTriad()
- Conjoint triad descriptorsextractSOCN()
- Sequence-order-coupling numberextractQSO()
- Quasi-sequence-order descriptorsextractPAAC()
- Pseudo-amino acid composition
(PseAAC)extractAPAAC()
- Amphiphilic pseudo-amino acid
composition (APseAAC)extractPSSM()
extractPSSMAcc()
extractPSSMFeature()
The descriptors commonly used in Proteochemometric Modeling (PCM) implemented in protr include:
extractScales()
, extractScalesGap()
-
Scales-based descriptors derived by Principal Components Analysis
extractProtFP()
, extractProtFPGap()
-
Scales-based descriptors derived by amino acid properties from AAindex
(a.k.a. Protein Fingerprint)extractDescScales()
- Scales-based descriptors derived
by 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM,
VHSE, etc.)extractFAScales()
- Scales-based descriptors derived by
Factor AnalysisextractMDSScales()
- Scales-based descriptors derived
by Multidimensional ScalingextractBLOSUM()
- BLOSUM and PAM matrix-derived
descriptorsThe protr package integrates the function of parallelized similarity score computation derived by local or global protein sequence alignment between a list of protein sequences; Biostrings and pwalign provide the sequence alignment computation, and the corresponding functions listed in the protr package include:
twoSeqSim()
- Similarity calculation derived by
sequence alignment between two protein sequencesparSeqSim()
- Parallelized pairwise similarity
calculation with a list of protein sequences (in-memory version)parSeqSimDisk()
- Parallelized pairwise similarity
calculation with a list of protein sequences (disk-based version)crossSetSim()
- Parallelized pairwise similarity
calculation between two sets of protein sequences (in-memory
version)protr also integrates the function for parallelized similarity score computation derived by Gene Ontology (GO) semantic similarity measures between a list of GO terms / Entrez Gene IDs, the GO similarity computation is provided by GOSemSim, the corresponding functions listed in the protr package include:
twoGOSim()
- Similarity calculation derived by GO-terms
semantic similarity measures between two GO terms / Entrez Gene
IDs;parGOSim()
- Pairwise similarity calculation with a
list of GO terms / Entrez Gene IDs.To use the parSeqSim()
function, we suggest the users to
install the packages foreach and doParallel first in order to make the
parallelized pairwise similarity computation available.
In the following sections, we will introduce the descriptors and function usage in this order.
Note: Users need to evaluate the underlying details of the descriptors provided intelligently, instead of using protr with their data blindly, especially for the descriptor types that have more flexibility. It would be wise for the users to use some negative and positive control comparisons where relevant to help guide the interpretation of the results.
A protein or peptide sequence with \(N\) amino acid residues can be generally represented as \(\{\,R_1, R_2, \ldots, R_n\,\}\), where \(R_i\) represents the residue at the \(i\)-th position in the sequence. The labels \(i\) and \(j\) are used to index amino acid position in a sequence, and \(r\), \(s\), \(t\) are used to represent the amino acid type. The computed descriptors are roughly categorized into 4 groups according to their major applications.
A protein sequence can be partitioned equally into segments. The descriptors designed for the complete sequence can often be applied to each individual segment.
The amino acid composition describes the fraction of each amino acid type within a protein sequence. The fractions of all 20 natural amino acids are calculated as follows:
\[ f(r) = \frac{N_r}{N} \quad r = 1, 2, \ldots, 20. \]
where \(N_r\) is the number of the amino acid type \(r\) and \(N\) is the length of the sequence.
As was described above, we can use the function
extractAAC()
to extract the descriptors (features) from
protein sequences:
library("protr")
x <- readFASTA(system.file(
"protseq/P00750.fasta",
package = "protr"
))[[1]]
extractAAC(x)
#> A R N D C E Q
#> 0.06405694 0.07117438 0.03914591 0.05160142 0.06761566 0.04804270 0.04804270
#> G H I L K M F
#> 0.08185053 0.03024911 0.03558719 0.07651246 0.03914591 0.01245552 0.03202847
#> P S T W Y V
#> 0.05338078 0.08896797 0.04448399 0.02313167 0.04270463 0.04982206
Here, using the function readFASTA()
, we loaded a single
protein sequence (P00750, Tissue-type plasminogen activator) from a
FASTA format file. Then, we extracted the AAC descriptors with
extractAAC()
. The result returned is a named vector whose
elements are tagged with the name of each amino acid.
Dipeptide composition gives a 400-dimensional descriptor, defined as:
\[ f(r, s) = \frac{N_{rs}}{N - 1} \quad r, s = 1, 2, \ldots, 20. \]
where \(N_{rs}\) is the number of
dipeptides represented by amino acid type \(r\) and type \(s\). Similar to extractAAC()
,
we use extractDC()
to compute the descriptors:
dc <- extractDC(x)
head(dc, n = 30L)
#> AA RA NA DA CA EA
#> 0.003565062 0.003565062 0.000000000 0.007130125 0.003565062 0.003565062
#> QA GA HA IA LA KA
#> 0.007130125 0.007130125 0.001782531 0.003565062 0.001782531 0.001782531
#> MA FA PA SA TA WA
#> 0.000000000 0.005347594 0.003565062 0.007130125 0.003565062 0.000000000
#> YA VA AR RR NR DR
#> 0.000000000 0.000000000 0.003565062 0.007130125 0.005347594 0.001782531
#> CR ER QR GR HR IR
#> 0.005347594 0.005347594 0.000000000 0.007130125 0.001782531 0.003565062
Here, we only showed the first 30 elements of the result vector and omitted the rest of the result. The element names of the returned vector are self-explanatory, as before.
Tripeptide composition gives an 8000-dimensional descriptor, defined as:
\[ f(r, s, t) = \frac{N_{rst}}{N - 2} \quad r, s, t = 1, 2, \ldots, 20 \]
where \(N_{rst}\) is the number of
tripeptides represented by amino acid type \(r\), \(s\), and \(t\). With the function
extractTC()
, we can easily obtain the length-8000
descriptor, to save some space, here we also omitted the long
outputs:
tc <- extractTC(x)
head(tc, n = 36L)
#> AAA RAA NAA DAA CAA EAA
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> QAA GAA HAA IAA LAA KAA
#> 0.001785714 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> MAA FAA PAA SAA TAA WAA
#> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000
#> YAA VAA ARA RRA NRA DRA
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
#> CRA ERA QRA GRA HRA IRA
#> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000
#> LRA KRA MRA FRA PRA SRA
#> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
Autocorrelation descriptors are defined based on the distribution of amino acid properties along the sequence. The amino acid properties used here are various types of amino acids index (Retrieved from the AAindex database, see Kawashima, Ogata, and Kanehisa (1999), Kawashima and Kanehisa (2000), and Kawashima et al. (2008); see Figure 2 for an illustrated example). Three types of autocorrelation descriptors are defined here and described below.
All the amino acid indices are centralized and standardized before the calculation, i.e.
\[ P_r = \frac{P_r - \bar{P}}{\sigma} \]
where \(\bar{P}\) is the average of the property of the 20 amino acids:
\[ \bar{P} = \frac{\sum_{r=1}^{20} P_r}{20} \quad \textrm{and} \quad \sigma = \sqrt{\frac{1}{2} \sum_{r=1}^{20} (P_r - \bar{P})^2} \]