ampir predicts the probability of a protein to be an antimicrobial peptide (AMP) or not based on a trained SVM model with as input known AMP sequences corresponding to a wide diversity of organisms. However, within the
predict_amps function there is a
model argument that allows users to pass their own trained model object. Using a different trained model might be useful when users wish to e.g. use a taxonomic specific model to predict AMPs in a restricted group of taxa.
This vignette will go through a mock example of how you can train your own model using the
caret package. For more information on how to use
caret and the functions used within this example, please see the extensive documentation made by the author, Max Kuhn.
First, a positive and negative dataset have to be obtained. In this example, we want to predict AMPs in bats and decide to train a model using protein sequences found in bats. The positive dataset are AMPs and the negative dataset are random sequences. Both datasets were obtained from UniProt:
For the positive dataset:
For the negative dataset:
Combine the positive and negative dataset
Calculate features on the combined positive and negative dataset and add the label column
Split feature set data and create train and test set with
Resample method using repeated cross validation and adding in a probability calculation with
Train model using a support vector machine with radial kernel with
caret. Note: Other classification models are supported too. For example, to use a random forest model in
method could be changed from “svmRadial” to “ranger”.
Test model to get an indication of how well the model performs on test data with
Convert the bat feature test data to the original FASTA type format containing just the sequence name and sequence as this is the required input data for
Use the trained bat model in
predict_amps function on the bat test set