# Introduction

This vignette will give you an overview of how you can analyse bottom-up proteomics data or LiP-MS data using protti. If you would like to analyse dose-response data please refer to the dose-response data analysis vignette. Before analysing your data make sure that it is of sufficient quality and that you do not have any outliers. To do this you can take a look at the quality control vignette.

protti includes several functions that make it easy for the user to analyse and interpret data from bottom-up proteomics or LiP-MS experiments. The R package includes functions for

• Quality control (see quality control vignette for more detailed information)
• Data preparation
• Median normalisation
• Data filtering
• Protein abundance calculation from precursor or peptide intensities
• Imputation of missing values
• Fetching of database information (ChEBI, GO, KEGG, MobiDB, UniProt)
• Calculation of sequence coverage
• Data analysis
• Statistical hypothesis tests
• Data visualisation
• Volcano plots
• Barcode plots (plots that show protein coverage and significant changes projected onto the protein sequence)
• Wood’s plots
• Profile plots
• Data interpretation
• Treatment enrichment (check if your hits are enriched with known targets)
• GO-term enrichment
• Network analysis (based on STRING database information)
• KEGG pathway enrichment

You can read more about specific functions and how to use them by calling e.g. ?normalise (for the normalise() function). Calling ? followed by the function name will display the function documentation and give you more detailed information about the function. This can be done for any of the functions included in the package.

This document will give you an overview of data preparation, data analysis, data visualisation and data interpretation functions included in protti. It will show you how they can be applied to your data. The examples in this file are run on a published LiP-MS data set. In the experiment a HeLa cell lysate was treated with different amounts of the drug rapamycin and the target of rapamycin (FKBP12) was successfully identified. For this vignette we are using a filtered version of the original data set where we include only the 10 μM treatment concentration and the control (untreated). For simplicity only 50 proteins are included in the data set.

The data set is produced from the output of Spectronaut™. However, if you have any other data such as DDA data that was searched with a different search engine you can still apply protti’s functions. Just make sure that your data frame contains tidy data. That means data should be contained in a long format (e.g. all sample names in one column) rather than a wide format (e.g. each sample name in its own column). You can easily achieve this by using the pivot_longer() function from the tidyr package. If you are unsure what your input data is supposed to look like, please use the create_synthetic_data() function and compare this to your data. You can also take a look at the input preparation vignette, there you will find all the necessary information on how to get your data into the correct format.

The input data should have a similar structure to this example:

Sample Replicate Peptide Condition log2(Intensity)
sample1 1 PEPTIDER treated 14
sample1 1 PEPTI treated 16
sample1 1 PEPTIDE treated 17
sample2 1 PEPTIDER untreated 15
sample2 1 PEPTI untreated 18
sample2 1 PEPTIDE untreated 12

# How to use protti to analyse your data

## Getting started

Before we can start analysing our data, we need to load the protti package. This is done by using the base R function library(). In addition, we are also loading the packages magrittr and dplyr. Both magrittr and dplyr are part of the tidyverse, a collection of packages that provide useful functionalities for data processing and visualisation. If you use many tidyverse packages in your workflow you can easily load all at once by calling library(tidyverse).

library(protti)
library(magrittr)
library(dplyr)

After having loaded the required packages we will load our data set into the R environment. In order to do this for your data set you can use the function read_protti(). This function is a wrapper around the fast fread() function from the data.table package and the clean_names() function from the janitor package. This will allow you to not only load your data into R very fast, but also to clean up the column names into lower snake case to make them more R-friendly. This will make it easier to remember them and to use them in your data analysis.

# To read in your own data you can use read_protti()
your_data <- read_protti(filename = "mydata/data.csv")

For this example we are going to use the rapamycin_10uM test data set included in protti. To read in the file we are simply going to use the utils function data().

data("rapamycin_10uM")

## Data preparation

### Log2 transformation, median normalisation and CV filtering

After inspecting the data and performing quality control (see quality control vignette for more information) we will now start to prepare the data for the analysis.

First, we remove decoy hits (used for false discovery rate estimation). Our example data set contains a column called eg_is_decoy that consists of logicals indicating whether or not the peptide is a decoy hit. To remove decoys we will use the dplyr function filter().

Next, we log2 transform our intensities, then normalise the data to the median value of all runs. To transform the intensities we use the dplyr function mutate() which creates a new column while maintaining the original column.

Note that we are also using the pipe operator %>% included in the R package magrittr. %>% takes the output of the preceding function and supplies it as the first argument of the following function. Using %>% makes code easier to read and follow.

For normalisation we are using the protti function normalise(). For this example we will use median normalisation (method = "median"). The function normalises intensities for each run to the median of all runs. This is only necessary if your search algorithm does not already median normalise your intensities. For the example data we have disabled median normalisation in Spectronaut, therefore we need to median normalise now. The formula for median normalisation is:

$median ~ normalised ~ intensity = intensity - median ( run ~ intensity ) + median ( global ~ intensity)$

To ensure that only good peptide measurements will be used for further analysis, we will also filter our data based on coefficients of variation (CV). In order to do this we are using the function filter_cv(). For this example we are retaining peptides with a CV < 25 % in at least one of the two conditions.

The CVs are calculated within the function according to the formula:

$CV = \frac{standard ~ deviation}{mean} * 100$

Note: The use of the filter_cv() function is optional. It might remove a lot of your data if your experiment was noisy. However, especially in these cases, the function will remove peptides with poor quality and should improve the result. It is very important that if you use this function you should not use the moderated t-test or proDA algorithm on your data for differential abundance estimation and significance testing. This likely will lead to an inflated false positive rate because it alters the distributional assumptions of these tests (Bourgon 2010).

data_normalised <- rapamycin_10uM %>%
filter(eg_is_decoy == FALSE) %>%
mutate(intensity_log2 = log2(fg_quantity)) %>%
normalise(sample = r_file_name,
intensity_log2 = intensity_log2,
method = "median")

data_filtered <- data_normalised %>%
filter_cv(grouping = eg_precursor_id,
condition = r_condition,
log2_intensity = intensity_log2,
cv_limit = 0.25,
min_conditions = 1)

### Remove non-proteotypic peptides

For LiP-MS analysis we commonly remove non-proteotypic peptides (i.e. peptides that could come from more than one protein). If you detect a change in non-proteotypic peptides it is not possible to clearly assign which protein it comes from and therefore which protein is affected by your treatment. If you are using the output from Spectronaut you will find a column called “pep_is_proteotypic”. This column contains logicals indicating whether your peptide is proteotypic or not.

To filter out non proteotypic peptides we are using the dplyr function filter().

data_filtered_proteotypic <- data_filtered %>%
filter(pep_is_proteotypic == TRUE)

### Fetching database information and assigning peptide types

In order to obtain more information about our identified proteins we are going to use the function fetch_uniprot() to download information from UniProt directly.

fetch_uniprot() uses a vector of UniProt IDs as its input. We produce this vector by using the base R function unique() which will extract all unique elements in the selected protein ID column. In this case we want to download the full protein name, gene IDs, GO terms associated with molecular function, StringDB IDs, information on known interacting proteins, location of the active site, location of binding sites, PDB entries, protein length and protein sequence. There are more options for columns to add (for more information on possible columns to add click here).

fetch_uniprot() returns a new data frame. In order to be able to merge this with our original data frame we have to rename the ID column to match the name of the protein ID column of our original data frame. To do this we use dplyr’s rename() function.

To merge the two data frames we use the dplyr function left_join(). We match the two data frames by the column “pg_protein_accessions”. By using left_join() we retain all rows from our original data frame while adding the columns from the data fame generated with fetch_uniprot().

Note: you can also directly join the UniProt data frame with your data without the need to rename its id column. You can specify in the by argument in left_join() that two columns are differently named.

Next, we would like to assign the trypticity of our peptides (i.e. if the peptides are fully-tryptic, semi-tryptic or non-tryptic). In order to do this we first need to define the peptide positions in the protein and find the preceding and following amino acids. To obtain this information we use the protti function find_peptide(). The output of this function can then be used in the function assign_peptide_type() which will add an additional column with the peptide trypticity information. By using the function calculate_sequence_coverage() we add an additional column to the data frame containing information on how much of the protein sequence we identified in our experiment.

### Peptide profile plots

To see how the individual precursors in our target protein are changing with the treatment we plot profile plots by using the function peptide_profile_plot(). This is particularly useful as you can quickly see if your whole protein changes in abundance or only a fraction of precursors/peptides. If you have protein abundance data you can also use the plot to show changes in protein abundance over your treatment condition(s). By selecting multiple targets (as a vector) you can produce the plot for multiple proteins.

FKBP12_intensity <- data_filtered_uniprot %>%
filter(pg_protein_accessions == "P62942")

peptide_profile_plot(
data = FKBP12_intensity,
sample = r_file_name,
peptide = eg_precursor_id,
intensity_log2 = normalised_intensity_log2,
grouping = pg_protein_accessions,
targets = "P62942",
protein_abundance_plot = FALSE
)
#> \$P62942

protti includes additional helpful functions that do not make sense to use for this data set but apply to data sets of full size that have global changes. These functions include the calculate_go_enrichment()function that helps you check if your hits are enriched for specific gene ontology (GO) terms, the analyse_functional_network() function that plots a String network based on information from StringDB for your hits and the calculate_kegg_enrichment() function which checks for enriched pathways in your hits. Furthermore, you can directly check for enrichment of a self defined treatment with the calculate_treatment_enrichment() function.

For GO enrichment you would add an additional column to your data frame containing information on whether your hit is significant or not. You can do this by using the dplyr function mutate(). Here we want the column to contain logicals that are either TRUE when the adjusted p-value is below 0.05 and the log2(fold change) is below -1 or above 1 or to be FALSE if this is not the case. We use the ifelse() function to produce the logicals. Furthermore, we annotate if the hit is true positive by marking peptides of the known rapamycin binding protein FKBP12.

For the network analysis we filter the previously produced data frame containing the is_significant column for significant hits. This data frame can then be used as an input for analyse_functional_network() to check if the proteins can be found in an interaction network based on StringDB information.

For calculate_kegg_enrichment() you need to first use the function fetch_kegg() to obtain the KEGG pathway identifiers for your data set. You can then use dplyr’s right_join()to join the output with the previously produced data frame containing a column indicating whether your hits are significant or not.

If you know all known interactors of your specific treatment you can check for an enrichment of these with the calculate_treatment_enrichment() function. This is particularly useful if your treatment has an effect on many proteins.

diff_abundance_significant <- diff_abundance_data %>%
# mark significant peptides
mutate(is_significant = ifelse((adj_pval < 0.01 & abs(diff) > 1), TRUE, FALSE)) %>%
# mark true positive hits
mutate(binds_treatment = pg_protein_accessions == "P62942")

### GO enrichment using "molecular function" annotation from UniProt

calculate_go_enrichment(
data = diff_abundance_significant,
protein_id = pg_protein_accessions,
is_significant = is_significant,
go_annotations_uniprot = go_molecular_function
)

### Network analysis

network_input <- diff_abundance_significant %>%
filter(is_significant == TRUE)

analyse_functional_network(data = network_input,
protein_id = pg_protein_accessions,
string_id = database_string,
binds_treatment = binds_treatment,
organism_id = 9606)

### KEGG pathway enrichment

# First you need to load KEGG pathway annotations from the KEGG database
# for your specific organism of interest. In this case HeLa cells were
# used, therefore the organism of interest is homo sapiens (hsa)

kegg <- fetch_kegg(species = "hsa")

# Next we need to annotate our data with KEGG pathway IDs and perform enrichment analysis

diff_abundance_significant %>%
# columns containing proteins IDs are named differently
left_join(kegg, by = c("pg_protein_accessions" = "uniprot_id")) %>%
calculate_kegg_enrichment(protein_id = pg_protein_accessions,
is_significant = is_significant,
pathway_id = pathway_id,
pathway_name = pathway_name)

### Treatment enrichment analysis

calculate_treatment_enrichment(diff_abundance_significant,
protein_id = pg_protein_accessions,
is_significant = is_significant,
binds_treatment = binds_treatment,
treatment_name = "Rapamycin")