pathfindR is a tool for enrichment analysis via active subnetworks. The package also offers functionalities to cluster the enriched terms and identify representative terms in each cluster, to score the enriched terms per sample and to visualize analysis results.
The functionalities of pathfindR is described in detail in Ulgen E, Ozisik O, Sezerman OU. 2019. pathfindR: An R Package for Comprehensive Identification of Enriched Pathways in Omics Data Through Active Subnetworks. Front. Genet. https://doi.org/10.3389/fgene.2019.00858
The observation that motivated us to develop
pathfindR was that direct enrichment analysis of differential RNA/protein expression or DNA methylation results may not provide the researcher with the full picture. That is to say: enrichment analysis of only a list of significant genes alone may not be informative enough to explain the underlying disease mechanisms. Therefore, we considered leveraging interaction information from a protein-protein interaction network (PIN) to identify distinct active subnetworks and then perform enrichment analyses on these subnetworks.
An active subnetwork can be defined as a group of interconnected genes in a PIN that predominantly consists of significantly altered genes. In other words, active subnetworks define distinct disease-associated sets of interacting genes, whether discovered through the original analysis or discovered because of being in interaction with a significant gene.
The active-subnetwork-oriented enrichment analysis approach of pathfindR can be summarized as follows: Mapping the input genes with the associated p values onto the PIN (after processing the input), active subnetwork search is performed. The resulting active subnetworks are then filtered based on their scores and the number of significant genes they contain. This filtered list of active subnetworks are then used for enrichment analyses, i.e. using the genes in each of the active subnetworks, the significantly enriched terms (pathways/gene sets) are identified. Enriched terms with adjusted p values larger than the given threshold are discarded and the lowest adjusted p value (over all active subnetworks) for each term is kept. This process of
active subnetwork search + enrichment analyses is repeated for a selected number of iterations, performed in parallel. Over all iterations, the lowest and the highest adjusted-p values, as well as number of occurrences over all iterations are reported for each significantly enriched term in the resulting data frame. An HTML report containing the results is also provided containing links to the visualizations of the enriched terms. This active-subnetwork-oriented enrichment approach is demonstrated in the section Active-subnetwork-oriented Enrichment Analysis of this vignette.
The enrichment analysis usually yields a great number of enriched terms whose biological functions are related. Therefore, we implemented two clustering approaches using a pairwise distance matrix based on the kappa statistics between the enriched terms (as proposed by Huang et al. 1). Based on this distance metric, the user can perform either hierarchical (default) or fuzzy clustering of the enriched terms. Details of clustering and partitioning of enriched terms are presented in the Clustering Enriched Terms section of this vignette.
Other functionalities of pathfindR including:
For convenience, we provide the wrapper function
run_pathfindR() to be used for the active-subnetwork-oriented enrichment analysis. The input for this function must be a data frame consisting of the columns containing:
Change Values (optional) and
p values. The example data frame used in this vignette (
input_df) is the dataset containing the differentially-expressed genes for the GEO dataset GSE15573 comparing 18 rheumatoid arthritis (RA) patients versus 15 healthy subjects.
The first 6 rows of the example input data frame are displayed below:
For a detailed step-by-step explanation and an unwrapped demonstration of the active-subnetwork-oriented enrichment analysis, see the vignette Step-by-Step Execution of the pathfindR Enrichment Workflow
Executing the workflow is straightforward (but does typically take several minutes):
This subsection demonstrates some (selected) useful arguments of
run_pathfindR(). For a full list of arguments, see
?run_pathfindR or visit our GitHub wiki.
run_pathfindR() uses the input genes with p-values < 0.05. To change this threshold, use
<- run_pathfindR(input_df, p_val_threshold = 0.01)output_df
run_pathfindR() creates a directory named
"pathfindR_Results" under the current working directory for writing the output files. To change the output directory, use
<- run_pathfindR(input_df, output_dir = "this_is_my_output_directory")output_df
"this_is_my_output_directory" under the current working directory.
In essence, this argument is treated as a path so it can be used to create the output directory anywhere. For example, to create the directory
"~/Desktop" and run the analysis there, you may run:
<- run_pathfindR(input_df, output_dir = "~/Desktop/my_dir")output_df
Note: If the output directory (e.g.
"my_dir") already exists,
run_pathfindR()creates and works under
"my_dir(1)". If that exists also exists, it creates
"my_dir(2)"and so on. This was intentionally implemented so that any previous pathfindR results are not overwritten.
The active-subnetwork-oriented enrichment analyses can be performed on any gene sets (biological pathways, gene ontology terms, transcription factor target genes, miRNA target genes etc.). The available gene sets in pathfindR are “KEGG”, “Reactome”, “BioCarta”, “GO-All”, “GO-BP”, “GO-CC” and “GO-MF” (all for Homo sapiens). For changing the default gene sets for enrichment analysis (hsa KEGG pathways), use the argument
<- run_pathfindR(input_df, gene_sets = "GO-MF")output_df
run_pathfindR() filters the gene sets by including only the terms containing at least 10 and at most 300 genes. To change the default behavior, you may change
## Including more terms for enrichment analysis <- run_pathfindR(input_df, output_df gene_sets = "GO-MF", min_gset_size = 5, max_gset_size = 500)
Note that increasing the number of terms for enrichment analysis may result in significantly longer run time.
If the user prefers to use another gene set source, the
gene_sets argument should be set to
"Custom" and the custom gene sets (list) and the custom gene set descriptions (named vector) should be supplied via the arguments
custom_descriptions, respectively. See
?fetch_gene_set for more details and Analysis with Custom Gene Sets for a simple demonstration.
For details on obtaining organism-specific Gene Sets and PIN data, see the vignette Obtaining PIN and Gene Sets Data.
run_pathfindR() adjusts the enrichment p values via the “bonferroni” method and filters the enriched terms by adjusted-p value < 0.05. To change this adjustment method and the threshold, set
<- run_pathfindR(input_df, output_df adj_method = "fdr", enrichment_threshold = 0.01)
For the active subnetwork search process, a protein-protein interaction network (PIN) is used.
run_pathfindR() maps the input genes onto this PIN and identifies active subnetworks which are then be used for enrichment analyses. To change the default PIN (“Biogrid”), use the
<- run_pathfindR(input_df, pin_name_path = "IntAct")output_df
pin_name_path argument can be one of “Biogrid”, “STRING”, “GeneMania”, “IntAct”, “KEGG”, “mmu_STRING” or it can be the path to a custom PIN file provided by the user.
# to use an external PIN of your choice <- run_pathfindR(input_df, pin_name_path = "/path/to/myPIN.sif")output_df
NOTE: the PIN is also used for generating the background genes (in this case, all unique genes in the PIN) during hypergeometric-distribution-based tests in enrichment analyses. Therefore, a large PIN will generally result in better results.
Currently, there are three algorithms implemented in pathfindR for active subnetwork search: Greedy Algorithm (default, based on Ideker et al. 2), Simulated Annealing Algorithm (based on Ideker et al. 3) and Genetic Algorithm (based on Ozisik et al. 4). For a detailed discussion on which algorithm to use see this wiki entry
# for simulated annealing: <- run_pathfindR(input_df, search_method = "SA") output_df # for genetic algorithm: <- run_pathfindR(input_df, search_method = "GA")output_df
Because the active subnetwork search algorithms are stochastic,
run_pathfindR() may be set to iterate the active subnetwork identification and enrichment steps multiple times (by default 1 time). To change this number, set
<- run_pathfindR(input_df, iterations = 25) output_df
run_pathfindR() uses a parallel loop (using the package
foreach) for performing these iterations in parallel. By default, the number of processes to be used is determined automatically. To override, change
# if not set, n_processes defaults to (number of detected cores - 1) <- run_pathfindR(input_df, iterations = 5, n_processes = 2)output_df
run_pathfindR() returns a data frame of enriched terms. Columns are:
list_active_snw_genes, default is
change value> 0, if the
change columnwas provided) in the input involved in the given term’s gene set, comma-separated. If change column was not provided, all affected input genes are listed here.
change value< 0, if the
change columnwas provided) in the input involved in the given term’s gene set, comma-separated
The first 2 rows of the output data frame of the example analysis on the rheumatoid arthritis gene-level differential expression input data (
RA_input) is shown below:
|hsa05415||Diabetic cardiomyopathy||3.246357||10||0.0752907||0e+00||0e+00||NCF4, MMP9, NDUFA1, NDUFB3, UQCRQ, COX6A1, COX7A2, COX7C, GAPDH||ATP2A2, MTOR, PDHA1, PDHB, VDAC1, SLC25A5, PARP1|
|hsa04130||SNARE interactions in vesicular transport||4.599006||10||0.0118360||1e-07||1e-07||STX6||STX2, BET1L, SNAP23|
run_pathfindR() also produces a graphical summary of enrichment results for top 10 enriched terms, which can also be later produced by
You may also disable plotting this chart by setting
plot_enrichment_chart=FALSE and later produce this plot via the function
# change number of top terms plotted (default = 10) enrichment_chart(result_df = RA_output, top_terms = 15)
The function also creates an HTML report
results.html that is saved in the output directory. This report contains links to two other HTML files:
This document contains the table of the active subnetwork-oriented enrichment results (same as the returned data frame). By default, each enriched term description is linked to the visualization of the term, with the gene nodes colored according to their change values. If you choose not to create the visualization files, set
visualize_enriched_terms = FALSE.
This document contains the table of converted gene symbols. Columns are:
During input processing, gene symbols that are not in the PIN are identified and excluded. For human genes, if aliases of these missing gene symbols are found in the PIN, these symbols are converted to the corresponding aliases (controlled by the argument
convert2alias). This step is performed to best map the input data onto the PIN.
The document contains a second table of genes for which no interactions were identified after checking for alias symbols (so these could not be used during the analysis).
The wrapper function
cluster_enriched_terms() can be used to perform clustering of enriched terms and partitioning the terms into biologically-relevant groups. Clustering can be performed either via
fuzzy method using the pairwise kappa statistics (a chance-corrected measure of co-occurrence between two sets of categorized data) matrix between all enriched terms.
cluster_enriched_terms() performs hierarchical clustering of the terms (using \(1 - \kappa\) as the distance metric). Iterating over \(2,3,...n\) clusters (where \(n\) is the number of terms),
cluster_enriched_terms() determines the optimal number of clusters by maximizing the average silhouette width, partitions the data into this optimal number of clusters and returns a data frame with cluster assignments.
<- cluster_enriched_terms(RA_output, plot_dend = FALSE, plot_clusters_graph = FALSE)RA_clustered
## First 2 rows of clustered data frame ::kable(head(RA_clustered, 2))knitr
|1||hsa05415||Diabetic cardiomyopathy||3.246357||10||0.0752907||0e+00||0e+00||NCF4, MMP9, NDUFA1, NDUFB3, UQCRQ, COX6A1, COX7A2, COX7C, GAPDH||ATP2A2, MTOR, PDHA1, PDHB, VDAC1, SLC25A5, PARP1||1||Representative|
|8||hsa00190||Oxidative phosphorylation||2.943760||10||0.0177005||3e-07||3e-07||NDUFA1, NDUFB3, UQCRQ, COX6A1, COX7A2, COX7C, ATP6V1D, ATP6V0E1||ATP6V0E2||1||Member|
## The representative terms ::kable(RA_clustered[RA_clustered$Status == "Representative", ])knitr
|1||hsa05415||Diabetic cardiomyopathy||3.2463571||10||0.0752907||0.0000000||0.0000000||NCF4, MMP9, NDUFA1, NDUFB3, UQCRQ, COX6A1, COX7A2, COX7C, GAPDH||ATP2A2, MTOR, PDHA1, PDHB, VDAC1, SLC25A5, PARP1||1||Representative|
|2||hsa04130||SNARE interactions in vesicular transport||4.5990059||10||0.0118360||0.0000001||0.0000001||STX6||STX2, BET1L, SNAP23||2||Representative|
|3||hsa03410||Base excision repair||5.7487574||1||0.0053191||0.0000001||0.0000001||POLE4||MUTYH, APEX2, POLD2, PARP1||3||Representative|
|4||hsa04064||NF-kappa B signaling pathway||2.9185999||10||0.0502592||0.0000001||0.0000001||LY96||PRKCQ, CARD11, TICAM1, IKBKB, PARP1, UBE2I, CSNK2A2||4||Representative|
|5||hsa03040||Spliceosome||3.6887860||10||0.0477457||0.0000002||0.0000446||SF3B6, LSM3, BUD31||SNRPB, SF3B2, U2AF2, PUF60, SNU13, DDX23, EIF4A3, HNRNPA1, PCBP1, SRSF8, SRSF5||5||Representative|
|6||hsa04714||Thermogenesis||2.5174653||10||0.0441982||0.0000003||0.0000003||NDUFA1, NDUFB3, UQCRQ, COX6A1, COX7A2, COX7C||ADCY7, CREB1, KDM1A, SMARCA4, ACTG1, ACTB, ARID1A, MTOR||6||Representative|
|7||hsa03013||RNA transport||2.6017234||10||0.0184111||0.0000003||0.0000419||NUP214||NUP62, NUP93, RANGAP1, UBE2I, SUMO3, GEMIN4, EIF2S3, EIF2B1, EIF4A3, RNPS1, SRRM1||7||Representative|
|15||hsa04722||Neurotrophin signaling pathway||2.8938660||10||0.0414102||0.0000088||0.0000151||SH2B3, CRKL, FASLG, CALM3, CALM1, ABL1, MAGED1, IRAK2, IKBKB||8||Representative|
|17||hsa04630||JAK-STAT signaling pathway||1.4500051||10||0.0115943||0.0000171||0.0000171||IL2RB, IL10RA, IL27RA, JAK1, PIAS3, MTOR||9||Representative|
|23||hsa05203||Viral carcinogenesis||1.8396024||10||0.0346832||0.0000446||0.0000446||GTF2B||CREB1, JAK1, SCRIB, RBL2, HDAC1, DNAJA3, SRF||10||Representative|
|33||hsa04110||Cell cycle||1.5299112||10||0.0118003||0.0002246||0.0004249||RBL2, ABL1, HDAC1, CDKN1C, ANAPC1||11||Representative|
|41||hsa03010||Ribosome||1.7473197||9||0.0058480||0.0007058||0.0459772||MRPS18C, RPS24, MRPL33, RPL26, RPL31, RPL39||RPLP2||12||Representative|
|42||hsa00020||Citrate cycle (TCA cycle)||3.7941799||10||0.0179106||0.0007264||0.0009951||MDH2, PDHA1, PDHB||13||Representative|
|48||hsa00630||Glyoxylate and dicarboxylate metabolism||2.6166758||2||0.0060061||0.0011912||0.0015442||MDH2, SHMT1||14||Representative|
|64||hsa05202||Transcriptional misregulation in cancer||2.2188187||10||0.0058310||0.0033621||0.0033621||MMP9, DDIT3||HDAC1, SIN3A, BCL11B, SLC45A3, EWSR1, IL2RB, TAF15, ASPSCR1||16||Representative|
|96||hsa04120||Ubiquitin mediated proteolysis||1.6616846||10||0.0058310||0.0214469||0.0414098||TRIP12||UBE2G1, UBE2I, HERC1, PIAS3, ANAPC1||17||Representative|
|101||hsa00340||Histidine metabolism||5.4202570||10||0.0058310||0.0335915||0.0335915||HNMT||ALDH9A1, CNDP2||18||Representative|
|108||hsa04350||TGF-beta signaling pathway||0.8526247||9||0.0058140||0.0404554||0.0458347||SMAD7, TGIF2||19||Representative|
After clustering, you may again plot the summary enrichment chart and display the enriched terms by clusters:
# plotting only selected clusters for better visualization <- subset(RA_clustered, Cluster %in% 5:7) RA_selected enrichment_chart(RA_selected, plot_by_cluster = TRUE)
For details, see
term_gene_heatmap() can be used to visualize the heatmap of genes that are involved in the enriched terms. This heatmap allows visual identification of the input genes involved in the enriched terms, as well as the common or distinct genes between different terms. If the input data frame (same as in
run_pathfindR()) is supplied, the tile colors indicate the change values.
term_gene_heatmap(result_df = RA_output, genes_df = RA_input)
See the vignette Visualization of pathfindR Enrichment Results for more details.
The visualization function
term_gene_graph() (adapted from the “Gene-Concept network visualization” by the R package
enrichplot) can be utilized to visualize which genes are involved in the enriched terms. The function creates a term-gene graph which shows the connections between genes and biological terms (enriched pathways or gene sets). This allows for the investigation of multiple terms to which significant genes are related. This graph also enables visual determination of the degree of overlap between the enriched terms by identifying shared and/or distinct significant genes.
term_gene_graph(result_df = RA_output, use_description = TRUE)
See the vignette Visualization of pathfindR Enrichment Results for more details.
UpSet plots are plots of the intersections of sets as a matrix. This function creates a ggplot object of an UpSet plot where the x-axis is the UpSet plot of intersections of enriched terms. By default (i.e.,
method = "heatmap"), the main plot is a heatmap of genes at the corresponding intersections, colored by up/down regulation (if
genes_df is provided, colored by change values). If
method = "barplot", the main plot is bar plots of the number of genes at the corresponding intersections. Finally, if
method = "boxplot" and
genes_df is provided, then the main plot displays the boxplots of change values of the genes at the corresponding intersections.
UpSet_plot(result_df = RA_output, genes_df = RA_input)
See the vignette Visualization of pathfindR Enrichment Results for more details.
score_terms() can be used to calculate the agglomerated z score of each enriched term per sample. This allows the user to individually examine the scores and infer how a term is overall altered (activated or repressed) in a given sample or a group of samples.
## Vector of "Case" IDs <- c("GSM389703", "GSM389704", "GSM389706", "GSM389708", cases "GSM389711", "GSM389714", "GSM389716", "GSM389717", "GSM389719", "GSM389721", "GSM389722", "GSM389724", "GSM389726", "GSM389727", "GSM389730", "GSM389731", "GSM389733", "GSM389735") ## Calculate scores for representative terms ## and plot heat map using term descriptions <- score_terms(enrichment_table = RA_clustered[RA_clustered$Status == "Representative", ], score_matrix exp_mat = RA_exp_mat, cases = cases, use_description = TRUE, # default FALSE label_samples = FALSE, # default = TRUE case_title = "RA", # default = "Case" control_title = "Healthy", # default = "Control" low = "#f7797d", # default = "green" mid = "#fffde4", # default = "black" high = "#1f4037") # default = "red"