Datasets Included in HaploCatcher

To use the HaploCatcher package, the data must be structured properly. There are three necessary data frames that must be available to use this package. Let’s take a look at them.

Data Structure - Gene Compendium

First, lets look at the head of the “gene_comp” file.

Gene Compendium Input File
Trait	Chromosome	Gene	Nursery	Line	FullSampleName	Call
sawfly_resistance_solid_stem	3B	sst1_solid_stem	SRPN	Geno_190	Geno_190	non_sst1_solid_stem
sawfly_resistance_solid_stem	3B	sst1_solid_stem	SRPN	Geno_100	Geno_100	non_sst1_solid_stem
sawfly_resistance_solid_stem	3B	sst1_solid_stem	RGON	Geno_49	Geno_49	non_sst1_solid_stem
sawfly_resistance_solid_stem	3B	sst1_solid_stem	CSU	Geno_71	Geno_71	non_sst1_solid_stem
sawfly_resistance_solid_stem	3B	sst1_solid_stem	CSU	Geno_123	Geno_123	non_sst1_solid_stem
sawfly_resistance_solid_stem	3B	sst1_solid_stem	CSU	Geno_88	Geno_88	non_sst1_solid_stem

The gene_comp file is a categorization file made by the user for their specific program. It contains seven columns which are as follows:

Trait: Defines what trait the locus affects. In this case, we are looking at Sst1 which affects resistance to sawfly in wheat.
Chromosome: Defines what chromosome the gene is located on. This will be specific to a marker_information file later on.
Gene: Defines the name of the gene.
Nursery: Defines what nursery/program the information was derived from.
Line: Defines the “line designation” given to a genotype. The line designation is denoted by the breeder.
FullSampleName: The name of the line which matches with the row names in the marker matrix.
Call: The haplotype for that line for that gene. This call column is very important for the functioning of the HaploCatcher package. This package only predicts biallelic loci. The locus can be only homozygous cases or include a heterozygous case.

For simplicity in the package, calls for gene haplotypes should be formatted like this:

Haplotype Calls
Call
non_sst1_solid_stem
sst1_solid_stem
het_sst1_solid_stem

The heterozygous case has the string “het_” in front of it like “het_sst1_solid_stem” and the negative case has the string “non_” in front of it like “non_sst1_solid_stem”.

If the user wants to only provide the positive and negative cases, without the heterozygous case, the format would be the same without “het_” category. To use the package properly, users must have at least four columns in the data present: FullSampleName, Chromosome, Gene, and Call. Other data may be present. These columns must both be properly named and case correct in the data.

For example, if the user was trying to predict the Pm34 locus in wheat with heterozygous cases, the data frame for the gene_comp would look like this:

Pm34 Example Input File
FullSampleName	Chromosome	Gene	Call
Geno_1	5D	pm34	pm34
Geno_2	5D	pm34	het_pm34
Geno_3	5D	pm34	non_pm34

Data Structure - Genotypic Matrix

The structure of the genotypic matrix is a full-rank (no missing data) numeric matrix of markers.

Marker Matrix Input File
	S3B_496866	S3B_619185
Geno_1	1	0
Geno_2	0	0
Geno_3	0	0
Geno_4	0	0
Geno_5	0	2

Where the columns of the matrix are markers, the rows of the matrix are genotypes (FullSampleName), and the the marker information is coded numerically. The marker matrix provided in the package is coded as 0, 1,and 2; where 0 is the major allele, 1 is the heterozygous haplotype, and 2 is the minor allele. The coding of the matrix must always be numeric, but it can be in any numeric format as long as the numbers are real integers.

Note: the column names must match information provided in the marker info file (see next section) and the row names must match the “FullSampleName” provided in the gene_comp!

Data Structure - Marker Information

The marker_info file contains information on all the markers found in the columns of the geno_mat file.

Marker Info Input File
Marker	Chromosome	BP_Position
S3B_496866	3B	496866
S3B_496906	3B	496906
S3B_509279	3B	509279
S3B_619176	3B	619176
S3B_619185	3B	619185
S3B_627259	3B	627259

The marker_info file has three required columns:

Marker: The name of the marker which is found in the column names of the marker matrix
Chromosome: The name of the chromosome (or linkage group) that the marker belongs to
BP_Position: The base pair position of the marker. This can be replaced by a numeric dummy variable or a centimorgan position!

Below is an example where a linkage group and centimorgan position is provided.

Marker Info Input File Example
Marker	Chromosome	BP_Position
TRACE_00102	1	0.0
TRACE_13112	1	0.5
TRACE_43821	1	1.2

Note: even though these are not chromosomes or base pair positions, you must leave the names of the columns titled as such! Furthermore, the columns for the marker_info and geno_comp are case sensitive

Pipeline Explanation

The total pipeline for the HaploCatcher package can be understood using the provided diagram (Figure 1).

Figure 1. A diagram of [A] input data structure and [B] the “auto_locus” function pipeline. Panel [A] shows a total data set that is partitioned into a training and test population. The training population in panel [A] shows a population of individuals, that is suggested to be comprised of more than 750 individuals, which have both genome wide marker and historical haplotype data. The testing population in panel [A] shows a testing population, which may be any size greater than zero, which only has genome wide marker data. Panel [B] shows the workflow of the “auto_locus” function. In the cross-validation step [I], the total training population is split in a user defined way (default is 80:20 split) and the 80% tuning population is used to train and select optimal hyper-parameters for a k-nearest neighbors (KNN) and random forest (RF) model. The trained models are then used to predict the haplotype of the validation population. The predicted haplotype is then compared to the ‘true’ haplotype and kappa (and accuracy) are calculated. This is repeated a user set number of times (default is 30). The best performing model based on accuracy or kappa (default is kappa) is then taken as the model to be used in forward prediction. There are two options post cross-validation: [IIA] a single model with a set seed for repeatability or [IIB] a user set number of random models (default is 30) used to create a consensus haplotype prediction.

The HaploCatcher package contains several different functions which are all placed into a pipeline for ease of access. The whole process hinges on having the “gene_comp” file, which has historical information about a locus of interest and a “geno_mat” file that contains all those individuals in the “geno_comp” file and individuals for which a predicted haplotype is desired (Figure 1A).

The “auto_locus” function (Figure 1B) has two distinct steps: (I) Selecting either a random forest or k-nearest neighbors model by cross-validation and (II) using the cross-validation results to select the desired model, training off the total available data, and either (IIA) set a random seed and predict haplotypes once or (IIB) set no random seed and predict many times to get a consensus vote by majority rule (Figure 1B).

Using the “auto_locus” Function

Now that we understand the format of the data and the pipeline process, it is time to use the main function “auto_locus”.

Read in Data from HaploCatcher Package

First, read in the data…

#library
library(HaploCatcher)

#read data from package
data("geno_mat")
data("gene_comp")
data("marker_info")

In this example, only individuals who are in the gene_comp are found in the geno_mat file provided in the package, so we will have to randomly partition the data so that we can show how the “auto_locus” function could be used for forward prediction of lines with only genome-wide marker data!

#set seed (for reproducible results)
set.seed(022294)

#randomly partition the training data and test from total
training_genotypes=sample(rownames(geno_mat), size = round(nrow(geno_mat)*0.8, 0))
testing_genotypes=rownames(geno_mat)[!rownames(geno_mat) %in% training_genotypes]

#nullify the seed we set so we don't mess with cross validation
set.seed(NULL)

In this example, we assume that the individuals in the “testing_genotypes” vector DO NOT have haplotype calls for Sst1 and we are trying to produce a predictive haplotype for these individuals.

Run “auto_locus” models

Now that we have defined our testing and training data sets, we can use these inputs for the auto_locus function. The auto_locus function has eight user defined inputs:

geno_mat: The genotypic matrix which contains marker information
gene_file: The gene compendium file that contains information about the Sst1 locus
gene_name: A character string which defines the name of the gene to be predicted found in the “gene_file” input
marker_info: The marker information file that contains all the chromosome and position information about the markers in the “geno_mat” file
chromosome: The chromosome that the gene resides upon
training_genotypes: A vector of FullSampleNames from the geno_mat row names that defines which individuals are to be used for training of the KNN or RF models
testing_genotypes: A vector of FullSampleNames from the geno_mat row names which are to be predicted
set_seed: A number used to set the seed of the function. This option must be defined if the “predict_by_vote” function is set to FALSE, which is the default setting.

The auto_locus function has many user defined setting that are optional, such as “include_hets” which either allows heterozygous lines to be predicted (TRUE) or not (FALSE). The default is FALSE.

Note: if your data has only positive and negative cases without heterozygous calls, you should set “include_hets” to FALSE, otherwise you will receive an error!

Furthermore, there are options for sequential modeling (parallel=FALSE) or parallel modeling (parallel=TRUE), as well as methods to suppress output of text (verbose=FALSE).

The example data set for the HaploCatcher package contains heterozygous calls, so we will perform models using the “include_hets” argument set to FALSE. We will also set n_perms=5 for expediency of models, even though the minimum should be 30.

#run without heterozygous individuals sequentially
results1<-auto_locus(geno_mat = geno_mat,
                     gene_file = gene_comp,
                     gene_name = "sst1_solid_stem",
                     marker_info = marker_info,
                     chromosome = "3B",
                     training_genotypes = training_genotypes,
                     testing_genotypes = testing_genotypes,
                     set_seed = 022294,
                     n_perms = 10,
                     verbose = FALSE)
#> Loading required package: lattice

Identify Location of Predictions in Results Object

Each result obtained puts out a list-of-list objects which contain several portions which can be found in the documentation for the “auto_locus” function. The predictions made for the training population can be found in the results$predictions pathway.

#show results for with and without hets
knitr::kable(results1$predictions[1:15,], 
             align = "c", 
             caption = "Heterozygous Excluded Results")

Heterozygous Excluded Results
FullSampleName	Prediction_RF
Geno_2	sst1_solid_stem
Geno_6	sst1_solid_stem
Geno_9	sst1_solid_stem
Geno_13	sst1_solid_stem
Geno_14	sst1_solid_stem
Geno_15	sst1_solid_stem
Geno_16	sst1_solid_stem
Geno_19	sst1_solid_stem
Geno_41	sst1_solid_stem
Geno_45	sst1_solid_stem
Geno_76	sst1_solid_stem
Geno_77	sst1_solid_stem
Geno_79	sst1_solid_stem
Geno_86	non_sst1_solid_stem
Geno_89	sst1_solid_stem

Prediction by Voting

For users who do not want to run a single final model with a set seed, there is the “predict_by_vote” option, which runs a series of random models to get a majority rule vote. Here we set the number of votes to 10 with the “n_votes” argument and the number of cross-validation permutations to 10 with the “n_perms” argument. We also silence the output using the “verbose=FALSE” argument.

Note that the predictions, when made through voting, may be found in the results$consensus_predictions pathway. This data frame shows the FullSampleName, a haplotype call titled “Consensus_Call” made by majority rule, and the different potential calls with a number of observations over all permutations of the model.

Visualizing Cross-Validation Results

If users which to visualize the cross-validation results outside of the model, they can use the “plot_locus_perm_cv” function by referencing the list-of-list in the results$cross_validation_results.

#plot out the cross validation results
plot_locus_perm_cv(results1$cross_validation_results)

Looking at Model Performance Parameters

Each results object has cross validation parameters to check the predictive ability of the data set. To see this, users can access summaries in the list-of-list results object. These parameters may be useful for publications or model selection.

#look at overall and by_class performance
knitr::kable(results1$cross_validation_results$Overall_Summary[,1:5], 
             align = "c", 
             caption = "Subsection of Overall Summary for 'results1' Object")

Subsection of Overall Summary for ‘results1’ Object
Model	Mean_Accuracy	Min_Accuracy	Max_Accuracy	SD_Accuracy
K-Nearest Neighbors	0.9750000	0.9591837	0.9897959	0.0121334
Random Forest	0.9755102	0.9591837	0.9846939	0.0089346

knitr::kable(results1$cross_validation_results$By_Class_Summary[,1:5],
             align = "c", 
             caption = "Subsection of By-Class Summary for 'results1' Object")

Subsection of By-Class Summary for ‘results1’ Object
Model	Mean_Sensitivity	Min_Sensitivity	Max_Sensitivity	SD_Sensitivity
K-Nearest Neighbors	0.9728407	0.9453125	1	0.0181457
Random Forest	0.9745744	0.9393939	1	0.0206555

An Introduction to HaploCatcher

By: Zachary J. Winn

Date: 03/28/2023

Rational for HaploCatcher