An Introduction to HaploCatcher

By: Zachary J. Winn

Date: 03/28/2023

Rational for HaploCatcher

The HaploCatcher package is based on the work done in Winn et al (2022) which demonstrated that relatively simple machine learning models could be used to generate haplotype information for genome-wide genotyped lines. The idea behind this approach was to leverage the historical PCR-based marker information in programs to identify haplotypes of interest in early generation material which has been sequenced for genomic selection. This approach offers several distinct advantages for breeding programs:

  1. Lines that are genotyped for genomic selection in early generations, which would not have PCR-based markers run on them, can receive haplotype information that they would otherwise not have.

  2. Users can apply this information for selection of desirable haplotypes in early generations.

  3. Haplotype information could become available earlier in the breeding process.

  4. Haplotypes could be predicted for historical lines for use in estimation of locus effects across time.

The development team of HaploCatcher made a package available for individuals to apply this method in their programs with as little intervention as possible. In this demonstration, we will discuss the potential uses of this package and the file formats required to run the functions in the HaploCatcher package.

Datasets Included in HaploCatcher

To use the HaploCatcher package, the data must be structured properly. There are three necessary data frames that must be available to use this package. Let’s take a look at them.

Data Structure - Gene Compendium

First, lets look at the head of the “gene_comp” file.

Gene Compendium Input File
Trait Chromosome Gene Nursery Line FullSampleName Call
sawfly_resistance_solid_stem 3B sst1_solid_stem SRPN Geno_190 Geno_190 non_sst1_solid_stem
sawfly_resistance_solid_stem 3B sst1_solid_stem SRPN Geno_100 Geno_100 non_sst1_solid_stem
sawfly_resistance_solid_stem 3B sst1_solid_stem RGON Geno_49 Geno_49 non_sst1_solid_stem
sawfly_resistance_solid_stem 3B sst1_solid_stem CSU Geno_71 Geno_71 non_sst1_solid_stem
sawfly_resistance_solid_stem 3B sst1_solid_stem CSU Geno_123 Geno_123 non_sst1_solid_stem
sawfly_resistance_solid_stem 3B sst1_solid_stem CSU Geno_88 Geno_88 non_sst1_solid_stem

The gene_comp file is a categorization file made by the user for their specific program. It contains seven columns which are as follows:

For simplicity in the package, calls for gene haplotypes should be formatted like this:

Haplotype Calls
Call
non_sst1_solid_stem
sst1_solid_stem
het_sst1_solid_stem

The heterozygous case has the string “het_” in front of it like “het_sst1_solid_stem” and the negative case has the string “non_” in front of it like “non_sst1_solid_stem”.

If the user wants to only provide the positive and negative cases, without the heterozygous case, the format would be the same without “het_” category. To use the package properly, users must have at least four columns in the data present: FullSampleName, Chromosome, Gene, and Call. Other data may be present. These columns must both be properly named and case correct in the data.

For example, if the user was trying to predict the Pm34 locus in wheat with heterozygous cases, the data frame for the gene_comp would look like this:

Pm34 Example Input File
FullSampleName Chromosome Gene Call
Geno_1 5D pm34 pm34
Geno_2 5D pm34 het_pm34
Geno_3 5D pm34 non_pm34

Data Structure - Genotypic Matrix

The structure of the genotypic matrix is a full-rank (no missing data) numeric matrix of markers.

Marker Matrix Input File
S3B_496866 S3B_496906 S3B_509279 S3B_619176 S3B_619185
Geno_1 1 0 0 0 0
Geno_2 0 0 0 0 0
Geno_3 0 0 0 0 0
Geno_4 0 0 0 0 0
Geno_5 0 0 0 0 2

Where the columns of the matrix are markers, the rows of the matrix are genotypes (FullSampleName), and the the marker information is coded numerically. The marker matrix provided in the package is coded as 0, 1,and 2; where 0 is the major allele, 1 is the heterozygous haplotype, and 2 is the minor allele. The coding of the matrix must always be numeric, but it can be in any numeric format as long as the numbers are real integers.

Note: the column names must match information provided in the marker info file (see next section) and the row names must match the “FullSampleName” provided in the gene_comp!

Data Structure - Marker Information

The marker_info file contains information on all the markers found in the columns of the geno_mat file.

Marker Info Input File
Marker Chromosome BP_Position
S3B_496866 3B 496866
S3B_496906 3B 496906
S3B_509279 3B 509279
S3B_619176 3B 619176
S3B_619185 3B 619185
S3B_627259 3B 627259

The marker_info file has three required columns:

Below is an example where a linkage group and centimorgan position is provided.

Marker Info Input File Example
Marker Chromosome BP_Position
TRACE_00102 1 0.0
TRACE_13112 1 0.5
TRACE_43821 1 1.2

Note: even though these are not chromosomes or base pair positions, you must leave the names of the columns titled as such! Furthermore, the columns for the marker_info and geno_comp are case sensitive

Pipeline Explanation

The total pipeline for the HaploCatcher package can be understood using the provided diagram (Figure 1).


Figure 1. A diagram of [A] input data structure and [B] the “auto_locus” function pipeline. Panel [A] shows a total data set that is partitioned into a training and test population. The training population in panel [A] shows a population of individuals, that is suggested to be comprised of more than 750 individuals, which have both genome wide marker and historical haplotype data. The testing population in panel [A] shows a testing population, which may be any size greater than zero, which only has genome wide marker data. Panel [B] shows the workflow of the “auto_locus” function. In the cross-validation step [I], the total training population is split in a user defined way (default is 80:20 split) and the 80% tuning population is used to train and select optimal hyper-parameters for a k-nearest neighbors (KNN) and random forest (RF) model. The trained models are then used to predict the haplotype of the validation population. The predicted haplotype is then compared to the ‘true’ haplotype and kappa (and accuracy) are calculated. This is repeated a user set number of times (default is 30). The best performing model based on accuracy or kappa (default is kappa) is then taken as the model to be used in forward prediction. There are two options post cross-validation: [IIA] a single model with a set seed for repeatability or [IIB] a user set number of random models (default is 30) used to create a consensus haplotype prediction.


The HaploCatcher package contains several different functions which are all placed into a pipeline for ease of access. The whole process hinges on having the “gene_comp” file, which has historical information about a locus of interest and a “geno_mat” file that contains all those individuals in the “geno_comp” file and individuals for which a predicted haplotype is desired (Figure 1A).

The “auto_locus” function (Figure 1B) has two distinct steps: (I) Selecting either a random forest or k-nearest neighbors model by cross-validation and (II) using the cross-validation results to select the desired model, training off the total available data, and either (IIA) set a random seed and predict haplotypes once or (IIB) set no random seed and predict many times to get a consensus vote by majority rule (Figure 1B).

Using the “auto_locus” Function

Now that we understand the format of the data and the pipeline process, it is time to use the main function “auto_locus”.

Read in Data from HaploCatcher Package

First, read in the data…

#library
library(HaploCatcher)

#read data from package
data("geno_mat")
data("gene_comp")
data("marker_info")

In this example, only individuals who are in the gene_comp are found in the geno_mat file provided in the package, so we will have to randomly partition the data so that we can show how the “auto_locus” function could be used for forward prediction of lines with only genome-wide marker data!

#set seed (for reproducible results)
set.seed(022294)

#randomly partition the training data and test from total
training_genotypes=sample(rownames(geno_mat), size = round(nrow(geno_mat)*0.8, 0))
testing_genotypes=rownames(geno_mat)[!rownames(geno_mat) %in% training_genotypes]

#nullify the seed we set so we don't mess with cross validation
set.seed(NULL)

In this example, we assume that the individuals in the “testing_genotypes” vector DO NOT have haplotype calls for Sst1 and we are trying to produce a predictive haplotype for these individuals.

Run “auto_locus” models

Now that we have defined our testing and training data sets, we can use these inputs for the auto_locus function. The auto_locus function has eight user defined inputs:

The auto_locus function has many user defined setting that are optional, such as “include_hets” which either allows heterozygous lines to be predicted (TRUE) or not (FALSE). The default is FALSE.

Note: if your data has only positive and negative cases without heterozygous calls, you should set “include_hets” to FALSE, otherwise you will receive an error!

Furthermore, there are options for sequential modeling (parallel=FALSE) or parallel modeling (parallel=TRUE), as well as methods to suppress output of text (verbose=FALSE).

The example data set for the HaploCatcher package contains heterozygous calls, so we will perform models using the “include_hets” argument set to FALSE. We will also set n_perms=5 for expediency of models, even though the minimum should be 30.

#run without heterozygous individuals sequentially
results1<-auto_locus(geno_mat = geno_mat,
                     gene_file = gene_comp,
                     gene_name = "sst1_solid_stem",
                     marker_info = marker_info,
                     chromosome = "3B",
                     training_genotypes = training_genotypes,
                     testing_genotypes = testing_genotypes,
                     set_seed = 022294,
                     n_perms = 10,
                     verbose = FALSE)
#> Loading required package: lattice

Identify Location of Predictions in Results Object

Each result obtained puts out a list-of-list objects which contain several portions which can be found in the documentation for the “auto_locus” function. The predictions made for the training population can be found in the results$predictions pathway.

#show results for with and without hets
knitr::kable(results1$predictions[1:15,], 
             align = "c", 
             caption = "Heterozygous Excluded Results")
Heterozygous Excluded Results
FullSampleName Prediction_RF
Geno_2 sst1_solid_stem
Geno_6 sst1_solid_stem
Geno_9 sst1_solid_stem
Geno_13 sst1_solid_stem
Geno_14 sst1_solid_stem
Geno_15 sst1_solid_stem
Geno_16 sst1_solid_stem
Geno_19 sst1_solid_stem
Geno_41 sst1_solid_stem
Geno_45 sst1_solid_stem
Geno_76 sst1_solid_stem
Geno_77 sst1_solid_stem
Geno_79 sst1_solid_stem
Geno_86 non_sst1_solid_stem
Geno_89 sst1_solid_stem

Prediction by Voting

For users who do not want to run a single final model with a set seed, there is the “predict_by_vote” option, which runs a series of random models to get a majority rule vote. Here we set the number of votes to 10 with the “n_votes” argument and the number of cross-validation permutations to 10 with the “n_perms” argument. We also silence the output using the “verbose=FALSE” argument.

Note that the predictions, when made through voting, may be found in the results$consensus_predictions pathway. This data frame shows the FullSampleName, a haplotype call titled “Consensus_Call” made by majority rule, and the different potential calls with a number of observations over all permutations of the model.

Visualizing Cross-Validation Results

If users which to visualize the cross-validation results outside of the model, they can use the “plot_locus_perm_cv” function by referencing the list-of-list in the results$cross_validation_results.

#plot out the cross validation results
plot_locus_perm_cv(results1$cross_validation_results)

Looking at Model Performance Parameters

Each results object has cross validation parameters to check the predictive ability of the data set. To see this, users can access summaries in the list-of-list results object. These parameters may be useful for publications or model selection.

#look at overall and by_class performance
knitr::kable(results1$cross_validation_results$Overall_Summary[,1:5], 
             align = "c", 
             caption = "Subsection of Overall Summary for 'results1' Object")
Subsection of Overall Summary for ‘results1’ Object
Model Mean_Accuracy Min_Accuracy Max_Accuracy SD_Accuracy
K-Nearest Neighbors 0.9750000 0.9591837 0.9897959 0.0121334
Random Forest 0.9755102 0.9591837 0.9846939 0.0089346
knitr::kable(results1$cross_validation_results$By_Class_Summary[,1:5],
             align = "c", 
             caption = "Subsection of By-Class Summary for 'results1' Object")
Subsection of By-Class Summary for ‘results1’ Object
Model Mean_Sensitivity Min_Sensitivity Max_Sensitivity SD_Sensitivity
K-Nearest Neighbors 0.9728407 0.9453125 1 0.0181457
Random Forest 0.9745744 0.9393939 1 0.0206555

Conclusion

In this tutorial, we have demonstrated how to use the HaploCatcher package. We hope that this tutorial has been informative for potential users and that it has fueled interest in the use of our package. We appreciate your interest in our work and thank you for taking the time to learn about HaploCatcher. If you have any further questions or comments, please do not hesitate to contact the package maintainer listed in the package description.

References