whoa Tutorial

Eric C. Anderson

2021-08-09

This is a small, lightweight package that lets users investigate the distribution of genotypes in genotype-by-sequencing (GBS) data where they expect (by and large) Hardy-Weinberg equilibrium, in order to assess rates of genotyping errors and the dependence of those rates on read depth.

The name comes from the bolded letters in this sentence:

Where’s my Heterozygotes at? Observations on genotyping Accuracy.

It also fits well with my reaction when I started investigating heterozygote miscall rates (rates at which true heterozygotes are incorrectly called as homozygotes) in some restriction associated digest (RAD) sequencing data sets—My eyes bugged out and I said, “Whoa!”

The package comes with a small bit of data from lobster to play with. The rest of this document shows a quick run through a few of the functions to do an analysis of a data set.

A quick run

Packages

# load up the package:
library(whoa)

Lobster data

Read about the lobster data here. Execute this if you want:

help("lobster_buz_2000")

The main thing to know is that it is a ‘vcfR’ object. You can make such an object yourself by reading in a variant call format (VCF) file using vcfR::read.vcfR().

Make a quick genotype frequency scatter plot

# first get compute expected and observed genotype frequencies
gfreqs <- exp_and_obs_geno_freqs(lobster_buz_2000)

# then plot those.  Set max_plot_loci so that all 2000
# loci will be plotted
geno_freqs_scatter(gfreqs, max_plot_loci = 2000)

Now infer an overall heterozygote miscall rate.

If we want to estimate the het miscall rate (over all read depth bins) we just set the minimum bin size to a very large value so it make just one bin, and use the infer_m function which implements a Markov chain Monte Carlo (MCMC) algorithm to estimate the heteryzogote miscall rate.

overall <- infer_m(lobster_buz_2000, minBin = 1e15)
#> Preparing data structures for MCMC
#> Running MCMC
#> Tidying output

Now look at that:

overall$m_posteriors
#> # A tibble: 1 x 6
#>     bin  mean  lo95  hi95 total_n mean_dp
#>   <int> <dbl> <dbl> <dbl>   <int>   <dbl>
#> 1     1 0.256 0.247 0.265   65291    65.9

Wow! (Or should we say “WHOA!”) A het miscall rate of around 25%.

Now infer a miscall rate for read depth bins

See the total_n above is about 65,000. That means 65,000 genotypes. (2000 loci typed at 36 individuals, with some missing data).
We will bin those up so that there are at least 2000 genotypes in each bin and then estimate the het miscall rate for each read depth bin.

binned <- infer_m(lobster_buz_2000, minBin = 2000)
#> Preparing data structures for MCMC
#> Running MCMC
#> Tidying output

And then we can plot the posterior mean and CIs for each read depth bin.

posteriors_plot(binned$m_posteriors)

Again, WHOA! The het miscall rate at low read depths is super high!