We provide some example usage of the oracle calculations available in
updog. These are particularly useful for read-depth determination. These calculations are described in detail in Gerard et al. (2018).
Suppose we have a sample of tetraploid individuals derived from an S1 cross (a single generation of selfing). Using domain expertise (either from previous studies or a pilot analysis), we’ve determined that our sequencing technology will produce relatively clean data. That is, the sequencing error rate will not be too large (say, ~0.001), the bias will be moderate (say, ~0.7 at the most extreme), and the majority of SNPs will have reasonable levels of overdispersion (say, less than 0.01). We want to know how deep we need to sequence.
oracle_mis, we can see how deep we need to sequence under the worst-case scenario we want to control (sequencing error rate = 0.001, bias = 0.7, overdispersion = 0.01) in order to obtain a misclassification error rate of at most, say, 0.05.
Before we do this, we also need the distribution of the offspring genotypes. We can get this distribution assuming various parental genotypes using the
get_q_array function. Typically, error rates will be larger when the allele-frequency is closer to 0.5. So we’ll start in the worst-case scenario of assuming that the parent has 2 copies of the reference allele.
This is what the genotype distribution for the offspring looks like:
Now, we are ready to iterate through read-depth’s until we reach one with an error rate less than 0.05.
Looks like we need a depth of 90 in order to get a misclassification error rate under 0.05.
oracle_mis returns the best misclassification error rate possible under these conditions (
ploidy = 4,
bias = 0.7,
seq = 0.001,
od = 0.01, and
pgeno = 2). In your actual analysis, you will have a worse misclassification error rate than that returned by
oracle_mis. However, if you have a lot of individuals in your sample, then this will act as a reasonable approximation to the error rate. In general though, you should sequence a little deeper than suggested by
Suppose we only have a budget to sequence to a depth of 30. Then what errors can we expect? We can use
oracle_plot to visualize the errors we can expect.
Most of the errors will be mistakes between genotypes 2/3 and mistakes between genotypes 1/2.
Even though the misclassification error rate is pretty high (0.14), the correlation of the oracle estimator with the true genotype is pretty reasonable (0.89). You can obtain this using the
Gerard, David, Luís Felipe Ventorim Ferrão, Antonio Augusto Franco Garcia, and Matthew Stephens. 2018. “Genotyping Polyploids from Messy Sequencing Data.” Genetics 210 (3). Genetics: 789–807. https://doi.org/10.1534/genetics.118.301468.