This vignette is additional to the basic
polymapR vignette “How to use polymapR”, to which we would direct readers’ attention to first.
Here, we will go through the main steps of polyploid linkage analysis using
polymapR  when discrete dosages are not available or not desired, but where genotype probabilities are instead available. Genotype probabilties are a direct output of many polyploid genotype callers such as
updog  or
polyRAD , although dosage probabilities from any other genotyping software can be used.
For more background information on the functions described here, please refer to the 2021 article of Liao et al. .
By their very nature, probabilistic datasets in polyploids are larger than discrete ones: each marker-individual combination has ploidy + 1 floating-point numbers associated with it, in comparison to a single integer value for a discrete dataset. For example, we might encounter the following probabilistic data for 5 SNP markers in a single tetraploid individual:
Using discrete dosages (e.g. by selecting the maximum probability) would give the following table for the same individual:
However, there are some advantages to probabilistic genotypes - in cases where uncertainty exists in the genotype assignment, it might not be obvious how to discretise the genotype scores. This may lead to observations being marked as missing data or even being entirely removed from the dataset, depending on what filtering thresholds are used.
Probabilistic genotypes are a standard output from many polyploid genotype calling procedures. This vignette assumes the R package
fitPoly has been used (developed to score the output of SNP arrays).
polymapR also includes functions to convert the output of other polyploid genotype callers developed for sequencing reads into a compatible format. For example, output of the
multiflex function of
updog can be re-formatted using the
convert_updog function, while
convert_polyRAD does the same with the
RADobject output of function
PipelineMapping2Parents in R package
In this vignette we will use a simulated SNP dataset from an autotetraploid F1 population with 200 individuals. This fictive organism has 5 chromosomes (2n = 4x = 10) and exhibits polysomic inheritance (although the pairing behaviour was that of random bivalents). The dataset was generated using
PedigreeSim  and the
GenoSim package (Liao, Tumino et al., in preparation) which simulates SNP array intensity values (and also sequencing reads) given an underlying population. The genotypes were called using
fitPoly  using the function
The data should be downloaded from the following FigShare repository and saved to your working directory:
We will assume you have already installed or updated the version of
polymapR to at least version 1.1.0. If not, please install the latest version of
polymapR from CRAN:
Assuming you have not altered the filename of the sample dataset, we will first load the genotype data into R:
Have a look at the layout of the data, as this is generally the format that is expected for probabilistic genotypes:
For clarity, we’ll also define what the parental and offspring samples are already. In this dataset there are two maternal replicates (‘P1’ and ‘P1a’) but only a single paternal replicate (‘P2’). It is quite common (recommended in fact) to have multiple parental replicates to ensure parental calls are reliable.
The first step in the mapping using probabilistic genotypes is to run the
checkF1 function (written by Roeland Voorrips). This looks for concordance between parental and offspring genotypes, as well as checking for marker skewness / distorted segregation under various genetic models (disomic, polysomic, mixed) and returns a number of different quality parameters that might be useful for flagging problematic markers. See
?checkF1 for more details. We will capture the output of
checkF1, as this will be needed in many subsequent function calls:
This function returns a list with elements
checked_F1 (the actual function output) and
meta (which carries the information about parameter settings used). Note that we have assumed that we are dealing with a polysomic species here (
polysomic = TRUE). In cases where there is uncertainty regarding the mode of inheritance, it can be useful to run
checkF1 multiple times, e.g. setting
polysomic = FALSE and
disomic = TRUE. Quality of the markers can be assessed by filtering on
qall_mult > 0 for example. In some cases, you may notice that parental scores have been made missing:
## m MarkerName parent1 parent2 F1_0 F1_1 F1_2 F1_3 ## 1 563 LG1_0.02 2 1 16.307194 91.47333 81.57135 10.648126 ## 2 1315 LG1_0.03 1 0 91.105597 83.89326 22.70502 2.296116 ## 3 437 LG1_0.13 0 NA 0.000000 101.97259 97.02741 0.000000 ## 4 1130 LG1_0.21 1 2 17.758076 75.43383 91.64195 15.166141 ## 5 512 LG1_0.64 4 NA 5.552898 32.96319 63.11849 72.637833 ## F1_4 F1_NA P1 P1a P2 bestfit frqInvalid_bestfit Pvalue_bestfit ## 1 0.00000 0 2 2 1 1551_0 0.000 0.3895 ## 2 0.00000 0 1 1 0 11_0 0.125 0.5856 ## 3 0.00000 1 0 0 2 11_1 0.000 0.7259 ## 4 0.00000 0 1 1 2 1551_0 0.000 0.6185 ## 5 23.72759 2 4 4 1 1551_1 0.028 0.0000 ## matchParent_bestfit bestParentfit frqInvalid_bestParentfit ## 1 Yes 1551_0 0.0000 ## 2 Yes 11_0 0.1250 ## 3 OneOK 11_1 0.0000 ## 4 Yes 1551_0 0.0000 ## 5 No 141_2 0.1945 ## Pvalue_bestParentfit matchParent_bestParentfit q1_segtypefit q2_parents ## 1 0.3895 Yes 1 1 ## 2 0.5856 Yes 0 1 ## 3 0.7259 OneOK 1 1 ## 4 0.6185 Yes 1 1 ## 5 0.0000 OneOK 0 1 ## q3_fracscored qall_mult qall_weights ## 1 1.0000 1.0000 1.0000 ## 2 1.0000 0.0000 0.4444 ## 3 0.9875 0.9875 0.9972 ## 4 1.0000 1.0000 1.0000 ## 5 0.9750 0.0000 0.4389
If you are interested, you can impute these values using the internal function
polymapR as follows:
For now, we will just focus our attention on potentially problematic markers that have had a
qall_mult value of 0:
There appear to be quite a large number of markers with questionable agreement between parental and offspring scores - we will remove these as follows:
If you are concerned about removing these markers, you can revisit them later on and try to add them back to the map. For now we will proceed with a cleaned-up dataset, as this makes the mapping a lot easier.
We can get an overview of the “precision” of probability scores using the function
By default a plot is generated giving some summary statistics of the marker scores, such as the mean genotype probability encountered etc. The output can be used to filter the marker dataset further (although here using the default parameters no filtering actually occurs):
The next step is to have a look at the distribution of maximum genotype probabilities, as we would ideally like these to be as close to 1 as possible in a clearly-scored dataset:
## interval count percent ## 1 0 ~ 0.05 0 0% ## 2 0.05 ~ 0.1 0 0% ## 3 0.1 ~ 0.15 0 0% ## 4 0.15 ~ 0.2 0 0% ## 5 0.2 ~ 0.25 0 0% ## 6 0.25 ~ 0.3 0 0% ## 7 0.3 ~ 0.35 0 0% ## 8 0.35 ~ 0.4 0 0% ## 9 0.4 ~ 0.45 19 0.01% ## 10 0.45 ~ 0.5 153 0.05% ## 11 0.5 ~ 0.55 1624 0.54% ## 12 0.55 ~ 0.6 1661 0.55% ## 13 0.6 ~ 0.65 1880 0.62% ## 14 0.65 ~ 0.7 2017 0.67% ## 15 0.7 ~ 0.75 2546 0.85% ## 16 0.75 ~ 0.8 2824 0.94% ## 17 0.8 ~ 0.85 2872 0.95% ## 18 0.85 ~ 0.9 3534 1.17% ## 19 0.9 ~ 0.95 6012 2% ## 20 0.95 ~ 1 275591 91.48% ## 21 NA 519 0.17%
This may give some indication of possible issues with the assigned genotypes. In the example above, over 90% of the data (marker-individual combinations) had an assigned probability of 0.95 or higher - suggesting a relatively high-quality dataset.
As in the original
polymapR pipeline, it is also possible to check the marker data for duplicate individuals. This is accomplished using the
## ## No duplicates found
Currently there is no specific function to check for duplicate markers using genotype probabilities. The current advice here is to discretise the data and use the
screen_for_duplicate_markers function from the original
polymapR pipeline, and then go back and filter the probabilistic dataset. Specific functions to assist with this task are planned but not currently implemented.
In the original
polymapR pipeline, marker conversion is an important step that reduces the number of marker segregation types to a smaller set (without any loss or distortion of the linkage information they contain). This function has been moved to an internal function as part of the
linkage.gp call and therefore is no longer directly run by the user.
For map ordering we use the ordering algorithm implemented in the
MDSMap package . This is achieved using the function
Once the map has been generated, we can use the marker assignment information to generate a phased linkage map (necessary for other subsequent steps e.g. IBD-based QTL mapping, available in the
For extra details e.g on map visualisations, see the main
1. Bourke PM et al: PolymapR—linkage analysis and genetic map construction from f1 populations of outcrossing polyploids. Bioinformatics 2018, 34:3496–3502.
2. Voorrips RE, Gort G, Vosman B: Genotype calling in tetraploid species from bi-allelic marker data using mixture models. BMC bioinformatics 2011, 12:172.
3. Gerard D, Ferrão LFV, Garcia AAF, Stephens M: Genotyping polyploids from messy sequencing data. Genetics 2018, 210:789–807.
4. Clark LV, Lipka AE, Sacks EJ: PolyRAD: Genotype calling with uncertainty from sequencing data in polyploids and diploids. G3: Genes, Genomes, Genetics 2019, 9:663–673.
5. Liao Y, Voorrips RE, Bourke PM, Tumino G, Arens P, Visser RG, Smulders MJ, Maliepaard C: Using probabilistic genotypes in linkage analysis of polyploids. Theoretical and Applied Genetics 2021, 134:2443–2457.
6. Voorrips RE, Maliepaard CA: The simulation of meiosis in diploid and tetraploid organisms using various genetic models. BMC bioinformatics 2012, 13:248.
7. Preedy KF, Hackett CA: A rapid marker ordering approach for high-density genetic linkage maps in experimental autotetraploid populations using multidimensional scaling. Theoretical and Applied Genetics 2016, 129:2117–2132.