Geometry of Genotyping Space in the Presence of Marker Typing Errors
The sample size (N) of mapping populations limits the marker density in the map. Thus, for a DH population with N = 200, the minimal non-zero recombination rate between two adjacent markers cannot be less than 0.5 %. In the absence of errors, all markers should appear in AL groups, with the distance between the groups ³0.5 cM. Typing errors will lead to the erosion of these groups into “clouds” of falsely different markers. Figure 14.1 illustrates the formation of such a cloud from a set L of 11 AL markers in a multi-dimensional space of markers scored for a sub-sample of 16 individuals from the mapping population. In an ideal error-free situation, all
11 markers would vary identically across the shown 16 genotypes: in the 16-dimensional space these markers are in the same state (aababbbaaaabbaba)
Fig. 14.1 A geometric model of erosion of AL marker groups due to scoring errors (only 11 markers scored for 16 genotypes are shown)
and belong to the set L (represented as dots within the grey circle). Due to typing errors, some of the markers change their 16-dimensional states and leave the set L (white holes); corresponding genotypes will be erroneously recorded as “recombinants”. The problem is how to select markers for building a reliable genetic map in a challenging situation when the data set includes thousands of markers per chromosome while a certain proportion of markers are contaminated by erroneous data points and a part of the data points are missing.
The Proposed Method and Algorithm
We propose a method of addressing these problems based on a simple idea that with very large numbers of scored markers (e.g., thousands or dozens of thousands per chromosome) and small-to-moderate population size, many markers will be irresolvable by recombination and should appear as groups of AL markers. But some of AL markers will appear as “recombinants” if even a small proportion of scores per marker are erroneous. Thus, we can trust more markers from groups of absolutely linked markers compared to singleton markers. For sample size N and a proportion of genotyping errors p per marker, the probability that in all individuals both alleles of a marker will be unmistakably identified can be estimated under the assumption that the typing errors are independent, as P= (1-p)N " e−Np. Assuming 1 % error rate within a group of AL markers, about a third will still remain error-free. In a DH population of N = 100 individuals, for a chromosome length of 100 cM the minimum interval length will be 1 cM. Consequently, the density of the map cannot be greater than 101 markers. If we genotyped 10,000 markers of this chromosome, only 100 markers (referred to as skeletal markers) can be ordered, whereas the rest will remain absolutely linked to the skeletal markers. Thus, for building a skeleton map one can select presumably error-free markers based on the presence of “twins” in the sample, although there is also a small probability that non-identical markers may become “twins” because of genotyping errors. Therefore, a certain threshold is
Fig. 14.2 Scheme of the “twin” algorithm. Illustrated is the marker information flow in the process of map construction
introduced in our algorithm for the selection of markers with a sufficient number of twins. In regions with a lower density of recombination events (e.g., affected by the centromeric effect on recombination), the map will be less affected by typing errors. The major steps in our algorithm for building ultra-dense genetic maps (Fig. 14.2) implemented in MultiPoint software (multiqtl.com) include: (a) Forming groups of markers with zero distance and selecting a “delegate” from each group containing no fewer twins than the predefined threshold (equal 3 in Fig. 14.2); (b) Except for twins of the candidates, all remaining markers are removed to the Heap;
(c) Clustering the delegate markers and ordering the obtained linkage groups (LG);
(d) Filling gaps and extending LG ends using markers from Heap; (e) Removal of markers violating map stability and monotonic growth of distance from a marker and its subsequent neighbors.