A high quality reference genome of bread wheat hasn't yet been completed. In the past few years, however, the wheat community has produced valuable genomic resources that can be used as a proxy reference: a whole genome shotgun (WGS) assembly of Chinese Spring based on 454 technology (Brenchley et al. 2012); the reference genomes for the diploid progenitor species Triticum uratu (2n = 14;AA) (Ling et al. 2013) and Aegilops tauschii (2n = 14;DD) (Jia et al. 2013); an assembly of the close Triticeae relative Hordeum vulgare L. (2n = 14;HH) (International Barley Genome Sequencing et al. 2012); and the genomic contigs from the Chromosome-based Survey Sequencing (CSS; International Wheat Genome Sequencing Consortium 2014). We can use the listed collections as supporting information, bearing in mind that discrepancies are expected for the related organisms and that specific genes being studied may not be present in Chinese Spring. Nevertheless, using the available genomic sequences facilitates NGS analyses that depend on a reliable genome reference.
SNPs that can be used to map genes of interest are becoming increasingly available. The CerealsDB website holds ~100,000 SNPs from British varieties (Allen et al. 2013) obtained by transcriptome and exome sequencing, although only a fraction (<8 %) have been converted into functional HTP assays. Recently a subset of these have been incorporated into the ~82,000 iSelect array coordinated by Eduard Akhunov at Kansas State University (Wang et al. 2014) which follows the 9,000 SNP array (Cavanagh et al. 2013). These SNP arrays are providing an extremely valuable resource for the community in diversity and association studies. Their use however is limited for large mapping populations because of their cost, the difficulty in interpreting data when using diploid or alien introgressions, and the difficulty to reliably call heterozygous individuals. Because of these reasons, alternative approaches are sometimes required to identify polymorphisms across target regions. NGS is providing different methodologies to identify putative variations. WGS sequencing consists on sequencing random fragments of genomic DNA without any selection except for the fragment size. The reads are then aligned to a reference and SNPs can be called from the alignments. In several model organisms, these resequencing NGS approaches have been combined with bulks/pools of phenotypically distinct individuals to identify SNPs that are closely linked with the gene of interest within a single experiment.
These approaches, collectively termed as NGS-enabled genetics (Schneeberger and Weigel 2011), are rapidly evolving and have used different strategies. MutMap produces segregant populations from seeds with induced mutations and then bulks are sequenced using WGS (Abe et al. 2012; Takagi et al. 2013b). QTL-seq takes two plants with opposite phenotypes and the progeny with extreme phenotypes are bulked and their DNA is extracted for sequencing (Takagi et al. 2013a). Renseq, focuses on R-genes by designing baits from known resistance genes and performing targeted resequencing (Jupe et al. 2013). The mentioned techniques do not scale well in wheat because they rely on having a reliable reference sequence and relatively small genome size. A full Illumina HiSeq 2500 can produce 600-Gbp in a single run, providing 35-fold coverage of the complete wheat genome. However, this translates into less than two reads for each position per individual for a bulk of 20 plants.
To reduce the complexity of the data, we propose to focus on sequencing the wheat transcriptome using RNA-Seq (Westermann et al. 2012) instead of genomic DNA. By sequencing the transcribed RNA we can use short read aligners and use transcript assemblies as reference (Fig. 22.2a). At first we used the wheat UniGenes set available NCBI, which consist of collapsed homoeologous transcripts from a
Fig. 22.2 (continued) reference sequence, with matches indicated by dots, and polymorphisms at positions 181 and 184 indicated by the corresponding nucleotide variants at those positions. The SNP index is calculated as the frequency of the informative allelic SNP in each bulk. The Bulk Frequency Ratio is the quotient of the resistant and susceptible bulk SNP Indexes. (c) Primer design: The allelic SNP (G184A) is used at the 3′ end of the differentiating primers. For the common primer, a homoeologous SNP is selected for the 3′ end to make the marker genome-specific.
(d) The KASP assay output shows the intensities of the HEX and FAM fluorescence of individual plants as a single dot. The clusters near the X and Y-axis are composed of homozygous individuals, while the central cluster contains heterozygous plants
Fig. 22.2 RNA-Seq NGS-enabled genetics. (a) Representation of a typical RNA-Seq output; the method excludes non-coding regions from the genome and the coverage is correlated to the expression of the gene. (b) Illustration of a non-informative homoeologous SNP (G181T) present in both parental lines, and an informative allelic SNP (G184A), only present in the resistant progenitor Avocet S+ Yr15. The consensus sequences from the parental genotypes include this information in the form of ambiguity codes (K and R, respectively). In the bulks, the individual reads align across the
myriad of sources (Pontius et al. 2002). More recently, we have shifted to a phased transcriptome which has gene models separated by the corresponding genome (Krasileva et al. 2013).
RNA-Seq captures the full range of the dynamic spectrum of the transcriptome, an advantage when compared to array platforms that are restricted to the pre-defined set of variants incorporated into the array design. SNPs can be identified either by aligning to a known transcriptome or by de novo assembly over the transcriptome (Grabherr et al. 2011). With the use of F2 populations it is possible to create a panel of putative SNPs that enables haplotype analysis without a priori knowledge of the positions of the loci (Trick et al. 2012). RNA-Seq allows rapid access to SNPs in wheat and it scales well as the transcriptome is several orders of magnitude smaller than the genome, ~80 Mb compared to 17 Gb, respectively. In principle, it is possible to sequence the transcriptome with an average coverage over 900x (keeping in mind variations of coverage due to expression) in a single HiSeq 2,500 lane, as opposed to just over 1.5x of genomic sequence.
The original purpose of RNA-Seq is to characterize expression levels of genes, which can bias the SNP calling, as the assumption of uniform coverage is not valid. To overcome these biases and analyse the volume of data produced by NGS, bioinformatics pipelines and access to high performance computing are required. Although this is a potential barrier for adoption, new web-based user-friendly graphical interfaces, such as Galaxy (Goecks et al. 2010), are empowering new users to access high-performance computing facilities. For NGS-enabled genetics in wheat, we propose a pipeline that integrates BSA (Michelmore et al. 1991; Trick et al. 2012), syntenic information from related grasses, and the use of the CSS to aid in the design of genome specific primers from putative SNPs.