Complete Genomes of the Fruit Fly

The fruit fly (Drosophila melanogaster) is a classical model species in evolutionary biology. Kao et al. [139] investigated the origin of fruit flies from Central and North America using 121 individuals from America, Africa, and Europe. With Illumina sequencing of 23 lines from 12 locations and additional published data, they inferred 4,021,717 SNPs (out of a genome size of 139.5 Mb). Further filtering to remove low quality SNPs, which likely represent false positives, resulted in 1,047,913 SNPs. The authors deposited the final

VCF file on Dryad1 which we will use here. Information on the origin of the 121 flies was found in the original publication. We made a file ‘geo_droso.txt’ (provided with the on-line resources of this book) with the individual labels, the locality, and the region coded with three letters as used in the original publication:

> geo <- read.delim("geo_droso.txt")

> strCgeo)

’data.frame’: 121 obs. of 3 variables:

$ ID : Factor w/ 121 levels "13_29","13.34": 1 2 3 4____

$ Locality: Factor w/ 16 levels "Birmingham, AL",..: 15 15 13....

$ Region : Factor w/ 6 levels "CAM","CAR","FRA",..: 5 5 5 5____

Human Genomes

The ‘1000 Genomes Project’ started in 2008 with the initial goal to sequence one thousand human genomes in order to give a picture of the genomic variation within the world population. A study based on 1092 genomes was published in 2012 [271] followed by another one based on 2504 genomes three years later [272]. The Web site of this project0 gives access to several sets of data. Twenty-five VCF files are provided (one for each chromosome and one for the mitochondrial genome) for a total of 16.2 GB compressed with GZIP.

Additional information is available from the above Web site in the text file ‘igsr_samples.tsv’ (downloaded from https: //www. internationalgenome. org/data-portal/sample on 2019-06-11):

> samples.infо <- read.delim("igsr_samples.tsv")

> str(samples.info)

’data.frame’: 3904 obs. of 8 variables:

$ Sample.name : Factor w/ 3904 levels "HG00096",....

$ Sex : Factor w/ 2 levels "female","ma":....

$ Biosample.ID : Factor w/ 3504 levels "","SAME122....

$ Population.code : Factor w/ 29 levels "ACB","ASW","....

$ Population.name : Factor w/ 29 levels "African-Amer....

$ Superpopulation.code: Factor w/ 5 levels "AFR","AMR","E....

$ Superpopulation.name: Factor w/ 5 levels "African","Ame....

$ Data.collections : Factor w/ 18 levels "","1000 Geno....

This file has data on more individuals than in the VCF files. We thus match the individual labels from the first VCF file and test whether they are all in the above file:

> fl <- "ALL.chrl.phase3_shapeit2_mvncall_integrated_v5a. [1]

20130502.genotypes.vcf.gz"

> labs <- VCFlabels(fl)

> all (labs y,in°/t samples. inf oSSample. name)

[1] TRUE

We can then create a smaller data frame with only the 2504 individuals and the variables we are interested in (stored in the vector vars):

> i <- match(labs, samples.info$Sample.name)

> vars <- c("Sex", "Population.code", "Superpopulation.code")

> DATA <- samples.infofi, vars]

> row. names (DATA) <- samples, inf o$Sample . named]

> str(DATA)

’data.frame’: 2504 obs. of 3 variables:

$ Sex : Factor w/ 2 levels "female","male":....

$ Population.code : Factor w/ 29 levels "ACB","ASW","BEB".....

$ Superpopulation.code: Factor w/ 5 levels "AFR","AMR","EAS",....

The script is easily modifiable, for instance is we want to keep other variables than the three selected here. We finally save this data frame in a file:

> saveRDS(DATA, "DATA_G1000.rds")

  • [1] https://doi.org/10.5061/dryad.446sv.2(>https: //www. internationalgenome. org/
 
Source
< Prev   CONTENTS   Source   Next >