Obviously colored by my interests.
Rapid radiation of common Eurasian Y chromosome haplogroups occurred significantly later than the out of Africa migration. M. Järve, International Consortium of the Estonian Centre for Genomics Estonian Biocentre and Department of Evolutionary Biology, University of Tartu, Tartu, Estonia.
Human genetic diversity outside Africa is low, which is commonly ascribed to a recent out of Africa bottleneck and subsequent rapid colonization of the rest of the world. Previous studies of the male-specific Y chromosome have shown that haplogroups common throughout non-African populations all coalesce to a small number of shared ancestral lineages, the branching order of which is only partly understood. Using 475 high coverage whole Y chromosome sequences, including 317 newly reported here, we selected reliable regions within the Y chromosome based on coverage analysis, mappability and sequence class. Based on these data, we refined the Y chromosome haplogroup tree, applying phylogenetic methods to establish the branching order and temporal dynamics of splits in non-African Y chromosome haplogroups. Compared to the length of the branches that separate African and non-African diversity, the internal branches distinguishing continental and sub-continental differences outside Africa are generally short, consistent with the model of a rapid initial colonization of Eurasia and Oceania. Following the split between African and non-African haplogroups [90 KYA (95% CI: 87-94 KYA)], the differentiation of South and Southeast Asian haplogroups H, S, M, and C did not begin until around 43 KYA, and haplogroups N and R, widely spread among Northeast and Northwest Eurasian populations, started to diversify significantly later [17 KYA (95% CI: 16-19 KYA) and 26 KYA (95% CI: 25-28 KYA), respectively]. Many major phylogenetic groups in different geographic regions seem to originate from the period around 50 KYA.
Use of Long-read-sequence Aided Phasing for Inference of Ancestry Assignment in Admixed Populations. F. L. Mendez1, S. S. Shringarpure1, A. Moreno1, E. R. Martin2, M. L. Cuccaro2, C. D. Bustamante1 1) Department of Genetics, Stanford University, Stanford, CA; 2) Center for Genetic Epidemiology and Statistical Genetics, University of Miami, Miami, FL.
Correct phase reconstruction of individual chromosomes is important for numerous genetic analyses, including inferring demographic parameters in admixture processes. Admixture is the result of interbreeding of previously differentiated populations. The chromosomes of admixed individuals are composed of segments that can be traced individually to one of the ancestral populations. The abundance and length distributions of these chromosomal segments provide crucial information on the admixture process; however, the correct inference of their length and ancestry requires phasing data from the admixed individuals. Phase reconstruction can be performed applying the rules of Mendelian segregation of variants or using statistical methods that rely on population data. Alternatively, molecular phasing (the observation of different polymorphisms in the same chromosomal sequence) provides direct evidence of phase. In this fashion, methods of molecular-phasing, like long-read sequencing, may be used to extend the range of confident phasing. We simulate long-read sequence data and explore improvement brought by molecular phasing on accuracy of ancestry assignment of chromosomal segments, in definition of segment boundaries, and in overall admixture inference. We then use genotype information from 5 trios of Latino individuals, including 10 individuals with long-read sequence information, together with three reference panels of haplotypes of European, African, and Native American ancestry to evaluate the effect of molecular phasing on inference of the admixture process.
Linear Mixed Model-Based Admixture Mapping. L. Brown, T. Thornton Biostatistics, University of Washington, Seattle, WA.
Genetic studies in recently admixed populations can provide invaluable insight into novel risk factors contributing to disease. Population admixture results in combined genomes from previously isolated ancestral populations that may have discernible allele frequency differences due to natural selection and genetic drift. Genes that underlie ethnic differences in traits and that show differential risk by ancestry can be identified using admixture mapping. Compared to studies carried out in more ethnically homogenous populations, admixture mapping has potentially greater power to detect certain genetic variants. Linear mixed models have gained traction as a tool for genome-wide association studies. Mixed model methods have been shown to protect against spurious associations in structured samples, a common pitfall in genetic association studies, by directly accounting for sources of dependence including cryptic relatedness and population stratification. We present a linear mixed model approach for admixture mapping in the presence of population structure and hidden relatedness. We implement this method using local ancestry estimates based on genome-wide SNP data. We apply the method to analyze genetic associations with white blood cell count and C-reactive protein phenotypes in the African American cohort of the Women’s Health Initiative study. We demonstrate that our proposed linear mixed model method for admixture mapping provides a substantial improvement over widely used admixture mapping approaches.
Genetic evidence of archaic admixture in India. A. Basu, D. Das, S. Das, N. Biswas National Institute of BioMedical Genomics, Kalyani, India.
Comparing high-coverage Denisova and Neanderthal whole-genome sequences has revealed significant admixture with all present day non-African populations. Microblade tool usage from central-India has been reported, yet no genetic-study examined archaic admixture in present day South-Asians. We report the first evidence of archaic admixture from whole genome sequence data of 4 present-day Indians. Four individuals, who are at the extremities of a two-dimensional Principal-Component plot summarizing the extant of genomic variance in 237 Indians belonging to 20 linguistically, ethnically diverse populations sampled from different geographic locations in India; were sequenced. Their population identities are Onge, Jamatia, Panniya and Birhore. All individuals showed slight excess of Denisova admixture (D-Statistic 1.6-1.99) when compared with Eurasians, with admixture evidence increasing from Jamatia(1.6) to Onge(1.99). Similar pattern was also observed when compared with Neanderthal. Our findings show evidence of archaic admixture in different present-day populations, not restricted to proximity of archaeological evidence, indicating wide spread admixture.
Inferring patterns of demography and assortative mating in the Thousand Genomes Project admixed populations from the Americas. E. E. Kenny1,5,6,7, C. Gignoux2, S. Baharian3, S. Musharoff2, B. Maples2, S. Shingarpure2, A. Auton4, C. D. Bustamante2, S. Gravel3, A. R. Martin2, The 1000 Genomes Consortium 1) Department of Genetics, Icahn School of Medicine Mount Sinai, New York, NY; 2) Department of Genetics, School of Medicine, Stanford University, California, CA; 3) Department of Human Genetics, McGill University, Montreal, Canada; 4) Department of Genetics,Albert Einstein College of Medicine, New York, NY; 5) Institute of Personalized Medicine, Icahn School of Medicine Mount Sinai, New York, NY; 6) Center of Statistical Genetics, Icahn School of Medicine Mount Sinai, New York, NY; 7) Institute of Genomics and Multiscale Biology, Icahn School of Medicine Mount Sinai, New York, NY.
The phase 3 release of the 1000 Genomes project includes genotype and sequence data for 2,535 individuals from 26 populations around the globe. These include six populations from the Americas with mixed Native American, European and West African ancestry. We have identified admixture proportions in these six populations, which, include African Caribbean’s from Barbados (ACB), African American’s from south-west USA (ASW), Colombians from Medellin (CLM), Peruvians from Lima (PEL), Mexican-American from Los Angeles (MXL), and Puerto Ricans from Puerto Rico (PUR). We show presence of Native American, European and African ancestry in all six populations, in particular, we identify six ASW individuals with > 20% Native American ancestry. The European component of these individuals looks most similar to Nordic ancestry, rather than Spanish ancestry often seen in Hispanic/Latino individuals. Among the ACB, PEL, PUR, CLM and MXL populations, we find an excess of Native American and dearth of European ancestry on chromosome X compared to the autosomes, indicating a history of non-random mating in these populations. We have also inferred local ancestry tracts (LAT), identifying haplotype specific segments of ancestry across chromosomes. We assessed the accuracy of our tract calls and demonstrated accuracies >0.99, >0.98 and >0.97 in African, European and Native American tracts across all populations. By modeling the distribution of ancestral tract lengths, we inferred the timings of migration in the two populations from the America’s that are new to phase 3, ACB and PEL. We estimated the PEL have had more recent admixture with European and African individuals than other Hispanic/Latino groups in the Caribbean and throughout northern South America, consistent with known migration patterns. These analyses have given us an insight into the demographic history and migration patterns among admixed populations in the 1000 Genomes Project.
Sub-continental local ancestry inference in U.S. individual. B. K. Maples1,2, J. K. Byrnes3, J. M. Granka3, K. Noto3, S. Shringarpure1, M. L. Carpenter1, M. J. Barber3, R. E. Curtis4, N. M. Myres4, C. A. Ball3, K. G. Chahine4, C. D. Bustamante1 1) Genetics, Stanford University, Stanford, CA; 2) Biomedical Informatics, Stanford University, Stanford, CA; 3) AncestryDNA L.L.C., San Francisco, CA; 4) AncestryDNA L.L.C., Provo, UT.
The United States was populated through a sequence of migratory waves including immigrants from numerous distinct source populations. This “melting pot” process has led to the majority of current U.S. residents being genetically admixed. Understanding this complex genetic diversity is of great interest to the field of population genetics and accounting for it is critical for medical genetics. Numerous methods have been developed for performing local ancestry inference (LAI) in which the ancestry of each genomic locus is estimated, but the majority of these methods are only accurate at the level of continental admixture (e.g. African Americans with ancestry from Africa and Europe). Sub-continental LAI is often more difficult as neighboring populations typically have reduced differentiation. In this study we apply the LAI method RFMix that has been shown to perform well at a sub-continental level (e.g. mixtures of Northern and Southern European ancestry). RFMix is seeded with a reference panel of samples with known origins, and then iteratively learns from a larger collection of test samples. The performance of this method greatly improves with larger reference panel and test sample sizes. Here we use more than 2,000 single-origin reference samples from Ancestry.com and 1000 Genomes, along with over 100,000 research-consented customer samples with admixed origins to train the model to perform inference on individuals with admixed European ancestry. We compare the genome-wide ancestry estimates from RFMix with pedigrees. Using pedigree data as a truth set, we tune the performance of RFMix. We then compare the performance of RFMix with results from the commonly used ancestry estimation method ADMIXTURE run in supervised mode with the same initial single-origin reference panel provided to RFMix. We also use single-origin samples to create synthetic admixed samples with known local ancestry patterns to assess the accuracy of RFMix to call individual segments in admixed Europeans. Finally, we apply the highly trained version of RFMix to the National Institute on Aging’s Health and Retirement Study data. We compare county-level geographic summaries of sub-continental ancestry estimates in this data to recent U.S. census data. We find strong evidence of fine-scale population structure with certain localities showing enrichment for particular ancestries (e.g. Irish ancestry in and around Boston and Scandinavian ancestry in the Midwestern states).
Inference of the demographic history of Japan using Approximate Bayesian Computation. C. D. Quinto1, K. R. Veeramah2, A. E. Woerner3, M. F. Hammer3 1) Graduate Interdisciplinary Program in Genetics, University of Arizona, Tucson, Arizona, USA; 2) Department of Ecology and Evolution, Stony Brook University, Stony Brook, NY, USA; 3) Arizona Research Laboratories, Division of Biotechnology, University of Arizona, Tucson, Arizona, USA.
The genetic exchange between differentiated populations, termed admixture, has increasingly been shown to be an important process in human history. The formation of Hispanic populations in the Americas is one of the best-known examples of this phenomenon. Another important but less well-known example is the origin of modern Japanese. At least two distinct incoming migrations are known to have occurred during the prehistory of Japan. The first took place at least 10,000 years ago and established the Jōmon culture, which was characterized by a semi-sedentary hunter-gatherer way of life and one of the earliest uses of ceramics. Then, around 2,300 years ago, a second migration to the archipelago brought rice agriculture and iron, and established the Yayoi culture. The mixture of the people belonging to these two cultures is believed to have formed the ancestors of the modern Japanese population. Although archaeological records provide information about the time of arrival of the Yayoi people to Japan, the dynamics of the admixture process are still unclear. Previous genetic studies, focusing on mitochondrial DNA and the Y chromosome, have supported an admixture model for the origin of the modern Japanese population. While genome-wide data have being used to investigate this question, there are currently no studies that infer the parameters describing the dynamics of the admixture process. Part of the reason for this is that explicit population genetic modeling is problematic when utilizing genome-wide arrays because of the underlying ascertainment bias in the choice of SNPs. To address these issues, we genotype 500,000 SNPs in 282 samples from populations across the Japanese archipelago and East Asia. We then attempt to correct for the ascertainment bias by using whole genome sequencing data to approximate the discovery sample used to ascertain SNPs. We utilize the SNP genotypes from the different populations to identify ancestry blocks in Japanese samples. The distribution of these blocks provides insights about the time and proportion of admixture, and this information is used in an Approximate Bayesian Computation analysis to infer other key demographic parameters such as divergence times, migration rates and population sizes.
A Fine-Scale Comparative Analysis of Population Structure, Divergence and Admixture in in Han Chinese, Japanese and Korean Populations. S. Xu1, Y. Wang1, D. Lu1, Y. Chung2 1) Population Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai, Shanghai, China; 2) Integrated Research Center for Genome Polymorphism, Department of Microbiology, The Catholic University Medical College, Socho-gu Seoul 137-701, Korea.
n East Asia, human origins and dispersals remain poorly understood and debatable. As the major ethnicities of East Asia, Han Chinese, Japanese and Korean people share many similarities in characteristics, language and culture. However, the genetic relationships, divergence times and subsequent gene flow among the three populations have not been well studied or quantitatively estimated. Here, we conducted a genome-wide study using over 900,000 single nucleotide polymorphisms (SNPs) and evaluated the population structure of 182 unrelated Han Chinese, 90 Japanese and 100 Korean individuals, and compared with 663 individuals representing 8 world-wide populations. Our analysis revealed that Han Chinese, Japanese and Korean populations have distinct genetic makeup and can be well distinguished based on the genome wide data, or a panel of ancestry informative markers (AIMs) screened from genome-wide SNPs, indicating they have been isolated for substantially long time. Interestingly, population structure is perfectly corresponding to the geographical distribution of the three populations, indicating geography was an important factor resulted in population differentiation. We identified a cline of north/south admixture, which is consistent with either a scenario of isolation by distance (IBD) or that of north/south migrations or both. We theorized that both IBD effect and migrations could have resulted in such a pattern. On the other hand, our analysis revealed patterns of admixture which occurred after initial splits of populations. We further estimated gene flows among the three populations. We concluded that the genetic structure of the present-day Han Chinese, Japanese and Korean people was shaped jointly by common origin, subsequent gene flow and local adaptation. Our results advance the understanding of the genetic relationship and population history in East Asia.
Analysis of autosomal and Y-chromosomal DNA Suggests West Asian Population Derivation from Northern Middle Eastern Populations in the post-Glacial Period. P. Zalloua1,2, F. Utro3, M. Haber1, L. Parida3, E. Matisoo-Smith34, D. Platt3 1) Genomics Laboratory, Gradute School, Beirut, Lebanon; 2) Harvard School of Public Health, Boston, MA, USA; 3) I.B.M. T. J. Watson Research Center, Yorktown Hgts, NY; 4) University of Otago, Dunedin 9054, New Zealand.
Analysis of Y DNA J and E haplogroups in West Asians (Georgians, Armenians, Turks, Syrians, Lebanese, Jordanians, Saudi Arabians, Yemenis, Kuwaitis) suggest expansions coming primarily from the north (Turkey, Georgia, Armenia), with an early differentiation between those who headed south along the Tigris-Euphrates, versus those who headed south along the Levantine coast. We sought to resolve whether southern variations represented evolution within separate ice age refugia, or evolved from the same northern refugia as suggested by Y chromosome data by revealing population divergence times between Saudi Arabia and Yemen versus Turkey, Syria, and Armenia that predate the post-glacial expansions. We employed IRIS to compute times for grand most recent common ancestors applied between pairs of subjects drawn from Georgians, Armenians, Turks, Syrians, Lebanese, Jordanians, Saudi Arabians and Yemenis, as well as pair-wise FST’s based on the estimated times. We contrasted these results with raw SNP counts and pairwise FST’s obtained from those counts. We applied MDS and hierarchic clustering to identify geographically informative relationships, and observed a clear pattern of a north-to-south gradient. Within the western Middle East, our results suggest population differentiation dates consistent with post Last Glacial expansions, with subsequent population constriction into the Fertile Crescent in the presence of admixture. Our estimates show a north-to-south differentiation time of ~24,800-18,200 y.a., well within the Last Glacial Period. However, the time of J1/J2 haplogroups splits that mark this diversion are dated by BATWING well into the Last Glacial Period, around 31kya. These results place the genetic differentiation of the autosomal genome to be a bit more recent than the J1/J2 split. Expansions into Europe show a somewhat more recent record than those into Africa, with signals that show affinities with particular Middle Eastern regions, suggesting more recent trade impacts.
Population Genomics of the South American Andean Region. J. R. Homburger1, A. Moreno-Estrada1, C. R. Gignoux1, E. Sanchez-Rodriguez2, B. A. Pons-Estel3, E. Acevedo4, J. M. Cucho4, P. Miranda5, L. Catoggio6, M. A. García7, G. Berbotto8, A. Babini9, H. Scherbarth10, S. Toloza11, M. Alarcon-Riquelme2, C. D. Bustamante1 1) Department of Genetics, Stanford University, Stanford, CA, USA; 2) Centre for Genomics and Oncological Research (GENYO), University of Granada, Granada, Spain; 3) Sanatorio Parque, Rosario, Argentina; 4) Hospital Nacional Guillermo Almenara Irigoyen, Lima, Peru; 5) Facultad Medicina Occidente, Universidad de Chile, Santiago de Chile, Chile; 6) Hospital Italiano de Buenos Aires, Argentina; 7) H.I.G.A. General San Martin, La Plata, Argentina; 8) Hospital Eva Peron, Granadero Baigorria, Argentina; 9) Hospital Italiano de Córdoba, Córdoba, Argentina; 10) H.I.G.A. Oscar E. Alende, Mar del Plata, Argentina; 11) Hospital Interzonal San Juan Bautista, Catamarca, Argentina.
The South American continent has experienced multiple migration and admixture events. Here, we examine the genetic history of the Andean region using 551 individuals from Colombia, Ecuador, Peru, Chile, and Argentina genotyped on Illumina SNP arrays. Combining these data with individuals from the 1000 Genomes Project and the Population Reference Panel (POPRES), we show that the admixed individuals have varying degrees of Native American and European ancestry. We use ADMIXTURE and principal component analysis (PCA) to study the genetic ancestry of the admixed South American individuals. We show that on average the Peruvian individuals have a higher amount of Native American ancestry while the Argentinian individuals had on average the highest amount of European ancestry when compared with the other admixed South American samples. We also find that Andean indigenous groups account for the largest proportion of Native American ancestry in the South American individuals. On the other hand, the largest proportion of European ancestry in admixed individuals is from Southern Europe and the Iberian Peninsula. We aim to estimate the specific timing and the subcontinental origin of ancestral components involved in South American admixture by applying ancestry-specific PCA and tract length analysis to admixed genomes.
Fast individual ancestry inference from DNA sequence data leveraging allele frequencies from multiple populations. O. Libiger1,3, V. Bansal1,2 1) Scripps Translational Science Institute, La Jolla, CA; 2) Department of Pediatrics, University of California San Diego, La Jolla CA; 3) MD Revolution, San Diego, CA.
Estimation of individual ancestry from genetic data is useful for the analysis of disease association studies, understanding human population history and interpreting personal genomic variation. We describe a fast method for estimating the relative contribution of known reference populations to an individual’s genetic ancestry. Our method utilizes allele frequencies from the reference populations and individual genotype or sequence data to obtain a maximum likelihood estimate of the global admixture proportions using the BFGS optimization algorithm. It accounts for the uncertainty in genotypes present in sequence data by using genotype likelihoods instead of genotypes. Unlike previous methods, our method does not require individual genotype data from external reference panels and can utilize allele frequencies estimated from the analysis of homogeneous as well as admixed human populations. Simulation studies and application of the method to real datasets demonstrate that our method is 8-10 times faster than ADMIXTURE and has comparable accuracy. Using data from the 1000 Genomes project, we show that our method can estimate genome-wide average ancestry of admixed individuals using exome or low-coverage sequence data. Finally, we demonstrate that our method can be used to estimate admixture proportions using pooled sequence data making it a valuable tool for controlling for population stratication in sequencing based association studies that utilize DNA pooling.
How population growth affects linkage disequilibrium. A. Rogers Anthropology, University of Utah, Salt Lake City, UT.
The “LD curve” relates the linkage disequilibrium (LD) between pairs of nucleotide sites to the distance that separates them along the chromosome. It is used to map disease genes and to search for adaptive evolution. But it also responds to the history of population size. The present research describes new theoretical results about the effect of population history. When a population expands in size, the LD curve grows steeper, and this effect is especially pronounced following a bottleneck in population size. When a population shrinks, the LD curve rises but remains relatively flat. As LD converges toward a new equilibrium, its time path may not be monotonic. Following an episode of growth, for example, it declines to a low value before rising toward the new equilibrium. These changes happen at different rates for different LD statistics. They are especially slow for estimates of σd2, which therefore allow inferences about ancient population history. For the human population of Europe, these results suggest a history of population growth.
The Structure of Linkage Disequilibrium in the Recently Admixed Populations. H. Zhang, J. Jung, B. Grant National Institute on Alcohol Abuse and Alcoholism, National Institutes of Health, Rockville, MD.
The linkage disequilibrium (LD) in human population is found to be stronger with increasing geographic distance from Africa, which reflects the Africa origin of human history. Recently admixed populations (such as African Americans and Hispanic Americans) are more likely to harbor a larger number of genetic variants, relative to their inferred ancestral populations. However, the pattern of linkage disequilibrium in these admixed populations are not well studied. Here, we conduct an analysis of linkage disequilibrium at 659,184 single nucleotide polymorphisms (SNPs) in 924 unrelated samples from 11 Hapmap3 populations and 24 samples from Karitiana population (Native American in Brazil from Human Genome Diversity Project). African Americans (ASW) derive their genomic ancestry from African and European with an average of 77.3% African and 20.0% European ancestry. Hispanic Americans (MXL) lie on a cline of an average of 45.5% European ancestry, 42.9% Native American ancestry, 4.9% East Asian ancestry and 4.4% African ancestry. The mean of SNP based haplotype heterozygosity across the whole genome in these two admixed populations is greater than that of their major inferred ancestral populations. We further use r2 between all possible SNP pairs in various distance classes as a measure of LD and also focus on the proportion of SNP pairs with r2 greater than 0.8. Both of these two admixed populations show intermediate LD (as measured in r2 and the proportion of SNP pairs with r2>0.8), compared with their two major inferred ancestral populations. The extent of LD (r2) in African Americans (ASW) is more closer to that in African population (YRI) in the short distance classes, while the values of LD in African Americans (ASW) is more likely to be similar to the European Americans (CEU) with the increased distance classes. The amount of LD (r2) in Hispanic Americans (MXL) shows the similar pattern, but it is much closer to European Americans (CEU) in all distance classes. The findings on the structure of LD in admixed populations are helpful to better understand the evolution of human population and the design of the genetic association studies in admixed populations.
Using linkage disequilibrium to refine estimates of accelerating growth in human populations. M. Reppell1, J. Carlson1, S. Zöllner1,2, The BRIDGES Consortium 1) Department of Biostatistics, University of Michigan, Ann Arbor, MI; 2) Department of Psychiatry, University of Michigan, Ann Arbor, MI.
Correctly modeling the effective size of a population is critical to making accurate inferences about mutation and migration rates, and the strength of selective pressures. In humans, several large sequencing studies have given us novel insight into a genome characterized by an abundance of extremely rare genetic variation, consistent with a history of recent massive population growth. These large sequencing studies offer us unprecedented resolution for distinguishing between models of recent growth. To improve on conventional inference methods we propose a novel likelihood based approach that incorporates pairwise r2, a measure of linkage disequilibrium, in addition to the site frequency spectrum. We observe that over short genetic distances, pairwise r2 is a function of the variance in ancestral tree branch lengths, and therefore contains information about ancestral population sizes lacking from the site frequency spectrum, which is a function only of the mean total ancestral branch lengths. Using simulations we show that with large samples, the inclusion of pairwise r2 improves the accuracy of demographic inference in populations that have undergone recent growth, relative to methods relying solely on the site frequency spectrum. We quantify how increasing sample size increases the accuracy of inferences about recent demography, and magnifies the improvement our method yields versus conventional approaches. Lastly, we apply our method to regions defined as neutral in whole genome sequence data from ~4,000 European ancestry individuals sequenced as part of the BRIDGES consortium. This dataset has ideal features for our purposes; providing both a large sample and non-coding genetic regions free from evidence of ongoing selection, a mixture unavailable from exome only or functional sequencing projects. We use a Monte Carlo method to estimate the likelihood of the observed data under a range of realistic growth models, including those incorporating continuous, accelerating, faster than exponential growth. With our data we are able to simultaneously make inferences about the mutation rate, μ, and the rate of accelerating growth experienced by the European population from which our sample is drawn.
Forensic Phenotyping in Brazilian population: SLC24A5 and ASIP as phenotypic predictors genes of skin, eye and hair color. C. Fridman, F. A. Lima, F. T. Gonçalves Dept of Legal Medicine, Ethics and Occupational He, University of São Paulo, São Paulo, São Paulo, Brazil.
Pigmentation is a very variable and complex trait in humans and it is determined by the interaction of environmental factors, age, disease, drugs, hormones, exposure to ultraviolet radiation and genetic factors, including pigmentation genes. Many of these genes and their variants have been associated with phenotypic diversity of skin, eyes and hair color in homogeneous populations. SLC24A5, TYR, MC1R, SLC45A2, ASIP, OCA2 and HERC2 genes are noteworthy for their important contribution in pigmentation process. Prediction of phenotypes by using genetic information has benefited forensic area in many countries because it has made possible to infer physical characteristics from biological samples and, thus, lead criminal investigations. The aim of this study was evaluate polymorphisms in TYR, ASIP, SLC24A5 and SLC45A2 genes in a sample of 350 individuals of admixed population from Brazil, intending to use the data in forensic genetics casework in several situations. Volunteers answered a questionnaire where they self-reported their skin, eye and hair colors, sun sensitivity and lifestyle. No significant results were observed except for SLC24A5 and ASIP. The polymorphic homozygous allele of rs1426654 and rs6058017 in SLC24A5 (OR 32.88 p<0.0001) and ASIP (OR 8.68 p< 0.007) respectively, showed strongest association with fairer skin. Besides, the polymorphic homozygous allele in SLC24A5 exhibited relation to light eye color – green (OR 9.82 – p<0.0001), blond hair (OR 50.14 – p<0.0001) and also to increased sensitivity to sun exposure (OR 7.86 – p<< 0.0002). Our data suggests that polymorphic allele (A) in the SLC24A5 and ASIP genes is correlated with characteristics of light pigmentation, while the ancestral allele (G) is related to darker traits. Our findings corroborate previously published data on studies in European and African populations. These associations between pigmentation genes and skin, eyes and hair color shows that it is possible to use molecular information of an individual to access its phenotypic traits and use the obtained in attempt to help forensic investigations. Additional analyzes are ongoing as part of a project that evaluates 600 samples to check possible associations of phenotypic pigmentation in the Brazilian population with the mentioned genes. Financial Support: FAPESP (2012/02043-6), LIM 40/HCFMUSP and Department of Legal Medicine, Ethics and Occupational Health – FMUSP.
Maternal Age Effect and Severe Germline Bottleneck in the Inheritance of Human Mitochondrial DNA. M. Su1, B. Rebolledo-Jaramillo2, N. Stoler2, J. A. McElhoe3, B. Dickins4, D. Blankenberg2, T. Korneliussen5, F. Chiaromonte6, R. Nielsen5, M. M. Holland3, I. M. Paul7, A. Nekrutenko2, K. D. Makova1 1) Department of Biology, Penn State University, USA; 2) Department of Biochemistry and Molecular Biology, Penn State University, USA; 3) Forensic Science Program, Penn State University, USA; 4) School of Science and Technology, Nottingham Trent University, UK; 5) The Department of Integrative Biology, the University of California at Berkeley, USA; 6) Department of Statistics, Penn State University, USA; 7) Department of Pediatrics, College of Medicine, Penn State University, USA.
The manifestation of mtDNA diseases depends on the frequency of heteroplasmy (the presence of several alleles in an individual), yet its transmission across generations cannot be readily predicted due to the lack of data on the size of mtDNA bottleneck during oogenesis. For deleterious heteroplasmies, a severe bottleneck may abruptly transform a benign (low) frequency in a mother into a disease-causing (high) frequency in her child. Here we present a high-resolution study of heteroplasmy transmission conducted on blood and buccal mtDNA of 39 healthy mother-child pairs of European ancestry (a total of 156 samples, each sequenced at ~20,000x/site). On average, each indivual carried one heteroplasmy, and one in eight individuals carried a disease-causing heteroplasmy, with minor allele frequency ≥1%. We observed frequent drastic heteroplasmy frequency shifts between generations and estimated the size of the bottleneck at only ~29-35 mtDNA molecules. Strikingly, we found a positive association between the number of heteroplasmies in a child and maternal age at fertilization, likely attributable to oocyte aging. Accounting for heteroplasmies, we estimate mtDNA germline mutation rate to be 1.3×10-8 mutations/site/year – lower than in previous pedigree studies but in agreement with phylogenetic studies, thus solving a long-standing controversy and informing the use of mtDNA in dating evolutionary events. This study takes advantage of droplet digital PCR (ddPCR) to validate heteroplasmies and confirm a de novo mutation. These results have profound implications for predicting the transmission of disease-causing mtDNA variants and illuminate mitochondrial genome evolutionary dynamics.
An estimate of the average number of recessive lethal mutations carried by humans. Z. Gao1, D. Waggoner2,3, M. Stephens2,4, C. Ober1,2,5, M. Przeworski6,7 1) Committee on Genetics, Genomics and Systems Biology; 2) Dept of Human Genetics; 3) Dept of Pediatrics; 4) Dept of Statistics; 5) Dept of Obstetrics and Gynecology, University of Chicago, Chicago, IL; 6) Dept of Biological Sciences; 7) Dept of Systems Biology, Columbia University, New York, NY.
The effects of inbreeding on human health depend critically on the number and severity of the recessive deleterious mutations carried by an individual. In humans, estimates of the burden of recessive mutations per individual are based either on comparisons between consanguineous and non-consanguineous couples, an approach that confounds socioeconomic and genetic effects, or on carrier screening for disease-causing mutations, which suffers from other biases, notably the highly incomplete catalogue of disease-causing mutations. To circumvent these limitations, we sought to estimate a lower bound of the burden by focusing on recessive lethal disorders in a founder population with almost complete Mendelian disease ascertainment and a known pedigree. By considering all autosomal recessive lethal diseases recognized in the population and simulating allele transmissions along the pedigree, we estimated that each haploid human genome carries on average approximately one autosomal recessive allele that leads to severe disorders at or after birth in homozygous condition. When compared with previous estimates, our result suggests that recessive mutations that are lethal constitute a substantial fraction of the total burden of recessive deleterious mutations in humans.
Inference of mutation rates using hidden relatedness. P. F. Palamara1,2,3, P. Wilton4, M. Fromer5,6, G. Kirov7, S. McCarroll3,6,8, P. Sklar5,9, M. Owen7, S. Purcell5,6,10, M. O’Donovan7, J. Wakeley4, I. Pe’er11, 12 1) Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA; 2) Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA; 3) Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Boston, MA, USA; 4) Department of Organismic and Evolutionary Biology, Harvard University, Boston, MA, USA; 5) Division of Psychiatric Genomics in the Department of Psychiatry, and Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA; 6) Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Boston, MA, USA; 7) Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Institute of Psychological Medicine and Clinical Neurosciences, Cardiff University, Cardiff, UK; 8) Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA; 9) Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA; 10) Analytic and Translational Genetics Unit, Psychiatric and Neurodevelopmental Genetics Unit, Massachusetts General Hospital, Boston, MA, USA; 11) Center for Computational Biology & Bioinformatics, Columbia University Medical Center, New York, NY, USA; 12) Department of Computer Science, Columbia University, New York, NY.
Reliably estimating the mutation rate in modern humans has several implications for our understanding of demographic history (Scally and Durbin, Nature Reviews Genetics 2012). Recent estimates of the mutation rate obtained using de novo mutations in next-generation sequencing of families, however, were found to disagree with phylogenetic mutation rates derived from fossil evidence, motivating the development of new analytical methods. We describe an approach for the inference of mutation rates based on sharing of identical-by-descent (IBD) segments in sequencing data across purportedly unrelated individuals from a population. Using coalescent theory, we derive theoretical results for the distribution of mutation events found on IBD segments longer than a specified centimorgan threshold, for arbitrary demographic settings, under the SMC and SMC’ models. Leveraging the relationship between the length and the age of shared IBD haplotypes, we devise a method to estimate both genotype error rates and mutation rates. The proposed approach based on hidden relatedness offers a substantial increase in statistical power compared to family-based analysis of de-novo mutations. This gain in power occurs despite the fact that the fraction of genome shared through long (e.g. >1cM) IBD segments across purportedly unrelated individuals is usually small, since IBD regions harbor events which have occurred in the recent past, over tens to hundreds of generations. Furthermore, analysis of de-novo mutations in trio-based studies is limited to genomic regions transmitted through known pedigree relationships, while when accurately phased data is available, mutation events can be analyzed on IBD segments across the quadratically larger set of all pairs of unrelated individuals. We validate the proposed methodology using synthetic datasets for a variety of demographic scenarios, and analyze mutation rates in 1246 trio-phased unrelated individuals from a recent exome sequencing study (Fromer et al., Nature 2014) of schizophrenia patients.
Hundreds of shared ‘deletions’ in ancient hominins are polymorphic in modern human populations. D. Radke1,2, C. Lee3, S. Sunyaev1,2 1) Harvard Medical School, Boston, MA; 2) Brigham and Women’s Hospital, Boston, MA; 3) The Jackson Laboratory for Genomic Medicine, Farmington, CT.
Deciphering the genetic uniqueness of modern humans in relation to distant hominins and other primates is one of the central goals of human evolutionary genomics. Recently, with the availability of high-coverage sequence data for both Neanderthal and Denisova, it is now possible to more precisely determine the particular loci responsible for modern human uniqueness. While much of the distinguishing variation may be due to single nucleotide variants, genomic structural variants may also play a crucial role. Structural variants can be a potent phenotype-shaping force, particularly for unbalanced events, such as deletions, as they can alter reading frames and remove regulatory component space. Analyzing sequence read depth across archaic genomes, we find hundreds of ‘deleted’ regions in Neanderthal and Denisova (including many shared deletions), which are polymorphic in modern human populations. Some shared deletions overlap genes, and shared deletions as a set have a significantly higher allele frequency in modern human populations. Because these deletions are polymorphic in modern humans, they may represent regions of modern human-specific insertion, regions lost in archaic human lineages, or deletions polymorphic in both modern and archaic populations.
Convergent mechanisms underlying hypoxia adaptation in Drosophila and Humans. A. R. Jha1,2,3, D. Zhou4,5, C. D. Brown1,2, G. G. Haddad4,5, M. Kreitman1,2,3, K. P. White1,2,3 1) Institute for Genomics and Systems Biology, The University of Chicago, Chicago; 2) Department of Human Genetics, The University of Chicago, Chicago, USA; 3) Department of Ecology and Evolution, The University of Chicago, Chicago, USA; 4) Department of Pediatrics, Division of Respiratory Medicine, University of California at San Diego, San Diego, CA, USA; 5) Rady Children’s Hospital, San Diego, CA, USA.
The ability to withstand low oxygen (hypoxia) is a highly polygenic yet mechanistically conserved trait that has important implications for both human health and evolution. However, little is known about the diversity of genetic mechanisms involved in hypoxia-adaptation in evolving populations. We used experimental evolution and whole-genome sequencing in Drosophila melanogaster to investigate the role of natural variation in adaptation to hypoxia. Using a Generalized Linear Mixed Model we identified significant allele frequency differences between three independently evolved hypoxia-tolerant populations and normoxic controls for ~4000 single nucleotide polymorphisms. Many of these variants are clustered in 66 distinct genomic regions representing long-distance linkage in our populations. These regions are enriched for genes associated with metabolic processes and contain genes that are differentially expressed between hypoxia-tolerant and normoxic populations. Additional genes associated with open tracheal system development and notch signaling pathways also showed evidence of directional selection. Knocking down the gene expression of a handful of candidate genes showed striking enhancement in survival in severe hypoxia, demonstrating their functional relevance in hypoxia adaptation. Using whole genome genotyping data from three high-altitude human populations, namely— Sherpas, Tibetans, and Ethiopians , we show that the human orthologs of the genes under selection in flies are also under positive selection in all three high-altitude human populations. Therefore, comparative genomics approaches, such as the one we have taken here, can be powerful in revealing genes and pathways underlying evolutionarily ancient traits that have conserved functions for millions of years.
Evaluating the impact of recent human demography on the frequency spectra using numerical solution of time-inhomogeneous diffusion equation. E. Koch1, J. Novembre2 1) Department of Ecology and Evolution Unversity of Chicago, Chicago, IL; 2) Department of Human Genetics Unversity of Chicago, Chicago, IL.
Differences in recent demographic history appear to be an important driver of observed levels of genetic diversity among human populations. Recent attention has particularly centered on how populations that went through the out-of-Africa bottleneck have lower heterozygosity and polymorphic sites that are proportionally more likely to be nonsynonymous or predicted to be damaging. These results have suggested differences in the frequency spectrum of deleterious variation are also caused by varying population demographic histories. To investigate these phenomena in more detail, we perform numerical solutions to time-inhomogeneous diffusion equations for the allele frequency spectrum under the Poisson Random Field Model. This allows us to efficiently examine how the frequency spectra has evolved through time under a large number of possible human demographies and distributions of selective effects. We also are able to easily stratify variation observed today by the age at which the variation was generated. Using these tools, we demonstrate the ability of natural selection and demography to produce observed patterns and evaluate the relative impacts of population bottlenecks, recent growth rates, and changing efficacy of selection on the abundance of different variant types. The results emphasize how human frequency spectra are far from equilibrium and make more clear how frequencies are affected by major human demographic events at different timescales. For instance, in out-of-Africa populations the impacts of the bottleneck on the frequency spectra are still being realized, even as more recent growth events lead to an overlaid influx of rare variants. We quantify these effects and discuss their importance for interpretation of human genetic variation patterns. .
The Genetic Architecture of Skin Pigmentation in the Southern African ≠Khomani San. A. R. Martin1, J. M. Granka2, C. R. Gignoux1, M. Lin3, C. Uren4, M. Möller4, C. J. Werely4, J. M. Kidd5, M. W. Feldman2, E. G. Hoal4, C. D. Bustamante1, B. M. Henn1,3 1) Genetics Department, Stanford University, Stanford, CA; 2) Department of Biological Sciences, Stanford University, Stanford, CA, 94305; 3) Department of Ecology and Evolution, SUNY Stony Brook, NY 11794; 4) Division of Molecular Biology and Human Genetics, Stellenbosch University, Tygerberg, South Africa; 5) Department of Human Genetics, University of Michigan, Ann Arbor MI.
Skin pigmentation is one of the most recognizably diverse phenotypes in humans across the globe, but its highly genetic basis has been primarily studied in northern European, Asian, and African American populations. The Eurasian pigmentation alleles are among the most differentiated variants in the genome, suggesting strong selection for light skin pigmentation. Light skin pigmentation is also observed in the far southern latitudes of Africa among KhoeSan hunter-gatherers of the Kalahari Desert. The KhoeSan hunter-gatherers are among the oldest human populations, believed to have diverged from other populations 100,000 years ago, and maintain extraordinary levels of genetic diversity. It is unknown whether light skin pigmentation represents convergent evolution or the ancestral human phenotype. We have collected ethnographic information, pigmentation phenotypes, and genotype data from 136 individuals in the ≠Khomani San from the Kalahari. To understand the genetic basis for light skin pigmentation, we have also exome sequenced 83 ≠Khomani San individuals to high coverage, generating one of the largest indigenous African exome datasets sequenced outside of the 1000 Genomes Project. In this study, ≠Khomani individuals have 11.5% admixture with Europeans and 10.9% admixture with Bantu speakers on average. European ancestry significantly lightens skin and explains 13.3% of the variance in pigmentation, and Bantu ancestry significantly darkens skin and explains 16.1% of the variance in pigmentation on average. We estimate that pigmentation is highly heritable (h2 = 0.887 ± 0.188 standard error) and find that most of the heritability can be explained by 50 known pigmentation genes (0.527 ± 0.310 or 64.1% on average). After controlling for admixture with European and Bantu-speaking populations, a linear mixed model GWAS approach does not identify variants significantly associated with pigmentation. However, pigmentation genes are among the most globally differentiated between the ≠Khomani San and European or Bantu individuals, and aggregating differentiation with association data improves power to detect variants influencing selected traits. We identify highly differentiated variants between the ≠Khomani and both European and Bantu populations in multiple canonical pigmentation genes, including OCA2 and MITF. Our results highlight the strength of diverse population studies to explain phenotypic variation impacted by human evolutionary history.
Association study confirms that two OCA2 polymorphisms are involved in normal skin pigmentation variation in East Asian populations. E. Parra, K. Eaton, P. Kavanagh, M. Edwards, S. Krithika Dept Anthropology, Univ Toronto, Toronto, ON, Canada.
The last decade has witnessed dramatic advances in our understanding of the genetic architecture of normal skin pigmentation variation in European populations. However, evidence is much more limited for East Asian populations. Recently, we carried out a study aimed at identifying putative signatures of positive selection in pigmentation candidate genes in populations of East Asian ancestry. Based on the list of genes that show putative signatures of selection in East Asia, we prioritized a number of polymorphisms based on 1/ allele frequency information (e.g. differences in frequency between East Asian and non-East Asian populations) 2/ potential functional effects (e.g. Polyphen, SIFT and CADD scores) and 3/ conservation (e.g. GERP++ scores). The panel of SNPs selected includes 3 markers in the LYST gene (rs3754234, rs7522053 and rs4659610), one marker in the MLPH gene (rs2292881), 2 markers in the OPRM1 gene (rs1799971 and rs6917661), one marker in the EGFR gene (rs2227983), 4 markers in the BNC2 gene (rs9406647, rs3739714, rs10756778 and rs10962591), one marker in the TH gene (rs4930046), 3 markers in the OCA2 gene (rs1800414, rs74653330 and rs7497270), one marker in the TRPM1 gene (rs3809578) and 2 markers in the MC1R gene (rs33932559 and rs885479). We evaluated the association of these polymorphisms with skin pigmentation measured quantitatively using a DSM II colorimeter in a sample comprising 452 individuals of East Asian ancestry. Two previously described nonsynonymous polymorphisms within the OCA2 gene, rs1800414 (His615Arg) and rs74653330 (Ala481Thr) were strongly associated with melanin levels in this sample. Under an additive model, the common rs1800414 G allele, coding for Arginine, is associated with a decrease of 0.9 units in melanin levels. The rs74653330 A allele, coding for Threonine, is present at low frequency in East Asia (around 3% in our sample) and has a stronger effect on melanin levels than rs1800414 (decrease of 1.3 melanin units). No significant associations with skin pigmentation were observed for any of the other variants.
Neanderthal Origin of the Haplotypes Carrying the Functional Variant Val92Met in the MC1R in Modern Humans. Q. Ding1, Y. Hu1, S. Xu2, C. Wang1, H. Li1, R. Zhang1, S. Yan1, J. Wang1, L. Jin1,2 1) State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, School of Life Sciences, Fudan University, Shanghai, China; 2) CAS-MPG Partner Institute for Computational Biology, Shanghai Institute for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS), Shanghai, China.
Skin color is one of the most visible and important phenotypes of modern humans. Melanocyte-stimulating hormone and its receptor played an important role in regulating skin color. Here we present evidence of Neanderthal introgression encompassing the melanocyte-stimulating hormone receptor gene MC1R. The haplotypes from Neanderthal introgression diverged with the Altai Neanderthal 103.3 KYA, which postdates the anatomically modern human – Neanderthal divergence. We further discovered that all of the putative Neanderthal introgressive haplotypes carry the Val92Met variant, a loss-of-function variant in MC1R that is associated with multiple dermatological traits including skin color and photoaging. Frequency of this Neanderthal introgression is low in Europeans (~5%), moderate in continental East Asians (~30%), and high in Taiwanese aborigines (60-70%). Since the putative Neanderthal introgressive haplotypes carry a loss-of-function variant that could alter the function of MC1R and is associated with multiple traits related to skin color, we speculate that this Neanderthal introgression, together with the previously reported Neanderthal introgression at HYAL2, may have played an important role in the local adaptation of modern Eurasians to sunlight intensity.
Whole genome sequencing to uncover adaptation to high altitude in the Andes. M. Muzzio1,2, K. Slivinski3, M. C. Yee4, T. Cooke5, C. D. Bustamante5, G. Bailliet1, C. M. Bravi1,2, E. E. Kenny3,4,6,7,8 1) Consejo Nacional de Investigaciones Cientificas y Tecnologicas, La Plata, Buenos Aires, Argentina; 2) Facultad de Ciencias Naturales y Museo, Universidad Nacional de La Plata, Argentina; 3) The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, NY; 4) Dinneny Lab. Carnegie Institution of Washington. Department of Plant Biology, CA; 5) Stanford University School of Medicine, CA; 6) Department of Genetics & Genomic Sciences, Icahn School of Medicine at Mount Sinai, NY; 7) The Center for Statistical Genetics, Icahn School of Medicine at Mount Sinai, NY; 8) The Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, NY.
There is interest in human adaptation to a diversity of environments, including finding the genetic basis to phenotypes favorable to pressures such as hypoxia. We have preliminary Illumina Exome Array data on a set of 43 individuals from high altitude villages in the Andes from the Humahuaca area, Argentina (~2500 meters above sea level) and 11 individuals from a neighboring lowland population, Tartagal, Argentina (less than 500 meters above sea level), all with over 90% Native American ancestry estimated using the Admixture software. Currently, we are sequencing full genomes of 10 individuals from each of these populations, in search for new population-specific variants. We will use the population branch statistics (PBS) to identify highly differentiated genomic regions between the highlanders (Andean) and lowlanders (Chaqueños). We will discuss the results of our scan in light of related work on the adaptation of Tibetans, Ethiopians, and other Andean populations to hypoxia.
IFNL3/IFNL4 region shows evidence for recent positive selection specific to Asian populations. G. L. Wojcik, C. D. Bustamante Department of Genetics, Stanford School of Medicine, Stanford, CA.
Hepatitis C virus (HCV) is a global health burden, chronically infecting 130-150 million people and causing 350,000-500,000 deaths per year from HCV-related liver disease. Twenty-five years after the discovery of HCV, there is no vaccine and treatment remains ineffective in a large proportion of individuals. Heterogeneity in clinical outcomes such as spontaneous clearance of the virus, as well as sustained virologic response (SVR) after treatment, has been observed between individuals of different genetic ancestry. Previous genetic studies have pinpointed a single nucleotide polymorphism (SNP) in the interferon-λ 3 and 4 (IFNL3/IFNL4) region (rs12979860) as being strongly associated with clinical outcome. While the derived and favorable allele of rs12979860 (C) is present globally, its frequency is greatly differentiated by continent with the lowest in African populations (34-49%), and the highest in Asian populations (89-96%). To determine if these differences are due to selective pressures, data from the phase 3 release of the 1000 Genomes Project (TGP) was analyzed for population-specific signatures of selection. Derived allele frequency (DAF), F st, nucleotide diversity (π), and haplotype structure were examined and compared in populations from Europe, Africa, Asia, and the Americas. A 5 kilobase (kb) region around IFNL3/IFNL4 showed decreased nucleotide diversity, high DAF, and increased haplotype homozygosity in Asian populations. This pattern is not found in Native American populations, suggesting recent positive selection specific to Asia. Historical selective pressures from HCV, or likely a related ancestral virus, may have driven the favorable rs12979860 allele to near fixation. However, Asia currently has disproportionately high HCV-related morbidity and mortality despite this adaptation, suggesting further evolution of the virus. Differences in clinical outcomes within Asian populations may therefore be also due to non-IFNL3/IFNL4 genetic variation. Further studies are needed to identify additional genetic associations that will better our knowledge of how HCV interacts with the human immune system.
A genome-wide natural selection scan using 1000 high-coverage, Alzheimer’s-specific whole-genome sequences. M. Ebbert1, H. Smith1, T. Dawson1, S. Grossman2, M. Norton3, J. Tschanz3, R. Munger4, C. Corcoran5, P. Ridge1, J. Kauwe1, ADNI 1) Department of Biology, Brigham Young University, Provo, UT; 2) Broad Institute of MIT and Harvard, Cambridge, MA; 3) Department of Family Consumer and Human Development, Utah State University, Logan, Utah; 4) Department of Nutrition, Dietetics, and Food Sciences, Utah State University, Logan, Utah; 5) Department of Mathematics and Statistics, Utah State University, Logan, Utah.
Natural selection studies have impacted genetic research and our understanding of human adaptations, including malaria resistance, skin pigmentation, and others. More recently, Grossman et al. discovered adaptations to bacterial response and specific human phenotypes by performing a genome-wide selection scan using the 1000 Genomes data—identifying specific adaptive mutations without foreknown, adaptive phenotypic traits. This scanning approach successfully reversed the study type from a hypothesis-driven to a hypothesis-generating study. While the genome-wide scan was successful, there are potential limitations: (1) the 1000 Genomes data has only 179 whole-genome sequences; (2) the sequences were low coverage (2-6x average coverage); and (3) genotypes for the 1000 Genomes data may be inaccurate due to low coverage and because they were not genotyped using modern ‘joint-calling’ algorithms. We are performing an updated analysis including 1000+ Alzheimer’s-specific, whole-genome sequences with 37x average coverage. Our data set includes 152 Alzheimer’s disease (AD) cases and 211 ‘super controls’. The ‘super controls’ are APOE ε4 positive individuals aged 75+ that do not exhibit AD symptoms. Using our large, high-coverage data set, we will explore whether larger sample size and deeper coverage reveals previously undiscovered loci under selection. We will also explore whether using an AD-specific data set will enhance selection signals related to AD under the premise that AD-related loci are known to be under selection. As such, AD may be the result of a conflicting pleiotropic effect of an otherwise beneficial genotype. After joint calling all samples using GATK’s HaplotypeCaller, we will perform a genome-wide natural selection scan using the Composite of Multiple Signals (CMS) algorithm on our data set to identify specific loci under selection. These results will be compared to Grossman et al.’s previous results to determine whether any new loci show evidence of selection and whether any previously identified regions were eliminated (potential false positives). Previous and newly identified loci will be examined for potential AD implications based on known disease associations and functional annotations. Top candidates will be tested using an association test. Natural selection studies reveal important genetic artifacts for observed phenotypes. Many AD-related genes are under selection and there are likely other undiscovered AD-related genes.
Evolutionary history of pigmentation candidate gene diversity in a Melanesian population. H. Norton, E. Werren Department of Anthropology, University of Cincinnati, Cincinnati, OH.
Pigmentation of the skin, hair, and eyes are complex phenotypic traits determined by multiple loci. Human skin pigmentation is a trait that is believed to have evolved under strong natural selection in response to varying levels of ultra-violet radiation (UVR) intensity. Lighter skin color has evolved multiple times in human evolutionary history, but it is unclear if the darker skin color observed in many high UVR populations is also the result of evolutionary convergence (suggesting that population-specific mutations may have been favored by positive selection) or if instead ancestral variants associated with darker skin color have been maintained in high-UVR populations via purifying selection. To begin to address this question we compare DNA sequence variation from multiple pigmentation candidate genes in a Melanesian population to variation observed in European, East Asian, and African populations sequenced in the 1000 Genomes Project. Summaries of the site frequency spectrum, including Tajima’s D (TD), for three genes, ASIP, OCA2, and TYRP1, do not indicate that any of these genes were targeted by positive selection in the Melanesian population (ASIP TD = 0.037, OCA2 TD = -0.85,TYRP1 TD = -0.55). With the exception of a single novel haplotype in the OCA2 locus observed at a frequency of ~10% there is little evidence that Melanesians exhibit any high frequency population-specific haplotypes at these loci, suggesting that if an independent adaptation to high UVR conditions occurred in Melanesians then other pigmentation loci are responsible. However, there is also little evidence that Melanesians are similar to Africans at these loci, which one might expect if Melanesians share ancestral haplotypes with other high UVR populations: pairwise F ST estimates between Melanesians and Africans for the pigmentation loci examined here range from 0.043-0.443, and the majority of Melanesian haplotypes are common haplotypes shared between Africans, Europeans, and East Asians. We explore these patterns of sequence variation and inter-population divergence at pigmentation loci in the context of evolutionary models for pigmentation change in the human species and with consideration to Melanesian population history.
Inference of the strength of purifying selection based on haplotype patterns. D. Ortega Del Vecchyo1, K. E. Lohmueller1,2, J. Novembre3 1) Interdepartmental Program in Bioinformatics, University of California, Los Angeles, CA; 2) Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA; 3) Department of Human Genetics, University of Chicago, IL.
The strength of purifying selection is a central factor underlying levels of genetic diversity in a population and is important to characterize to understand the expected genetic architecture of disease traits. Recent sequencing studies with large sample sizes have revealed a much higher proportion of non-synonymous variants among rare versus common variants in human populations. This finding suggests that natural selection is acting against such variants to keep them at low frequencies in the population. To estimate the strength of purifying selection, we have developed a method that uses the lengths of pairwise haplotype identity among rare-variant-carrying haplotypes. Unlike previous approaches, our method conditions on the present-day frequency of the allele and is based on the intuition that alleles under purifying selection are on average younger than neutral alleles and, therefore should have higher average levels of haplotype identity among variant carriers. To obtain the probability distribution on the lengths of pairwise haplotype identity, one needs to perform two integrations: one over all possible allele frequency trajectories and another one over all pairwise coalescent times given a certain allele frequency trajectory. The integration over the space of possible allele frequency trajectories is done using a fast importance-sampling algorithm while the integration over the coalescent times is done using an analytical solution. Using the probability of the lengths of the haplotypes under different selective coefficients, we can calculate the likelihood for a selective coefficient for a single variant or set of variants. We use simulations to test how accurately the method estimates the selective coefficient under different demographic scenarios, such as a constant population size and a realistic model of European population growth. Variants with the same selective coefficient are harder to differentiate from neutral variants in scenarios of recent population growth. These methods will be applied to a set of 202 drug target genes sequenced in 14,002 individuals (Nelson et al, 2012, Science) to identify which genes are most likely to harbor damaging variants that may predispose to disease.
Highlighting strongly differentiated regions using three high coverage genomes each from a set of worldwide human populations. L. Pagani1,2,3, T. Kivisild1 1) Division of Biological Anthropology, University of Cambridge, Cambridge, Cambridgeshire, United Kingdom; 2) The Wellcome Trust Sanger Institute, CB10 1SA, Hinxton, UK; 3) Molecular Anthropology Lab, Department of Biological Geological and Environmental Sciences, University of Bologna, Italy.
Following the steady reduction in sequencing costs, several international projects will shortly make available sets of 2-4 high coverage genomes each from hundreds of worldwide human populations. While these resources allow for refining the demographic histories of the studied populations, little can be done to detect signatures of differentiation, possibly driven by natural selection, on these populations. The selection scan methods available to date indeed focus on various genomic components (SNPs, Haplotypes, LD blocks), yet relying on genome frequencies rather than on the full sequence information. Here we show how the top 1% of genic regions analysed using only three genomes each from two populations (CEU and YRI) contains as many as 25% of the top 5% FST candidates obtained using 160 low coverage individuals from the 1000 Genomes Project. The three genomes from each chosen population are combined in three pairs, and FST based on average pairwise differences is calculated between populations. The average FST is computed on a sliding window of 10000 or 50000 bp across all the pop1-pop2 sets of genomic pairs. The top 1% windows showing the highest differentiation were selected and inspected for their gene content. Of the 1785 genes identified by the FST scan based on the 160 low coverage individuals (taken as the gold standard), 98 were found among the 439 genes included in the top 1% 50000bp windows of the YRI-CEU pairs. This 2.4-fold enrichment was found significant with a chi-squared test (p=10-19). The empirical ranking nature of the gold standard did not allow a formal assessment of the false positive rate of our newly developed method. However, the overlap between the top genes retrieved using the 10000 and 50000 bp windows showed a significant enrichment in high ranking FST signals. In summary the proposed approach based on three genomes per population is capable of retrieving at least 25% of the genes under putative natural selection found from traditional methods. Ongoing power assessment will also inform on the optimal number of high coverage genomes per population required to further reduce the false positive rate. These promising results, given the limitations imposed by the small sample sizes, make our method suitable to be applied on newly sequenced populations (expected to be released on Mid June 2014, during the SMBE conference).
Searching for soft selective sweeps in worldwide human populations. Z. A. Szpiech1, R. H. Hernandez1,2,3 1) University of California, San Francisco, San Francisco, CA; 2) Institute for Human Genetics, University of California, San Francisco, San Francisco, CA; 3) Institute for Quantitative Biosciences (QB3), University of California, San Francisco, San Francisco, CA.
There is ample debate about the strength and mode of natural selection that has occurred in recent human evolution. This is particularly so for classical hard sweeps, during which an adaptive allele quickly drags a single haplotype to high frequency. An alternative model of adaptation involves soft sweeps, whereby multiple haplotypes are brought to high frequency (i.e. when a previously segregating neutral or slightly deleterious allele becomes adaptive in a new environment). Existing haplotype-based tests—such as the integrated haplotype score (iHS) that scans for positive selection by tracking the decay of haplotype homozygosity—work under the assumption that a positively selected region will be dominated by a single haplotype. However, iHS is expected to lose power under a soft sweep. Here we develop a statistic, inspired by iHS and recent work in Drosophila population genetics, designed to detect recent soft sweeps by tracking the decay of homozygosity of multiple haplotypes away from a core locus. We evaluate our statistic with rigorous simulations under multiple realistic models of human demography. We find that it has high power to detect both hard and soft sweeps and has improved power compared to iHS. In particular, for a fixed selection coefficient, our simulations suggest that we have greatest power to detect soft sweeps in African populations, which have been understudied to date. We apply this statistic and iHS to a large human genotype dataset of 1,728 unrelated individuals spanning 20 worldwide populations from the 1000 Genomes Project. A large number of regions identified by our statistic are not identified by iHS, in particular in African populations. This suggests a possibly important role of soft sweeps in recent human evolution.
Natural selection at the melanocortin-3 receptor gene loci. I. Yoshiuchi Dept Diabetes Mellitus and Medicine, Yoshiuchi Medical Diabetes Institute, Kamakura, Kanagawa, Japan.
Obesity is significantly associated with type 2 diabetes mellitus, metabolic syndrome, hypertension, stroke, and cardiovascular diseases. The worldwide prevalence of obesity is increasing steadily. Obesity is highly heritable disease that causes serious health problems. During the traditional cycles of feast and famine, natural selection of obesity-related genes would be significant because these genes control body weight and fat levels. Human adaptation to environmental changes in food supply, lifestyle, and geography may have influenced the selection of genes associated with the metabolism of glucose, lipids, carbohydrates, and energy. The melanocortin-3 receptor (MC3R) gene is one of obesity-associated genes, and MC3R mutations have been shown to be associated with obesity. MC3R-deficient mice showed increased fat mass. Here, We aimed to uncover evidence of selection at the MC3R gene loci. We performed a three-step method to detect selection at the MC3R gene loci using the HapMap population data. We used Wright’s F statistics as a measure of population differentiation, the long-range haplotype test to test extended haplotypes, and the integrated haplotype score test to detect selection at the MC3R gene loci. We observed natural selection at the MC3R gene loci by the integrated haplotype score test in the African population. This finding provides evidence of natural selection at the MC3R gene loci. Further discoveries are warranted on the adaptive evolution of obesity-associated genes.
Positive selection in smallpox associated genes among Mesoamericans. O. A. Garcia1, K. Arslanian2, D. Whorf1, M. Shriver3, L. G. Moore4, T. Brutsaert5, A. W. Bigham1 1) Department of Anthropology, University of Michigan, Ann Arbor, MI; 2) Department of Anthropology, Yale University, New Haven, CT; 3) Department of Anthropology, Penn State University, University Park, PA; 4) Department of Obstetrics and Gynecology, University of Colorado, Aurora, CO; 5) Department of Exercise Science, Syracuse University, Syracuse, NY.
During the colonization of Mesoamerica, one of the major causes of death was the introduction of novel infectious diseases. Among the most lethal infectious diseases was smallpox. Therefore, studying signatures of natural selection in genes related to smallpox infection and immune response not only provides a window to our evolutionary past but is also a particularly attractive strategy to identify host factors for modern infectious disease. To characterize host risk factors within Mesoamerican populations, we interrogated 906,600 SNPs assayed using the Affymetrix 6.0 genotyping array for signatures of natural selection in 231 immune response genes. Our populations included: Mesoamerican: 25 Maya and 14 Nahua, Mixtec, and Tlapanec speakers from Mexico, Andean: 25 Aymara from Bolivia, and 24 Quechua from Peru. Additionally, we used available data from 60 Europeans of northern European ancestry and 90 East Asians from China and Japan. We applied three statistical tests to identify signatures of natural selection: locus specific branch length (LSBL), the natural log of the ratio of heterozygosities (lnRH), and Tajima’s D. Furthermore, we analyzed partial and hard sweeps with two haplotype texts: integrated haplotype score (iHS) and cross population extended haplotype homozygosity (XP-EHH). We determined statistical significance based on an empirical distribution. Among our strongest results for positive selection were CD74, ZAP-70, and IKZF1 that were significant in all the statistical tests at the 5% and 1% level for Mesoamericans between East Asians and European Americans comparisons. Furthermore, they were statistically significant in comparison to the Andean populations. CD74 is major histocompatibility complex class II (MHC II) invariant chain. Studies have shown CD74’s protein to function as a receptor for cytokine MIF, a critical immune response factor. ZAP-70 is an integral part of the T-cell signaling pathway thereby regulating adaptive immune response. Several studies have shown CD74 and ZAP-70 expression to be correlated. IKZF1 has mostly been studied as in autoimmune disorders as part of the pathway regulating haematopoiesis. The results of this study will aid future studies by pinpointing candidate genes for infectious disease susceptibility and resistance in Mesoamerican populations.
Selection and reduced population size cannot explain higher amounts of Neanderthal ancestry in East Asian than European human populations. B. Kim1, K. Lohmueller1,2 1) Ecology and Evolutionary Biology, University of California Los Angeles, Los Angeles, CA; 2) Interdepartmental Program in Bioinformatics, University of California Los Angeles, Los Angeles, CA.
Understanding the Neanderthal ancestry of modern humans may provide crucial insights into the evolution of different human populations. It is believed that Neanderthals admixed with European and Asian populations to a much greater degree than with African populations. Additionally, recent studies show a higher frequency of Neanderthal alleles in East Asians relative to Europeans. Several hypotheses to explain this difference have been proposed. One hypothesis posits that there was a single admixture event in the population ancestral to modern Europeans and East Asians and that many of the Neanderthal alleles were weakly deleterious in modern humans. Because East Asians have historically had smaller population sizes than Europeans, purifying selection may have been less effective at removing the Neanderthal alleles from East Asian populations, leading to the observed higher proportion of Neanderthal ancestry in East Asians. Here we test this hypothesis using forward-in-time population genetic simulations. These simulations include plausible models of European and East Asian population history which have been estimated from data as well as models of the fitness effects of Neanderthal alleles in humans that include different dominance scenarios and a distribution of selection coefficients. Starting with the same amount of Neanderthal ancestry in both populations, we find that the differences in population size between European and East Asians combined with purifying selection cannot lead to the observed increase in the amount of Neanderthal ancestry in East Asian populations. Furthermore, when starting with the same initial amount of Neanderthal ancestry in both populations, realistic population size changes alone are insufficient to decrease or increase the Neanderthal ancestry in one population relative to the other. The observed data must be explained by some other process, such as additional waves of Neanderthal admixture into East Asian populations.
Asian diversity project: a survey of population structure and local adaptations in Asian populations. X. Liu1,2, D. Lu3, W. Y. Saw1, T. H. Ong1, C. Simmons4, P. Suriyaphol5, S. Tongisma6, B. P. Hoh7, N. Kato8, Y. Y. Teo1,9 1) Saw Swee Hock School of Public Health, National University of Singapore, Singapore; 2) NUS Graduate School, National University of Singapore, Singapore; 3) Max Planck Independent Research Group on Population Genomics, Chinese Academy of Sciences and Max Planck Society Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences,, Shanghai, China; 4) Oxford University Clinical Research Unit, Hospital for Tropical Diseases, Ho Chi Minh City, Viet Nam; 5) Division of Bioinformatics and Data Management for Research, Mahidol University, Bangkok, Thailand; 6) Genome Institute, National Center for Genetic Engineering and Biotechnology, Pathumtani, Thailand; 7) Insitute of Medical Molecular Biotechnology (IMMB), Faculty of Medicine, Universiti Teknologi MARA (UiTM) Malaysia, Sg Buloh, Selangor, Malaysia; 8) Department of Gene Diagnostics and Therapeutics, Research Institute, National Center for Global Health and Medicine, Tokyo, Japan; 9) Department of Statistics and Applied Probability, National University of Singapore, Singapore.
As the largest continent on Earth, Asia hosts more than 60% of the human populations in the world. Great genetic diversity exists in the Asian populations. The HUGO Pan-Asian SNP consortium provided a valuable genetic resource of Asian populations and performed a thorough survey of genetic diversity and population history of Asian populations. However, the sparse coverage of SNPs made the analysis of natural adaption difficult to perform. In this study, we collected dense genotyping data from 46 populations across Asia. More than 4093 individuals from East Asia, Central Asia, Southeast Asia and South Asia were genotyped on various genotyping platforms. Principal components analysis (PCA) and admixture analysis were performed to elucidate the population structure in ADP populations. It was revealed that geographic played an important role in shaping the population structure of Asian populations; and the ADP populations were further grouped into East Asian, Central Asian, Southeast Asian and South Asian subgroups. We performed a genome wide scan of positive selection signals in the ADP populations using iHS, XP-EHH and haploPS. A total of 669 candidate selection regions were detected across the 46 ADP populations. A PCA analysis on the selection signals were performed to investigate the degree of sharing of the selection signals in the 46 populations. It was found that clustering of populations by selection signals resembles the clustering inferred from population structure analysis. East and Southeast Asian groups share the largest number of selection signals; and the South Asian group possesses distinct selection signals from the rest of the Asian populations. For selection signals shared by multiple populations, we studied the origin of the selection, ie. either the selection originated from a single mutation in the common ancestor followed by subsequent gene flow, or it was the result of convergent evolution, where the selection emerged separately from multiple mutation events. The origins of positive selection signals were investigated by calculating the haplotype similarity index. The haplotype similarity index identified 36 selection regions under convergent evolution, and most of them involve aboriginal populations from Southeast Asia.
The pleiotropic effects of EDARV370A in an admixed Uyghur population. Q. Peng1, J. Li1, J. Tan2,3, Y. Yang2,3, Y. Guan4, L. Zhang4, Y. Jiao4, P. Sabeti5,6, L. Jin1,2,3, S. Wang1 1) CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China; 2) MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, China; 3) CMC Institute of Health Sciences, Taizhou, Jiangsu Province, China; 4) Department of Biochemistry, Preclinical Medicine College, Xinjiang Medical University, Urumqi, Xinjiang, China; 5) The Broad Institute of Harvard and MIT, Cambridge, USA; 6) Center for Systems Biology, Department of Organismic and Evolutionary Biology.
An adaptive variant of the human Ectodysplasin receptor, EDARV370A, showed one of the strongest signals of recent positive selection from genome-wide scans. In transgenic mice and in humans, it is found that EDARV370A affects ectodermal related phenotypes, including hair thickness and shape, active sweat gland density, and teeth formation. However, previous human studies were all based on East Asian populations, in which the frequencies of ancestral allele 370V are low. It is inconclusive whether the genetic model of EDARV370A is additive or dominant. The lack of power was due to the low presence of 370V homozygotes, which made it impractical to explore a large spectrum of potentially affected ectodermal related phenotypes. In this study, we took advantage of an admixed population between East Asian and European – the Uyghur, to investigate the pleiotropic nature and the genetic model of EDARV370A. By examining a series of ectodermal related phenotypes and the EDARV370A genotype in 294 Uyghur samples, we replicated the previous association findings in incisors shoveling (P=5.76×10-12) and hair straightness (P=3.37×10-03), and further confirmed the association is following an additive genetic model. We also found EDARV370A associated with novel phenotypes including higher total sweat gland density (P=0.03) and triangular earlobes (P=2.05×10-04). By revealing more pleiotropic effects of EDARV370A and confirming its genetic model, our study provides a more complete picture for the adaptive evolution of EDARV370A in human history.
A hidden Markov framework to estimate the timing of selection for hard sweeps. J. Smith1, M. Stephens2, M. Przeworski3, G. Coop4, J. Novembre2 1) Department of Ecology and Evolution, University of Chicago, Chicago, IL; 2) Department of Human Genetics, University of Chicago, Chicago, IL; 3) Department of Biological Sciences, Columbia University, New York, NY; 4) Department of Evolution and Ecology, University of California–Davis, Davis, CA.
Dispersal across the globe has resulted in humans occupying a wide range of ecological habitats. Natural selection seems to have played a role in this process, as current methods have identified a number of well supported loci that have undergone a recent selective sweep. In some cases, comparing estimates for the timing of selection with events in the historical/archeological record can provide a more clear picture of the ecological context driving adaptation in a population. For example, an overlap between cultural shifts towards dairy food production with the timing of selection on the lactase persistence allele has helped evaluate a possible cause for the observed selective sweep. As a result, there is substantial interest in methods to infer the age of a positively selected allele. A key principle for allele age estimation is that due to recombination and mutation, the signature of a selective sweep decays at a constant rate per generation. Current methods to estimate the age of selective sweeps either rely on a heuristic estimate of the length of the selected haplotype or employ a simulation-based framework to identify the distribution of ages that produce the observed summary statistics of the complete data. In practice the confidence intervals for these estimates are large. Here, we provide methods for inferring the ancestral haplotype of the selected allele and the recombination breakpoints off of this haplotype in order to provide more refined estimates of allele age. We do so using a hidden Markov model framework which allows us to integrate over uncertainty in recombination breakpoints. This framework uniquely uses both the present day length distribution of the ancestral haplotype and the number of derived mutations to estimate the number of generations since the sweep occurred. The joint use of haplotype lengths and derived mutations increases the total number of observed events and provides more narrow confidence intervals for the age estimate. Using this joint estimator on simulated data, 95% quantiles for estimates of sweep ages from 400 to 500 generations are within 35 generations of the true value. Whereas estimates based on derived mutations or haplotype lengths alone provide 95% quantiles ~70 generations from the true value. Future applications will revisit the timing of selection for lactase persistence in Northern Europeans, skin pigmentation alleles in Europe and Asia, and malaria resistance at the G6PD locus in Africa.
Genome wide survey of positive selection signals in African Americans since admixture. H. Wang1, Y. Choi2, X. Wang3, B. Tayo4, u. Broeckel5, C. Hanis6, S. Kardia7, S. Redline8, R. Cooper4, H. Tang2, X. Zhu1 1) Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH; 2) Department of Genetics, Stanford University, Stanford, CA; 3) Departments of Preventive Medicine, Biomedical Informatics, and Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY; 4) Department of Public Health Science, Loyola University Medical Center, Maywood, IL; 5) Human and Molecular Genetics Center, Medical College of Wisconsin, Milwaukee, WI; 6) 6 Department of Epidemiology, Human Genetics and Environmental Sciences, University of Texas Health Science Center at Houston, Houston, TX; 7) Department of Epidemiology, University of Michigan, Ann Arbor, MI; 8) Department of Medicine, Harvard Medical School, Boston, MA, USA.
In an admixed population such as African Americans, over or deficient ancestry in a local genomic region may suggest natural selection. We scanned three large African American cohorts of 20,153 individuals but failed to identify any genome-wide significant over or deficient signals. We showed that the failure to identify any significant selection signals can be attributed to the estimated variance of the test, which consists of two components: variance due to sampling error and variance due to genetic random drift. The proportion of variance due to genetic random drift increases when sample size increases. Thus, a test based on examining local ancestry excess is not efficient and its power will not increase when increasing sample size. We also showed that the high correlations of local ancestries between different cohorts are due to the historical recombination and genetic random drift. Assuming African-Americans have been admixed for 8 to 12 generations, we estimated the effective population size as between 32,000 to 48,000.
Khoisan hunter-gatherers have been the largest population throughout most of modern human demographic history. HL. Kim1,2, A. Ratan2, GH. Perry3, A. Montenegro4,5, W. Miller2, SC. Schuster1,2 1) Singapore Centre on Environmental Life Sciences Engineering, Nanyang Technological University, Singapore; 2) Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, PA, USA; 3) Department of Anthropology, Pennsylvania State University, PA, USA; 4) Department of Geography, Ohio State University, OH, USA; 5) Campus do Litoral Paulista, Unesp – Univ Estadual Paulista, Brazil.
We sequenced the complete genome sequences of five Khoisan hunter-gatherers from the Kalahari Desert and one Bantu-speaking agriculturalist individual also from southern Africa, with a high accuracy. Compared the 420K SNP genotyping dataset from 490 worldwide individuals, admixture analyses showed that three of our Khoisan genomes from the Ju/’hoansi group (northern Khoisan) have no or minimal admixture from non-Khoisan populations, allowing us to assess the early demographic history of the human species. Population genomic analyses for our complete genome sequences along with those from eight non-Khoisan humans were performed to infer their effective population sizes and demonstrated that the Ju/’hoansi population have maintained their large effective population size and been the people most isolated from all the other human populations, since the earliest population split between the Khoisan and other populations ~100-150 thousand years ago (kya). In contrast, all other human populations, including the ancestral Bantu-speaking agriculturists (currently the largest population within Africa in terms of census size), have experienced severe bottlenecks and lost more than half of their genetic diversity from ~120 to 30 kya. According to paleoclimate records and models, west-central Africa became drier, while southern Africa experienced increases in precipitation, ~80-100 kya. We hypothesize that these climate differences might be related to the divergent ancestral population history within African human populations.
The Kalash isolate from Pakistan. Q. Ayub1, L. Pagani1,2, M. Mezzavilla1,3, C. Tyler-Smith1 1) The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, United Kingdom; 2) Division of Biological Anthropology, University of Cambridge, Cambridge, United Kingdom; 3) Institute for Maternal and Child Health — IRCCS “BurloGarofolo” — Trieste, University of Trieste, Trieste, Italy.
The Kalash represent an enigmatic isolated population that has been living for centuries in the Hindu Kush mountain ranges of present-day Pakistan. Previous uni-parental (Y and mitochondrial) DNA markers provided no support for their claimed Greek descent following invasion of this region by Alexander III of Macedon, and analysis of autosomal loci provide evidence of a strong genetic bottle-neck. To understand their origins and demography further, we genotyped 23 unrelated Kalash samples on the IlluminaHumanOmni2.5 BeadChip and sequenced a male individual at high coverage on an Illumina Hi-Seq 2000. Comparisons with neighboring populations confirmed results based on genotyping 650,000 common single-nucleotide polymorphisms in the Kalash samples from the Centre Etude Polymorphism Humain (CEPH) Human Genome Diversity Project (HGDP) Cell Line Panel. However, we observed no evidence for admixture as suggested recently by Hellenthal et al. The mean time of divergence between Kalash and other populations currently residing in this region, that also speak Indo-European languages, was estimated to be 11.8 (10.6 -12.6) KYA. Since the split the Kalash have experienced little, or no, gene flow from their geographic neighbors and have maintained a low long-term effective population size (2,247-2,780). They could represent some of the earliest migrants into the Indian sub-continent.
Identifiability and efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. A. Bhaskar1,2, Y. X. R. Wang3, Y. S. Song1,2,3,4 1) Simons Institute for the Theory of Computing, University of California, Berkeley, Berkeley, CA; 2) Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA; 3) Department of Statistics, University of California, Berkeley, Berkeley, CA; 4) Department of Integrative Biology, University of California, Berkeley, Berkeley, CA.
Several recent large-sample human genetics studies have found a massive excess of rare variants compared to predictions of previously inferred demographic models of human history. A widely cited explanation is that such polymorphism patterns are indicative of explosive and accelerating population growth in recent human history. Using the site frequency spectrum (SFS), a summary of genetic variation in a set of sequences that counts the segregating sites as a function of the mutant allele frequency, we develop an efficient method for inferring recent population demography that can scale to samples involving tens or hundreds of thousands of individuals. Using analytic results for the expected SFS under the coalescent and by leveraging the technique of automatic differentiation, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum and can also accurately estimate locus-specific mutation rates. We show that our method can accurately infer multiple recent epochs of rapid exponential growth, a signal which is difficult to pick up with small sample sizes. We apply our method to a recent large-sample exome-sequencing dataset of 11,000 European individuals and find evidence of rapid recent exponential population growth of 1.5% per generation during the last 370 generations. We also study the statistical identifiability aspect of this inference problem. It has been recently shown that very different population demographies can generate the same SFS for arbitrarily large sample sizes. Although in principle this non-identifiability issue poses a thorny challenge to statistical inference, the population size functions involved in these counterexamples are arguably not biologically realistic. We revisit this problem and show that the SFS of even moderate-sized samples uniquely determines the population demography when the population size is piecewise-defined with each piece belonging to some family of biologically-motivated functions. In the cases of piecewise-constant, piecewise-exponential, and piecewise-generalized-exponential models, which are often assumed in population genomic inferences, we provide explicit values for the sample sizes that are sufficient for identifying the demographic model from the SFS.
Identity by descent segments within and across worldwide populations from sequence data. S. R. Browning1, B. L. Browning1,2 1) Department of Biostatistics, University of Washington, Seattle, WA; 2) Department of Medicine, Division of Medical Genetics, University of Washington, Seattle, WA.
Segments of identity by descent (IBD) shared by individuals within and across populations provide information on key aspects of demographic history, such as effective population sizes and migration rates.
Sequence data present opportunities and challenges for IBD analysis. Sequence data are more informative than SNP array data, improving power to accurately detect smaller IBD segments and hence obtain higher levels of information about demographic history. On the other hand, low-coverage sequence data have high rates of error, whereas SNP array data are usually extremely accurate.
We recently developed two IBD segment detection methods: Refined IBD and IBDseq. Refined IBD is a haplotype-frequency-based method designed for SNP array data, while IBDseq is an allele-frequency-based method designed for low-coverage sequence data. Both methods were developed in the context of samples from a homogeneous population. When using frequency-based methods in a heterogeneous setting we expect increased rates of false-positive IBD within sub-populations.
We use 1000 Genomes Project data and simulated data to investigate the performance of the IBDseq and Refined IBD methods when analyzing sequence data from world-wide populations. We find that the allele-frequency-based IBDseq method suffers from increased rates of false positive detected IBD segments due to population heterogeneity, whereas the haplotype-frequency-based Refined IBD approach is much less affected. We develop a strategy using multiple runs of Refined IBD and a process of filling small gaps between adjacent detected segments in order to recover near-complete large IBD segments while having high power to detect short segments. Our approach enables powerful IBD detection in the 1000 Genomes project data.
The Population Genomic Landscape of Human Genetic Structure, Admixture History and Local Adaptation in Peninsular Malaysia. L. Deng1, B. Hoh2, D. Lu1, R. Fu1, M. Phipps3, S. Li4, A. Nur-Shafawati5, W. Hatin6, E. Ismail7, S. Mokhtar2, L. Jin4, B. Zilfalil5, C. Marshall8, S. Scherer8,9, F. Al-Mulla10, S. Xu1 1) Max Planck Independent Research Group on Population Genomics, Chinese Academy of Sciences and Max Planck Society (CAS-MPG) Partner Institute for Computational Biology (PICB), Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shangh; 2) Institute of Medical Molecular Biotechnology, Faculty of Medicine, Universiti Teknologi MARA, Sungai Buloh Campus, Jalan Hospital, 47000, Sungai Buloh, Selangor, Malaysia; 3) Jeffrey Cheah School of Medicine and Health Sciences, Monash University (Sunway Campus), Selangor 46150, Malaysia; 4) Ministry of Education (MOE) Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai 200433, China; 5) Department of Pediatrics, School of Medical Sciences, Universiti Sains Malaysia, Kelantan 16150, Malaysia; 6) Human Genome Center, School of Medical Sciences, Universiti Sains Malaysia, Kelantan 16150, Malaysia; 7) School of Biosciences & Biotechnology, Faculty of Science & Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Malaysia; 8) The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, Ontario, Canada; 9) McLaughlin Centre and Department of Molecular Genetics, University of Toronto, Toronto, Canada; 10) Department of Pathology, Faculty of Medicine, Kuwait University, Safat 13110, Kuwait.
Peninsular Malaysia is a strategic region which might have played an important role in the initial peopling and subsequent human migrations in Asia. However, the genetic diversity and history of human populations—especially indigenous populations—inhabiting this area remain poorly understood. Here, we conducted a genome-wide study using over 900,000 single nucleotide polymorphisms (SNPs) in four major Malaysian ethnic groups (MEGs; Malay, Proto-Malay, Senoi and Negrito), and made comparisons of 17 world-wide populations. Our data revealed that Peninsular Malaysia has greater genetic diversity corresponding to its role as a contact zone of both early and recent human migrations in Asia. However, each single Orang Asli (indigenous) group was less diverse with a smaller effective population size (Ne) than a European or an East Asian population, indicating a substantial isolation of some duration for these groups. All four MEGs were genetically more similar to Asian populations than to other continental groups, and the divergence time between MEGs and East Asian populations (12,000—6,000 years ago) was also much shorter than that between East Asians and Europeans. Thus, Malaysian Orang Asli groups, despite their significantly different features, may share a common origin with the other Asian groups. Nevertheless, we identified traces of recent gene flow from non-Asians to MEGs. Finally, natural selection signatures were detected in a batch of genes associated with immune response, human height, skin pigmentation, hair and facial morphology and blood pressure in MEGs. Notable examples include SYN3 which is associated with human height in all Orang Asli groups, a height-related gene (PNPT1) and two blood pressure-related genes (CDH13 and PAX5) in Negritos. We conclude that a long isolation period, subsequent gene flow and local adaptations have jointly shaped the genetic architectures of MEGs, and this study provides insight into the peopling and human migration history in Southeast Asia.
Shared Identity by Descent segments within current Italian population reveals new details about recent population history. G. Fiorito1,2, C. Di Gaetano1,2, F. Rosa1, S. Guarrera1,2, B. Pardini1, A. Piazza1,2, G. Matullo1,2 1) Human Genetics Foundation, Turin, TORINO, Italy; 2) Department of Medical Sciences, University of Turin, Turin, Italy.
The inference of Identity by Descent (IBD) shared segments were recently enabled by high-resolution genomic data from large cohorts and novel algorithms for IBD detection. This approach permits to examine more in detail the genetic structure of a population as well as to get information about recent demographic events such as bottlenecks and migrations. This study aims to characterize the genetic variability within the Italian population. We present analytical results on the relationship between IBD sharing across 301 unrelated Italian individuals genotyped for about 2.5 million Single Nucleotide Polymorphisms (SNPs). Each sample has well-defined geographical origins (four grandparents coming from the same geographical region). Due to the well-known common ancestral origin of the Italian population we focused our attention on long-range and relatively recent shared IBD segments. By using Principal Component Analysis (PCA) and ancestry estimation, we ascertain Sardinia as the genetic outlier within Italy. Moreover a certain degree of differentiation is still detectable within Aosta Valley population. For each of the 11 subpopulation, we find a significant highest number of shared IBD segments within vs. between population, suggesting isolation by distance. Samples sharing the highest number of internal IBD blocks are Sardinian as expected, followed by those living in Aosta Valley, Tuscany and Sicily. We also evaluate the relationship between shared IBD segments and geographical distance. Contrary to what is expected, the decay of IBD with distance is not steeper for longer (recent) blocks. Such result suggests a constant exchange due to several migratory waves within Italy and/or to the considerable high number of population that have lived in Italy. We finally demonstrate that regions of increased IBD sharing are enriched for structural variation and loci implicated in natural selection and we highlighted the relationship between shared IBD haplotypes and demographic events occurred both in Sardinia and in the Italian peninsula. In conclusion, our results suggest that the study of shared IBD segments between populations is a useful method to detect novel details about relatively recent population history.
Identity by descent between humans, Denisovans, and Neandertals. S. Hochreiter, G. Povysil Institute of Bioinformatics, Johannes Kepler University Linz, Linz, Austria.
We analyze the sharing of very short identity by descent (IBD) segments between humans, Neandertals, and Denisovans to gain new insights into their demographic history. Short IBD segments convey information about events far back in time because the shorter IBD segments are, the older they are assumed to be. The identification of short IBD segments becomes possible through next generation sequencing (NGS), which offers high variant density and reports variants of all frequencies. Only recently HapFABIA has been proposed as the first method for detecting very short IBD segments in NGS data. HapFABIA utilizes rare variants to identify IBD segments with a low false discovery rate. We applied HapFABIA to the 1000 Genomes Project whole genome sequencing data to identify IBD segments which are shared within and between populations. Some IBD segments are shared with the reconstructed ancestral genome of humans and other primates. These segments are tagged by rare variants, consequently some rare variants have to be very old. Other IBD segments are also old since they are shared with Neandertals or Denisovans, which explains their shorter lengths. The Denisova genome most prominently matched IBD segments that are shared by Asians. Many of these segments were found exclusively in Asians and they are longer than segments shared between other continental populations and the Denisova genome. Therefore, we could confirm an introgression from Deniosvans into ancestors of Asians after their migration out of Africa. While Neandertal-matching IBD segments are most often shared by Asians, Europeans share more than other populations, too. Again, many of the Neandertal-matching IBD segments are found exclusively in Asians, whereas Neandertal-matching IBD segments that are shared by Europeans are often found in other populations, too. Neandertal-matching IBD segments that are shared by Asians or Europeans are longer than those observed in Africans. This hints at a gene flow from Neandertals into ancestors of Asians and Europeans after they left Africa. Interestingly, many Neandertal- or Denisova-matching IBD segments are predominantly observed in Africans – some of them even exclusively. IBD segments shared between Africans and Neandertals or Denisovans are strikingly short, therefore we assume that they are very old. This may indicate that these segments stem from ancestors of humans, Neandertals, and Denisovans and have survived in Africans.
Exploring Detailed Demographic Histories of Human Populations Using SNP Frequency Spectrums. X. Liu, Y.-X. Fu Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX.
Inferring human demographic history using genetic information can shed light on important prehistoric evolutionary events such as population bottleneck, expansion, migration, and admixture, among others. It is also the foundation of many population genetics analyses, as demographic history is one of the most important forces shaping the polymorphic pattern of DNA sequences. We developed a novel model-free method called stairway plot, which infers detailed population size changes over time using SNP frequency spectrums. This method can be applied to low-coverage sequence data, pooled sequence data and even reference-free sequence data for species whose reference genome are not yet available. Another advantage of this method is the ability to handle whole-genome sequences of hundreds of individuals. Using extensive simulation we compared our method to Li and Durbin’s method based on the pairwise sequentially Markovian coalescent (PSMC) framework and the results show that our method outperformed the PSMC method for inferring recent population size changes. We applied our method to the genomes of nine non-admixed populations (CEU, GBR, TSI, FIN, CHB, CHS, JPT, YRI and LWK) from the 1000 Genomes Project, and showed a detailed pattern of human population fluctuations from 10 to 500 thousand years ago (kya). The results supported many mainstream viewpoints on the demographic histories of human populations, and at the meantime also produced several interesting observations worth further and more careful investigations.
Exome sequencing of 3,000 individuals reveals differences in recent demographic history between East Asian and European populations. K. E. Lohmueller1, M. He2,3, Y. Li3, B. Kim1, L. Sun4, X. Zhang4, X. Jin3, K. Kristiansen3,5, T. Hansen6,7, J. Wang3, O. Pedersen7,8,9, E. Huerta-Sanchez10, R. Nielsen5,10 1) Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles, CA; 2) Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA; 3) BGI-Shenzhen, Shenzhen, China; 4) Department of Dermatology, First Affiliated Hospital, Anhui Medical University, Hefei, China; 5) Department of Biology, University of Copenhagen, Copenhagen, Denmark; 6) Faculty of Health Sciences, University of Southern Denmark, Odense, Denmark; 7) The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen, Denmark; 8) Faculty of Health Sciences, Aarhus University, Aarhus, Denmark; 9) Institute of Biomedical Sciences, University of Copenhagen, Copenhagen, Denmark Denmark; 10) Integrative Biology, University of California, Berkeley, Berkeley, CA.
Studies of genetic variation in thousands of individuals have found evidence for extreme population growth within the last 10,000 years in European and African American populations. The magnitude of recent growth in other continental populations, such as East Asians, has received comparatively little attention. In order to learn more about recent population history in East Asia, here we analyze high-coverage exome sequencing data from 1,449 Han Chinese individuals sampled from the Anhui province of China and 1,449 Danish individuals. We estimated recent demographic history using the site frequency spectrum. We find that the current effective size of the Han is approximately 4-fold larger than that estimated in the Danish population. Thus, while previous studies of common variants suggest historically smaller effective sizes in East Asian populations relative to European populations, our estimates of recent effective population sizes show the opposite pattern and trend in the same direction as the census population sizes. Next, we characterize the relationship between our estimates of the current effective population sizes and the census sizes. The ratio of the census size (over the last 200 years) to the recent effective size is significantly higher in the Han population than in the Danish population (P<2×10-4). This difference can be explained by greater variance in reproductive success in the Han population as compared to the Danish population. Alternatively, this result could be due to greater migration into the Danish population than the Han population. While it is appreciated that effective sizes of human populations are smaller than the census sizes, here we demonstrate that the magnitude of this difference varies across populations, even after accounting for population size changes. Finally, we examine patterns of deleterious variants in the Han and Danish populations. We find that the proportion of private variants that are nonsynonymous is higher in the Han sample (67.6%) than in the Danish sample (64.6%; P<10-10), consistent with recent population growth increasing the input of weakly deleterious mutations into the population that selection has not had sufficient time to remove. Our study provides the first analysis of recent population history and exploration of neutral and deleterious rare variants in an East Asian population.
Analysis of Genetic Diversity Representation of the 1000 Genomes in Worldwide Human Populations. D. Lu, S. Xu Partner institute for Computational Biology, Shanghai, Shanghai, China.
The 1000 Genomes Project (1KG) aims to provide a deep characterization of human genome sequence variation, by design was expected to aims to provide a comprehensive resource on human genetic variation. With an effort of sequencing 2,500 individuals, 1KG is expected to cover the majority of the human genetic diversities worldwide. However, it would be interesting to evaluate to what extent the 1KG data represent the genetic diversity of human populations in each region, which will give insight into the power of 1KG and also give guidance to regional efforts for further sequencing project and study design. In this study, using analysis of population structure based on genome-wide single nucleotide polymorphisms (SNPs) data, we examined and evaluated the coverage of genetic diversity of 1KG samples with the available genome-wide data from 3,831 individuals representing 140 worldwide population samples. We demonstrated that the 1KG does not have sufficient coverage of human genetic diversity in Asia, especially in Southeast Asia. We thus suggest a better coverage of Southeast Asian populations be considered in 1KG or a regional effort be initialized to provide a more comprehensive characterization of the human genetic diversity in Asia, which is important for both evolutionary and medical studies in the future.
Visualizing the Geographic Distribution of Genetic Variants. J. H. Marcus, J. Novembre Department of Human Genetics, University of Chicago, Chicago, IL.
One of the core features of any genetic variant, beyond its potential phenotypic effects or its frequency, is its geographic distribution. The geographic distribution of a genetic variant can shed light on where the variant first arose, in what populations it survived and spread within, and in turn help us learn about historical patterns of migration and natural selection. Collectively the geographic distribution of genetic variants can help to explain how populations have been related through time (e.g. levels of gene flow and divergence). For variants with large effects, it can also help us understand the geographic distribution of spatially-varying phenotypes. For these reasons, visual inspection of geographic maps for genetic variants is common practice in genetic studies. Here we develop a series of reusable interactive visualizations for illuminating the geographic distribution of genetic variants. We specifically address several non-trivial challenges of this type of visualization; in particular, how to represent non-uniform levels of uncertainty in allele frequencies due to variable sample sizes; how to represent results from data with >10,000 individuals in which allele frequencies can vary over 4 orders of magnitude; how to display data for regions of the globe with dense sampling of populations; and how to quickly access frequency data from large samples. To meet these challenges, we implement a flexible REST API for allowing for easy access to allele frequency and sample size data from large scale public genomic datasets. Built upon this API we develop a web-based browser, entitled the Geography of Genetic Variants (GGV) browser for visualizing the geographic distribution of genetic variants. The GGV browser rapidly provides maps of derived allele frequencies in populations distributed across the globe. The GGV browser builds upon past tools such as the HGDP Selection browser by allowing for more interactive features, new representations of rare variation, as well as incorporating uncertainty in allele frequency estimation. As ancillaries, we also develop a research visualization toolkit that includes a method for displaying high Fst outlier SNPs from the joint site frequency spectrum and an interactive version of commonly used PCA figures. We hope the GGV browser will be a valuable research and education tool for exploring population genetics data.
Finding the oasis of humanity in Neanderthal deserts. B. Vernot, JM. Akey Department of Genome Sciences, University of Washington, Seattle, WA.
As anatomically modern humans dispersed out of Africa, they encountered Neanderthals in Eurasia and low levels of hybridization occurred such that approximately 2% of each non-African’s genome is inherited from Neanderthal ancestors. Recently, we developed an approach to identify surviving Neanderthal lineages in contemporary individuals, and recovered over 600 Mb of the Neanderthal genome present in modern non-African populations. The map of surviving Neanderthal sequences shows marked heterogeneity across the genome, and we identified many “deserts of Neanderthal sequence” that are almost entirely devoid of Neanderthal sequence. These genomic regions are of particular interest because they delimit sequences that may confer uniquely human characteristics. For example, the largest Neanderthal desert is a 15Mb region on Chromosome 7, centered around the FOXP2 gene, which has previously been implicated in speech and language. Here, we present a detailed characterization of Neanderthal deserts by analyzing surviving archaic sequences in an expanded sample of geographically diverse individuals. We have developed a formal statistical test to identify genomic regions significantly depleted of Neanderthal lineages, and performed extensive simulations to infer the strength of purifying selection acting on these Neanderthal deserts. Additionally, we have utilized extensive bioinformatics analyses superimposing heterogenous functional genomics data to identify candidate causal variants. These analyses provide significant new insights into regions of the human genome that harbor sequences that have played a critical role in the evolution of anatomically modern humans, and suggest that regulatory sequences responsible for muscle, bone, and brain development were key differences between humans and Neanderthals. Vernot and Akey, Science, 2014.
Population structure in the UK: Rare variant analysis using whole genome sequencing in 3,621 samples in the UK10K cohorts project. K. Walter1, S. Metrustry2, E. Zeggini1, Y. Memari1, J. Min3, J. Huang1, M. Cocca4, S. Schiffels1, I. Mathieson5, D. Lawson6, N. Soranzo1, UK10K Consortium Cohorts Group 1) Human Genetics, Wellcome Trust Sanger Institute, Hinxton, United Kingdom; 2) Twin Research & Genetic Epidemiology, Kings College London, United Kingdom; 3) MRC CAiTE Centre, University of Bristol, United Kingdom; 4) Institute for Maternal and Child Health-IRCCS ʻʻBurlo Garofolo”-Trieste, University of Trieste, Italy; 5) Harvard Medical School, Boston MA 02115, United States; 6) Heilbronn Institute, School of Mathematics, University of Bristol, United Kingdom.
Population structure is a well-characterized potential confounder of association studies based on common variants, but the structural pattern for rare variants and their influence on association studies is less understood. The cohorts arm of the UK10K project undertook whole-genome sequencing at low-read depth (median ~7x) in nearly 4,000 individuals from two large population samples in the UK (TwinsUK, N=1,754 and the Avon Longitudinal Study of Parents and Children (ALSPAC), N=1,867) in a comprehensive exploration of associations between rare and common genetic variants and a set of 61 bio-medically important quantitative phenotypes. The two study cohorts have marked differences in demographic profile with ALSPAC participants originating from a geographically restricted area (Bristol) in the South West of the UK, while the TwinsUK participants were born in different parts of the UK. After stringent QC steps, the data set comprises 42 million SNPs, 3.5 million INDELs and about 18,000 large deletions across 3,621 study participants. Here we describe the extent to which geographic stratification exists at rare variants by focusing on 31 shared ‘core’ phenotypes in 1,139 twins with available place of birth data throughout the United Kingdom. We modeled genetic structuring using a Euclidian distance metric, a regional grid and generalized additive models (GAM) applied to latitudinal and longitudinal data, and for single nucleotide variants of different minor allele frequencies separately. We further modeled correlation of genotypic and phenotypic data at these geographical locations, and compared them to simulated datasets. Finally, we applied Mantel tests to analyze the significance of genotypic and phenotypic relationships given the distance metrics. Overall, these analyses suggested that there is a moderate genetic structuring of very rare alleles (MAF=0.1-0.3%), however this structure is not associated with phenotypic variation and is unlikely to pose a serious concern for association studies of complex quantitative phenotypes and rare variation in the UK.
Sequencing the genomes of single cells. P. Ribaux, C. Borel, F. Santoni, E. Falconnet, S. E. Antonarakis Dept Genetic Medicine, Univ Geneva Medical School, Geneva, Switzerland.
Whole-genome amplification and next-generation sequencing advances enable investigation of somatic structural and nucleotide variation to single-cell resolution. The ultimate goals of our study are (i) to identify disease-associated somatic mutations and (ii) to uncover the extent of low-abundance DNA variations in individual cancer cells in order to underlie mechanisms of tumor evolution. Because of the technical challenge of detecting and analyzing genomic heterogeneity among single cells, we first analyzed individual cells in culture and tested the robustness of our experimental workflow. We choose the K562 cells, a human immortalized myelogenous leukemia line and F-T21, a human primary Trisomy 21 fibroblast cell line. We used the C1 Single Cell Auto Prep System (Fluidigm) to capture hundreds of individual cells and to generate high quality of individual amplified DNA. So far, 96 barcoded whole-exome libraires were sequenced at deep coverage (PE, 100bp). Variant calls (CNVs and SNVs) were generated with an in-house analysis pipeline. Here, we will discuss the amplification uniformity, the detectable fraction of the exome and the level of DNA contamination. By comparing single cells and bulk of cells datasets, we will assess the percentage of allelic drop out for each each single-cell exome based on the heterozygous SNVs. High quality single-cell genome sequence will greatly enhance the genetic analysis of somatic genomic disorders. C.B. and P.R. contributed equally.
Monozygotic Twin Pairs: CNV and sequence concordance. A. Abdellaoui1, E. Ehli2, J. J. Hottenga1, Z. Weber2, H. Mbarek1, G. Willemsen1, T. van Beijsterveldt1, A. Brooks3, J. J. Hudziak4, P. F. Sullivan5, E. C. J. de Geus1, K. Ye6, P. E. Slagboom7, G. E. Davies2, D. I. Boomsma1 1) Biological Psychology, VU University Amsterdam, Amsterdam, Noord Holland, Netherlands; 2) Avera Institute for Human Genetics, Avera McKennan Hospital & University Health Center, Sioux Falls, SD, USA; 3) Department of Genetics, Rutgers, The State University of New Jersey, Piscataway, New Jersey, USA; 4) University of Vermont, College of Medicine, Burlington, VT, USA; 5) Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA; 6) The Genome Institute, Washington University School of Medicine, St. Louis, Missouri, USA; 7) Molecular Epidemiology, Leiden University Medical Center, Leiden, Netherlands.
Monozygotic (MZ) twins are genetically identical at conception, making them informative subjects for studies on somatic mutations. Copy number variants (CNV) are responsible for a substantial part of genetic variation, have relatively high mutation rates, and have been associated with susceptibility to disease, such as autism and schizophrenia. We conducted a genome-wide survey for post-twinning de novo CNVs (i.e., not shared by co-twins) in ~1,100 MZ twin pairs who had been repeatedly phenotyped across a wide range of traits, and of which a large proportion has gene-expression and methylation data available. CNVs from 1,097 MZ twin pairs were measured in DNA from peripheral blood samples (mostly in adults) or buccal epithelium (mostly in children) with the Affymetrix 6.0 microarray. Whole-genome sequencing was performed in DNA from blood samples from 13 MZ twin pairs and their parents (12x coverage – Illumina, and 2 twin pairs additionally sequenced with Complete Genomics). We found a total of 153 putative post-twinning de novo CNVs >100 kb, of which the majority resided in the same unstable genomic region (15q11.2). Based on how well the raw intensity signals visually agreed with CNV calls made by the two algorithms, a first selection was made of eleven de novo CNVs from 15q11.2 for a first series of qPCR validation experiments. Two out of eleven post-twinning de novo CNVs were validated with qPCR in the same twin pair. This 13-year old twin pair did not show large phenotypic differences. The remaining putative de novo CNVs from 15q11.2 were found significantly more often in older twins, suggesting that we are capturing real signals. The large putative de novo CNVs detected with microarray data were not present in the subsample that had whole-genome sequence data available. We do expect the whole-genome sequence data to allow us to search for smaller de novo CNVs that cannot be detected with micro-array data.
Beyond the 1000 Genomes Project. L. Clarke, H. Zheng-Bradley, A. Datta, I. Streeter, D. Richardson, P. Flicek, The 1000 Genomes Consortium Vertebrate Genomics, European Molecular Biology Laboratory – European BioInformatics Institute (EMBL-EBI), The Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
The 1000 Genomes Project provides an essential reference catalog of human variation with more than 60 million variant sites ranging from single nucleotide polymorphisms to structural variant events including inversions and duplications. Also provided are global allele frequencies and genotypes for 2535 individuals from 26 different populations across Europe, Africa, East and South Asia and the Americas, which enable many other projects to better interpret their results. Primary uses for the 1000 Genomes data sets include imputation panels to create whole genome variant sets from exome or array-based genotypes; as filters of “normal” or shared variation in rare disease or cancer sequencing projects; and to explore demography and selection in human populations. The 1000 Genomes Project is now drawing to a close. Here we describe plans to maintain the resource in order to ensure it remains the valuable data set it is today by providing long-term support for the 1000 Genomes Project resource. For example, we will continue to host both the FTP site (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp) and the project website (http://www.1000genomes.org) to ensure the community can access both the raw data and the documentation about the project. We will also create a stable version of the 1000 Genomes Browser (http://browser.1000genomes.org) based on the project’s final date release. This project specific Ensembl-based browser displays all of the 1000 Genomes variants as soon as possible and will use the GRCh37 assembly of the human reference genome. We will also maintain the existing tools and incorporate new ones as appropriate to enable users to easily access the data they desire. Our most popular tools are the Data Slicer—that allows users to select genomic subsections of our alignment (BAM) and variant (VCF) files and thus download just the piece of the file they need—and the Variation Pattern Finder, which allows users to discover patterns of shared variation in a specific region of the genome. Other tools include the VCF to PED converter, which allows users to generated PLINK format files from remotely hosted VCF files and the recently introduced the Allele Frequency Calculator that will calculate allele frequencies in bulk for specific sub populations from our VCF files.
Next generation association studies in isolated populations. E. Zeggini1, L. Southam1,2, K. Panoutsopoulou1, K. Hatzikotoulas1, G. R. S. Ritchie1, A.-E. Farmaki3, I. Tachmazidou1, A. Matchan1, N. W. Rayner1,2, J. Schwartzentruber1, I. Ntalla3, E. Tsafantakis4, M. Karaleftheri5, G. Dedoussis3, A. Gilly1 1) Wellcome Trust Sanger Institute, Hinxton, United Kingdom; 2) Wellcome Trust Centre for Human Genetics, University of Oxford, UK; 3) Harokopio University Athens, Athens, Greece; 4) Anogia Medical Centre, Anogia, Greece; 5) Echinos Medical Centre, Echinos, Greece.
Isolated populations have unique characteristics that can be leveraged to increase power in genetic association studies. In founder populations genetic drift can drive trait-associated alleles to higher frequency and thus enable the identification of rare variant associations with smaller discovery sets. We have collected samples from two isolated populations in Greece (HELlenic Isolated Cohorts study): the Pomak villages (HELIC-Pomak) in the North of Greece; and the Mylopotamos villages (HELIC-MANOLIS) on Crete. All samples (n~3000) have information on a wide array of anthropometric, cardiometabolic, biochemical, haematological and diet-related traits, genotypes from the Illumina OmniExpress and exome-chip platforms, and are being whole-genome sequenced at low depth. Using 1x WGS data from 995 (HELIC-MANOLIS) individuals, we demonstrate that over 80% of true low-frequency (0.01<MAF<0.05) variants are found, compared to an average 60% for 0.001<MAF<0.01 and 40% for MAF<0.001. Genotype concordance reaches >95% and minor allele concordance >90% across the whole MAF spectrum. We replicate known association hits, thereby providing a proof of concept for a robust processing pipeline for low-depth WGS variant calls. Using genotype data, we find that 80% of subjects have at least one “surrogate parent” in the isolates, compared to 1% in the outbred Greek population. In the MANOLIS cohort we observe an enrichment of missense variants amongst the variants that have drifted up in frequency by >5 fold. We have previously reported a lipid traits association with a functional variant in the APOC3 gene in 1267 individuals in MANOLIS. The equivalent sample size needed to detect this in the general European population would be 67,000. In the Pomak cohort we find novel associations at variants on chr11p15.4 showing large allele frequency increases (from 0.2% in the general Greek population to 4.6% in the isolate) with haematological traits, for example with mean corpuscular volume (at rs11035019, beta=-1.249, p=3.45×10-29). Their detection in cosmopolitan populations would necessitate thirteen times as many samples. We demonstrate the significant power gains that can be afforded by studying well-characterised founder populations.
Admixture mapping of exome genotyping data implicates region 15q21.2-22.3 with keloid risk in African Americans. K. S. Tsosie1,2, D. R. Velez Edwards1,3,4,5, S. M. Williams6, T. L. Edwards1,2,3,4, S. B. Russell1,7 1) Center for Human Genetics, Vanderbilt University, Nashville, TN; 2) Division of Epidemiology, Department of Medicine, Vanderbilt University, Nashville, TN; 3) Vanderbilt Epidemiology Center; 4) Institute for Medicine and Public Health; 5) Department of Obstetrics and Gynecology, Vanderbilt University, Nashville, TN; 6) Department of Genetics, Geisel School of Medicine, Dartmouth University, Hanover, NH; 7) Division of Dermatology, Department of Medicine, Vanderbilt University, Nashville, TN.
Keloids (MIM 148100) are benign dermal fibrotic tumors with no effective clinical remedy that affect people of recent African ancestry approximately 20 times more than individuals of Caucasian descent. Possible related fibroproliferative diseases with increased prevalence in African populations include hypertension, nephrosclerosis, allergic disease, and uterine fibroma. Familial aggregation and ancestral differences in risk among geographic subpopulations strongly suggests a genetic association between African ancestry, keloids and fibroproliferative disease risk. There are no published genome-wide studies of keloid risk in African ancestry subjects. We conducted admixture mapping (AM) and whole exome association in 478 African Americans (AAs: 122 cases, 356 controls) with exome arrays to identify regions of local ancestry and SNP genotypes under AM peaks associated with keloid risk. Results: The most significant association with keloids discovered by AM was observed on chr15q21.2-22.3. This 5Mb region includes NEDD4, which was previously implicated in keloid formation by GWAS in Japanese and later validated in Chinese. Though our study nominally replicated this finding by AM and genotype association, the most significant SNP genotype association under the AM peak was observed at MYO1E (rs747722, odds ratio [OR]=4.41, 95% confidence interval [CI]=2.29-8.50, p=9.07×10-6). A scan of all common genotype associations also identified associations at MYO7A (rs35641839, OR=4.71, 95% CI=2.38-9.32, p=8.34×10-6) at chr11q13.5. GWAS have linked the chr15q21.2-22.3 region with hypertension in AAs, asthma in Europeans, and atherosclerosis in a Finnish cohort, providing evidence for common genetic elements. Examination of earlier microarray data of fibroblasts from keloids and normal scars that included some subjects from this study also implicated chr15q21.2-22.3 as a causal region for keloids, with increased expression of MYO1E in keloids compared to normal scars. Notably, MYO1E has been shown to be a crucial component of the invadosome, a structure involved in matrix degradation and invasion and thus may have a functional role in the keloid phenotype. Conclusion: This study is the first to use AM and exome array association analysis to explore the genetics of keloids in AAs. Our findings, strengthened by support from expression data, further elucidate a potential region on chr15q21.2-22.3 for a role in risk of keloids in AAs, Japanese, and Chinese populations.
Exome sequencing of 487 Community Acquired Pneumonia patients. K. S. Elliott1, A. Ndungu1, T. C. Mills1, A. L. Rautanen1, P. Hutton2, C. Garrard2, A. Gordon3, C. M. Hinds4, M. Lathrop5, A. V. S. Hill1, S. J. Chapman1 1) Wellcome Trust Centre Human Genetics, University of Oxford, Oxford, UK; 2) Intensive Care Unit, John Radcliffe Hospital, Oxford, UK; 3) Anaesthetics, Pain Medicine and Intensive care, Imperial College, London, UK; 4) William Harvey Research Institute, Queen Mary University of London, London EC1M 6BQ, UK; 5) McGill University-Génome Québec Innovation Centre, Montreal, Canada.
Respiratory infection is the largest contributor to global disease burden and pneumonia kills over one million children each year. A major genetic component of the infectious disease was demonstrated by a study of Danish adoptee children where a 5.8-fold increased risk of death from infectious disease was observed if one of their biological parents had died prematurely from infection. Severe bacterial disease may exert enormous selective pressure leading to the finding of rare susceptibility variants of relatively recent origin. In order to identify such variants an exome sequencing study was undertaken. DNA samples from 487 adult UK individuals admitted to an intensive care unit with severe community-acquired pneumonia (CAP) were collected as part of a study of genetic predictors of death from sepsis in critically ill patients (Genomic Advances in Sepsis [GAinS]). Analysis of sepsis susceptibility was performed on a discovery cohort of 270 CAP samples compared to the UK10K ALSPAC control dataset. After stringent QC criteria were applied, 135,392 variants were identified. Of these, 43 reached ExWAS significant threshold for association (p < 3.6 x 10-7) and an additional 63 variants were suggestive (p < 1 X 10-4). The exomes from the remaining 217 CAP patients are being analysed as a replication dataset against the UK10K TWINSUK control dataset. The sepsis outcome phenotype was also analysed, measured as 28 day mortality post-ICU admission within the 487 combined CAP cohorts (deaths n=237, survivors n =237, unknown n=13). In single variant analysis of – sepsis outcome, seven variants were identified reaching the ExWAS significance threshold including variants in two related genes known to be involved in thrombosis. Collapsing methods with rare deleterious variants are being performed to detect gene centric associations. Identification of novel, large-effect genetic variants has the potential to significantly expand current understanding of sepsis biology and may have clinical applications.