After further consideration, I have decided that I shall spare you my own clumsy exposition in plain English as to the details of site frequency spectrum calculations. There are after all enough points of interest in the paper at which I can throw my verbal talents more effectively. First, the abstract:
High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2–4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after ∼1,000 sequenced chromosomes per population, whereas ∼2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.
The first figure illustrates one of the clearest, though most unsurprising, findings in the paper: the lack of overlap of rare alleles across two distinct populations. In this panel they’re comparing Chinese from Beijing (CHB) and Yoruba from Nigeria (YRI). They focused on rare alleles as defined by variants present in 15 or less out of 100 in their sample. The union of the two populations yielded ~3,300 alleles, but only ~200 of these intersected across the populations. In other words, well over ~90% of these alleles were private across these populations. This immediately clues you in on the peculiarity of these genetic variants, as you should know that at any random polymorphic gene there will be far less between population variance than this. The zone of intersection on the histogram is notably “flat,” while it is “cool” on the heat map. In contrast, the “edges” of the graphs, which are defined by alleles exclusive to each respective population, exhibit a wide distribution in counts (observe that there are many more very rare alleles than moderately rare alleles).
An important aspect of this paper is that they synthesized results from “high coverage” and “low coverage” sequencing efforts. The former is highly accurate in terms of the actual state of the genome, but often very targeted and narrow (in this paper they focused on a set of exomes, regions of the genome which actually encode proteins). In contrast, the latter covers wider swaths of the genome (the full genome in this case), but may not be as accurate. One can immediately imagine the problem when one is fixing upon low frequency variants: errors in the data as well as limitations in the sample size may result in inflation or omission of alleles. When it comes to high frequency alleles error is of less account because a mistake here and there will not change the qualitative assessment. In any case, by comparing the rare variants found in deeply covered regions of the genome with the presumed underestimates which are yielded in the more thinly covered projects the authors generated parameters which allowed them to project the proportion of private alleles as a function of frequency across populations.
To the left you see a set of series on a line chart generated by their method. On the x-axis you have the minor allele frequency (the rare variant on a locus). For the y-axis you have the ratio of the allele shared across the two populations. What is notable to me is how even two closely related populations tend to differ a great deal at very low frequencies! The Chinese data needs a little explanation I think. The Chinese in Denver are almost certainly skewed toward a South Chinese sample. Historically American Chinese were disproportionately Cantonese, while the newer immigration waves tend to be Fuijianese, whether directly from Fujian, or ethnic Fujianese from Taiwan (where they are the majority). Though likely cosmopolitan, the Beijing Chinese are obviously going to sample more from the north of the country. This difference shows up on PCA plots, where the Beijing and Denver Chinese samples exhibit the distances from populations to their north and south that you’d expect if the latter was derived from southern Chinese populations.
The fact that very rare alleles are not shared across even closely related populations should not be too surprising when you think about it of course (everything is so obvious in hindsight!). For example, much of southern China was populated by Han ~1,500 years ago, during the first interregnum between Chinese dynasties (a period of disunity of particularly great length, lasting three centuries). During the Song ~1000 A.D. the Yangtze region and provinces to the south definitively surpassed the Yellow River basin in demographic heft. Without taking into account migration, this gives about ~1,000 years on average, or 40 generations (assuming 25 years) for new genetic variants to arise which might be private to the Han of the north and south of China respectively. The same process writ small certainly applies within putative populations, and there are going to be family private alleles. That is, genetic markers of recent origin distinctive to family lineages (more broadly construed we already know this with tandem repeats, but here we’re focusing on single nucleotide polymorphisms, changes on one base pair).
Finally, let’s hit their main demographic finding, which received a lot of coverage in The New York Times. They estimated that the last common ancestor of Asians and Africans in their data set was on the order of ~50,000 years before the present. This is absolutely unsurprising. As they note this is entirely consonant with the archeological record. What is fascinating is the confidence: 45 to 69 thousand years over the 95% interval. This immediately seemed congenially narrow to me, and they confirm this by reviewing earlier estimates with noisier data sets which had much larger intervals. Here is the rough demographic model which they inferred from their data:
CEU refers to Utah whites, CHB to Chinese in Beijing, JPT to Japanese, and YRI are Yoruba. You can see that their estimate of the last common ancestor of Europeans and Asians is ~23 thousand years B.P., in line with other calculations, though a touch on the low side for my own taste. The N refers to population sizes, while the nature of the tree illustrates the non-African bottleneck followed by demographic expansion vs. the relatively constant African population size over the past ~100,000 years.
The real good stuff comes in the discussion. Here’s something that jumped out at me: “It should be emphasized that, because we use a single Western African population as our African panel, the divergence described by our model might have occurred earlier than the actual Out-of-Africa event.” Within the discussion it is noted repeatedly that their results are sensitive to a host of conditions. For example, they were limited in the populations they used, and their demographic-historical model was obviously not as complex as it could have been. These results then perhaps should be seen as an important guide, and a pointer to things to come, rather than a substantive marker to lay down and take to heart. Given the populations they had and the data available the method outlined here seems very useful, but there are still limitations imposed by the population set and the nature of the data (which will be obviated in the near future).
Finally, there’s the practical payoff in medical genetics. The New York Times accurately reflected the inference one could make from this: if lots of diseases which are common are due to a host of rare variants, then it is even more important to gain a better understanding of fine-grained human variation. Risk alleles found in one population via genome-wide association in one population have been found to often predict well in other populations, but if these more common variants are part of our common ancestral heritage, then they should be relatively robust to genetic background. Such may not be the case with many rare variants, which reflect the peculiarities of more recent history. If medicine is to be truly personal in the genomic sense, then it seems likely that it will be more context dependent than had been hoped 10 years ago.
Citation: Simon Gravel, Brenna M. Henn, Ryan N. Gutenkunst, Amit R. Indap, Gabor T. Marth, Andrew G. Clark, Fuli Yu, Richard A. Gibbs, The 1000 Genomes Project, & and Carlos D. Bustamante (2011). Demographic history and rare allele sharing among human populations PNAS : 10.1073/pnas.1019276108