Across the ~3 billion or so base pairs in the human genome there’s a fair amount of variation. That variation can be partitioned into different classes, somewhat artificial constructions of human categorization systems, but nevertheless mapping on to real demographic or life history events of particular importance. Some of the variation is specific to populations, while some of it is specific to a set of populations, and, there is also variation which we find only within families. Presumably when whole genome sequencing and analysis becomes the norm such distinctions will still have utility, but we should be able to tunnel down to whatever level of analysis we wish. But until that day comes we’re going to have to rely on population sets which are deeply sequenced and can serve as a reasonable representation of a subset of human variation.
Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains pa lained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called ‘HapMap 3’, includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of ≤5%, and demonstrated the feasibility of imputing newly discovered CNPs and SNPs . This expanded public resource of genome variants in global populations supports deeper interrogation of genomic variation and its role in human disease, and serves as a step towards a high-resolution map of the landscape of human genetic variation.
Since the supplements are free to all I recommend you download them if you don’t have academic access. The main difference is that they’re not as pithy in the supplements, and the graphics are lower quality. The populations (the original HapMap populations bold):
Centre d’Etude du Polymorphisme Humain collected in Utah, USA, with ancestry from northern and western Europe (CEU)
Han Chinese in Beijing, China (CHB)
Japanese in Tokyo, Japan (JPT)
Yoruba in Ibadan, Nigeria (YRI)
African ancestry in the southwestern USA (ASW)
Chinese in metropolitan Denver, Colorado, USA (CHD)
Gujarati Indians in Houston, Texas, USA (GIH)
Luhya in Webuye, Kenya (LWK)
Maasai in Kinyawa, Kenya (MKK)
Mexican ancestry in Los Angeles, California, USA (MXL)
Tuscans in Italy (Toscani in Italia, TSI)
So memorize some of those abbreviations! One particular difference across these populations is that some are parent-offspring trios, and some are not. So the CEU sample are trios, while the TSI are not. This obviously matters since you’re going to have clusters of relatedness within the CEU sample that you wouldn’t have within TSI. There are analytic upsides and downsides to having trios or not having trios, but, for a major purpose of this sort of data set, covering world wide human variation, you probably would want unrelated individuals with a population. These are the samples with trios: CEU, ASW, MXL, MKK, and YRI.
To get the SNPs and CNPs they merged the results from Affymetrix and Illumina chips, and came out with ~1.5 million variants across ~1,000 individuals. In terms of exploring big picture questions which are on a coarse scale this is pretty good, though I’m not sure that it’s that much better than the HGDP, which has so many populations (though about half the number of SNPs). Rather, one of the primary issues focused on in this paper is finding enough of the rarer variants, which may not have shown up in the initial panel because of its narrow population coverage, so as to perform imputation for purposes of statistical analysis in GWAS. So, for example, they compare the CEU vs. the CEU+TSI in imputing to a British study group. Here’s what they found (MAF = major allele frequency):
For common SNPs (MAF ≥5%), the larger HapMap 3 reference panel made only a slight difference to the already excellent performance (mean r2 increased from 0.946 to 0.961). However, as expected there was greater improvement for rare (MAF <0.5%) and low-frequency SNPs (MAF = 0.5–5%). Their combined mean r2 increased from 0.60 to 0.76, driven by a large subset of rare SNPs (41%) and low-frequency SNPs (25%) where r2 increased by at least 0.1, yielding mean r2 improvement for these subsets of 0.62 and 0.49 respectively…
So the older HapMap data set was fine with more common variants, but a larger sample set really gave some returns with less common variants. This makes intuitive sense. What is interesting to me is that the CEU sample of Utah Whites is presumably genetically close to a group of British whites born in 1958, and yet adding a Tuscan sample was still useful. To get a sense of how the power of this sort of imputation drops off between populations, as the further the genetic distance the fewer rare variants are shared, they imputed in a pairwise fashion, or, comparing a population to putative admixtures. So African Americans, who have a substantial proportion of European admixture with West African primary ancestry, are best modeled once you combine CEU+YRI with appropriate weights. This is especially true for rare alleles, r2 was 83% and 86.5% for common SNPs for African Americans and Yoruba, and Africans Americans and Yoruba & Utah Whites. For rare SNPs, it was 45.5% vs. 71.7%! Models which added the other HapMap 3 populations were actually less effective at imputation. East Eurasians have different genetic variants which simply confuse the picture.
It is intuitively obvious why rare alleles show up as you increase sample size. But why are rare alleles more distinctive across populations? If they’re common alleles they’re likely to have been around a long time, and so may be ancestral variants, or have had time to spread via gene flow. In contrast, rare alleles may be new, and so more distinctive across populations. Similarly, there are alleles which surely are passed down through families.
Figure 3 shows the impact of sample size on SNPs discovered:
Note the two groups of curves: African vs. non-African. This paper confirms the findings that Africans have more genetic diversity than other populations, while East Asians have less (presumably if Amerindians were in the sample they would round out the bottom). From the text:
As judged by this measure, informativeness varied greatly for different population pairs. Consistent with the observation that non-African diversity is largely a subset of African diversity…African samples provided a more complete discovery resource for variant sites in non-African samples than the converse…Focusing only on low-frequency variants in the original sample of 30 A individuals (one or two copies, corresponding to allele frequencies of 3.3% or less), even African samples were highly incomplete for diversity outside of Africa, with informativeness ratios dropping to 40–60% in LWK and YRI…In general, for low-frequency variants only closely related populations did an adequate job of capturing variation…probably reflecting the recent origins of low-frequency variants. Two populations, LWK and GIH, stand out as being poorly captured by any of our other populations, the result of admixture with an ancestral population not closely related to any in our regional sequencing data….
So again, African genetic diversity can inform on other populations, but with low frequency allelic variants even Africans don’t have enough to account for non-African groups. As a historical matter much of that might be due to the fact that the non-African variants have emerged more recently since the out of Africa event. Figure 2a shows the pairwise relationships between and within the populations measured by low frequency SNPs. More precisely, they took 30 random individuals from a population, and compared them to 30 random individuals from within the same population (without overlap), as well as 30 random individuals from other populations. The black bar is the same population comparison, while the colored bars represent across population comparisons. The higher the bar the better the across sample concordance; SNPs in one sample set map on well to those in the other sample set. First, observe the minimal difference between CEU & TSI. Europeans are relatively genetically homogeneous, and as far back as History and Geography of Human Genes it was evident that there was relatively minimal within continental variance. Next in line in relation to a CEU reference is GIH, the Gujaratis. This makes sense from all the other studies we know. South Asians are closer to West Eurasians than any other populations. Similarly, YRI are closest in correspondence with LWK, the Bantu sample from Kenya. But though the rank order of population relatedness is roughly similar to what you’d find in Fst, the authors note that the pairwise comparisons are not symmetrical. GIH was informative for 71% of TSI low frequency SNPs, but TSI was only informative for 55% of GIH. Why? GIH is more diverse, but it is also probably the Gujaratis are a compound of a European-like and non-European population, so what your’e seeing is overlap across the European fractions. Since the Tuscans lack the non-European fraction the Gujaratis will have alleles which aren’t found within them.
Speaking of the Gujaratis, there are some interesting results in the supplements which I want to highlight. They illustrate again the importance of context in PCA charts. They’re representations of reality, but only as good as your ability to interpret them and the inputs you’re giving them. Below are a set of images from the supplements, and you can skim them quickly. I’ve labelled them by population and context. Note how the populations shift positions based on the population set of variation you plug into the analysis. These are all the two largest components of variance.
Notice how Gujaratis and Mexican Americans overlap on the world wide PCA plot. Why? Because their gene frequencies are a linear combination of East and West Eurasian genetic variance, to a first approximation. I’ve indicated before that the overlap disappears when you look at other components of variation. But as the second image shows, you don’t have have to do that. Use only Mexican Americans, Europeans, and Gujaratis, and you see that Mexican Americans have a component of variance which is different from the other two. That’s because the non-European ancestry of Gujaratis is very different from that of the Mexican Americans, though both cluster to together when set next to Europeans, East Asians, and Africans. Remember that in the world wide set PC 1 is African vs. non-African, so removing Africans immediately frees up a dimension for the plot. The last figure shows Mexican Americans with Chinese and Europeans, and again, you see that there’s variation which isn’t simply a linear combination of Chinese and Europeans, Amerindians have their own uniqueness not found in either. In contrast, African Americans are a rather straightforward combination of West Africans and Europeans. Thankfully for African American genetics their parental populations were in the original HapMap. For Gujaratis and Mexican Americans you have only half the picture in the original HapMap, and you’d have to use the imperfect substitute of East Asians (very imperfect for Gujaratis, and somewhat so for Mexican Americans).
One final issue on phylogenetic relationships: the strange pattern among Gujaratis which I perceived among other South Asians as well is still evident. In the plot with Mexican Americans + Europeans + Gujaratis, the Gujaratis seem a linear combination of European + something else. What Reich et al. would term “Ancestral North Indian” + “Ancestral South Indian.” But the Gujarati + European plot shows that in the second component of variation there’s a difference between two clusters of Gujaratis. There’s something going on with the Gujarati group which is a touch closer to Europeans on the largest component of variance, because on the second dimension they’re deviated from the other Gujarati cluster and Europeans. This is similar in quality to the pattern with the South Asian data set with an orthogonal component of variation to the European-South Indian axis. The orthogonal component is striking among those which are between the Europeans and South Indians. The CEU + GHI + CHB plot doesn’t indicate to us that it’s East Asian either.
Of course the paper wasn’t just about validating the power of expanding the data set for medical genetics and clarifying phylogenetic relationships. There are several subsections, but I thought I’d jump to the end where they allude to detecting natural selection. This seems preliminary at least. They didn’t really go that much further for populations in the original HapMap, but found some interesting stuff for the new groups. To the left is a table from the supplements (I reedited it a bit) which shows loci which popped out of the CMS test for natural selection for Tuscans, Masai, and Luhya (the second are Nilotic and Bantu from Kenya). I present the results for readers with an interest in particular loci who might seem something in the list that does, or doesn’t, make sense to them. It seems that this part of the paper is primarily about showing that the new populations have some utility in fleshing out evolutionary phenomena which may have been missing in the original analyses of the HapMap because of constrained population coverage. Comparing Tuscans to CEU, and the Masai to Luhya, should tell us something about the evolution of lactase persistence. These pairs consist of populations which are rather close to each other in terms of ancestry (especially the European groups), but local ecological and cultural conditions have no doubt applied different selection pressures (the majority of Tuscans seem to lack the lactase persistence allele common in northern Europe last I checked).
Finally, from the conclusion:
With improvements in sequencing technology, low-frequency variation is becoming increasingly accessible. This greater resolution will no doubt expand our ability to identify genes and variants associated with disease and other human traits. This study integrates CNPs and lower-frequency SNPs with common SNPs in a more diverse set of human populations than was previously available. The results underscore the need to characterize population-genetic parameters in each population, and for each stratum of allele frequency, as it is not possible to extrapolate from past experience with common alleles. As expected, lower-frequency variation is less shared across populations, even closely related ones, highlighting the importance of sampling widely to achieve a comprehensive understanding of human variation.
Intrepid readers can poke around the data themselves at the HapMap website.
Citation: The International HapMap 3 Consortium (2010). Integrating common and rare genetic variation in diverse human populations Nature : 10.1038/nature09298