The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Kjmtchl Razib Khan
Nothing found
 TeasersGene Expression Blog

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
These buttons register your public Agreement, Disagreement, Thanks, LOL, or Troll with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used three times during any eight hour period.
Ignore Commenter Follow Commenter
🔊 Listen RSS

The figure to the left is a three dimensional representation of principal components 1, 2, and 3, generated from a sample of Gujaratis from Houston, and Chinese from Denver. When these two populations are pooled together the Chinese form a very homogeneous cluster. They don’t vary much across the three top explanatory dimensions of genetic variance. In contrast, the Gujaratis do vary. This is not surprising. In the supplements of Reconstructing Indian population history it was notable that the Gujaratis did tend to shake out into two distinct clusters in the PCAs. This is a finding you see over and over when you manipulate the HapMap Gujarati data set. In reality, there aren’t two equivalent clusters. Rather, there’s one “tight” cluster, which I will label “Gujarati_B” from now on in my data set, and another cluster, “Gujarati_A,” which really just consists of all the individuals who are outside of Gujarati_B cluster. Even when compared to other South Asian populations these two distinct categories persist in the HapMap Gujaratis.

Zack has already identified a major difference between the two clusters: Gujarat_A has some individuals with much more “West Eurasian” ancestry. To be more formal about this in the future I simply assigned individuals in my merged data set to one of the two Gujarati clusters based on their position in the first two PCs. Yesterday night I ran ADMIXTURE K = 2 to 10, with 75,000 SNPs. I also removed the Native American groups, and added more European and East Asian samples from the HapMap. Below are some populations at K = 4:

Let’s drill down to the level of individuals. Here are the Gujarati individuals, along with Sindhis, and my parents (Bengali). I’ve sorted by the “European” and then “South Asian” components (light blue and green respectively, while purple is modal in Papuans and red in East Asians):

The ADMIXTURE plots are in total alignment with the PCA. In the PCA Gujarati_A exhibit a spectrum of distance from the European cluster, and in the ADMIXTURE you see the same. In contrast, Gujarati_B is relatively uniform. So what’s going on? I will be posting something similar over at Sepia Mutiny soon. But my guess is that Gujarati_B are a subset of Patels. In other words, they’re a genetically distinct jati. I suspect that Gujarati_A are a more diverse bunch from a number of different jatis.

Does this matter? I believe it does. If Gujarati_B are a distinct ethno-social group which is a subset of Gujaratis, then they may not be as good a proxy for South Asian medical genetics as Gujarati_A. More concretely, Gujarati_B may have relatively high frequency rare disease alleles because they’re an inbred clan. In contrast, while Gujarati_A may exhibit all the hallmarks of South Asian endogamy, if they’re a larger number of different groups, then they’ll have all sorts of different rare alleles. The ones they have in common may be more generally South Asian.

🔊 Listen RSS

Across the ~3 billion or so base pairs in the human genome there’s a fair amount of variation. That variation can be partitioned into different classes, somewhat artificial constructions of human categorization systems, but nevertheless mapping on to real demographic or life history events of particular importance. Some of the variation is specific to populations, while some of it is specific to a set of populations, and, there is also variation which we find only within families. Presumably when whole genome sequencing and analysis becomes the norm such distinctions will still have utility, but we should be able to tunnel down to whatever level of analysis we wish. But until that day comes we’re going to have to rely on population sets which are deeply sequenced and can serve as a reasonable representation of a subset of human variation. I mention some of these populations regularly on this weblog, the HGDP, HapMap and POPRES being three prominent data sets with a diverse range. These groups cover only a small subset of human populations, and of those populations only a small proportion of the genomes of individuals (albeit, the component which is likely to vary within the population). A new paper in Nature takes a close look at the expansion of the HapMap to a new set of populations. Since it’s out of the HapMap consortium the list of authors themselves gives us a large set of individuals who might be of population genetic interest! (though not a representative set of human population variation; where are the Papuan employees of the Broad Institute?) Some of the data coming out of the next stage of the HapMap has been found in several papers already (often in the supplements), but this looks to be an overview and taste of what’s to come (the paper was submitted last fall). Integrating common and rare genetic variation in diverse human populations:

Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains pa lained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called ‘HapMap 3’, includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of ≤5%, and demonstrated the feasibility of imputing newly discovered CNPs and SNPs . This expanded public resource of genome variants in global populations supports deeper interrogation of genomic variation and its role in human disease, and serves as a step towards a high-resolution map of the landscape of human genetic variation.

Since the supplements are free to all I recommend you download them if you don’t have academic access. The main difference is that they’re not as pithy in the supplements, and the graphics are lower quality. The populations (the original HapMap populations bold):

Centre d’Etude du Polymorphisme Humain collected in Utah, USA, with ancestry from northern and western Europe (CEU)
Han Chinese in Beijing, China (CHB)
Japanese in Tokyo, Japan (JPT)
Yoruba in Ibadan, Nigeria (YRI)

African ancestry in the southwestern USA (ASW)
Chinese in metropolitan Denver, Colorado, USA (CHD)
Gujarati Indians in Houston, Texas, USA (GIH)
Luhya in Webuye, Kenya (LWK)
Maasai in Kinyawa, Kenya (MKK)
Mexican ancestry in Los Angeles, California, USA (MXL)
Tuscans in Italy (Toscani in Italia, TSI)

So memorize some of those abbreviations! One particular difference across these populations is that some are parent-offspring trios, and some are not. So the CEU sample are trios, while the TSI are not. This obviously matters since you’re going to have clusters of relatedness within the CEU sample that you wouldn’t have within TSI. There are analytic upsides and downsides to having trios or not having trios, but, for a major purpose of this sort of data set, covering world wide human variation, you probably would want unrelated individuals with a population. These are the samples with trios: CEU, ASW, MXL, MKK, and YRI.

To get the SNPs and CNPs they merged the results from Affymetrix and Illumina chips, and came out with ~1.5 million variants across ~1,000 individuals. In terms of exploring big picture questions which are on a coarse scale this is pretty good, though I’m not sure that it’s that much better than the HGDP, which has so many populations (though about half the number of SNPs). Rather, one of the primary issues focused on in this paper is finding enough of the rarer variants, which may not have shown up in the initial panel because of its narrow population coverage, so as to perform imputation for purposes of statistical analysis in GWAS. So, for example, they compare the CEU vs. the CEU+TSI in imputing to a British study group. Here’s what they found (MAF = major allele frequency):

For common SNPs (MAF ≥5%), the larger HapMap 3 reference panel made only a slight difference to the already excellent performance (mean r2 increased from 0.946 to 0.961). However, as expected there was greater improvement for rare (MAF <0.5%) and low-frequency SNPs (MAF = 0.5–5%). Their combined mean r2 increased from 0.60 to 0.76, driven by a large subset of rare SNPs (41%) and low-frequency SNPs (25%) where r2 increased by at least 0.1, yielding mean r2 improvement for these subsets of 0.62 and 0.49 respectively…

So the older HapMap data set was fine with more common variants, but a larger sample set really gave some returns with less common variants. This makes intuitive sense. What is interesting to me is that the CEU sample of Utah Whites is presumably genetically close to a group of British whites born in 1958, and yet adding a Tuscan sample was still useful. To get a sense of how the power of this sort of imputation drops off between populations, as the further the genetic distance the fewer rare variants are shared, they imputed in a pairwise fashion, or, comparing a population to putative admixtures. So African Americans, who have a substantial proportion of European admixture with West African primary ancestry, are best modeled once you combine CEU+YRI with appropriate weights. This is especially true for rare alleles, r2 was 83% and 86.5% for common SNPs for African Americans and Yoruba, and Africans Americans and Yoruba & Utah Whites. For rare SNPs, it was 45.5% vs. 71.7%! Models which added the other HapMap 3 populations were actually less effective at imputation. East Eurasians have different genetic variants which simply confuse the picture.

It is intuitively obvious why rare alleles show up as you increase sample size. But why are rare alleles more distinctive across populations? If they’re common alleles they’re likely to have been around a long time, and so may be ancestral variants, or have had time to spread via gene flow. In contrast, rare alleles may be new, and so more distinctive across populations. Similarly, there are alleles which surely are passed down through families.

Figure 3 shows the impact of sample size on SNPs discovered:


Note the two groups of curves: African vs. non-African. This paper confirms the findings that Africans have more genetic diversity than other populations, while East Asians have less (presumably if Amerindians were in the sample they would round out the bottom). From the text:

As judged by this measure, informativeness varied greatly for different population pairs. Consistent with the observation that non-African diversity is largely a subset of African diversity…African samples provided a more complete discovery resource for variant sites in non-African samples than the converse…Focusing only on low-frequency variants in the original sample of 30 A individuals (one or two copies, corresponding to allele frequencies of 3.3% or less), even African samples were highly incomplete for diversity outside of Africa, with informativeness ratios dropping to 40–60% in LWK and YRI…In general, for low-frequency variants only closely related populations did an adequate job of capturing variation…probably reflecting the recent origins of low-frequency variants. Two populations, LWK and GIH, stand out as being poorly captured by any of our other populations, the result of admixture with an ancestral population not closely related to any in our regional sequencing data….

hapmap3fig2aSo again, African genetic diversity can inform on other populations, but with low frequency allelic variants even Africans don’t have enough to account for non-African groups. As a historical matter much of that might be due to the fact that the non-African variants have emerged more recently since the out of Africa event. Figure 2a shows the pairwise relationships between and within the populations measured by low frequency SNPs. More precisely, they took 30 random individuals from a population, and compared them to 30 random individuals from within the same population (without overlap), as well as 30 random individuals from other populations. The black bar is the same population comparison, while the colored bars represent across population comparisons. The higher the bar the better the across sample concordance; SNPs in one sample set map on well to those in the other sample set. First, observe the minimal difference between CEU & TSI. Europeans are relatively genetically homogeneous, and as far back as History and Geography of Human Genes it was evident that there was relatively minimal within continental variance. Next in line in relation to a CEU reference is GIH, the Gujaratis. This makes sense from all the other studies we know. South Asians are closer to West Eurasians than any other populations. Similarly, YRI are closest in correspondence with LWK, the Bantu sample from Kenya. But though the rank order of population relatedness is roughly similar to what you’d find in Fst, the authors note that the pairwise comparisons are not symmetrical. GIH was informative for 71% of TSI low frequency SNPs, but TSI was only informative for 55% of GIH. Why? GIH is more diverse, but it is also probably the Gujaratis are a compound of a European-like and non-European population, so what your’e seeing is overlap across the European fractions. Since the Tuscans lack the non-European fraction the Gujaratis will have alleles which aren’t found within them.

Speaking of the Gujaratis, there are some interesting results in the supplements which I want to highlight. They illustrate again the importance of context in PCA charts. They’re representations of reality, but only as good as your ability to interpret them and the inputs you’re giving them. Below are a set of images from the supplements, and you can skim them quickly. I’ve labelled them by population and context. Note how the populations shift positions based on the population set of variation you plug into the analysis. These are all the two largest components of variance.

[nggallery id=8]

Notice how Gujaratis and Mexican Americans overlap on the world wide PCA plot. Why? Because their gene frequencies are a linear combination of East and West Eurasian genetic variance, to a first approximation. I’ve indicated before that the overlap disappears when you look at other components of variation. But as the second image shows, you don’t have have to do that. Use only Mexican Americans, Europeans, and Gujaratis, and you see that Mexican Americans have a component of variance which is different from the other two. That’s because the non-European ancestry of Gujaratis is very different from that of the Mexican Americans, though both cluster to together when set next to Europeans, East Asians, and Africans. Remember that in the world wide set PC 1 is African vs. non-African, so removing Africans immediately frees up a dimension for the plot. The last figure shows Mexican Americans with Chinese and Europeans, and again, you see that there’s variation which isn’t simply a linear combination of Chinese and Europeans, Amerindians have their own uniqueness not found in either. In contrast, African Americans are a rather straightforward combination of West Africans and Europeans. Thankfully for African American genetics their parental populations were in the original HapMap. For Gujaratis and Mexican Americans you have only half the picture in the original HapMap, and you’d have to use the imperfect substitute of East Asians (very imperfect for Gujaratis, and somewhat so for Mexican Americans).

One final issue on phylogenetic relationships: the strange pattern among Gujaratis which I perceived among other South Asians as well is still evident. In the plot with Mexican Americans + Europeans + Gujaratis, the Gujaratis seem a linear combination of European + something else. What Reich et al. would term “Ancestral North Indian” + “Ancestral South Indian.” But the Gujarati + European plot shows that in the second component of variation there’s a difference between two clusters of Gujaratis. There’s something going on with the Gujarati group which is a touch closer to Europeans on the largest component of variance, because on the second dimension they’re deviated from the other Gujarati cluster and Europeans. This is similar in quality to the pattern with the South Asian data set with an orthogonal component of variation to the European-South Indian axis. The orthogonal component is striking among those which are between the Europeans and South Indians. The CEU + GHI + CHB plot doesn’t indicate to us that it’s East Asian either.

hapmap3selectfigOf course the paper wasn’t just about validating the power of expanding the data set for medical genetics and clarifying phylogenetic relationships. There are several subsections, but I thought I’d jump to the end where they allude to detecting natural selection. This seems preliminary at least. They didn’t really go that much further for populations in the original HapMap, but found some interesting stuff for the new groups. To the left is a table from the supplements (I reedited it a bit) which shows loci which popped out of the CMS test for natural selection for Tuscans, Masai, and Luhya (the second are Nilotic and Bantu from Kenya). I present the results for readers with an interest in particular loci who might seem something in the list that does, or doesn’t, make sense to them. It seems that this part of the paper is primarily about showing that the new populations have some utility in fleshing out evolutionary phenomena which may have been missing in the original analyses of the HapMap because of constrained population coverage. Comparing Tuscans to CEU, and the Masai to Luhya, should tell us something about the evolution of lactase persistence. These pairs consist of populations which are rather close to each other in terms of ancestry (especially the European groups), but local ecological and cultural conditions have no doubt applied different selection pressures (the majority of Tuscans seem to lack the lactase persistence allele common in northern Europe last I checked).

Finally, from the conclusion:

With improvements in sequencing technology, low-frequency variation is becoming increasingly accessible. This greater resolution will no doubt expand our ability to identify genes and variants associated with disease and other human traits. This study integrates CNPs and lower-frequency SNPs with common SNPs in a more diverse set of human populations than was previously available. The results underscore the need to characterize population-genetic parameters in each population, and for each stratum of allele frequency, as it is not possible to extrapolate from past experience with common alleles. As expected, lower-frequency variation is less shared across populations, even closely related ones, highlighting the importance of sampling widely to achieve a comprehensive understanding of human variation.

Intrepid readers can poke around the data themselves at the HapMap website.

Citation: The International HapMap 3 Consortium (2010). Integrating common and rare genetic variation in diverse human populations Nature : 10.1038/nature09298

🔊 Listen RSS

Had to link to a paper with such a title. Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes:

…SNPs and CNVs captured 83.6% and 17.7% of the total detected genetic variation in gene expression, respectively, but the signals from the two types of variation had little overlap….

Here’s a quote from a popular press article:

“We’ve been able to look back into our history and find changes that are older and likely to be shared among populations,” explained Dr Manolis Dermitzakis, senior author and Project Leader at the Wellcome Trust Sanger Institute. “But we also find many that are newer and less widespread.”

These are part of our recent evolution and a step along the way to understanding the origin and personal consequences of genetic change, not least for our wellbeing. This is a first generation map of biologically important DNA sequence variation”

• Category: Science • Tags: Evolution, Genetics, HapMap 
Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"