The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog
Indian Genetics

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
These buttons register your public Agreement, Disagreement, Thanks, LOL, or Troll with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used three times during any eight hour period.
Ignore Commenter Follow Commenter
🔊 Listen RSS

ncomms9912-f451IZQjMbVlL._SX346_BO1,204,203,200_Update: Here is a post that you must read, A note on the early expansions of the Indo-Europeans. The post dates to the middle of December, and is similar in many ways to my own thoughts. But, the author rejects a two wave model where the first wave has a deep time history, and seems to give the balance of opinion that agriculture is predominantly indigenous in development to South Asia, and not primarily an exogenous event. Rather, they suggest that there were multiple waves of Indo-Aryans into South Asia, with the steppe cultures being parallel and pulses from the Indo-Iranian ur-heimat. The primary criticism of the genetic interpretation that I would make is that from what I am to understand LD decay methods seem to catch the last admixture event and/or underestimate time since initiation of mixture. Therefore, though I accept a substantial mixture event ~4,000 years before the present, my own model present below suggests that older ones occurred thousands of years earlier.

That being said, I have updated my own views to rather uncertain at this point. I would not be surprised if on the whole a model as the one proposed in the blog post is closer to the truth than the one below. My reasoning has less to do with the details of the argumentation, and more to do with authority.

1) the individual who wrote the above post has comparable mastery of the historical genetic descriptive results.

2) but, the individual has far superior understanding of the archaeology and philology in comparison to me.

Ignoring the details of any argument, on a priori grounds I find that the individual above could give a better appraisal of the probabilities in regards to South Asian archaeogenetics than I could. The main thing that is holding me back from suggesting that I now find their model more probable than mine is the issue in regards to LD and rolloff methods. But I’ve definitely increased my uncertainty, from ~25% to ~50%, with the balance split between the two models (or some combinations thereof).

End Update

Sometimes you see things in fragments, disparate threads, which only snap into focus in hindsight. In this post I will hazard a prediction of results which are going to come out of remains from Indus valley sites in South Asia, which will confirm that there were two major demographic pulses which entered the subcontinent from the Northwest over the past 10,000 year. The first wave was the dominant one in comparison to the second genetically, and began at Mehrgarh 9,000 years ago. Its locus of origin was in the highlands of Western Asia, between the Caucasus and the Fertile Crescent. The second wave though left its mark culturally, as it is associated with Indo-Aryans, and likely derives ultimately from the trans-Volga steppe societies. The genetic signatures of the former people are found in nearly every indigenous South Asian group, as they amalgamated with a deeply entrenched local group of peoples who were distantly related to those of Oceania and eastern Eurasia. In short, the latter are the “Ancestral South Indians” (ASI) and the former are the “Ancestral North Indians” (ANI, see Reconstructing Indian Population History).

Screenshot - 01022016 - 09:46:36 PM The figure above is from Upper Palaeolithic genomes reveal deep roots of modern Eurasians (open access), which found that ancient DNA from two samples in the northern Caucasus region are representatives of a population which contributed to the origins of the steppe people who swept into Northern Europe ~4,500 years ago. It shows how contemporary populations are best modeled as admixture events between reference populations. What you see is that most South Asian groups are well modeled as a mixture between “Caucasian hunter-gatherers” (CHG), and another element which is labeled “South Asian” because it is mostly restricted to the subcontinent. But wait there’s more! In the supporting materials the statistics show that though most South Asian groups have more potential mixture from the high quality CHG sequence, Kotias, a subset, unspecified Gujarati groups and Tiwaris, share more drift with the Afanasevo culture, which flourished in the Altai region of Central Asia between 5,500 and 4,500 years ago. We have enough ancient DNA to infer that the Afanasevo basically the same people as the Yamna culture, who were present between the Volga and Dnieper, far to the west. The Tiwari are an upper caste group which is present across Northern India. The second wave component is clearly strongest in the Northwest, as indicated by the Kalash sharing so much drift with Ma’lta. Before subsequent waves of gene flow into the steppe people, which brought dollops of European farmer and hunter-gatherer ancestry into the mix, they had a higher fraction of Ancestral North Eurasian (ANE) than any contemporary Northern European population. Their contribution to South Asian groups on the Northwest fringe of the subcontinent explains then the presence of high fractions of ANE there.

A final aspect which needs to be mentioned is that the Z93 subclade of R1a1a is found across much of South Asia. Though it is correlated with higher caste, and Indo-Aryan speaking, populations, it is not exclusive to them. In fact it is found in substantial fractions among notionally primal tribal people in South India who traditionally practice primitive slash and burn agriculture and engage in extensive hunting and gathering. Ancient DNA results from the Sbruna culture of Central Eurasia have yielded Z93 among buried males. This subclade is rather rare in this region today, and, it succeeded groups which were carrying R1b, today dominant across Western Europe. The details are to be worked out, but, I believe that are associated with, but more expansive than, the Indo-Aryans. Beyond the limits of the folk migrations were outrider groups of males who integrated themselves into indigenous societies, often taking elite positions as members of a dominant patrilineage. If there was a strong bias for male descendants of a small number of these individuals, but not female ones, to have higher reproductive fitness, than over time their Y chromosomes might be far more common than their total genome contribution (to illustrate what I’m talking about, a recent paper in Australian Aboriginals admits that 56% of their Y chromosomes introgressed over the past 200 years from Europeans!).

Bringing it together one implication of the above is that the Dravidian languages of the Indian subcontinent were probably brought by the West Asian farmers (perhaps confirming an ancient link to Elamite?). Therefore, the language(s) of the Indus valley civilization was probably a form of Dravidian. Another aspect to consider is that no South Asian population lacks the genetic imprint of these West Asian farmers. It seems likely that as in Europe the farmer populations which entered the subcontinent via the northwest totally marginalized most of the hunter-nihms137159f3 gatherer groups, which were numerically less substantial in any case. But, why do all South Asian groups also exhibit ASI ancestry, which is deeply rooted in the subcontinent? Just as in Europe the initial populations of farmers on the fringes of the subcontinent mixed with the local hunter-gatherers, producing a synthetic population which over time evolved its cultural toolkit to become more well adapted to South Asian geographies. Once the crucial cultural adaptations occurred then the synthetic population underwent a phase of massive demographic expansion beyond its delimited ghetto on the fringes, where West Asian climatic parameters allowed for the initial phase of near total cultural transplantation. As in Europe the expanding South Asian farmer groups absorbed hunter-gatherer substrate, accruing greater and greater ASI fractions on the wave of demographic advance, and so generating the ANI-ASI cline evident in genetic analyses. The presence of ASI in groups like the Pashtuns in Afghanistan is probably due to the fact that the synthetic populations, what we now term “South Asians” or “Indians” or “desis”, exhibited enough cultural hegemony and influence to reach deep into the plateau of modern Afghanistan and impacted both the pre-Iranic and East Iranic people of Afghanistan (also, note that Indians were very common as slaves in the cities of Afghanistan during the early Islamic period).

The reason I took time to put this post up now is that it looks like the publication of ancient South Asian genomes from the Indus valley period is imminent. From The Guardian on December 30th, Rakhigarhi: Indian town could unlock mystery of Indus civilisation:

One has stood out: who exactly were the people of the Indus civilisation? A response may come within weeks.

“Our research will most definitely provide an answer. This will be a major breakthrough. I am very excited,” said Vasant Shinde, an Indian archaeologist leading current excavations at Rakhigarhi, which was discovered in 1965.

Shinde’s conclusions will be published in the new year. They are based on DNA sequences derived from four skeletons – of two men, a woman and a child – excavated eight months ago and checked against DNA data from tens of thousands of people from all across the subcontinent, central Asia and Iran.

They looked somewhat like a recent Miss America!

They looked somewhat like a recent Miss America!

I predict that the Y chromosomal haplogroups will be H or J2. Both these are common in Dravidian speaking groups of Southern India, and, are found at some fractions in West Asia. I predict that these individuals who share gene flow with Kotias, and not with Central Eurasian groups. I predict that these individuals will not be enriched for ANE ancestry. I predict these individuals will have mtDNA lineages present in modern Indian populations, probably M. Though excavated in a region of South Asia where today lactase persistence (LP( is common, none of the individuals with carry the common derived Eurasian haplotype conferring LP. They will segregate for the derived variant of SLC24A5. On a PCA plot these individuals will cluster with non-Brahmin upper/middle caste South Indian populations, such as the Reddys of Andhra Pradesh.

Note: I’ve been told by friends for two years and more that there are efforts to sequence and type Indus valley individuals. But I have no inside information. If you are an individual in the media who has early access feel free to send me a PDF with the understanding that I will honor the embargo! (if you don’t send me the PDF I’m mildly confident I’ve already hit the major themes you are safeguarding)

• Category: Science • Tags: India, Indian Genetics 
🔊 Listen RSS


parsi2 In the comments below I made the comment that the Parsi people of India, who reputedly arrived in India ~1000 years ago from Iran, are about 25 percent South Asian. By this, I mean that their ancestry is about 75 percent Iranian (presumably Persian), with 25 percent admixture from South Asian populations amongst whom they lived. But my feeling about this was vague, and I decided to check the scientific literature. Unfortunately there hasn’t been a lot of work done in this area with cutting edge genomics. But a cursory examination shows that there’s been substantial migration of Indian women into the Parsi lineage via the mtDNA. In the figure to the right you see that “PA”, the Parsis, have a lot of “South Asian” mtDNA lineages compared to the Iranian groups. This mostly consists of South Asian branches of haplogroup M. It jumps out to you immediately when looking at the haplotypes that the Parsis carry on their mtDNA. I found less on the Y chromosomes, which are less informative in differentiated South Asians from Iranians in any case (the mtDNA difference is much greater between these two regions), but what I did find is that Parsis can be modeled as 100% Iranian on their paternal lineages. This is probably an exaggeration, but as a stylized fact I think it gets to the heart of the matter.

But what would really be useful are autosomal results. Those were hard to find. Noah Rosenberg’s 2006 paper on Indian genetic differentiation using microsatellites did have a Parsi sample. If you look at the results the Parsi do seem South Asian, roughly equivalent to Pathans, an Iranian speaking group in Pakistan which has strong South Asian affinities. But the sample set does not include any Iranian groups from Iran proper, but rather Middle Eastern groups from the Arab world or the Caucasus. Without such a reference population it is hard to gauge Parsi relatedness.

There was one last hope. Harappa DNA has been collecting results for many years now, and I was hoping that there was a Parsi in the sample. There was, just one. I took the Parsi and compared this individual to various Iranian and a few select Indian groups. Here are the admixture results (edited to show only the relevant ancestral clusters):

Ethnicity S.Indian Baloch Caucasian NE.Euro Mediterranean SW.Asian
Kurd (Iraqi) 0 29 40 4 6 16
Iraqi Arab 1 11 30 0 5 44
Kurd (Iraqi) 1 26 43 5 5 16
Kurd (Iraqi) 1 28 43 5 5 13
Kurd (Iranian) 1 29 41 7 6 12
Kurd Zaza Turkey 2 23 43 6 6 13
Iranian 2 24 43 5 7 13
Kurd (Turkish) 2 26 46 6 6 10
Iranian 2 28 47 7 3 10
Iranian 2 29 43 3 8 8
Iranian 2 30 44 4 2 13
Iraqi Arab 3 20 39 0 10 19
Kurd Kurmanji Iraq 4 21 41 4 7 15
Kurd from Turkey 4 24 41 4 8 12
Iranian 4 26 39 7 7 12
Kurd Yezidi Iraq 4 26 39 4 7 13
Iranian 4 27 41 5 6 11
Iranian 4 29 37 4 4 12
Iraqi Arab 5 19 38 5 7 19
Kurd Kurmanji Iraq 5 24 39 4 8 13
Iranian 5 25 38 5 7 12
Kurd (Iraqi) 5 27 41 5 5 14
Iranian 6 25 37 6 6 12
Kurd (Feyli) 6 25 38 3 7 14
Iranian Khorasani 8 29 35 9 2 11
Afghan Pashtun 14 32 25 12 3 4
Pashtun (Kandahar) 15 34 25 10 0 5
Mumbai Parsi 16 28 28 5 4 12
Afghan Pashtun 20 36 17 11 0 5
Afghan Pashtun 21 33 17 9 2 2
Pashtun 21 35 18 10 0 5
Gujarati Khoja 28 47 13 7 0 1
Gujarati Patel Muslim 34 32 13 3 3 6
Gujarati Sunni Vohra Surti 35 34 13 5 2 4
Gujarati Ganchi 38 42 5 9 3 0
Gujarati Vaishnav Vania 45 36 4 4 1 3
Gujarati Jain 46 36 6 4 0 0
Gujarati Vaniya 52 37 2 6 0 1
Gujarati 53 43 0 0 2 0
Gujarati 56 39 0 0 2 0

The key is to focus on the “South Indian” ancestry. Though this is found in some Iranian groups, it drops off very rapidly once you move past groups like the Pathans. The Parsi individual has 16 percent South Indian ancestral component. Looking at the Iranian individuals, you can probably say that you might expect 5 percent from this population. The question is what is the Indian source population? There’s a lot of variation among these. But, if you take 50 percent South Indian for the South Asian source population, then you get:

(50 percent)*(0.25) + (5 percent)*(0.75) = 16.25%

So at least going by this one individual something like ~25 percent is probably correct for the Parsis in terms of how much “native” South Asian ancestry they’ve picked up. Since they are genetically quite homogeneous at this point an N = 1 might be sufficient to reach a conclusion. I’d be curious if anyone finds anything different.

🔊 Listen RSS

Most people in South Asia speak one of two varieties of language, Indo-Aryan and Dravidian. These two are not particularly closely related. Indo-Aryan is an Indo-European language, as is evident in the plethora of obvious cognates with other Indo-European dialects. I have a minimal fluency in Bengali, the easternmost of the Indo-European languages, and quite a bit more fluency with English, one of the most westernmost, and it was evident to me rather early on (e.g., grass vs. gash, man vs. manush, nose vs. nak). In contrast to me Dravidian languages are peculiar because the accent and cadence are clearly South Asian, but they are utterly impenetrable (though there are many loan words into Indo-Aryan from Dravidian).

But in this post I’m going to explore the genetic relationships of the people who speak a subgroup of Austro-Asiatic languages indigenous to India, that of the Munda. The traditional question has always been whether the Austro-Asiatic languages are from India, or, whether they are from Southeast Asia. More precisely, did the Munda culture come to India, or is the Munda culture a relic of the original Austro-Asiatic domain in eastern India?

As background I believe it is important that readers understand that the territory between Vietnam and that of the Munda was likely dominated by Austro-Asiatic dialects ~2,000 years ago. Both the Burmese and Thai arrived in the historic period from southern China, and overthrew Mon or Khmer cultures which flourished in lowland Southeast Asia. In the case of both the Burmese and Thai it was a situation where the newcomers imposed their language upon the indigenous population, but by and large adopted most elements of high culture from the natives (e.g., Theravada Buddhism). The monarchies of Thailand and Burma drew directly from the Indic-inflected polities of the Khmer and Mon.

The recent extensive distribution and variety of Austro-Asiatic languages in Southeast Asia is suggestive of the likelihood that they derive from this area, but it is not a definitive point in that model’s favor. But there are now other genetic lines of inquiry. A few years ago a paper came out which reported that the Y chromosomal lineages of the Munda people which connect them to the Southeast Asia are much more diverse in Southeast Asia. This matters because population expansions and migrations tend to homogenize lineages through greater genetic drift, with the “source” population more likely to maintain diversity. Additionally, there was also evidence of a genetic variant in EDAR which has the hallmark of recent increase in frequency across eastern Asia. This seems to peg the Munda arrival to the Holocene, not the Pleistocene. Finally, there is the pattern of male lineages exhibiting some concordance with Southeast Asia, but female lineages being entirely indigenous. This is a classic expectation from a model of migration where there was a strong bias toward males because of the mobility of these groups, which lacked women and children.

I decided to further explore the question using the Estonian Biocenter data sets, as well as the HGDP and HapMap. For those of you who are curious about the technical details, I LD pruned the Estonian Biocenter marker set from ~600,000 down to ~130,000. I also put the samples through –geno 0.01 and –mind 0.80 on Plink to get high quality individuals and good coverage on markers. To be explicitly clear, I renamed and combined some of the populations in the original data set (e.g., Chamars = UP_Dalits). I ran a preliminary MDS to make sure that the data wasn’t strange, and it checked out.

So to do the analysis I ran TreeMix. I used Chinese Americans as the root outgroup population, and wanted 5 migrations, and also tried to correct for any remaining LD by looking across a window of 1,000 SNPs. You can view my first plot below.

The primary thing I would focus on is the gene flow from Cambodians to Munda. This is exactly what one might expect if the Munda were intrusive to South Asia. More interestingly, observe that there is no gene flow into Burmese from the South Asian groups, even though they are much closer proximity to South Asia! This is probably picking up something deep in history then. The fact that the Munda diverge early from other South Asian groups is also in keeping with Admixture or Structure bar plot results: the South Asian ancestry of the Munda is relatively unadmixed.

Next I wanted to focus more on the eastern population flows. So I removed a lot of the western groups which overwhelmed my gene flow edges.

In this scenario again there is a gene flow parameter from the rough region of the Cambodian node. Perhaps more curious now there is a powerful gene flow parameter into the Burmese from the same locus.Totally intelligible in light of the fact that the modern Burmese are genetically a hybrid population between Tibeto-Burman and Mon (Austro-Asiatic).

I’m certainly not ready to assert that the “case is closed.” But it seems that we need to shift our probabilities again toward the intrusive hypothesis.

Image credit: Wikipedia

🔊 Listen RSS

A Cape Coloured family

I’ve mentioned the Cape Coloureds of South Africa on this weblog before. Culturally they’re Afrikaans in language and Dutch Reformed in religion (the possibly related Cape Malay group is Muslim, though also Afrikaans speaking traditionally). But racially they’re a very diverse lot. In this way they can be analogized to black Americans, who are about ~75% West African and ~25% Northern European, with the variance in ancestral proportions being such that ~10% are ~50% or more European in ancestry. The Cape Coloureds though are much more complex. Some of their ancestry is almost certainly Bantu African. This element is related to the West African affinities of black Americans. And, they have a Northern European element, which likely came in via the Dutch, German, and Huguenot settlers (mostly males). But the Cape Coloureds also have other contributions to their genetic heritage. Firstly, they have Khoisan ancestry, whether from Bushmen or Khoi. This is well known in their oral memory. The the hinterlands of the Cape of Good Hope are beyond the ecological range of the Bantu agricultural toolkit, so the region was still dominated by the Khoisan when the Europeans arrived. But there are also other suggestions of ancestry from Asia. The existence of the Cape Malays, whose adherence to Islam derives from the Muslims slaves brought by the Dutch, hints at likely relationships to the populations of maritime Southeast Asia. Finally, there are the Indians. This element is not too well recalled in cultural memory. But the Dutch brought many slaves from India as well as Southeast Asia. The Dutch first governor of the Cape Colony had a maternal grandmother who was an Indian slave, by various accounts Goan or Bengali (the town of Stellensbosch is named for him). No doubt it was far more likely that the usual lot of the descendants of Indian slaves during the Dutch era would be to be absorbed into the melange of the Coloured population than assimilated into what later became the Afrikaners.

Why is this aspect of Cape Coloured ancestry forgotten? I think part of the reason is that there is a large South African Indian community present today, but that community post-dates the Dutch period, and arrived with the British. When South Africans think of Indians they think of these people. Interestingly when the new genetic studies confirming Indian ancestry came on the scene I was “corrected” several times by Indians themselves when reporting this part of the Coloured heritage. They were under the impression I must be mistaken, as no one was familiar with the Cape Coloureds having Indian ancestry. Unfortunately pointing to PCA and STRUCTURE plots did not clear up the confusion.

In any case, thanks to the African Ancestry Project I now have three unrelated Coloured samples (I have more, but they are related). Since AAP is Afrocentric I thought it would be appropriate to run the Coloured samples separate first. So that’s what I did.

First, the methodology. I took the Gujaratis, Utah whites, Chinese from Denver, and Luhya (Bantu) from Kenya, and merged them with the Bushmen from the Henn et al. thick-marker data set. I also decided to add in the Yemeni Jews from Behar et al., mostly to check that the West Eurasian ancestry of the Cape Coloureds was in fact Northern European. I limited the Gujarati sample to those from “Gujarati_B”, which is the “more South Asian” cluster within the HapMap data set. I also reduced the numbers for a lot of HapMap populations. I’m looking at inter-continental differences, so I assumed that N of ~20 would suffice. After merging these data sets with the Cape Coloured samples I pruned all the missing SNPs. This left me with ~230,000 markers. In my experience this is kind of overkill for ADMIXTURE at this level of genetic distance between the hypothetical parent populations, but better safe than sorry. I also ran the samples through EIGENSOFT to generate PCAs. Also know that I performed a few “trials” with Sandawe and Hadza from Henn et. al., as well as with larger samples from the HapMap. That either added nothing on the margin, or just got confusing (there’s not really too much Sandawe and Hadza in the Cape Coloureds beyond what the Bantu must have picked up).

After I ran ADMIXTURE up to K = 7 it was clear that the optimal point in terms of informativeness was K = 6. You can see that the Cape Coloured samples have Northern European, Khoisan, Bantu African, Indian, and East Asian ancestry. There is a Yemeni component in two of the Coloured individuals which begs to be explained. This component is too high to be explained by Northern European ancestry alone. It could be explained by slaves from the Muslim Arab world. Also, the Indian reference sample used here was pruned to be very homogeneous. The slaves from South Asia were almost certainly much more diverse than the Gujarati_B population, which is mostly a group of Patels. Finally, sometimes when you run ADMIXTURE you see that combinations of atypical genetic backgrounds (e.g., Khoisan + Chinese) can general components which are likely artifacts. This tends to be an issue when you have two components which aren’t normally found together, and one is at a far lower level than the other. I’ve noticed this in particular with people with low amounts of Sub-Saharan African ancestry and Eurasian genetic backgrounds. They often come out to be East African or Pygmy or Bushmen when the probability of this is likely to be very low a priori. Notice that a few of the Bushmen have the Yemeni component but nothing else besides what you’d expect. This to me increases the likely that the light green in the Coloureds is also an artifact of the Khoisan genetic background against one of the other components.

So below is the K = 6 ADMIXTURE plot, along with the informative PCA’s. Observe that the three Coloureds have IDs.

Image Credit: Wikimedia Commons.

🔊 Listen RSS

Two years ago Reconstructing Indian Genetic History reframed how we should view South Asian historical genomics. In short, Indians can be viewed as a hybrid between a West Eurasian group, “Ancestral North Indians” (ANI) and a very different group, “Ancestral South Indians” (ASI), which had distant connections to West and East Eurasians. At least to a first approximation. Last fall I posted on a new paper which surveyed the Austro-Asiatic speaking peoples of India, and concluded that they were exogenous to the subcontinent. This is an interesting point. Prehistoric treatments of South Asia often use linguistic terms to denote putative ancient populations. One model is that first it was the Munda, the most ancient Austro-Asiatics. Then the Dravidians. And finally the Indo-Aryans. These genetic data imply that the Munda arrived after the initial ANI-ASI synthesis. The Munda people of India can be thought of as ANI-ASI, with an overlay of East Eurasian ancestry.

Zack Ajmal’s K = 11 ADMIXTURE run has highlighted some further issues. He has a set of Austro-Asiatic samples, as well as a host of Indo-Aryan and Dravidian speaking populations. I now believe we can now further clarify and refine our model of the peopling of India. Here it is:

1) ASI, circa ~10,000 years BP

2) ANI enters the subcontinent from the northwest, synthesis with ASI

3) The ancestors of the Munda enter from the northeast, synthesis with ANI + ASI in their region

4) A subsequent group of West Eurasians, related to the ANI, so I will term them ANI2, enters from the northwest and overlays the ANI + ASI synthesis. In the northeast quadrant of the subcontinent this group marginalizes the Munda people, who are either assimilated or escape to more remote locations. I believe that ANI2 is likely the Indo-Europeans, but it may be Dravidians as well

5) A second group of Austro-Asiatic peoples enters from the northeast, and synthesizes with the AN2 + ANI + ASI. In some regions they are absorbed (Assam), but in other regions they are culturally dominant (Meghalaya)

Below are two plots which illustrate where I’m coming from. The “S Asian” component from K = 11 above seems to overlap, but is not identical to, ANI. The “Onge” component plays a similar role with ASI. The “SW Asian” and “European” elements are pretty straightforward. They’re very closely related to the “S Asian” one, but they do separate from it. Their relationship to distant non-Indian groups as well as a gradient toward the northwest suggests to me a more recent arrival of this element.

Two patterns. For the Indo-European and Dravidian South Asian groups you see a vertical distribution which corresponds to populations which are a combination of ANI/ASI. But notice the perpendicular distribution of the Austro-Asiatic groups. The East Eurasian element to their ancestry means that they are not fully modeled by the two-way admixture. I believe that the the “Onge” fraction, which tracks ASI, is overestimating ASI in the Austro-Asiatic because the this proportion just seems way too high in many Southeast Asian and Dai groups to be plausible to me as a prefect proxy for ASI in them. But in any case, note that the Austro-Asiatic groups seem to be mostly a mix of ANI/ASI like other South Asians. There is clearly one outlier population. I’ll get to them.

Below is a plot which shows the ratio of the sum of AN2 over the stabilized hybrid proportion.

We know from Reconstructing Indian Genetic History that South Indian tribals and Dalits have a fair amount of West Eurasian ANI. But, from the genome bloggers, and especially Zack’s further analyses, we can see that there is a further component of West Eurasian ancestry which is probably not ANI, but post-dates it. These components have affinities to Southwest Asia or Central Eurasia. They’re labeled “SW Asian” and “European” in Zack’s K = 11. Here’s the big thing you notice: this element increases southeast-northwest, and low caste to high caste. It’s almost absent among many Dravidian populations. It is very common in the northwest of the subcontinent.

Again, except for that one outlier, the Austro-Asiatic groups almost totally lack AN2, just like some Dravidian tribals. On the other hand, even the most AN2 groups in South Asia clearly have some ASI and ANI. But having ASI and ANI does not guarantee AN2. The East Eurasian component found in the Austro-Asiatics seems constrained to the northeast of the subcontinent by and large. Finally, we have the outlier Austro-Asiatic group.

These are the Khasi. They are are not Munda, and seem to have closer relationships to other East Eurasian populations. They also have a small, but noticeable AN2 component. What’s going on? I believe that the Khasi arrived in northeast India after those who brought AN2 had already marginalized the Munda. Some of the Khasi were probably assimilated into the post-Munda (Indo-European or Dravidian speaking) peasantry. But some of the Khasi maintained their identity in the highlands, where they also intermarried with the post-Munda population, which had AN2. In contrast the Munda who retained their cultural identity had withdrawn and disengaged.

Here’s a table for you perusal (remember that ASI is inferred):

Group Language Status S Asian Onge E Asian SW Asian Euro Siberian ASI
Paniya Dravidian Tribe 47% 45% 4% 0% 0% 1% 67%
Santhal Austro-Asiatic Tribe 40% 45% 13% 0% 0% 0% 67%
Bonda Austro-Asiatic Tribe 27% 44% 27% 0% 0% 0% 66%
Ho Austro-Asiatic Tribe 34% 44% 20% 0% 0% 0% 66%
Kharia Austro-Asiatic Tribe 33% 44% 21% 0% 0% 0% 65%
Savara Austro-Asiatic Tribe 33% 44% 21% 0% 0% 0% 65%
Mawasi Austro-Asiatic Tribe 38% 44% 16% 0% 0% 1% 65%
Juang Austro-Asiatic Tribe 26% 43% 28% 0% 0% 0% 65%
Asur Austro-Asiatic Tribe 42% 42% 14% 0% 0% 0% 64%
Gadaba Austro-Asiatic Tribe 29% 42% 24% 0% 0% 0% 63%
Mala Dravidian Dalit 58% 40% 1% 0% 0% 0% 60%
Kurumba Dravidian Tribe 54% 39% 2% 2% 1% 0% 60%
Sahariya Indo-European Dalit 44% 39% 12% 0% 2% 1% 59%
Chenchu Dravidian Tribe 53% 39% 3% 0% 2% 1% 59%
Madiga Dravidian Dalit 57% 38% 0% 0% 1% 1% 58%
Bhil Indo-European Tribe 56% 37% 0% 1% 3% 1% 57%
North Kannadi Dravidian 57% 37% 1% 1% 2% 0% 56%
Satnami Indo-European L Caste 49% 36% 8% 1% 3% 0% 56%
Sakilli Dravidian Dalit 59% 36% 1% 2% 0% 0% 55%
Kamsali Dravidian L Caste 59% 35% 1% 2% 0% 0% 54%
Vysya Dravidian Mid Caste 62% 34% 0% 2% 0% 0% 53%
Hallaki Dravidian Tribe 57% 34% 0% 3% 3% 1% 53%
Tharu Indo-European Tribe 52% 32% 3% 3% 6% 2% 50%
Naidu Dravidian U Caste 59% 32% 0% 4% 2% 1% 50%
Lodi Indo-European L Caste 58% 32% 1% 2% 6% 0% 50%
Velama Dravidian U Caste 60% 29% 0% 7% 2% 0% 46%
Srivastava Indo-European U Caste 56% 28% 0% 4% 10% 0% 44%
Gujaratis a Indo-European 64% 26% 0% 3% 6% 0% 42%
Meghawal Indo-European Dalit 55% 25% 0% 8% 10% 1% 41%
Cochin jews Dravidian 50% 24% 1% 16% 7% 0% 39%
Vaish Indo-European U Caste 52% 24% 0% 6% 15% 0% 39%
Gujaratis b Indo-European 56% 22% 0% 7% 13% 0% 36%
Khasi Austro-Asiatic Tribe 21% 21% 48% 0% 3% 5% 36%
Bene Israel Jews Indo-European 45% 19% 0% 26% 8% 1% 32%
Kashmiri pandit Indo-European U Caste 51% 18% 0% 12% 15% 2% 31%
Cambodian 4% 17% 75% 1% 1% 0% 30%
Singapore malay 5% 17% 73% 1% 1% 0% 30%
Garo Tibeto-Burman Tribe 8% 17% 65% 0% 0% 9% 29%
Sindhi Indo-European 52% 13% 0% 16% 13% 1% 25%
Pathan Indo-European Tribe 48% 11% 1% 17% 19% 2% 21%
Burusho Isolate Tribe 47% 10% 6% 12% 18% 5% 21%
Lahu Tibeto-Burman 0% 10% 86% 0% 0% 3% 20%
Dai Tibeto-Burman 0% 8% 91% 0% 0% 0% 18%
Balochi Indo-European Tribe 49% 7% 0% 27% 12% 1% 16%
Brahui Dravidian Tribe 50% 5% 0% 28% 12% 1% 14%
Makrani Indo-European 47% 5% 0% 29% 11% 1% 14%

• Category: Science • Tags: Genetics, Genomics, Indian Genetics, Indian genomics 
🔊 Listen RSS

Zack Ajmal now has over 50 participants in the Harappa Ancestry Project. This does not include the Pakistani populations in the HGDP, the HapMap Gujaratis, the Indians from the SVGP. Nevertheless, all these samples still barely cover vast heart of South Asia, the Indo-Gangetic plain. Here is the provenance of the submitted samples Zack has so far:

  • Punjab: 7
  • Iran: 7
  • Tamil: 6
  • Bengal: 5
  • Andhra Pradesh: 2
  • Bihar: 2
  • Karnataka: 2
  • Caribbean Indian: 2
  • Kashmir: 2
  • Uttar Pradesh: 2
  • Sri Lankan: 2
  • Kerala: 2
  • Iraqi Arab: 2
  • Anglo-Indian: 1
  • Roma: 1
  • Goa: 1
  • Rajasthan: 1
  • Baloch: 1
  • Unknown: 1
  • Egyptian/Iraqi Jew: 1
  • Maharashtra: 1

Again, note the underrepresentation of two of India’s most populous states, Uttar Pradesh, ~200 million, and Bihar, ~100 million. Nevertheless, there are already some interesting yields from the project. Below I’ve reedited Zack’s static images (though go to his website for something more dynamic) with the labels of individuals. I’ve highlighted myself and my parents with the red pointers.

To the left is a set of plots and tables which I’ve spliced together from Zack’s various posts. What you need to know is that this at K = 12, and I’ve used the labels that Zack gave the various putative “ancestral populations” which emerged out of his ADMIXTURE runs. I’ve also displayed the participants in the Harappa Ancestry Project so far, with their ethnic labels. Finally, smack in the middle you see the Fst values, standardized by the smallest between population difference. So the values in the boxes represent the genetic distances for the inferred ancestral populations in the row and column (I also rounded, since I didn’t want to give the impression of excessive precision). This last point is important, these are not between population distance measures across real populations. Rather, they’re distance measures across the inferred allele frequencies of populations generated which emerge out of the parameters you constrain ADMIXTURE to, as well as the genetic variation which you throw into the pot for the algorithm in the first place.

In the broadest sense the first thing that jumps out at you is the high distance value between “Papuans” and everyone else. This is interesting. In fact, the genetic distance of between Papuans and other ancestral populations is greater than the genetic distance between the putative African populations and other non-Africans, except Papuans. This goes to the point that you need to be very careful in making definitive inferences from these sorts of programs. Interestingly, the population to which the Papuans exhibit the least genetic distance are the “South Asians.” What does that mean? I think this has a straightforward explanation. I believe that the South Asian cluster is a hybridized compound, as suggested by Reconstructing Indian History, and that the populations of Oceania represent a relatively “pure” eastern expansion of long resident southern Asian groups which have generally been submerged by admixture with other groups intrusive to the region. This also explains the fact that Cambodians share some of this Papuan component with various South Asian populations. Finally, I wouldn’t make too much of this, but in some ADMIXTURE runs which I’ve done the genuine Papuan population in the HGDP data set breaks into two ancestral components, of which the southern Asian groups from Pakistan to Cambodia share only one. Remember that Oceania was settled initially by Melanesians and Australians ~40-50,000 years ago, and it looks like the people of Melanesia and indigenous Australians date to this initial period. So connections between southern Asians and Papuans are likely very old, and the two groups have been distinctive for a long time.

To the South Asian individuals surveyed so far, there’s nothing that surprising. The South Asian element tends to increase as one goes south and east. This is what you’d expect. And, the Pakistan/Caucasian component which spans much of western and central Asia is what connects the Iranian samples to the South Asian ones. The Iranians have very little of the South Asian component. This makes sense if the South Asian element is simply an outcome of an admixed population, and one of the ancestral groups from which this component derives, “Ancestral South Indians,” were generally not present to the west of Pakistan. The eastern Asian components are enriched among Bengalis, as you’d expect, but they’re found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the “eastern” element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance. Most of the other patterns are as you would expect. Finally, I’d like to point out that I suspect that Zack is the first one to post the ancestral fractions of someone from the Nadar caste using SNP-chip markers.

Here are all the details about participation.

🔊 Listen RSS

Last week I announced the Harappa Ancestry Project. It now has its own dedicate website, Additionally, it has its own Facebook page. For Zack to get his own URL he needs about 10 more “likes,” so please like it! (if you are so disposed) Finally, from what I’ve heard the first wave of the 23andMe holiday sale results are coming online this week. Actually, one of the relatives who I purchased the kit for is in processing currently, so I know that we should have a bunch of new people in the system very, very, soon.

Speaking of people, last I heard Zack had gotten about a dozen responses. That’s enough to start an initial round of runs, but obviously he needs more people. More importantly, the goal here is to get better population coverage. One of the things we know intuitively and also from the most current research is the existence of a lot of within-region population variation in South Asia which is structured by community. In other words, a sample of 30 people, where you have 3 from 10 different communities exhibiting geographical and caste diversity is going to be far more useful right now than 300 Jatts from Indian Haryana. Getting 300 Jatts for Haryana would be interesting in that it would give you a window into intra-communal variance, but there’s diminishing returns on the inferences you could make about South Asians as a whole.

If you know someone who has done the 23andMe testing and has preponderant ancestry from South Asia, Iran, Burma, or Tibet, please forward the the URL for the Harappa Ancestry Project. If you are a 23andMe member, and involved in the forums, it might be useful to post a comment thread on this project, as the people you share genes with would see it.

🔊 Listen RSS

price_fig1I have put up a few posts warning readers to be careful of confusing PCA plots with real genetic variation. PCA plots are just ways to capture variation in large data sets and extract out the independent dimensions. Its great at detecting population substructure because the largest components of variation often track between population differences, which consist of sets of correlated allele frequencies. Remeber that PCA plots usually are constructed from the two largest dimensions of variation, so they will be drawn from just these correlated allele frequency differences between populations which emerge from historical separation and evolutionary events. Observe that African Americans are distributed along an axis between Europeans and West Africans. Since we know that these are the two parental populations this makes total sense; the between population differences (e.g., SLC24A5 and Duffy) are the raw material from which independent dimensions can pop out. But on a finer scale one has to be cautious because the distribution of elements on the plot as a function of principal components is sensitive to the variation you input to generate the dimensions in the first place.

I can give you a concrete example: me. I showed you my 23andMe ancestry painting yesterday. I didn’t show you my position on the HGDP data set because I’ve shared genes with others and I don’t want to take the step of displaying other peoples’ genetic data, even if at a remove. But, I have reedited some “demo” screenshots and placed where I am on the plot to illustrate what I’m talking about above. The first shot is my position on the two-dimensional plot of first and second principal components of genetic variation from the HGDP data set.

gsa-lillymendel-worldNo surprise that I’m in the Central/South Asian cluster. But what may surprise you is that I’m not in the South Asian cluster, I’m in the Central Asian cluster. In the Central Asian cluster are Uyghurs and Hazaras. These are two hybrid populations, a mixture of West and East Eurasian elements. The Uyghurs are likely the outcome of a process of admixture between the Iranian and Tocharian Indo-European populations of the cities of the Tarim basin, and later Turkic speaking settlers who arrived in the wake of the expansion and later collapse of the first Uyghur Empire (the historical connection between the current Uyghurs and ancient Uyghurs is tenuous at best, and complicated). The Hazaras are a more recent population, likely emerging as the product of intermarriages between Mongol soldiers who arrived in the 13th century, and indigenous women, Persians, Turks, and assorted Indo-Iranian groups between the Zagros and Khyber Pass. It is somewhat ironic that I’m on the edge of the Hazara cluster since they are almost certainly in part descended from Genghis Khan’s family, and my own surname is Khan. But I know that my Y chromosomal lineage is R1a1, very common across Central and Southern Eurasia, and not a Mongolian one at all.

23andmepcazoomZoom! Now we’ve constrained the input data set to the Central/South Asian groups. First, look at the Kalash. They’re strange, which is no surprise, they’re an inbred mountain group in Pakistan who have not adopted Islam. The Pakistani Taliban looks to be ending them as we speak. I really would prefer that they were just thrown out of the data set for this zoom view, because on this fine grained scale I don’t think they add much at all. They’re just an example of what long term endogamy can do to your allele frequencies. The bigger picture is the axis between the populations of Pakistan, and those of Central Asia. Observe that I’ve changed position. Whereas when taking world wide genetic variation into account I clustered with Central Asians, now I’m 2/3 of the way to the South Asian cluster. I will tell you that I’ve shared “genes” with around 50 South Asians now, from various parts of the subcontinent, and in the 23andMe plot they overlay the South Asians nearly perfectly. I’ve put labels at the approximate ethno-linguistic position. I’m an outlier. 23andMe tells me that I’m 43% “East Asian.” The typical South Asian is in the 10-30% range. My first assumption was that I have a lot of ancient South Indian, which just shows up as East Asian in their algorithm. With this in mind I tried sharing with a lot of South and East Indians, and found out two interesting points. First, South Indians seem no higher than 30-35% East Asian. Bengalis on the other hand are more East Asian, with Bangladeshis more East Asian than West Bengalis. My sample size for Bengalis is small, so take that with caution. Second, the PCA plots put the South Indians firmly in the South Asian cluster, but the Bengalis trail out toward my own position. This indicates again that different methods are telling you slightly different things. The PCA is only a thin slice of variation, but it’s highly informative of between population differences. A Bengali and a South Indian with the same “East Asian” fraction in the ancestry painting nevertheless have consistently different positions on the PCA, with Bengalis closer to the East Asians. Additionally, there’s an ethnic Persian in this zoom plot that I’m describing, and they are positioned near the Balochi. But on the world wide plot they’re on the margins of the European cluster. Another illustration that position of an element is sensitive to the input data because of how the dimensions are generated.

Blaine Bettinger, who inspired me to post this, told a story with his ancestry painting which was plausible. What can I say? First, I have less than 1% African ancestry. This could be noise. But, I do observe that the South Asians with Muslim names are enriched in the set of those who I’ve shared genes with and who have less than 1%, but not 0%, African ancestry. Just as Muslim South Asians have non-trivial West Asian ancestry, I suspect that many of us have Sub-Saharan African ancestry through the same dynamic. Sub-Saharan African soldiers were prominent across South Asia with the arrival of Muslims. Bengal even has a period of rule by Abyssinian rulers. But the bigger issue for me is the East Asian component. Here is a figure from a paper published 4 years ago:


The figure is showing Fst value comparing Indian Americans with Europeans and East Asians. Fst measures between population differences in allele frequency, in this case the alleles being 207 indels. Take a look at the Bengalis. These are West Bengalis, who I believe have a lesser East Asian component, but even there the allele frequency difference to East Asians is near that of Europeans. The Assamese, who speak a language very close to Bengali, are similar. Assam was ruled by a Tibeto-Burman people for nearly 600 years. The Oriya speakers, from the southwest of Bengal, are more distant from East Asians. As one goes south and east, and west and north, the distance from East Asians increases. This shouldn’t be that surprising, but nice to confirm. The fact that the genetic distance increases as one goes south means that for northeast South Asia you need to complexify the model from a two-way admixture with “ancient North Indians” and “ancient South Indians.” Set next to these two is an East Asian element, which is also clear in the Indo-Aryan peoples of Nepal.

Sheikh Hasina, Khaleda ZiaOf course anyone who knows Bengalis won’t be totally surprised by an East Asian component to their ancestry. To the left are head shots of the two women who have dominated Bangladeshi politics for the past two decades, Khaleda Zia and Sheik Hasina. They’re both Bengalis, but they do look different, and I know many people who look like one or the other (or a combination). My family is from one of most easternmost districts of Bengali, next to Tripura. In fact my late maternal grandmother lived in Tripura for some of her childhood (she was almost trampled to death by the Maharani of Tripura’s insane elephant as a young girl!). When I was a young child I once saw a black and white photo from my father’s college days, and I was curious who the Asiatic looking young man in the middle of the photograph was. Turns out it was my father! Sometimes our expectations affect how we perceive people. I have never perceived my father to have an Asian cast to his features as a more mature man, but others have told me that he does still exhibit them.

There is still the question of how Bengalis came to have this particular admixture. I think the most plausible scenario probably synthesizes conventional village-to-village intermarriage and isolation-by-distance, along with some component of migrationism. Tribes such as the Chakma have left Burma in historical time. The Chakma of Bangladesh now speak a dialect of Bengali, not their ancestral Sino-Tibetan tongue. I believe that a non-trivial portion of Bengalis have ancestors who were tribal people who shifted their religious identity to that of Hinduism or Islam (from Theravada Buddhism in the case of the Chakma, or animism in the case of the Garos before their Christianization). But eastern South Asia is adjacent to mainland Southeast Asia, and it stands to reason that continuous gene flow would over time would also have introduced East Asian alleles into the Bengali gene pool.

Image Credit:

🔊 Listen RSS

Dienekes has a post up where he highlights the fact that the recent paper on South Asian metabolic diseases has a figure which elucidates population structure within the region. Accounting for structure is important for genome-wide associations since you might get a spurious correlations if trait value/disease frequency is simply tracking cryptic population variation. Dienekes says:

The existence of two clusters is kind of obvious, while their interpretation is not as dots of the same color appear in both clusters: a placement of these individuals in a global context might have been useful here. Things are clearer at the top cluster which shows a clear gradient anchored by Punjabi Sikh and Hindu Tamils on either end.

Also of interest is the group of isolated Muslim/Christian individuals on the left which deviate strongly from the mainstream; these probably represent exogenous elements that don’t resembe the bulk of the Indian population.

The second issue is easily addressed. The Christian outliers are both give English as their native language. That suggests to me that they’re Anglo-Indian, a community of mixed South Asian and European origin. South Asian Muslims are overwhelmingly of indigenous origin. But, a minority of the Muslim elite are West Asian, or have substantial West Asian ancestry, as is evident by the fact that they look white. Benazir Bhutto’s mother was of Kurdish and Persian ethnic background (her family was from Esfahan in Iran). I’ve reedited the religious & linguistic PC plots to fit onto the screen.


So what’s going on with the cluster which extends along the second principal component? The first component is probably just a European/West Asian-South Asian axis of variation. But I don’t understand where the variation for the second is coming from. Observe that the one South Indian group, Tamil speakers, are not represented in the secondary cluster. The plot reminded me of something I saw last fall.

Below is figure S4 is from the supplements of Reconstructing Indian population history. I added some labels. The Indian cluster is tight when the genetic variation includes non-Indian groups. But, when you constrain the variation to Europeans and South Asians only, something strange happens:

The Gujarati sample is from Houston, and is from HapMap Phase 3. I have a suspicion that the secondary cluster among the Gujaratis here is of the same class of phenomenon as the secondary cluster in the first plot. The Anglo-Indians and West Asian Muslims serve as rough proxies for Europeans, and you have an expected European-South Asian axis. But you also have this strange orthogonal component. I had assumed that the plot from the Reich et al. paper was an anomaly, but I’m not so sure seeing the second paper.

• Category: Science • Tags: Genetics, Genomics, Indian Genetics, Indian genomics 
🔊 Listen RSS Despite the reality that I’ve cautioned against taking PCA plots too literally as Truth, unvarnished and without any interpretive juice needed, papers which rely on them are almost magnetically attractive to me. They transform complex patterns of variation which you are not privy to via your gestalt psychology into a two or at most three dimensional representation which can you can grok immediately. That is why History and Geography of Genes was so engrossing. You recognize patterns which were otherwise unrecognizable. But how you interpret those patterns, that’s a wholly different matter. And how those patterns arise is also not something one can ignore.

price_fig1First, let’s start with an easy case. To the left is a PCA plot with four populations. Nigerians, East Asians (Chinese + Japanese), Europeans (whites from Utah), and finally, African Americans. The x-axis is the first principal component of variation, and the y-axis the second. That means that the x-axis is the independent dimension of variation within the patterns of genetic data which explains the largest fraction of the total amount of genetic variation. The sum totality of the variation can be decomposed into an large set of independent dimensions which can be rank ordered from the largest explanatory components to the smaller ones, successively by number. In a human genetic context the first principal component invariably separates Africans from non-Africans, and the second principal component often maps onto a west-east axis from Europe to the New World. Subsequent principal components can often be useful in smoking out fine scale distinctions, or relationships which are confused by the existence of similar but different signals in admixed populations.

The interpretation of this plot is rather easy. You see that African Americans lay along a continuum between Nigerians and Europeans, skewed toward Nigerians, with some outliers toward East Asians. We know from other genetic findings that ~20% of the African American ancestral quanta is European, but, that quanta is not equally distributed across the population. ~10% of the African American population is more than 50% European in ancestry, while 90% is less than 50% European. And so you have a distribution which reflects this variation. As for the outliers, I will speculate and suggest that these are indications of Native American ancestry among some African Americans.

The story I presented above is probably plausible as an explanation of the visual because we have a wealth of historical data to corroborate the plausibility of that narrative. The fit between the results from the technique of analysis of genetic variation and what scholars have long inferred from textual sources is relatively easy. It is far more difficult to look at a PCA plot, and generate a plausible narrative that you yourself accept with a high degree of confidence with little external support. It is with that caveat in mind that I present Toward a more uniform sampling of human genetic diversity: A survey of worldwide populations by high-density genotyping:

High-throughput genotyping data are useful for making inferences about human evolutionary history. However, the populations sampled to date are unevenly distributed, and some areas (e.g., South and Central Asia) have rarely been sampled in large-scale studies. To assess human genetic variation more evenly, we sampled 296 individuals from 13 worldwide populations that are not covered by previous studies. By combining these samples with a data set from our laboratory and the HapMap II samples, we assembled a final dataset of ~ 250,000 SNPs in 850 individuals from 40 populations. With more uniform sampling, the estimate of global genetic differentiation (FST) substantially decreases from ~ 16% with the HapMap II samples to ~ 11%. A panel of copy number variations typed in the same populations shows patterns of diversity similar to the SNP data, with highest diversity in African populations. This unique sample collection also permits new inferences about human evolutionary history. The comparison of haplotype variation among populations supports a single out-of-Africa migration event and suggests that the founding population of Eurasia may have been relatively large but isolated from Africans for a period of time. We also found a substantial affinity between populations from central Asia (Kyrgyzstani and Mongolian Buryat) and America, suggesting a central Asian contribution to New World founder populations.

The studies which came out of the original HapMap had northern Europeans, Yoruba from Nigerians, and Chinese & Japanese. These three populations can tell us a lot, but there’s something lacking in the coverage. The HGDP sample is better. But specifically because of political considerations it was not feasible to collect Indian samples, so Pakistani ones are used in their stead. Additionally, the HGDP sample is a touch biased toward isolated and distinctive populations, such as the Kalash of Pakistan. This genetic distinctiveness is important to catalog because it is fast disappearing. But the Kalash are so unique because of their long history of isolation, so one can’t really use them as a proxy population for Pakistanis, as one could with Sindhis. The POPRES sample seems to complement the HGDP well, but I don’t see it being used so much. Since the next phase of the HapMap has more populations, some of the deficiencies which emerged with the utilization of just three terminal groups (in a World Island context) will soon no longer be an issue.

But until that time it’s nice when studies come out which close some of the gaps in our knowledge of world wide genetic variation. This is one such study. I’m somewhat familiar with the samples already because I’ve seen it in an analysis of Indian populations. It seems that it is somewhat skewed toward South and Southeast Asian populations, but hey, these are groups which need to draw the long straw sometimes as well.

Before I go any further I should mention that they use a SNP-chip with hundreds of thousands of markers. Additionally, they looked at copy number variation. Two rather different types of variation within the genome, probably to double check that the outcomes were the same. Population historical events which shape patterns of genomic variation would presumably have a similar large scale effect on both types of variation. In their results that checked out, or so they claimed, as the paper is a manuscript without the supplements attached.

Though there’s some interesting fine-grained analysis to be had, they draw some macro-scale and deep time inferences as well. First, you probably know the famous fact that 15% of variation in genes is between races, and 85% within races. That’s derived from the Fst statistic, which is basically partitioning between and within population variance across two populations. Obviously the value of Fst varies by the set of populations you’re comparing. That between Mbuti Pygmies and Japanese is far higher than between Chinese and Japanese. Using the HapMap the Fst was 16%. About what you’d expect. To equalize sample sizes with the HapMap they randomly selected individuals from a pooled set grouped by continent from their populations, and calculated Fst. They found values around 11%. Why the difference? Because their data set included populations which were between the three clusters within the HapMap.

This is naturally not a surprising result at all, but it does reiterate one issue which sometimes crops up: Platonism in relation to race. The northern European whites in the HapMaps are the whites par excellence. Turks, who are perhaps more centrally located in the genetic variation of West Eurasian and North African peoples, what used to be termed “Caucasoid,” are “less white.” Similarly, Nigerians are more African than Ethiopians. Chinese and Japanese are more Asian than Burmese. And so forth. When modeling between group differences there is I think a somewhat old-fashioned tendency to consider some populations racial archetypes. That modulates the input which modifies the results somewhat. The analytical technique may be as cold as stone, but they are used by flesh and blood human beings.

There is also some funny business going on with haplotype and SNP heterozygosities which I think needs to be highlighted, and speaks to the fact that SNP-chips are not perfect. They’re tools, and human tools are impacted by arbitrary or instrumental choices humans make. Let me quote:

We also compared the SNP and haplotype heterozygosity values in each population (Figure 2B). These two quantities are generally highly correlated, although there are several exceptions: First, SNP heterozygosity is higher than haplotype heterozygosity in European and Central Asian populations. This may reflect a SNP ascertainment bias, since many of these polymorphisms were historically selected to maximize heterozygosity in European populations. Second, the Pygmy sample shows a low SNP heterozygosity despite relatively high haplotype heterozygosity. This unusual pattern could be caused by stronger effects of SNP ascertainment bias in this population than in others. Indeed, a recent study of Khoisan individuals (another hunter-gatherer group from Africa) showed a similar pattern: despite high SNP heterozygosity (~60%) in whole-genome sequence data, a Khoisan individual showed low heterozygosity on the SNP microarray genotypes (~22%) . Alternatively, this difference could also reflect unique attributes of population history.

In plain English the gene chips were designed with Europeans in mind, so they don’t necessarily pick up all the variation in non-European groups, who are believe it or not genetically different. This issue cropped up (as alluded to in the above text) with the recent paper which sequenced some Bushmen as well as Desmond Tutu. The Bushmen have a lot of variation, this is well known, but they have variation at markers where Europeans don’t, and if Europeans don’t the chips may not look for polymorphism at that locus. This sort of thing probably doesn’t affect broad population relationships, but if you want to zoom in and do analysis which is sensitive to fine distinctions and quantitative differences, then it might be problematic.

Let’s jump to the pretty charts. First, a PCA plot with all of the individuals from all of the populations:


Note that PC 1 accounts for nearly eight times as much variation as PC 2. This speaks to the African vs. non-African gap. Because their data set is relatively thick in “intermediate” groups you see a spectrum. The vertical axis is obviously mostly east-west. And here’s the accompanying bar plot derived from the ADMIXTURE program. K = putative ancestral populations.


With this many populations at K = 12 I think you could write a fantasy novel worthy of Tolkien. K = 4 is more realistic. Among the African populations you see likely Eurasian admixture in some eastern, and it seems Bushmen, individuals. In Eurasia itself you see a clinal gradation of admixture between putative ancestral components that seems to follow longitude rather well.

Because so much of the variation in the total sample is due to Africans, removing them from the picture will allow us to focus more on the relationships of the Eurasian groups. And so that’s exactly what they did. Note that focusing on the Eurasian groups does not mean simply magnifying or zooming in on the Eurasian section of the PCA plot, rather, the plots are regenerated with a subset of the previous genetic variation. In other words, the dimensions will shake out a bit differently.

The first plot shows Eurasian populations as a whole. The second removes Europeans and Near Easterners.


Notice again the scale. The vast majority of the variance seems to be east-west. But, there is a noticeable north-south split. For the South Asian population it looks like they had Pakistanis who were farmers of modest means (Arain), high caste South Indians, and very low caste or tribal South Indians. For this Indian sample there’s a problem, and it’s the sample problem which plagued the Up Series, they are looking at the very top and bottom of Indian society and ignoring the middle. Presumably the middle is going to be somewhere in the middle genetically as well, but nevertheless that’s something to consider in a paper which presumes to fill in the patchiness of others. In contrast, the Nepali sample was notably ethnically diverse, including both the dominant Indo-Aryan segment as well as the Tibeto-Burman Newar.

In the first panel there are some curious patterns with the Southeast Asian groups. Culturally, as in language and history, the Thai and Vietnamese have relatively recent roots in the southern regions of modern China. The Dai of Yunnan are the same people in origin as the Thai of Thailand and the Lao of Laos. Both derive from migrations from Yunnan. This is historically attested, even if somewhat fragmentarily. The heartland of the Vietnamese was in the Red River valley and north into southern China, and they spread down the coast and toward the Me kong only within the last 1,000 years. Southeast Asia was not uninhabited during this period. It was dominated by the Khmer Empire, which was slowly consumed by the expanding Thai and Vietnamese polities. Some scholars argue that French colonialism actually preserved an independent Khmer nation, which otherwise would have been divided between Thailand and Vietnam, as Poland was between Germany and Russia. So the Khmer are the indigenous people, while the Thai and Vietnamese are intrusive.

What do the PCA plots tell us? I do not know where the Vietnamese samples were collected. If they were from South Vietnam, then their close position to the Chinese suggests to me that there was substantial demographic replacement or expansion from the Red River valley. In contrast, the Thai are relatively distant from the Chinese. In fact, the Cambodians are somewhat closer to the Chinese! The samples here are small, and the sets overlap, so I wouldn’t put too much stock in that. But, Thailand is geographically closer to South Asia, so isolation by distance models would predict this pattern. It seems that the ethnogenesis of the Thai occurred through the expansion of the Thai identity, likely among Khmer peoples. And it is intriguing that the Iban, an indigenous people of Borneo, are closer to the Vietnamese than they are to the Cambodians. We know that there was substantial migration between coast Vietnam and Maritime Southeast Asia, the Chams of central Vietnam, and dominant in the southern half of the nation before the Vietnamese expansion, are a Malayan people who may have migrated from Borneo.

Shifting to the second panel there’s more here to say about the South Asians. First, geography. The two lower caste groups are actually Dalits from Andhara Pradesh, a South Indian state. Dalits used to be called outcastes, so they aren’t even lower caste, but without caste. The upper caste groups are Brahmins from Andhara Pradesh and Tamil Nadu. Finally, the Irula are tribal people from Tamil Nadu. To me the tribal samples often produce weird results, and I suspect that has to do with population bottlenecks and their demographic isolation. People leave the tribes (becoming part of the Hindu society, or converting to Islam or Christianity), but few join them. The Pakistani sample are Araina, a group of conventional Punjabi farmers who have a made up ancestry from Arabs (obviously made up because they don’t cluster with Near Easterners). Let’s compare to a chart from Reich et al.:


It seems to me that they’re in rough agreement (Reich et al. uses the same two low caste groups for Andhara Pradesh for low caste South Indians by the way). Though South Indian Brahmins speak South Indian languages, and reside amongst other South Indian groups, their genetic heritage is somewhat different. Similarly, tribal peoples are also distinct from caste Hindus. Reich et al. posit that South Asians can be modeled as a composite of two groups, Ancestral North Indians, ANI, and Ancestral South Indians, ASI. Presumably the former are intrusive to the subcontinent in relation to the latter. There seem two clear dimensions along which the ratio of ANI to ASI vary: geography and caste. The proportion of ASI seems to increase from the northwest to the southeast. And, the proportion of ANI seems to increase from tribal to low caste to upper caste. The Pakistani sample does not seem to be from an elite caste (or it does not seem they were converted from an elite caste), but they have more affinity with West Eurasian populations than South Indian Brahmins. It is likely that the latter are intrusive to the south, and have admixed with the local population.

Finally, a word on the Nepali sample. On top of the ANI-ASI mixture, the Nepali groups have varying levels of Tibeto-Burman, and so East Asian, affinity. This is not a surprise if you have met Nepalis. The Assamese, and to a lesser extent Bengalis, also exhibit this pattern of Tibeto-Burman admixture. The Brahmins of Nepal are intrusive like the Brahmins of South India, and like the South Indians they admixed with the local substrate.

Next let’s move to a ADMIXTURE plot.


The selection of a particular K obviously is conditioned by the patterns which “fit” with what you know, and what you expect. With that caution aired, the population represented by red can easily be thought of as a Middle Eastern group which expanded with agriculture. That seems to be what the authors favor. The brown population is the modal Indian ancestral population, which has little presence outside the subcontinent (nice color coding by the way! Brown people are brown). A green color represents a population which the tribal group, the Irula, are heavily weighted on. This reminds me too much of the Kalash. I suspect that the Irula went through some bottleneck or other distinctive event, and some have assimilated to various low status groups in South India.

I’m not a fantasist intent on world-building, so I’ll stop with that in reading the tea leaves of the charts. But there’s an important section which I skipped over, and will move back to now. And that’s the deep time aspect:

A more likely explanation for the OoA bottleneck is that Eurasia was populated by a larger population that had been relatively isolated from other modern human populations for tens of thousands of years prior to the expansion. The first fossil evidence for modern humans outside of Africa is in the Middle East at Skhul and Qafzeh between 80,000-100,000 years ago, which is at least 20,000 years prior to the Eurasian diaspora. If a population of modern humans remained in the Middle East until the expansion into Eurasia, there would have been sufficient time for genetic drift to reduce heterozygosity dramatically before the Eurasia expansion. This “Middle East isolation” hypothesis provides a robust explanation for the relative homogeneity of European and Asian populations relative to African populations (see Figures 3A-B) and is supported by a recent maximum likelihood estimate of 140,000 years ago for the time of Eurasian-West African population separation . Interestingly, a recent study of the Neandertal genome suggests that the non-African individuals, but not the Africans, contain similar amount of admixture (1-4%) with the Neandertals . The authors suggest that the admixture must have happened between the Neandertals with an ancestral non-African population before the Eurasian expansion. Given the fossil, archaeological, and genetic evidence, the Middle East isolation hypothesis warrants rigorous evaluation as whole-genome sequence data become available.

Like the vast majority of genetic studies this work supports the Out of Africa hypothesis. Non-Africans are all branches from a specific African branch. Or more accurately, an African branch which left Africa. The reduction in heterozygosity, a measure of genetic variation, from Africa to Eurasians was large. Additionally, within Africa south of the Sahara there’s little difference in heterozygosity as a function of geography, but outside of Africa it drops off as a function of distance from Africa. A plausible model then is a radiation from a small ancestral population to the four corners of the world, going through a series of bottlenecks along the way. Or at least that’s a model supported by genomic data. But, the drop in heterozygosity is so great a quick separation from the parental African population would require an implausibly small number of founders (less than 10 in one generation). So, to explain the data, they are suggesting here that the original population was not quite so small, but was isolated from the large African population for thousands of years. They assume genetic drift reduced heterozygosity, but if the model is correct I suspect that the way it worked was that bottlenecks due to climatic fluctuations swept clean a lot of the genetic variation. But in the interregnum the isolated population may have interbred with Neandertals. In fact, perhaps they picked up genes from Neandertals when their own effective population was extremely small.

In any case, a wide ranging paper. They manage to tie their results into two other blockbuster papers.

H/T Dienekes

Citation Xing J, Watkins WS, Shlien A, Walker E, Huff CD, Witherspoon DJ, Zhang Y, Simonson TS, Weiss RB, Schiffman JD, Malkin D, Woodward SR, & Jorde LB (2010). Toward a more uniform sampling of human genetic diversity: A survey of worldwide populations by high-density genotyping. Genomics PMID: 20643205

Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"