Razib Khan
South Asian genomics

Mughal Emperor Akbar

In Strange Parallels Victor Lieberman made a reference to “Turkicized Pathans.” The very term has been gnawing at me. To get some sense of the context, Lieberman was sketching out the impact of Islamic civilization upon Indian civilization. Sometimes this “impact” was very literal. The Arab armies had rolled into Sindh in the 8th century, but that influence upon India was militarily marginal. The first real Muslim raider of consequence was Mahmud of Ghazni, a Turkic raider from what is today Afghanistan, who famously plundered the palaces and temples of North India circa ~1000. But even here the the impact is arguably superficial. Mahmud of Ghazni’s raids did not lead to a large Indian domain under his direct rule except in Punjab. Rather, these sallies into India were sources of supplementation to his broader fiscal resources. He was still fundamentally a Central Asian potentate fixated on Central Asian concerns. The real rise of Islamic civilization in India was precipitated by the Delhi Sultanate, a series of short-lived polities beginning circa ~1200 which dominated the Indian subcontinent for centuries, until they were superseded by the far more robust Mughal Empire.

These Indo-Islamic dominions were often dominated by individuals of Turkic identity. By this, I mean that they were from a lineage of Turkic tribes which had filtered into the world of Islam in the centuries before 1000, enslaved or enrolled in the armies of Muslim warlords. But eventually these pawns turned the tables on their erstwhile masters and snatched the keys to the kingdom for themselves. Mahmud of Ghazni’s own family were originally servitors of the Iranian Muslim Samanid dynasty. But just as Rome was enslaved by Greece culturally after its conquest of Hellas, so many Turks freely granted the manifest superiority of the Persian language in the domain of culture. Therefore the irony is that the Persian language spread as the elite cultural vehicle along with the expansion of the Turks west and east, culminating with the rise of the Ottomans and Mughals. Therefore you had a situation in Mughal India where the ruling dynasty, which was of proud Turco-Mongol origin along the paternal lineage, patronized Persian was the language of the court and administration more generally.

But what about the Afghans? They were not invisible. Along with the Turks and Persians, who came with the sword and quill respectively to serve in the courts of India’s Islamic rulers, came auxiliaries of Afghans, mostly Pashtuns. Though a majority of the dynasts seem to claim Turkic antecedents, some are self-consciously Afghan. For example the Lodi dynasty. The influence of these people is evident in India today insofar as upper class Muslims often refer to themselves as “Pathans,” presumably pointing to an origin outside of Indian proper.

To me it seems that the development of the Pathan over time is hampered by the fact that they are a people who did not develop their own independent robust high culture. A variety of Persian was the language of high culture. The Other par excellence was the Turk. The Pathan was a background figure, the illiterate peasant or nomad which was of no concern, and who integrated themselves into a Turkic or Persian world and identity when they rose above their station. In this way I wonder if they resemble Kurds, another highland Iranian people who have persisted over the centuries but seem strangely invisible as great empires rise and fall. This invisibility, a decentralized lack of reliance on elite institutional structures, may be one reason that a coherent Kurdish or Pathan identity persisted over so long, and continues down to this day, despite the spread of Turkic language in many zones of their broader region.

Though I probably know more about Indian history than most of you I’m still somewhat in the dark as to the detailed relationship of between Turks and Afghans in Afghanistan and India during this period. Much of the literature focuses on the faction between immigrants from outside India vs. those who were native-born, or between Muslim elites and Hindu elites. Secondarily there were divisions between Shia and Sunni, with Persians looming large.

I don’t have fluency in the languages to do primary research, but I do have genetic data sets to play with. I pruned my populations down to a few which were designed to explore the nature of East Asian ancestry among other groups, in particular Pathans. The population set has ~90,000 markers, and I ran ADMIXTURE across many K’s, with 9 seeming to be the most illuminating for my purposes. Below are three bar plots, one which shows population averages, and two which focus on specific populations on the grain of individuals. There are also two plots which visualize genetic distances between the hypothetical ancestral populations, labeled by modal group.

When you analyze East Asian data sets they always tend to divide first into a northeastern and southeastern component. In this case I have a northern Turkic group, Yakuts, in my population set. Since the data set is “Pakistan-centered” you see divisions of ancestral groups with a focus on that region. If I overloaded with Western European populations the outcomes might be very different in absolute terms. As per what other genome bloggers have found using a Pakistan-centered data set South Asian populations can be separable as admixtures of three broad elements:

1) A north Pakistan centered element, modal in the Burusho ethno-linguistic isolate. This element seems rather distant from other West Eurasian components, and is broadly correlated with “West Asian” in other runs (though the overlap is imperfect because of the Pakistani bias of this data set).

2) A south Pakistan centered element, modal in the Brahui, a Dravidian ethno-linguistic group surrounded by the Iranian speaking Balochi, with whom they share most cultural features except language.

3) A more South Asian general element, which is represented here by the Gujarati_B sample (probably Patels). It’s position on the Fst MDS is pretty much where you would expect from a South Asian population which is an admixture of West Eurasian and non-West Eurasian populations. Interestingly the proportion of this element in the Balochi and Brahui is in the same neighborhood as among the Cambodians. On the face of it I’m skeptical that there was a mass migration of South Asians to Cambodia, despite the Indic associations of early Khmer society. Rather, it seems more likely to be evidence of an ancient South Eurasian substratum which spanned the Indian subcontinent and Southeast Asia. On the other hand, Malays and Cambodians exhibit evidence of South Asian ancestry even when the Andaman Islander component is extracted out. Looking at other groups I’m still strongly leaning toward the assumption that this is an artifact though looking at some other results at the linked plot. But it will be something to investigate more closely in the future.

But there are other components at low proportions among the Pakistanis aside from the “big three.” Here are the population breakdowns in tabular form:

Population Dai Yakut Burusho Baltic Brahui Yemeni Bantu Sardinian Gujarati
Brahui 1% 1% 8% 3% 71% 6% 3% 3% 6%
Balochi 1% 1% 12% 3% 62% 6% 2% 2% 10%
Makrani 0% 0% 12% 1% 61% 9% 6% 4% 6%
Sindhi 2% 1% 21% 4% 32% 2% 4% 2% 32%
Pathan 2% 3% 27% 11% 22% 5% 1% 2% 27%
Iranians 1% 1% 30% 4% 15% 30% 3% 14% 3%
Uzbeks 14% 25% 18% 18% 7% 8% 0% 4% 5%
Turks 1% 4% 25% 12% 7% 27% 0% 22% 1%
Syrians 1% 0% 20% 3% 5% 44% 4% 23% 1%
Hazara 21% 29% 27% 7% 5% 4% 0% 2% 5%
Gujarati_b 1% 1% 2% 2% 3% 2% 0% 2% 88%
Georgians 0% 0% 43% 5% 1% 24% 0% 27% 0%
Chuvashs 2% 20% 3% 71% 1% 1% 0% 1% 1%
Cambodians 86% 1% 0% 0% 1% 0% 1% 0% 10%
Belorussian 0% 0% 1% 84% 1% 2% 0% 11% 1%
Burusho 8% 5% 55% 2% 1% 0% 1% 0% 28%
Lithuanians 0% 0% 0% 95% 0% 1% 0% 4% 0%
Yakut 1% 94% 0% 4% 0% 0% 0% 1% 0%
Han 83% 17% 0% 0% 0% 0% 0% 0% 0%
Yemen Jews 0% 0% 0% 0% 0% 99% 0% 1% 0%
Dai 100% 0% 0% 0% 0% 0% 0% 0% 0%
Sardinian 0% 0% 0% 1% 0% 2% 0% 97% 0%
Miaozu 89% 11% 0% 0% 0% 0% 0% 0% 0%
Bantu Kenya 0% 0% 0% 0% 0% 2% 98% 0% 0%

I’ve highlighted Pakistani populations, and bolded the modal fraction for all groups. The important point is to look at the ratio between “Yakut” and “Dai.” This should be an indication of the “Turkicness” of a population, with higher ratios implying more Turkic ancestry. The Burusho are a good test case. They exhibit little intra-population variation in the non-trivial East Asian proportion. But they’re somewhat biased toward a Dai component. I suspect that this balance is evidence of Tibetan admixture, as opposed to Turkic. In contrast, the Pathan levels are low, but biased toward Yakut ancestry. The ratio is very similar to the geographically close Hazara population, which is a clear Turco-Mongol ancestry in part, despite their adherence to a Persianate (Dari speaking) identity today. For me the point of curiosity is that the Pathan differ from the Baloch. The Baloch are in many ways just a more cosmopolitan spin on the Brahui, something that would make sense if the Iranian Baloch identity is an overlay upon a earlier Brahui layer. It would be interesting to know if the interaction with Turkic groups was historically known to be far less among the Baloch and Brahui than the Pathan, despite the geographically close position of these groups.

My analysis here is superficial. The analytic techniques aren’t too deep or informative. Rather, my intent was to push forward the project of exploring parahistorical dynamics through genetics. But “parahistory” I’m not talking counterfactuals, but rather what textually and even physically focused analysis of the past misses because of its methodological constraints. Would we know of the possibility of a relict Dravidian substratum in the hills of Balochistan deep into the medieval era if not for the fact that Brahui persists to this day? (it is giving way to Baloch even today) Many populations remain “dark” to textual records, or mentioned only as an aside to the “main stream” of military and fiscal concerns of rentier aristocracies or the poetic grandiloquence of literary elites. Archaeology in theory can compensate for this bias in the written word, but it is only an imperfect science in a positivistic sense. I am skeptical that archaeologists would have been bold enough to assert the existence of a Dravidian substratum in Balochistan even if there was a physical difference in the material objects. How would they even connect these people to Dravidian languages in the rest of South Asian anyhow?

As for the Turks and Pathans, some Turks did turn Pathan I believe. In fact one of the HGDP Pathans seems clearly to have been the product of recent admixture. That is not so surprising in light of history. Rather, it is critical to pin down specific values if we are ever to understand with any clarity the nature of ethnogenesis of the groups which met at the intersection of what the ancient Persians would have termed Iran, Turan, and Hind.

(Republished from Discover/GNXP by permission of author or representative)
Zack Ajmal now has over 50 participants in the Harappa Ancestry Project. This does not include the Pakistani populations in the HGDP, the HapMap Gujaratis, the Indians from the SVGP. Nevertheless, all these samples still barely cover vast heart of South Asia, the Indo-Gangetic plain. Here is the provenance of the submitted samples Zack has so far:

  • Punjab: 7
  • Iran: 7
  • Tamil: 6
  • Bengal: 5
  • Andhra Pradesh: 2
  • Bihar: 2
  • Karnataka: 2
  • Caribbean Indian: 2
  • Kashmir: 2
  • Uttar Pradesh: 2
  • Sri Lankan: 2
  • Kerala: 2
  • Iraqi Arab: 2
  • Anglo-Indian: 1
  • Roma: 1
  • Goa: 1
  • Rajasthan: 1
  • Baloch: 1
  • Unknown: 1
  • Egyptian/Iraqi Jew: 1
  • Maharashtra: 1

Again, note the underrepresentation of two of India’s most populous states, Uttar Pradesh, ~200 million, and Bihar, ~100 million. Nevertheless, there are already some interesting yields from the project. Below I’ve reedited Zack’s static images (though go to his website for something more dynamic) with the labels of individuals. I’ve highlighted myself and my parents with the red pointers.

To the left is a set of plots and tables which I’ve spliced together from Zack’s various posts. What you need to know is that this at K = 12, and I’ve used the labels that Zack gave the various putative “ancestral populations” which emerged out of his ADMIXTURE runs. I’ve also displayed the participants in the Harappa Ancestry Project so far, with their ethnic labels. Finally, smack in the middle you see the Fst values, standardized by the smallest between population difference. So the values in the boxes represent the genetic distances for the inferred ancestral populations in the row and column (I also rounded, since I didn’t want to give the impression of excessive precision). This last point is important, these are not between population distance measures across real populations. Rather, they’re distance measures across the inferred allele frequencies of populations generated which emerge out of the parameters you constrain ADMIXTURE to, as well as the genetic variation which you throw into the pot for the algorithm in the first place.

In the broadest sense the first thing that jumps out at you is the high distance value between “Papuans” and everyone else. This is interesting. In fact, the genetic distance of between Papuans and other ancestral populations is greater than the genetic distance between the putative African populations and other non-Africans, except Papuans. This goes to the point that you need to be very careful in making definitive inferences from these sorts of programs. Interestingly, the population to which the Papuans exhibit the least genetic distance are the “South Asians.” What does that mean? I think this has a straightforward explanation. I believe that the South Asian cluster is a hybridized compound, as suggested by Reconstructing Indian History, and that the populations of Oceania represent a relatively “pure” eastern expansion of long resident southern Asian groups which have generally been submerged by admixture with other groups intrusive to the region. This also explains the fact that Cambodians share some of this Papuan component with various South Asian populations. Finally, I wouldn’t make too much of this, but in some ADMIXTURE runs which I’ve done the genuine Papuan population in the HGDP data set breaks into two ancestral components, of which the southern Asian groups from Pakistan to Cambodia share only one. Remember that Oceania was settled initially by Melanesians and Australians ~40-50,000 years ago, and it looks like the people of Melanesia and indigenous Australians date to this initial period. So connections between southern Asians and Papuans are likely very old, and the two groups have been distinctive for a long time.

To the South Asian individuals surveyed so far, there’s nothing that surprising. The South Asian element tends to increase as one goes south and east. This is what you’d expect. And, the Pakistan/Caucasian component which spans much of western and central Asia is what connects the Iranian samples to the South Asian ones. The Iranians have very little of the South Asian component. This makes sense if the South Asian element is simply an outcome of an admixed population, and one of the ancestral groups from which this component derives, “Ancestral South Indians,” were generally not present to the west of Pakistan. The eastern Asian components are enriched among Bengalis, as you’d expect, but they’re found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the “eastern” element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance. Most of the other patterns are as you would expect. Finally, I’d like to point out that I suspect that Zack is the first one to post the ancestral fractions of someone from the Nadar caste using SNP-chip markers.

Here are all the details about participation.

(Republished from Discover/GNXP by permission of author or representative)
🔊 Listen RSS

Zack has been posting his data sources, as well as how he filtered and formatted them, all this week. I assume that the first wave of results will be online soon. As of yesterday, this is what he had (I know he got some more today):

- Punjab 7
- Bengal 1
- Bihar 1
- Tamil 5
- Karnataka 1
- Anglo-Indian 1
- Roma 1
- Iran 3

Whole swaths of north-central India are missing. I am hopeful that more people will join in after the first wave of results are put out there. But, from what I have discussed with Zack it looks plausible that the very first wave will have a richer set of results because of the necessity of preliminary steps. So there’s some benefit in getting early. It’s really ridiculous to have literally 1 sample representing the 300 million people of Uttar Pradesh and Bihar. That’s 25% of South Asians represented by one person. I’ve gotten a commitment from one friend who was born U.P. to give his data up once it comes in, but there have to be others out there. (the Bengali N should go up to 2 when I swap my parents in for me)

The public data sources have Gujaratis, Tamils, Pakistanis (Punjabis, Pathans, Sindhis), and some South Indian groups (Tamil and Telugu). This leaves a blank spot on the North Indian plain.

Here’s the brief for the project again.

(Republished from Discover/GNXP by permission of author or representative)
Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"