The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog
South Asian Genetics

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
These buttons register your public Agreement, Disagreement, Thanks, LOL, or Troll with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used three times during any eight hour period.
Ignore Commenter Follow Commenter
🔊 Listen RSS

brown For a while I’ve been playing around with 1000 Genomes South Asian data. It’s an interesting exercise, because unlike other South Asian data set it’s relatively generic with minimal ethnic/caste labels. This is important because unlike other population groups that the 1000 Genomes has sampled, such as in Africa, Europe, and East Asia, the South Asian data exhibit genetic structure beyond their ethno-linguistic identity. For example, the “Telugu” and “Tamil” data in the 1000 Genomes both contain individuals who are clearly Brahmins. This is obvious because these individuals are positioned on the margins of northern South Asian groups, not their ethno-linguistic compatriots. So using a combination of Estonion Biocentre data, the HGDP, and some friends and my own family, I’ve partitioned the 1000 Genomes South Asians more finely than is presented in the raw downloads.

The PCA above is hard to make out because there are so many groups I relabeled. But I’ve put the pedigree file (with my friends removed) with new labels on Dropbox. The Gujurati and Punjabi populations I separated by “ANI-ness,” from most to least numerically (Gujurati_ANI_1 is the most ANI for example). The large number of Patels I labeled separately, as they are pretty obvious (Zack Ajmal found that Patels for the Harappa project land right in the middle of this cluster of related individuals). Additionally, the Tamil and Telugu and Bangladeshi population had individuals who seem likely to have been scheduled caste or Dalit. I broke them out. I also removed a few outliers (e.g., one of the Telugu individuals was probably mixed caste, half-Brahmin and half non-Brahmin, so I removed them, and one of the Bangladeshis was likely a Bengali Brahmin or some such thing).

Some surprises for me in the 1000 Genomes. The “Punjabis” sampled from Lahore were very diverse. Many were clustering with Pathans in the HGDP (by the way, there were two Pathan clusters, so that I suspect that one of them is “Pathanized,” and I removed these). But there were others, such as Punjabi_ANI_4, who were not that different from more generic South Asians. I suspect these are Muhajirs who have become ethnically assimilated more or less (or, the 1000 Genomes just labeled everyone from Lahore as Punjabi). The Bangladeshis were ancestrally very homogeneous. Unlike the Tamils or Telugu speakers there wasn’t much of a separation of lower caste individuals, and not many were Brahmins either (I found one). There were a few individuals who were very distinct in the Bangladesh sample…they clustered with scheduled castes, and didn’t have much East Asian ancestry. I believe these people descend from migrants from India in the past few centuries because of the last fact and likely remain Hindu and maintain caste endogamy (two of them had adjacent IDs, so were probably sampled together?).

IndiaTree3 To the right is a representative TreeMix (you see all the rest on Dropbox). The Bangladeshi scheduled caste individuals are in the tree next to Chamars, Dalits from North India. The Telugu sample in the 1000 Genomes is most similar to Velamas, who I got from the Estonian Biocentre data set. Velamas are middle castes from Andhra Pradesh, so probably representative of the group that the 1000 Genomes Telugus are sampled from. The Bangladeshi samples are somewhat near the Patels or Gujurati_AN_4 in most of the runs, but have substantial East Asian ancestry. On the PCA above my parents, who are both from an eastern region of East Bengal, Comilla, are among the most East Asian of the Bangladeshis sampled. I also projected a friend whose family has deep roots in West Bengal and are Kayasthas. You can see that he is exactly between the Bangaldeshis and other South Asians. This suggests that the East Asian cline in Bengal is very sharp. It does not really persist outside of that region. Additionally, the idea that there is widespread Austro-Asiatic ancestry in South Asia does not seem to be supported by these data…only Bengalis and Burusho, both with notable East Asian ancestry, are shifted toward East Asians. The ANI-ASI cline is really sufficient for everyone else.

inbreed1 Finally, I wanted analyze inbreeding in the South Asian samples. I used plink’s default run of homozygosity feature. The raw results are in the first Dropbox link. I invite you to check them out yourself. Looking to the left you see total runs of homozygosity in KB units across the genome. Notice that the Gujurati Patels are shifted to the right, but they have a narrow window. In contrast, the Bangladeshis are to the left, but have a few outlier individuals. The Patels are an endogamous Hindu group, and so likely have lots of medium length IBD tracts. But they don’t engage in marriage between close relations. In contrast, the Bangladeshis don’t seem to have practiced much endogamy at all, presumably because they’re Muslims and caste consciousness is weak among Bangladeshis from what I can tell (my family has very vague awareness by surnames what castes they were from, but no one really cares), but some of them engage in marriage between close kin.

ibreed The second shows average length of the run of homozygosity, so is more informative of recent inbreeding. You can see that the Tamils have a flat distribution, because of lots of people who have long runs. Cousin marriage and uncle-niece marriage has been practiced by South Indian Hindus historically. The Punjabi samples also have long runs of homozygosity. One difference between Muslims in Pakistan and Muslims in Bangladesh seems to be that the Middle Eastern pattern of cousin marriage is much more ubiquitous in Pakistanis. I have no idea why there is this difference. Also, unlike Hindus in much of South Asia Bangaldeshis seem to exhibit little community level genetic structure. The thesis of Islam on the Bengal Frontier, that the strength of Islam in this region of Bengal was due to its relatively recent settlement and organization during the Muslim period, and that it was a unstructured frontier society, seems roughly supported by these genetic results.

A final thing I should note is that I appreciate the Estonian Biocentre releasing it raw data, but many of the samples seem to exhibit little co-ethnic association. I’m not sure whether this is a labeling problem or something else, but I discarded a lot of individuals (e.g., a Uttar Pradesh Brahmin placed among non-Brahmin South Indians). But for the South Asians people should be cautious about using this data set without double checking (in contrast, the non-South Asians have never caused me this problem from that data set).

Anyway, please download the data and use it if useful. The IDs are the same you would recognize in the 1000 Genomes and HGDP etc. I put an ADMIXTURE file in there too for K = 4. Nothing surprising.

• Category: Science • Tags: Genetics, South Asian Genetics 
🔊 Listen RSS

The above plot I generated using the 1000 Genomes data set. BEB = Bangladeshis from Dhaka, STU are Sri Lankan Tamils, ITU are Telegus, while PJL are Punjabis from Lahore, and GIH are Gujaratis (collected in Houston). These are big categories. The South Indian population sets exhibit some structure in terms of caste; there are a few Brahmins, as well as some Dalits. The Bengalis are strangely coherent for a South Asian population, shifted toward Cambodians. The Gujarati are differentiated between a large number of Patels, and other various groups. To my surprise the Punjabi samples are very diverse.

nihms137159f3 To a great extent it recapitulates the results of the 2009 paper Reconstructing Indian Population History. What you see to the left is the “ANI-ASI cline.” Basically South Asians, from Pashtuns all the way to Paniyas fall along a spectrum of genetic distance from West Asian and European populations. A secondary element is that some groups, such as Bengalis and many Austro-Asiatic tribes, are shifted toward East Asians. An old hypothesis of the ethnogenesis of South Asian peoples is that they are a variegated mix of “Caucasoid” populations intrusive to the subcontinent, which was originally inhabited by an “Australoid” element. Malala_and_Freida_Pinto_meet_the_Youth_For_Change_panel_cropped_frida Though these terms are somewhat archaic, the general point seems to get at something visually clear: some South Asians look nearly Mediterranean in appearance, while others are hard to distinguish from Australian Aboriginals (at least superficially). And of course, most of us are somewhere in the middle.

nihms137159f4 The insight of the Reich group was to use Andaman Islanders as a proxy for a primal indigenous population, and infer that the admixture alluded to above consisted of a very West Eurasian-like population, the Ancestral North Indians (ANI), and an indigenous group closer to East Eurasians, though very diverged, the Ancestral South Indians (ASI). Ergo, the ANI-ASI cline. Using the most closely related population to infer the “ghost population,” they were able to infer admixture proportions even though no “pure” ASI group was available as a reference against which they could judge. Clever strategies like this are important, because the reference populations you use to adduce admixture events (or lack thereof) strongly impact the nature of your results. Using simple PCA or model-based clustering, as with ADMIXTURE, one would fix South Indian Dalits and tribal populations as the “purest” aboriginal people. ~100% “Australoid.” And other groups could be modeled as a “Caucasoid/Australoid” mix. But this model was not satisfactory because even low caste South Indian groups were more shifted toward West Eurasians than you’d expect.

Using a statistic called the F4 ratio the they estimated that ANI ranges from 65-75% in the Northwest Indian populations, down to 15-30% in the lower caste South Indian ones. A 2013 paper, Genetic Evidence for Recent Population Mixture in India, attempted to infer an admixture period (two to four thousand years before the present), as well as a possible secondary pulse in some Indo-European groups. This stands to reason today when you note that most Indian groups share the most unique drift trajectory with the ancient Caucasian hunter-gatherer found in Kotias, but a minority, mostly upper caste, are closer to Sintashta steppe culture.

I’m putting this post up because people are asking me about a paper profiled in ArsTechnica, The caste system has left its mark on Indians’ genomes. Actually the 2009 Reich lab paper already concluded this. So what’s the major finding of this paper that makes it unique? We’ll start with the abstract, Genomic reconstruction of the history of extant populations of India reveals five distinct ancestral components and a complex structure:

India, harboring more than one-sixth of the world population, has been underrepresented in genome-wide studies of variation. Our analysis reveals that there are four dominant ancestries in mainland populations of India, contrary to two ancestries inferred earlier. We also show that (i) there is a distinctive ancestry of the Andaman and Nicobar Islands populations that is likely ancestral also to Oceanic populations, and (ii) the extant mainland populations admixed widely irrespective of ancestry, which was rapidly replaced by endogamy, particularly among Indo-European–speaking upper castes, about 70 generations ago. This coincides with the historical period of formulation and adoption of some relevant sociocultural norms.

So the two major results which warrant this paper being published are that instead of two ancestral populations, they posit four, and, the admixture between some of these is considerably more recent than in the 2013 paper. I think the first conclusion is wrong, and the second is too strong.

The authors make much of the fact that they have new samples. And their SNP-chip has a high density. But I’m confused why they didn’t integrate the 1000 Genomes data. The paper was received in early July 2015, and I know there was 1000 Genomes data from all the above groups by then. They didn’t even bother to use the HapMap GIH sample, which was definitely there!

Screenshot from 2016-01-26 23:29:21 The figure to the right shows the crux of their results. They used ADMIXTURE to break apart the ancestries of their Indian data set into four clusters. Through cross-validation they established that a K = 4 was optimal parameter fit for their data. Two of the populations are previously known: ANI and ASI. But they also find that there is an “Ancestral Austro-Asiatic,” and “Ancestral Tibeto-Burman,” cluster, AAA and ATB repectively. Because they did not use full labels, it can be hard to decipher, but they use this plot to assert that people of the Khatri caste are nearly 100% ANI, while Paniyas are nearly 100% ASI. Additionally, they found several groups which were nearly 100% AAA and 100% ATB.

Long-time readers will see the immediate problem: you can’t use ADMIXTURE like this! There is no guarantee that a group that is 100% x actually is in a situation where x corresponds to a genuine discrete ancestral population that existed in reality. That is, these sorts of models push a certain number of ancestral populations, and force individuals into being combinations of those. The model is constrained by the data you are putting into it to generate the results. For example, if I took Uygurs and Europeans, and did a K = 2, the Uygurs may form one cluster, and the Europeans another, at 100% levels. But we know from history and other methodologies that the Uygurs are a recently mixed group (within the last 2,000 years). Nevertheless if you tell the package to assume K = 2 with Uygurs and Northern Europeans, then it will place these into two distinct groups. And in fact, the result tells you something real and significant about the relatedness of the individuals in the data…but it doesn’t tell you necessarily anything about the real population history.

There’s a fair amount of evidence that Austro-Asiatic populations in India are not indigenous, nor are they pure. A major hole in this paper is the total lack of acknowledgement that Austro-Asiatic languages are much more common in Southeast Asia, and it seems likely that they were intrusive to India. If so, modern Austro-Asiatic peoples can be thought of us a compound of migrants with the local substrate.

The ATB element is found only in Austro-Asiatic tribes and Bengali Brahmins. That’s reasonable, because both populations exhibit a relationship to East Asian groups. While the Brahmins of South India absorbed a minor element of local Dravidian ancestry, those of Bengal absorbed Tibeto-Burman and Austro-Asiatic, which is found in higher concentrations among Bengalis proper.

To repeat, ADMIXTURE does not necessarily give you real population combinations!!! In fact, populations are to some extent a social construct, insofar as they’re just really collapsing the genetic variation which is the result of a particular demographic and pedigree history. The “ANI” group proffered here is an artifact. The Khatri are not a representative of a pure population which is similar to the ancestral ANI. The Paniya are not 100% ASI, they are just the most ASI. The Birhor are not 100% Ancestral Austro-Asiatic, they are just the most distinctively Austro-Asiatic. The Jamatia are not pure Ancestral Tibeto-Burman; most of these Northeastern tribes have some ANI/ASI admixture. They’re just the most Tibeto-Burman.

Instead of relying on ADMIXTURE so much, they should have also utilized D-stats and f-stats (not as sensitive to drift), as well as TreeMix. I think that would have quickly shown that some of these “pure” groups were mixed.

Second, there is the issue of time-since-admixture. They obtained lower values than the 2013 paper. Why? Because they use source populations (and probably the methodology) which are somewhat different from that earlier work. Honestly if some of these populations are compounds, then it doesn’t make sense to necessarily use them as idealized donors in an admixture event. The AAA tracts are most definitely artifacts in my opinion, since the tracts are the outcome of a previous admixture event.

Finally, the authors allude to a “Southern Route” out of Africa, and, imply that the Austro-Asiatic arrived with this. The best work today suggests that Austro-Asiatic peoples expanded with an agricultural wave ~4,000 years ago, with a locus of origin in the uplands of South China. Therefore, they are not primal. A simple inspection of the map of Austro-Asiatic languages forces one to ask the question of direction of migration.

I offer this critique in the spirit of post-publication review. Perhaps the authors will clarify, as I’m genuinely puzzled by the interpretations they offered.

• Category: Science • Tags: South Asian Genetics 
🔊 Listen RSS

As I’ve been harping on and on for the past few years that the patterns of contemporary genetic variation are probably only weakly tied to past patterns of genetic variation (though Henry Harpending warned me about this as far back as 2004). A major reason that scholars operated under this presupposition is the axiom that most of the variation we see around us crystallized during the Last Glacial Maximum (~20 thousand years before the present).

This may be true in some cases, but I doubt it is true in most cases. I was pointed to a classic case of this problem just today. A reader alerted me to a short paper from this spring which attempts to ascertain the point of origin of the dominant mtDNA haplogroup among the Onge tribe of the Andaman Islanders, M31a1. This is an interesting issue because some researchers proposed, plausibly in the past, that these indigenous people in the Andaman Islands represent the descendants of the first wave “Out of Africa,” who took the rapid “beachcomber” path. Understanding the key to their genetics may then unlock the key to the “Out of Africa” event. Or so we thought. It looks like the human evolutionary past was a lot more complicated than we’d presumed.

The paper is in the Journal of Genetics and Genomics. Mitochondrial DNA evidence supports northeast Indian origin of the aboriginal Andamanese in the Late Paleolithic:

In view of the geographically closest location to Andaman archipelago, Myanmar was suggested to be the origin place of aboriginal Andamanese. However, for lacking any genetic information from this region, which has prevented to resolve the dispute on whether the aboriginal Andamanese were originated from mainland India or Myanmar. To solve this question and better understand the origin of the aboriginal Andamanese, we screened for haplogroups M31 (from which Andaman-specific lineage M31a1 branched off) and M32 among 846 mitochondrial DNAs (mtDNAs) sampled across Myanmar. As a result, two Myanmar individuals belonging to haplogroup M31 were identified, and completely sequencing the entire mtDNA genomes of both samples testified that the two M31 individuals observed in Myanmar were probably attributed to the recent gene flow from northeast India populations. Since no root lineages of haplogroup M31 or M32 were observed in Myanmar, it is unlikely that Myanmar may serve as the source place of the aboriginal Andamanese. To get further insight into the origin of this unique population, the detailed phylogenetic and phylogeographic analyses were performed by including additional 7 new entire mtDNA genomes and 113 M31 mtDNAs pinpointed from South Asian populations, and the results suggested that Andaman-specific M31a1 could in fact trace its origin to northeast India. Time estimation results further indicated that the Andaman archipelago was likely settled by modern humans from northeast India via the land-bridge which connected the Andaman archipelago and Myanmar around the Last Glacial Maximum (LGM), a scenario in well agreement with the evidence from linguistic and palaeoclimate studies.

Geologically unless the Andaman Islanders’ ancestors were accomplished open ocean travelers they almost certainly did arrive via Myanmar. The inference they’re making is based on the likely false axiom that mainland Southeast Asia has been genetically stable for the past 10 to 20 thousand years . It hasn’t been genetically stable over the past 1,000 years! The authors themselves offer up a good explanation for what’s going on here in the conclusion:

In summary, by extensively studying a large number of Myanmar samples, our results failed to find any root lineage of haplogroup M31 in Myanmar, therefore suggesting that aboriginal Andamanese were unlikely originated from Myanmar, the closest region to the Andaman archipelago in geographic. Nevertheless, we still cannot completely rule out the possibility that the matrilineal landscape in Myanmar had been largely shaped by the Neolithic immigrants from the neighboring regions, addressing this issue needs extensive studying on the Myanmar populations. Significantly, our further analyses strongly suggested that Andamanese-specific M31a1 finds its origin in northeast India. Therefore, it seems that the ancient people bearing M31a root type likely had peopled the Andaman archipelago via the land-bridge connecting the Andaman archipelago and southeast Asia continent around the LGM.

Bingo! At a minimum it seems likely that the Onge have been resident in the Andaman Islands for ~10 thousand years. Therefore we should be cautious I think about making too many inferences as to whether their ancestors were resident only in Mynamar, or spanned the South China Sea to the Indus, and so forth. But, I think we can grant that they arrived via Mynamar, and were once resident in Myanmar. The disjunction between mtDNA lineages in their rather large sample strongly implies that Myanmar has seen major demographic reshaping since the ancestors of the Andaman Islanders parted ways with their mainland kin. This stands to reason. It is almost certainly likely that Myanmar was dominated by populations speaking Austro-Asiatic languages at some point in the past. These were replaced by the ancestors of the Burmans, Karen, etc. And to some extent even these have been displaced by newcomers, such as the Shan. But the Austro-Asiatic people themselves probably came from further east. If, and it’s a big if, the kin of the Andaman Islanders were the population which immediatedly predated the Austro-Asiatic groups, then there have been two linguistic shifts, likely accompanied by major genetic turnover. In fact I suspect there were probably more transitions in the past. I doubt hunter-gatherer populations were quite as static as we sometimes seem to posit, at least in the past 40 thousand years.

Here’s a table of haplogroup frequencies:

And here is how the branches of M31 are related to each other:

The Onge branch is distinct, as you might expect from an isolated island population. Using the molecular clock models they came up with a series of coalescences back to the last common ancestor (represented by the star in the figure above). I’ll quote them:

Previous work has suggested the “recent settlement” of the Andaman archipelago about 24 ± 9 kilo-years ago (kya) (Barik et al., 2008). In view of the time estimation results based on the updated phylogeny tree of haplogroups M31, peopling the Andaman archipelago would have occurred after the differentiation of lineage M31a (19.82 ± 10.01 kya) and before the divergence of M31a1 (7.96 ± 3.91 kya) (Table 2). Intriguingly, a similar result was achieved by studying the whole nuclear genome, in which the Andaman aboriginals were suggested to be originated from the potential ancestral populations of South Asian sub-continent before the admixture of ASI-ANI on the mainland (Reich et al., 2009). Noticeably, the paleoclimate evidence and data of Global Ocean Associates prepared for the office of Naval Research have showed that the sea level of Southeast Asia was about 120 m lower than that of today before 17 kya, and most of the sea level of Andaman sea was above 100 m today, supporting the existence of the potential land-bridge connecting Andaman archipelago and southeast Asia continent before the Last Glacial Maximum (LGM) (22–18 kya) ([Voris, 2000] and [Clark et al., 2009]). Taking into account the interesting distribution patterns and time estimation results of different subclades within haplogroup M31, it is likely that the ancestors of aboriginal Andamanese had arrived at Andaman arhcipelago around the LGM through the land-bridge before it was submerged with the raising of the sea level after the peak of the LGM.

They’re right that their number is in rough alignment with the results from Reich et al. The Andaman Islanders diverged from “Ancestral South Indians” on the order of a few tens of thousands of years before the present. But I wonder as the value-add of their estimate when they have a interval over ~10 years on their expectation. That being said, it seems clear that this mtDNA estimate at least pegs a lower boundary. As cultural anthropology would tell us the Andaman Islanders diverged from mainland South Asians well before agriculture. And, the arrival of “Ancestral North Indians.”

On a final note, if the Andaman Islanders arrived ~20 thousand years before the present from the South Asian mainland they don’t tell us very much about the “Out of Africa” people. They’re not “living fossils,” and it was frankly somewhat stupid probably to think they would be. Until recently the “Out of Africa” event was pegged at ~50 thousand years, at its most recent. Even assuming this date the Andaman Islanders arrived in their present location closer to the present than the point at which their ancestors left Africa. But now there is more of a tendency to accept the possibility that the “Out of Africa” event wasn’t so cut & dried in any case, and may date as far back as ~100,000 years. If so we may simply have to acknowledge that fine-grained understanding of paleodemographics will always elude us if we can’t get our hands on a sample of ancient DNA. Even among pre-agricultural peoples there was probably too much population genetic turnover for the palimpsest to be teased apart with enough subtly to read the tea leaves of the past

🔊 Listen RSS

Zack Ajmal now has over 50 participants in the Harappa Ancestry Project. This does not include the Pakistani populations in the HGDP, the HapMap Gujaratis, the Indians from the SVGP. Nevertheless, all these samples still barely cover vast heart of South Asia, the Indo-Gangetic plain. Here is the provenance of the submitted samples Zack has so far:

  • Punjab: 7
  • Iran: 7
  • Tamil: 6
  • Bengal: 5
  • Andhra Pradesh: 2
  • Bihar: 2
  • Karnataka: 2
  • Caribbean Indian: 2
  • Kashmir: 2
  • Uttar Pradesh: 2
  • Sri Lankan: 2
  • Kerala: 2
  • Iraqi Arab: 2
  • Anglo-Indian: 1
  • Roma: 1
  • Goa: 1
  • Rajasthan: 1
  • Baloch: 1
  • Unknown: 1
  • Egyptian/Iraqi Jew: 1
  • Maharashtra: 1

Again, note the underrepresentation of two of India’s most populous states, Uttar Pradesh, ~200 million, and Bihar, ~100 million. Nevertheless, there are already some interesting yields from the project. Below I’ve reedited Zack’s static images (though go to his website for something more dynamic) with the labels of individuals. I’ve highlighted myself and my parents with the red pointers.

To the left is a set of plots and tables which I’ve spliced together from Zack’s various posts. What you need to know is that this at K = 12, and I’ve used the labels that Zack gave the various putative “ancestral populations” which emerged out of his ADMIXTURE runs. I’ve also displayed the participants in the Harappa Ancestry Project so far, with their ethnic labels. Finally, smack in the middle you see the Fst values, standardized by the smallest between population difference. So the values in the boxes represent the genetic distances for the inferred ancestral populations in the row and column (I also rounded, since I didn’t want to give the impression of excessive precision). This last point is important, these are not between population distance measures across real populations. Rather, they’re distance measures across the inferred allele frequencies of populations generated which emerge out of the parameters you constrain ADMIXTURE to, as well as the genetic variation which you throw into the pot for the algorithm in the first place.

In the broadest sense the first thing that jumps out at you is the high distance value between “Papuans” and everyone else. This is interesting. In fact, the genetic distance of between Papuans and other ancestral populations is greater than the genetic distance between the putative African populations and other non-Africans, except Papuans. This goes to the point that you need to be very careful in making definitive inferences from these sorts of programs. Interestingly, the population to which the Papuans exhibit the least genetic distance are the “South Asians.” What does that mean? I think this has a straightforward explanation. I believe that the South Asian cluster is a hybridized compound, as suggested by Reconstructing Indian History, and that the populations of Oceania represent a relatively “pure” eastern expansion of long resident southern Asian groups which have generally been submerged by admixture with other groups intrusive to the region. This also explains the fact that Cambodians share some of this Papuan component with various South Asian populations. Finally, I wouldn’t make too much of this, but in some ADMIXTURE runs which I’ve done the genuine Papuan population in the HGDP data set breaks into two ancestral components, of which the southern Asian groups from Pakistan to Cambodia share only one. Remember that Oceania was settled initially by Melanesians and Australians ~40-50,000 years ago, and it looks like the people of Melanesia and indigenous Australians date to this initial period. So connections between southern Asians and Papuans are likely very old, and the two groups have been distinctive for a long time.

To the South Asian individuals surveyed so far, there’s nothing that surprising. The South Asian element tends to increase as one goes south and east. This is what you’d expect. And, the Pakistan/Caucasian component which spans much of western and central Asia is what connects the Iranian samples to the South Asian ones. The Iranians have very little of the South Asian component. This makes sense if the South Asian element is simply an outcome of an admixed population, and one of the ancestral groups from which this component derives, “Ancestral South Indians,” were generally not present to the west of Pakistan. The eastern Asian components are enriched among Bengalis, as you’d expect, but they’re found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the “eastern” element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance. Most of the other patterns are as you would expect. Finally, I’d like to point out that I suspect that Zack is the first one to post the ancestral fractions of someone from the Nadar caste using SNP-chip markers.

Here are all the details about participation.

🔊 Listen RSS

Last week I announced the Harappa Ancestry Project. It now has its own dedicate website, Additionally, it has its own Facebook page. For Zack to get his own URL he needs about 10 more “likes,” so please like it! (if you are so disposed) Finally, from what I’ve heard the first wave of the 23andMe holiday sale results are coming online this week. Actually, one of the relatives who I purchased the kit for is in processing currently, so I know that we should have a bunch of new people in the system very, very, soon.

Speaking of people, last I heard Zack had gotten about a dozen responses. That’s enough to start an initial round of runs, but obviously he needs more people. More importantly, the goal here is to get better population coverage. One of the things we know intuitively and also from the most current research is the existence of a lot of within-region population variation in South Asia which is structured by community. In other words, a sample of 30 people, where you have 3 from 10 different communities exhibiting geographical and caste diversity is going to be far more useful right now than 300 Jatts from Indian Haryana. Getting 300 Jatts for Haryana would be interesting in that it would give you a window into intra-communal variance, but there’s diminishing returns on the inferences you could make about South Asians as a whole.

If you know someone who has done the 23andMe testing and has preponderant ancestry from South Asia, Iran, Burma, or Tibet, please forward the the URL for the Harappa Ancestry Project. If you are a 23andMe member, and involved in the forums, it might be useful to post a comment thread on this project, as the people you share genes with would see it.

🔊 Listen RSS

price_fig1I have put up a few posts warning readers to be careful of confusing PCA plots with real genetic variation. PCA plots are just ways to capture variation in large data sets and extract out the independent dimensions. Its great at detecting population substructure because the largest components of variation often track between population differences, which consist of sets of correlated allele frequencies. Remeber that PCA plots usually are constructed from the two largest dimensions of variation, so they will be drawn from just these correlated allele frequency differences between populations which emerge from historical separation and evolutionary events. Observe that African Americans are distributed along an axis between Europeans and West Africans. Since we know that these are the two parental populations this makes total sense; the between population differences (e.g., SLC24A5 and Duffy) are the raw material from which independent dimensions can pop out. But on a finer scale one has to be cautious because the distribution of elements on the plot as a function of principal components is sensitive to the variation you input to generate the dimensions in the first place.

I can give you a concrete example: me. I showed you my 23andMe ancestry painting yesterday. I didn’t show you my position on the HGDP data set because I’ve shared genes with others and I don’t want to take the step of displaying other peoples’ genetic data, even if at a remove. But, I have reedited some “demo” screenshots and placed where I am on the plot to illustrate what I’m talking about above. The first shot is my position on the two-dimensional plot of first and second principal components of genetic variation from the HGDP data set.

gsa-lillymendel-worldNo surprise that I’m in the Central/South Asian cluster. But what may surprise you is that I’m not in the South Asian cluster, I’m in the Central Asian cluster. In the Central Asian cluster are Uyghurs and Hazaras. These are two hybrid populations, a mixture of West and East Eurasian elements. The Uyghurs are likely the outcome of a process of admixture between the Iranian and Tocharian Indo-European populations of the cities of the Tarim basin, and later Turkic speaking settlers who arrived in the wake of the expansion and later collapse of the first Uyghur Empire (the historical connection between the current Uyghurs and ancient Uyghurs is tenuous at best, and complicated). The Hazaras are a more recent population, likely emerging as the product of intermarriages between Mongol soldiers who arrived in the 13th century, and indigenous women, Persians, Turks, and assorted Indo-Iranian groups between the Zagros and Khyber Pass. It is somewhat ironic that I’m on the edge of the Hazara cluster since they are almost certainly in part descended from Genghis Khan’s family, and my own surname is Khan. But I know that my Y chromosomal lineage is R1a1, very common across Central and Southern Eurasia, and not a Mongolian one at all.

23andmepcazoomZoom! Now we’ve constrained the input data set to the Central/South Asian groups. First, look at the Kalash. They’re strange, which is no surprise, they’re an inbred mountain group in Pakistan who have not adopted Islam. The Pakistani Taliban looks to be ending them as we speak. I really would prefer that they were just thrown out of the data set for this zoom view, because on this fine grained scale I don’t think they add much at all. They’re just an example of what long term endogamy can do to your allele frequencies. The bigger picture is the axis between the populations of Pakistan, and those of Central Asia. Observe that I’ve changed position. Whereas when taking world wide genetic variation into account I clustered with Central Asians, now I’m 2/3 of the way to the South Asian cluster. I will tell you that I’ve shared “genes” with around 50 South Asians now, from various parts of the subcontinent, and in the 23andMe plot they overlay the South Asians nearly perfectly. I’ve put labels at the approximate ethno-linguistic position. I’m an outlier. 23andMe tells me that I’m 43% “East Asian.” The typical South Asian is in the 10-30% range. My first assumption was that I have a lot of ancient South Indian, which just shows up as East Asian in their algorithm. With this in mind I tried sharing with a lot of South and East Indians, and found out two interesting points. First, South Indians seem no higher than 30-35% East Asian. Bengalis on the other hand are more East Asian, with Bangladeshis more East Asian than West Bengalis. My sample size for Bengalis is small, so take that with caution. Second, the PCA plots put the South Indians firmly in the South Asian cluster, but the Bengalis trail out toward my own position. This indicates again that different methods are telling you slightly different things. The PCA is only a thin slice of variation, but it’s highly informative of between population differences. A Bengali and a South Indian with the same “East Asian” fraction in the ancestry painting nevertheless have consistently different positions on the PCA, with Bengalis closer to the East Asians. Additionally, there’s an ethnic Persian in this zoom plot that I’m describing, and they are positioned near the Balochi. But on the world wide plot they’re on the margins of the European cluster. Another illustration that position of an element is sensitive to the input data because of how the dimensions are generated.

Blaine Bettinger, who inspired me to post this, told a story with his ancestry painting which was plausible. What can I say? First, I have less than 1% African ancestry. This could be noise. But, I do observe that the South Asians with Muslim names are enriched in the set of those who I’ve shared genes with and who have less than 1%, but not 0%, African ancestry. Just as Muslim South Asians have non-trivial West Asian ancestry, I suspect that many of us have Sub-Saharan African ancestry through the same dynamic. Sub-Saharan African soldiers were prominent across South Asia with the arrival of Muslims. Bengal even has a period of rule by Abyssinian rulers. But the bigger issue for me is the East Asian component. Here is a figure from a paper published 4 years ago:


The figure is showing Fst value comparing Indian Americans with Europeans and East Asians. Fst measures between population differences in allele frequency, in this case the alleles being 207 indels. Take a look at the Bengalis. These are West Bengalis, who I believe have a lesser East Asian component, but even there the allele frequency difference to East Asians is near that of Europeans. The Assamese, who speak a language very close to Bengali, are similar. Assam was ruled by a Tibeto-Burman people for nearly 600 years. The Oriya speakers, from the southwest of Bengal, are more distant from East Asians. As one goes south and east, and west and north, the distance from East Asians increases. This shouldn’t be that surprising, but nice to confirm. The fact that the genetic distance increases as one goes south means that for northeast South Asia you need to complexify the model from a two-way admixture with “ancient North Indians” and “ancient South Indians.” Set next to these two is an East Asian element, which is also clear in the Indo-Aryan peoples of Nepal.

Sheikh Hasina, Khaleda ZiaOf course anyone who knows Bengalis won’t be totally surprised by an East Asian component to their ancestry. To the left are head shots of the two women who have dominated Bangladeshi politics for the past two decades, Khaleda Zia and Sheik Hasina. They’re both Bengalis, but they do look different, and I know many people who look like one or the other (or a combination). My family is from one of most easternmost districts of Bengali, next to Tripura. In fact my late maternal grandmother lived in Tripura for some of her childhood (she was almost trampled to death by the Maharani of Tripura’s insane elephant as a young girl!). When I was a young child I once saw a black and white photo from my father’s college days, and I was curious who the Asiatic looking young man in the middle of the photograph was. Turns out it was my father! Sometimes our expectations affect how we perceive people. I have never perceived my father to have an Asian cast to his features as a more mature man, but others have told me that he does still exhibit them.

There is still the question of how Bengalis came to have this particular admixture. I think the most plausible scenario probably synthesizes conventional village-to-village intermarriage and isolation-by-distance, along with some component of migrationism. Tribes such as the Chakma have left Burma in historical time. The Chakma of Bangladesh now speak a dialect of Bengali, not their ancestral Sino-Tibetan tongue. I believe that a non-trivial portion of Bengalis have ancestors who were tribal people who shifted their religious identity to that of Hinduism or Islam (from Theravada Buddhism in the case of the Chakma, or animism in the case of the Garos before their Christianization). But eastern South Asia is adjacent to mainland Southeast Asia, and it stands to reason that continuous gene flow would over time would also have introduced East Asian alleles into the Bengali gene pool.

Image Credit:

Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"