The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog
Genome Blogging

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
These buttons register your public Agreement, Disagreement, Thanks, LOL, or Troll with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used three times during any eight hour period.
Ignore Commenter Follow Commenter
🔊 Listen RSS

It’s been 10 months since Zack Ajmal first contacted me about the possibility of the Harappa Ancestry Project. I was of two minds. On the one hand I did think there was a major problem with undersampling some regions of South Asia. But, it seemed that the 1000 Genomes would fix that soon enough. As it turns out the 1000 Genomes has been a bit slower than I had anticipated (and I assume that the nixing of the Indian samples was a matter of politics not science). So I’m glad Zack started the project when he did.

At this point he’s hit the zone of diminishing marginal returns when it comes to participants. Looking through his samples he has a little over 100 non-founders of unadmixed South Asian ancestry (I’m not a founder because both my parents are in the database). I decided to prune the individuals down to this selection, and tack on a lot of his reference populations, with a bias toward South Asians, and see what I could find. I used his K = 11 ADMIXTURE run, since this seems maximally informative for South Asians. You can find the file here.

One interesting aspect of Zack’s project is that he began to collect Y and mtDNA haplogroups at a certain point. Not too surprising there was a preponderance of R1a1a. For many years now this paternal marker has been suggested to have some association with Indo-Iranians, though more recently researchers have suggested that in fact it’s a very old haplogroup sharply differentiated between a European branch and a South Asian one. Zack has 56 individuals with Y and mtDNA information in his database. These have to be males. He has 14 individuals with mtDNA information and no Y information. These are probably females (obviously there could be males who are only entering their mtDNA information, but this seems unlikely given that most of the results come from 23andMe). 27 of the males are R1a1a. 29 are not. The mean “Onge” proportion of those with R1a1a is 24%. Without? 24%. The respective values for “South Asian” is 56 and 55 percent respectively. In this likely skewed sample R1a1a doesn’t seem to predict the ancestral variation much.

How about we look at mtDNA. Haplogroup M is localized to South Asia. Dividing the population into M and not M you get the following values:

Not M, South Asian = 55%
Not M, Onge = 23%
M, South Asian = 56%
M, Onge = 23%

There doesn’t seem to be that much in uniparental markers, which aligns with my intuition. At least to this scale of analysis. So let’s look at the autosomal genome. The total genetic variation. If you’ve been following HAP the following won’t be news, for those who haven’t, I thought I’d generate some plots.

The two-way admixture aspect of South Asian populations is evident in the HAP data. “Onge” refers to an element affinal to those of Andaman Islanders. “S.Asian” seems to be some sort of compound, but with strong West Eurasian affinities. The axis is NW-SE, upper caste to lower caste, just as you’d expect.

There are two West Eurasian components which aren’t collapsed into “S.Asian,” “SW.Asian” and “European.” The names are rather self-evident. The interesting thing here is that “SW.Asian” tends to be elevated among South Indians, especially non-Brahmin upper castes. In contrast, there is far less “SW.Asian” amongst Northeast Indians, and proportionally more “European.” This is more evident when you look at populations in the reference set.

There are also some interesting caste/region patterns.

When you remove region from consideration it is interesting that Brahmins are somewhat “central” among South Asian populations.

In contrast, Punjabis are where you’d expect geography to predict. That’s one reason it was somewhat problematic that the HGDP had only Pakistani groups for South Asians. They’re not too representative of South Asians.

Differences along the axis of caste become more clear when you correct for region, at least mostly.

Punjab is somewhat atypical here. I am now much more willing to credit migrations within the last 2,000 years accounting for the distinctiveness of groups like Jatts.

On a somewhat less exciting note, it looks like a lot of the genome blogging projects are losing steam. I’m pretty busy right now, so I haven’t been able to maintain AAP, though we’ll have another Merina soon. But I suspect it goes to show just how important collection of new data is to these endeavors. There’s only so much juice you can get out of the same data set. Right now we depend on research groups and the 1000 Genomes, as well as enthusiasts. At some point in the near future the genotypes won’t be the limiting factor. I think then you’ll see a renaissance of amateur ancestral genomics.

• Category: Science • Tags: Genome Blogging, Personal Genomics 
🔊 Listen RSS

A few days ago I noticed that the Dodecad Ancestry Project had nearly nearly 10,000 individuals! ~500 are participants in the project (like myself, I’m DOD075). But most of the individuals were derived from public or shared data sets. You can see them in the Google spreadsheet with all the results. It’s quite an accomplishment, and I commend Dienekes for it. I also have to enter into the record that Dodecad prompted my own forays into genome blogging, and Dienekes also helped Zack with pointers for Harappa in the early days.

• Category: Science • Tags: Genome Blogging, Genomics, Personal Genomics 
🔊 Listen RSS

Dienekes Pontikos has just released DIY Dodecad, a DIY admixture analysis program. You can download the files yourself. It runs on both Linux and Windows. Since I already have tools in Linux I decided to try out the Windows version, and it seems to work fine. It is somewhat limited in that you start out with the parameters which Dienekes has set for you, but if you don’t want to write your own scripts and get familiar with all the scientific programs out there, I think this is a very good option. Additionally, it seems to run rather fast, so you won’t spend days experimenting with different parameters.

Dienekes has already run me, but I put my parents’ genotype files through the system. Here are the results:

Population Razib Mother Father
East_European 6.9 6.5 4.3
West_European 1.7 3.1 5.5
Mediterranean 6.3 5.6 5.9
Neo_African 0 0 0
West_Asian 0 2 3.9
South_Asian 65.9 59.6 60.4
Northeast_Asian 2.9 3.8 3.6
Southeast_Asian 15.8 16.6 15.5
East_African 0 0 0.2
Southwest_Asian 0.5 2.5 0.7
Northwest_African 0 0 0
Palaeo_African 0 0.3 0

The main thing to notice is that my mother has more total East Asian ancestry than my father, and, that she has a Southwest Asian component which is at a few percent. These are always consistent findings in the dozens of ADMIXTURE runs I’ve done with various parameter settings and reference population mixes, so it’s nice that DIY Dodecad replicates those findings. Though the population sets seem a bit Eurocentric to me, so I would recommend it most for those with West Eurasian ancestry.

🔊 Listen RSS

Attitudes on DNA ancestry tests:

The DNA ancestry testing industry is more than a decade old, yet details about it remain a mystery: there remain no reliable, empirical data on the number, motivations, and attitudes of customers to date, the number of products available and their characteristics, or the industry customs and standard practices that have emerged in the absence of specific governmental regulations. Here, we provide preliminary data collected in 2009 through indirect and direct participant observation, namely blog post analysis, generalized survey analysis, and targeted survey analysis. The attitudes include the first available data on attitudes of those of individuals who have and have not had their own DNA ancestry tested as well as individuals who are members of DNA ancestry-related social networking groups. In a new and fluid landscape, the results highlight the need for empirical data to guide policy discussions and should be interpreted collectively as an invitation for additional investigation of (1) the opinions of individuals purchasing these tests, individuals obtaining these tests through research participation, and individuals not obtaining these tests; (2) the psychosocial and behavioral reactions of individuals obtaining their DNA ancestry information with attention given both to expectations prior to testing and the sociotechnical architecture of the test used; and (3) the applications of DNA ancestry information in varying contexts.

If anyone wants the paper, email me, I can send you a copy. But really it’s just kind of dated because the information was collected in 2009, before the massive increase in 23andMe’s customer base which began in the spring of 2010. Additionally, “genome blogging” really hadn’t started much at that point.

In terms of the reactions to ancestry analysis, my personal experience after doing analysis on hundreds of people (most in public for AAP, but some in private) is that most are pretty calm about whatever they find out. On occasion you run into a stubborn person who is basically going to fix upon a really implausible explanation for a particular ancestral slice rather than the lowest hanging fruit. But there was one individual who had a freak out when their results were published, because it did not accord with family beliefs. I was kind of confused, and checked their results with their self-reported ethnicity. Weirdly the results were exactly what I would have expected from the self-reported ethnicity, so it was a really strange reaction.

• Category: Science • Tags: Genome Blogging, Genomics 
🔊 Listen RSS

Iman, a Somali model

Since I started up the African Ancestry Project one of the primary sources of interest has been from individuals whose family hail for Northeast Africa. More specifically, the Horn of Africa, Ethiopia, Eritrea, and Somalia. The problem seems to be that 23andMe’s “ancestry painting” algorithm uses West African Yoruba as a reference population, and East Africans are often not well modeled as derivative of West Africans. So, for example, the Nubian individual who I’ve analyzed supposedly comes up to be well over 50% “European” in ancestry painting. Then again, I”m 55-60% “European” as well according that method! So we shouldn’t take these judgments to heart too much. Obviously something was off, and thanks to Genome Bloggers like Dienekes Pontikos we know what the problem was: the populations of the Horn of Africa have almost no distinctive “Bantu” element to connect them with West Africans like the Yoruba. Additionally, a closer inspection shows that the “Eurasian” component present in these populations is very specific as well, almost totally derived from Arabian-like sources. When breaking apart the West Eurasian populations it is no surprise that Northern Europeans and Arabians are among the most distant pairs, even excluding recent Sub-Saharan African admixture. The HapMap Utah European American sample and the Nigerian Yoruba are very suboptimal for people with eastern African background. In contrast, African Americans are a mixture of West Africans and Northern Europeans, so the ancestry painting algorithm has nearly perfect reference populations for them. The results for African Americans may not be very detailed and rich, but they’re probably pretty accurate at the level of grain which they’re offering results.

Though I’m happy to give people of Northeast African ancestry more detailed results than 23andMe, one of my motivations for the African Ancestry Project was to obtain a data set which would allow me to explore the genomic variation in the east of Africa myself. This region is a strong candidate for “source” populations for non-Africans within the last 100,000 years, and, it seems to have experienced rapid population turnover within the last 2,000-3,000 years. My data set is not particularly adequate to my ambitions, yet. But I do now have 5 unrelated Somalis. To my knowledge there hasn’t much exploration of Somali genomics using thick-marker SNP chips, so why not? N = 5 is better than N = 0 in these cases of extreme undersampling.

Before I proceed to methods and results, I want to note that I put up most of my files here. It’s a ~25 MB compressed folder with images, spreadhseets, as well as raw output from ADMIXTURE and EIGENSOFT. I hope readers will take this as an invitation to poke around themselves.

Since my focus was on the Horn of Africa the coverage of populations is relatively constrained compared to what I normally run. From the HapMap I took the Yoruba, Masai, and Luhya. I renamed Masai “Nilotic Kenya” and the Luhya “Bantu Kenya.” The Behar et al. data set has a fair number of Ethiopians, gentiles and Jews. A reader helpfully labelled the various ethnicities by ID. I was going to do that myself, but because this tedious work was done for me I felt much more motivated to produce something instead of putting this task off! From the Behar et al. I also took some Arab populations, as well as Georgians, Lithuanians and Belorussians. I combined the two latter populations into “Baltic.” Syrians and Jordanians were converted to “Levantine” in the bar plots. I left Saudis, Yemenis, and Yemeni Jews disaggregated. Finally, I added some individuals from the AAP: all the people from the Horn of Africa who are unmixed in ancestry, as well as my Nubian individual. In the display that follows AAP members are combined with the ethnic groups which are appropriate in Behar et al.: Oromos, Amharas, and Tigray. Ethiopian Jews (the Beta Israel) I left as is. To mix it up I also brought over the Sandawe from Henn et al. The Somalis are all from AAP. They do not seem related (close relatives generally form their own cluster).

I tried to balance my populations in an ad hoc fashion. I took only ~30 Yoruba, but decided to add in more Masai, because they seemed to be a mixed population rather than a reference, and I wanted to flesh out their variation. I removed individuals who were closely related as per Zack Ajmal’s findings in his review of his reference data sets. After combining the data sets I was left with ~210,000 SNPs, with less than 0.1% missing. I ran this from K = 2 to K = 8 in ADMIXTURE, and, I also generated the top six independent dimensions of genetic variation in EIGENSOFT. I also took the Fst values from ADMIXTURE of the inferred ancestral populations and generated MDS representations of the genetic distances (though the original file can be found in the attached folder).

There are several different types of plots below. The MDS and PCA should be rather straightforward. But a little explanation for the ADMIXTURE bar plots. There are three for every K. First, average results by population. Second, a fine-grained display of all the individuals from all the populations. Third, a fine-grained display of some populations of interest. Please note that in the second set of plots I don’t label all the individuals by population, since it would unreadable. But they go alphabetically, so you should be able to see where populations start, and where they end.

Before I you even look at the results and we discuss them, there is one clear issue which jumps out: there are closely related individuals or clans in the Masai data set which I need to remove in future runs. Though these individuals hogged up higher K’s it didn’t effect the relationships across other populations, so I decided to publish this now before refining it for the future. It’s a learning experience. You can see that these individuals form their own clusters in the MDS and PCA as well. At least the problem reoccurs systematically using different methodologies.

(note: some of the images are larger than shown, so if you want to see better labels for the fine-grained plots, get the image URL and look at it separately)

[zenphotopress album=287 sort=sort_order number=50]



The fact that the Masai “break down” at K = 6 is really problematic, as there’s information that’s probably lost here. But several immediate observations:

1) The Somalis, like the Ethiopian groups, show almost no impact from the Bantu expansion. This is contrast to the one Nubian individual, who may have more West African ancestry through intermediate groups, or through direct contact with Bantus who were enslaved and brought to Sudan.

2) When you break apart West Eurasian ancestry the Ethiopian and Somali groups have their contribution almost exclusively from an ancestral component in southern Arabia. This makes some sense because of geography, but when you look at the fractions of “northern” admixture even among Yemeni Jews the proportions are not reflected among the Horn of Africa groups. One hypothesis which is consistent with this might be that the admixture event between the Arabian-like group occurred at a time when south Arabians were more genetically isolated and distinct from populations to the north. I suspect this is almost certainly going to be true before the camel, let alone Islam. Interestingly, just as the Nubian individual has more West African affinities, they also have more European affinities. The Nubian individual’s ancestry is simply more cosmopolitan than that of Ethiopians and Somalis, which is not historically that surprising.

3) There is a rough rank order of admixture estimates. In terms of Africanness it goes from Somali > Oromo > Beta Israel ~ Amhara > Tigray. The sample sizes are small though, so we should be cautious. The Amhara seem to vary the most. One might suspect that the Amhara, being the traditional core ethnicity of Ethiopia of late, assimilated other groups. If you look at the PCA the Somali actually look the most “East African” of the groups on PC 2. Note also the linear pattern of distribution other Ethiopians and the Masai toward Arabians and Bantu respectively. This is suggestive of some sort ancient admixture event between an East African substrate and other populations. I will label this population “Ancestral East Africans” (AEA).

4) The relationship of the Sandawe to the other groups is interesting. It seems clear that the Sandawe are related to the AEA, but are somewhat at a remove. Note that a “Sandawe” component is often found in low proportions outside of the Sandawe across East Africa. While the Ethiopians and Somalis do not have a Bantu aspect to their ancestry, they may have an “Ancestral Sandawe” (AS) one.

I don’t want to say more until I get the Masai data set fixed (and I might make recourse to some of Dienekes’ “tricks,” as well as supervised runs). But overall I would say that the ethnogesis of the Semitic and Cushitic people of the Horn of Africa pre-dates the Bantu expansion. I will do some more playing with this, but they do not seem to generate a “Ethiopian-Somali” cluster so easily as South Asians do. This may be because they are never numerous in any of these analyses. Or, it may be due to the possibility that the admixture event was recent enough that the underlying populations are not as obscured as amongst South Asians. I lean toward the latter, for now. As in South Asia, I do not think that the ethnogenesis of the families of Ethiopian peoples is quite a “one off” admixture event. It is suggestive that you have two major language families, Semitic and Cushitic, in this region.

Image credit: Wikimedia

🔊 Listen RSS

Both Eurogenes and Harappa now have map interfaces where you can drop in the origin of your location if you’re a participant. If you have submitted your data you should add your information in. We’re at a point where data is relatively plentiful, at least before the tsunami of whole genomes, so visualization and representation is of the essence.

Here’s HAP:

• Category: Science • Tags: Genome Blogging, Genomics, Personal Genomics 
🔊 Listen RSS

Zack pointed me to two new ones, Fennoscandia Biographic Project, and Magnus Ducatus Lituaniae Project – BGA analysis project for the territories of former Grand Duchy of Lithuania. So I guess the circum-Baltic region is getting some thick coverage. The latter is also releasing some format conversion tools which seem to work in Windows, if you want to play with the analytic software yourself.

• Category: Science • Tags: Genome Blogging, Genomics, Personal Genomics 
🔊 Listen RSS

At least about some things. In Guns, Germs, and Steel he argued that latitudinal diffusion of agricultural toolkits was much easier than longitudinal diffusion. This seems right, but, one thing which Diamond did not emphasize enough in hindsight I suspect is that demographic diffusion and replacement can follow a similar pattern. I am probably not a “Neolithic population replacement” maximalist to the extent of someone like “Diogenes” or Peter Bellwood, but that is probably mostly a matter of my modest confidence about all of these sorts of issues. But, after running many trials of ADMIXTURE, along with perusing the results generated by Dienekes, David, and Zack, I am more confident in the position that agriculture and agriculture-bearing populations tend to initially follow paths of least ecological resistance. In kilometers the distance between Lisbon and Damascus is 4,000 units, while between Helsinki and Damascus it is 3,000 units, but Lisbon has been much more affected by the migrations from the Middle East than Helsinki. The facilitation of water transportation as well as ecological similarities between Lisbon and Damascus, at least in relation to Helsinki, explains this phenomenon.

To illustrate this issue more broadly, let’s look at some ADMIXTURE results. Zack Ajmal at the Harappa Ancestry Project has one of the most cosmopolitan reference sets around, and he’s been posting results from his “reference 3” population, which merges a host of different study groups. Today he posted K = 6. That is, he generated 6 ancestral populations and allowed the program to assign proportions of each to individuals within the reference set. He labeled his putative ancestral populations:

– S Asian
– E Asian
– European
– SW Asian
– African
– American

Zack generated his usual nice bar plots, but I thought there might be another way to look at the relationships between the proportions. A scatter plot where each axis represents a proportion of a putative ancestral group. Below you see “SW Asian” on the y-axis and “European” on the x-axis:

First, I have to remind you that the ancestral groups which fall out of ADMIXTURE are not real ancestral groups necessarily. One good thing about doing your own runs is that you get a feel for the weirdness which bubbles out of the software. Populations which are evenly divided between two ancestral groups in an intuitive way can collapse in a higher K back into one element! Clearly hybridized populations can also transform into their own distinctive cluster. I tend to look at these sorts of results as suggestive pointers to relative relationships across populations. So, for example, there is the weird pattern that Western Europeans tend to have more “SW Asian” than Europeans from the Baltic, even though the latter region is closer as the crow flies to the Mid-East than the British Isles. What gives?

Water and climate. Much of the west of Europe is mild, while the Baltic is climatically harsh. Temperate climate societies, with technology and norms to suit moderate regimes, could be transplanted relatively easily to a valley in Ireland with sea access, as opposed to the interior of Russia’s center. In Zack’s run the Finns are the most “European” of Europeans, and have the least “SW Asian.” Intuitively this makes sense. The peoples of the Arabian peninsula are at the other pole. When you see deviations from the trend, that is due to the influence of other ancestral components. In the case of the Chuvash Turks, it is East Asian influence. For Iranians, it is South Asian. And for non-Jewish Yemenis, African. East Africa and Northeast Eurasia show parallel patterns. Both regions exhibit some level of “Caucasoid” admixture in direct proportion to distance, but, the influence is only from one stream. In East Africa it was an Arabian-affiliated element. In Northeast Eurasia it seems to have been mostly Northern European. That’s why they hug the margins of the plot. In contrast, groups like Mestizos and South Asians have ancestral contributions from both these putative groups.

Next let’s look at “European” against “S Asian”:

Nothing too surprising here. The main thing to note is that some groups in the Caucasus have a weird and unexpected affinity with South Asians. This has been extensively discussed by other genome bloggers, so I’ll leave that to the side.

Now “SW Asian” vs. “S Asian.”

Middle Eastern populations just get swapped out for Europeans now. The Caucasian groups interestingly are at same position. But, there are some minor details which are notable. To spotlight that, I filtered the results to South Asian groups which had at least 10% combined “SW Asian” and “European” ancestry, and compared these two components:

Ah, the two Jewish groups are outliers! Just as we’d expect. They’re enriched for “SW Asian.” The Baloch, Brahui, and Makrani, are three groups which are very closely associated at the junction of Iran, Pakistan, and Afghanistan. Historically the Baloch have had a presence in Oman because of their proximity, so it is likely that there has long been a maritime association with Southwest Asia on the part of this group. I suspect that the Baloch and Pathan have been diverging over time due to their different geographical position.

Finally, in combination with the ancient DNA I think these ADMIXTURE results strongly point to multiple demographic impacts of outsiders in Europe and South Asia. I probably lean moderately to the proposition that the Basques are the descendants of Europe’s first farmers. But those first farmers were not the last newcomers. At higher K’s there’s a clear difference between French Basques and French and Spaniards. There is the presence of a component which is modal in western Asia among the non-Basque populations of western Europe. In Dienekes’ runs this element is present in Scandinavians, but not Finns. It is not dominant, but neither is it trivial. Similarly, if Reich et al. are correct the “S Asian” component is compound of two other ancient ones. But it does seem that after the initial fusion between West Eurasian-like “Ancestral North Indians” and the indigenous “Ancestral South Asians” there were subsequent groups. The Austro-Asiatic populations have clear affinities to East Asian groups, but they are still predominantly “S Asian,” and, lacking in the “European” or “SW Asian” components. That suggests to me that these last two were later intrusions, subsequent to the amalgamation of the Austro-Asiatic Mundari populations with the indigenous substrate.

• Category: Science • Tags: Genome Blogging, Genomics 
🔊 Listen RSS

Zack Ajmal has been methodically working his way through issues in the public genomic data sets. Often it just involves noting duplicate samples across data sets, which need to be accounted for. But sometimes there seem to be problems within the uploaded data sets, for example relatively close related individuals. Today he highlights an issue which early on was noticeable in the Behar et al. data set:

Behar as in the Behar et al paper/dataset and not the Indian state of Bihar. The Behar dataset contains 4 samples of Paniya, which apparently is a Dravidian language of some Scheduled Tribes in Kerala.

I had always been suspicious of those four samples since one of them had admixture proportions similar to other South Indians but the other three were like Southeast Asians.

Since the Austroasiatic Paniya samples originated from Behar et al, I guess at some point before the Behar data being submitted to the GEO database the Paniyas got mislabeled.

I pulled down the Behar et al. data set too, and the Paniya just look weird enough that I just avoided them. Ideally this sort of stuff should be caught, but errors happen. Best to get as many eyeballs looking over everything.

• Category: Science • Tags: Genome Blogging, Genomics, Personal Genomics 
Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"