The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
These buttons register your public Agreement, Disagreement, Thanks, LOL, or Troll with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used three times during any eight hour period.
Ignore Commenter Follow Commenter
🔊 Listen RSS

In the post yesterday I reported what was generally known about the Horn of Africa, that its populations seem to lie between those of Sub-Saharan African and Eurasia genetically. This is totally reasonable as a function of geography, but there are also suggestions that this is not simply a function of isolation by distance (i.e., populations at position 0.5 on the interval 0.0 to 1.0 would presumably exhibit equal affinities in both directions due to gene flow). For example, you observe the almost total lack of “Bantu” genetic influence on the Semitic and Cushitic populations of the Horn of Africa, and the lack of Eurasian influence in groups to the south and west of the Horn except to some extent the Masai.

Tacking horizontally in terms of discipline, over the past few generations there has been a veritable cottage industry making the case for the recent origin of many ethno-linguistic populations through a process of cultural self-creation. Clearly there are many cases of this, some of them studied in depth by anthropologists (e.g., the shift from Dinka to Nuer identity). But there has been an unfortunate tendency to over-generalize in this direction. In some ways this is peculiar insofar as these models presuppose the infinite plasticity of culture without observing the sharp and strong norms which those very same phenomenon can enforce. The genetic isolation of non-Muslims in the Middle East after the rise of Islam seems rather well validated by the evidence from genomics. The norms of both Muslims and non-Muslims strongly biased them toward endogamy, and nature of Islamic hegemony and domination was such that Muslims were the ones who were likely to have cosmopolitan affinities with the “Islamic international.” In contrast, non-Muslim minorities began a long process of involution after the Islamic Arab conquests, only disrupted in the past century by emigration and to a lesser extent emancipation.

So back to the Horn of Africa. The vast majority of the people of the Horn of Africa speak an Afro-Asiatic language. Arabic and Hebrew are the most famous members of this group, but it is a very broad classification, ranging from the dialects of the Berbers in the Maghreb all the way to ancient Akkaddian. There are two large subfamilies of particular note and interest here: Semitic and Cushitic. The map above shows the distribution within the Horn of Africa. One can “quick & dirty” summarize the pattern here by observing that Semitic languages in Ethiopia tend to be concentrated in the north-central Christian highlands, while Cushitic is found everywhere else. Additionally, there is the confluence between religion and ethnicity, as there are Cushitic Muslims (Somalis, Afar, etc.) and Cushitic Christians (many Oromo, etc.). From what I can gather many Cushitic social and political elites have had a tendency toward assimilating into an Amhara Semitic identity (Haile Selassie’s mother was a Muslim Oromo). We could therefore generate a possible model where Semitic langauges arrived late to Ethiopia and spread through elite emulation, so the difference between Semitic and Cushitic peoples should be marginal in the genomic dimension (such as the marginal differences between Hausa and Yoruba in Nigeria). Or, we could posit that the Semitic element is distinctive from a pre-existent Cushitic substratum.

To make a long story short by running more ADMIXTURE with a Horn of Africa centered data set I have discerned that one can actually differentiate Cushitic and Semitic elements in the Horn and tentatively identify them with different ancestral components. First, the technical details….

I began with the data set I started with in the runs I posted yesterday. Strange outliers in the Masai were removed. These are a few sets of individuals who “fix” for minority ancestral components. This is a tell that there’s structure within the Masai being picked up, but more like distantly related individuals, not ethnic level differences. After running this I noticed that a lot of the same then popped up in the non-Jewish Yemeni and Saudi samples. To some extent this is like “whack-a-mole.” If you remove one problem others simply pop out of the woodwork. So I removed all the non-Jewish Yemenis and Saudis. The number of markers remained the same, 210,000 SNPs.

There were still a few issues with outliers, especially with the Bantu Kenya, and to a lesser extent the Levantine samples. But at this point I decided to go with it, since these are marginal to the story of the Horn of Africa in any case. I stated yesterday that in general Horn of Africa populations don’t present their own clusters, but are a composite of others, mostly East African and Arabian. After I removed some of the spurious Masai components and ran ADMIXTURE up to K = 10 I did finally get a Horn of Africa cluster, “HoAc”. Additionally, I also found that you can see systematic differences between Cushitic Oromo and Somalis, and the Semitic Ahmara, Ethopian Jews, and Tigray.

Below are bar plots of K = 7 and K = 9. The lower K’s aren’t too different from what I posted yesterday, while K = 8 and K = 10 has too many minor components. I’ve posted only fine-grained and Horn of Africa focused plots, instead of the more general summary plots which show average ancestral quanta. Also, below these I’ve posted two dimensional representations of genetic distances between inferred ancestral groups for K = 7 and K = 9. I’ve removed several components though, in the case of one because it was clearly a spurious “extended family” cluster, and in some cases to better visualize relationships.

To cut to the chase, it looks like all Horn of Africa populations share a HoAc base, which one might term “Cushitic,” though that is not totally accurate. On top of that base you see differences based on language family. The Semitic speaking groups have an ancestral component which is identical to the one fixed in Yemeni Jews, while the Cushitic speaking ones tend to lack this. But observe that the Semitic speaking populations generally have the component found in the Cushitic speaking groups, and especially the Somalis in which it often fixes. This is why I put the sequence of language-population expansions so that the Semitic is overlain upon a Cushitic base. Additionally, there does seem to be admixture from Nilotic groups into Ethiopian, but not Somali, populations. This is most consistent and evident in the Oromo, and where an isolation by distance model seems plausible, as the Oromo are geographically the most likely to have interacted with Nilo-Saharan populations and the Somali the least.

Finally, please keep in mind that if the Somalis are 100% cluster X, that does not mean that the Somalis are derived from some real homogeneous ancestral cluster X. These ADMIXTURE components are very interesting in helping to flesh out relationships horizontally across populations today, but we should be cautious about what they can tell us about relationships vertically in terms of how populations emerged over time. A thoroughly admixed group can break out into its own distinctive cluster if it exhibits a level of internal homogeneity and the ancestral “reference” populations themselves no longer exist. This seems to be what has occurred in South Asia, where certain groups shake out as “100% South Asian,” but themselves on the deeper genomic level seem to be stabilized admixtures of ancient fusions between two ancestral groups which were very diverged. A South Asian analogy to the Horn of Africa might lead us to infer that Somalis are the equivalent of these populations, where they lack admixture with more recent arrivals to the region after the initial admixture event between “Ancestral East Africans” (AEA) the Arabians of yore. This may simply be a function of geography and historical contingency, as the position of Somalis is more “sheltered” because of the quasi-peninsular nature of their region of the Horn. Additionally, Somalia is relatively dry and unsuitable for agriculture, making it perhaps less ecologically friendly than the highlands of Ethiopia to Semitic populations bringing a new agricultural toolkit.

There’s plenty more you can say, but I’ll hold off, and add a word of caution: it is very possible that I was looking for these specific clusters and arrived at them via confirmation bias. As I’ve noted before, if you tune ADMIXTURE’s parameters in the proper fashion you can “arrive” at the answers you want. How to protect against this? If I keep performing ad hoc runs and going by intuition, lots of repetition often helps. You naturally arrive at a sense of the underlying distribution of possibilities, can guard against anchoring upon an outlier result, because you know that it is atypical (this is though on reason that ground-breaking results are ignored, as they don’t fit the paradigm, so there’s a flip-side to this bias). I also run cross-validation now and then to find the optimal number of K’s, but that really slows down the program, so I this is a matter of trade offs for me. I’m rather sure that the differences between Ethiopian and Somali groups are robust, because the same pattern of relationships (e.g., the Amhara tendency to resemble the Tigray more than the Somali) reoccurs over and over. But I’m not so confident about the inference I’ve drawn here about the Afro-Asiatic language families and the partitioning of the Cushitic and Semitic groups.

You can find some more files here.

Image credit: Wikipedia

🔊 Listen RSS

Iman, a Somali model

Since I started up the African Ancestry Project one of the primary sources of interest has been from individuals whose family hail for Northeast Africa. More specifically, the Horn of Africa, Ethiopia, Eritrea, and Somalia. The problem seems to be that 23andMe’s “ancestry painting” algorithm uses West African Yoruba as a reference population, and East Africans are often not well modeled as derivative of West Africans. So, for example, the Nubian individual who I’ve analyzed supposedly comes up to be well over 50% “European” in ancestry painting. Then again, I”m 55-60% “European” as well according that method! So we shouldn’t take these judgments to heart too much. Obviously something was off, and thanks to Genome Bloggers like Dienekes Pontikos we know what the problem was: the populations of the Horn of Africa have almost no distinctive “Bantu” element to connect them with West Africans like the Yoruba. Additionally, a closer inspection shows that the “Eurasian” component present in these populations is very specific as well, almost totally derived from Arabian-like sources. When breaking apart the West Eurasian populations it is no surprise that Northern Europeans and Arabians are among the most distant pairs, even excluding recent Sub-Saharan African admixture. The HapMap Utah European American sample and the Nigerian Yoruba are very suboptimal for people with eastern African background. In contrast, African Americans are a mixture of West Africans and Northern Europeans, so the ancestry painting algorithm has nearly perfect reference populations for them. The results for African Americans may not be very detailed and rich, but they’re probably pretty accurate at the level of grain which they’re offering results.

Though I’m happy to give people of Northeast African ancestry more detailed results than 23andMe, one of my motivations for the African Ancestry Project was to obtain a data set which would allow me to explore the genomic variation in the east of Africa myself. This region is a strong candidate for “source” populations for non-Africans within the last 100,000 years, and, it seems to have experienced rapid population turnover within the last 2,000-3,000 years. My data set is not particularly adequate to my ambitions, yet. But I do now have 5 unrelated Somalis. To my knowledge there hasn’t much exploration of Somali genomics using thick-marker SNP chips, so why not? N = 5 is better than N = 0 in these cases of extreme undersampling.

Before I proceed to methods and results, I want to note that I put up most of my files here. It’s a ~25 MB compressed folder with images, spreadhseets, as well as raw output from ADMIXTURE and EIGENSOFT. I hope readers will take this as an invitation to poke around themselves.

Since my focus was on the Horn of Africa the coverage of populations is relatively constrained compared to what I normally run. From the HapMap I took the Yoruba, Masai, and Luhya. I renamed Masai “Nilotic Kenya” and the Luhya “Bantu Kenya.” The Behar et al. data set has a fair number of Ethiopians, gentiles and Jews. A reader helpfully labelled the various ethnicities by ID. I was going to do that myself, but because this tedious work was done for me I felt much more motivated to produce something instead of putting this task off! From the Behar et al. I also took some Arab populations, as well as Georgians, Lithuanians and Belorussians. I combined the two latter populations into “Baltic.” Syrians and Jordanians were converted to “Levantine” in the bar plots. I left Saudis, Yemenis, and Yemeni Jews disaggregated. Finally, I added some individuals from the AAP: all the people from the Horn of Africa who are unmixed in ancestry, as well as my Nubian individual. In the display that follows AAP members are combined with the ethnic groups which are appropriate in Behar et al.: Oromos, Amharas, and Tigray. Ethiopian Jews (the Beta Israel) I left as is. To mix it up I also brought over the Sandawe from Henn et al. The Somalis are all from AAP. They do not seem related (close relatives generally form their own cluster).

I tried to balance my populations in an ad hoc fashion. I took only ~30 Yoruba, but decided to add in more Masai, because they seemed to be a mixed population rather than a reference, and I wanted to flesh out their variation. I removed individuals who were closely related as per Zack Ajmal’s findings in his review of his reference data sets. After combining the data sets I was left with ~210,000 SNPs, with less than 0.1% missing. I ran this from K = 2 to K = 8 in ADMIXTURE, and, I also generated the top six independent dimensions of genetic variation in EIGENSOFT. I also took the Fst values from ADMIXTURE of the inferred ancestral populations and generated MDS representations of the genetic distances (though the original file can be found in the attached folder).

There are several different types of plots below. The MDS and PCA should be rather straightforward. But a little explanation for the ADMIXTURE bar plots. There are three for every K. First, average results by population. Second, a fine-grained display of all the individuals from all the populations. Third, a fine-grained display of some populations of interest. Please note that in the second set of plots I don’t label all the individuals by population, since it would unreadable. But they go alphabetically, so you should be able to see where populations start, and where they end.

Before I you even look at the results and we discuss them, there is one clear issue which jumps out: there are closely related individuals or clans in the Masai data set which I need to remove in future runs. Though these individuals hogged up higher K’s it didn’t effect the relationships across other populations, so I decided to publish this now before refining it for the future. It’s a learning experience. You can see that these individuals form their own clusters in the MDS and PCA as well. At least the problem reoccurs systematically using different methodologies.

(note: some of the images are larger than shown, so if you want to see better labels for the fine-grained plots, get the image URL and look at it separately)

[zenphotopress album=287 sort=sort_order number=50]



The fact that the Masai “break down” at K = 6 is really problematic, as there’s information that’s probably lost here. But several immediate observations:

1) The Somalis, like the Ethiopian groups, show almost no impact from the Bantu expansion. This is contrast to the one Nubian individual, who may have more West African ancestry through intermediate groups, or through direct contact with Bantus who were enslaved and brought to Sudan.

2) When you break apart West Eurasian ancestry the Ethiopian and Somali groups have their contribution almost exclusively from an ancestral component in southern Arabia. This makes some sense because of geography, but when you look at the fractions of “northern” admixture even among Yemeni Jews the proportions are not reflected among the Horn of Africa groups. One hypothesis which is consistent with this might be that the admixture event between the Arabian-like group occurred at a time when south Arabians were more genetically isolated and distinct from populations to the north. I suspect this is almost certainly going to be true before the camel, let alone Islam. Interestingly, just as the Nubian individual has more West African affinities, they also have more European affinities. The Nubian individual’s ancestry is simply more cosmopolitan than that of Ethiopians and Somalis, which is not historically that surprising.

3) There is a rough rank order of admixture estimates. In terms of Africanness it goes from Somali > Oromo > Beta Israel ~ Amhara > Tigray. The sample sizes are small though, so we should be cautious. The Amhara seem to vary the most. One might suspect that the Amhara, being the traditional core ethnicity of Ethiopia of late, assimilated other groups. If you look at the PCA the Somali actually look the most “East African” of the groups on PC 2. Note also the linear pattern of distribution other Ethiopians and the Masai toward Arabians and Bantu respectively. This is suggestive of some sort ancient admixture event between an East African substrate and other populations. I will label this population “Ancestral East Africans” (AEA).

4) The relationship of the Sandawe to the other groups is interesting. It seems clear that the Sandawe are related to the AEA, but are somewhat at a remove. Note that a “Sandawe” component is often found in low proportions outside of the Sandawe across East Africa. While the Ethiopians and Somalis do not have a Bantu aspect to their ancestry, they may have an “Ancestral Sandawe” (AS) one.

I don’t want to say more until I get the Masai data set fixed (and I might make recourse to some of Dienekes’ “tricks,” as well as supervised runs). But overall I would say that the ethnogesis of the Semitic and Cushitic people of the Horn of Africa pre-dates the Bantu expansion. I will do some more playing with this, but they do not seem to generate a “Ethiopian-Somali” cluster so easily as South Asians do. This may be because they are never numerous in any of these analyses. Or, it may be due to the possibility that the admixture event was recent enough that the underlying populations are not as obscured as amongst South Asians. I lean toward the latter, for now. As in South Asia, I do not think that the ethnogenesis of the families of Ethiopian peoples is quite a “one off” admixture event. It is suggestive that you have two major language families, Semitic and Cushitic, in this region.

Image credit: Wikimedia

🔊 Listen RSS

In the open thread someone asked: “Any recent stuff on the genetics of Ethiopians.” That prompted me to look around, because I’m curious too. Poking around Wikipedia I couldn’t find anything recent. A lot of the studies are older uniparental lineage based works (NRY and mtDNA). Ethiopia is interesting because unlike almost all other Sub-Saharan African nations it has a long written history. Culturally and linguistically it has both Sub-Saharan African, and non-Sub-Saharan African, affinities. The languages of highland Ethiopia are clearly Semitic. Those of lowland Ethiopia are Cushitic, a branch of the broader Afro-Asiatic language family concentrated around the Horn of Africa (Somali is a Cushitic language, though most Ethiopian nationals who speak a Cushitic dialect are of the Oromo group).

From a human evolutionary genetic perspective, Ethiopia also has specific interest. It is likely that the main recent pulse of humans Out of Africa traversed this region. Additionally, there is some evidence of deep time connections between the groups ancestral to Ethiopians and the Khoisan of southern Africa. It may be that Ethiopians and Khoisan are reservoirs of ancient genetic variation in Sub-Saharan Africa which as been overlain by Bantu in most other regions outside of West Africa. Finally, Ethiopians are known to have high altitude adaptations. This could be due to long term residence in the region, or, assimilation of favorable alleles from the long term residents by later populations.

Fortunately we can get a sense of the genetic affinities of Ethiopians thanks to a paper published last spring, The genome-wide structure of the Jewish people. The focus was clearly on Jews, but they surveyed Amhara & Tigray (Semitic speaking highlanders), Ethiopian Jews (similar ethnically to the Amhara & Tigray, but religiously non-Christian), and Oromo. In the PCA the Oromo and Semitic speaking populations are pretty obviously distinct clusters.

This just means that when you take worldwide genetic variation, and pull out the biggest independent dimensions, and then visualize individuals on the two largest dimensions in terms of how they explain variance, the Oromo and other Ethiopians don’t really intersect. Interestingly the Amhara and Tigray are almost indistinguishable, but the Ethiopian Jews are in their own cluster. There are, for the record, 7 Oromo, 7 Amhara, 5 Tigray, and 13 Ethiopian Jews in the sample.

Now let’s look at the genetic variation in ADMIXTURE. Remember this assigns the genomes of individuals in proportions to K ancestral units. As an example, if you had African Americans, Yoruba, and White Americans, in a total pool, and did K = 2, you might have a tendency where Yoruba and White Americans are in two totally different ancestral populations of K, while African Americans are 80% in one ancestry and 20% in another. The interpretation of this is straightforward, but when it comes to populations whose backgrounds we don’t know as well, one should be careful. The selection of a particular value for K is going to be really important, and we shouldn’t confuse the method from the reality which the method is trying to plumb.

First, K = 8 from Behar et al. I’ve reedited to highlight populations which might inform the variation of Ethiopians.

Now let’s look at a series of K’s. Note the changes.

Luckily for us, we don’t need to stop here. Dienekes included Behar’s Ethiopians (non-Jews) for Dodecad. Additionally, he included the Masai population from the HapMap. This turns out to be important because he found that Ethiopian Sub-Saharan ancestry is similar to that of the Masai, not the other African groups.

Dienekes also provided individual outputs. I’ve stitched together Ethiopians with Egyptians and Saudis. The color coding is the same as above.

You should be able to tell where the three groups start and stop pretty easily. I’m 99% sure that the six individuals with more East African and less Southwest Asian ancestry are all Oromo. Ethiopians, in particular highland Ethiopians, seem to me likely an ancient stabilized hybrid population between a population from Arabia, and a local Sub-Saharan population. This population seems unlikely to have been related to the peoples of West-Central Africa, who are associated with the Bantus across eastern and southern Africa. The Bantu agricultural toolkit runs into ecological constraints in various regions, and it is in those regions that non-Bantu populations have persisted. Ethiopia, with its unique climate and topography, naturally remains non-Bantu (as well as the Horn of Africa as a whole). The possible connections between Khoisan and Ethiopia may be a function of the fact that these areas harbor genetic variants which have disappeared in the intervening regions because of the Bantu expansion. I have a hard time accepting that the Bantu expansion was particular eliminationist, but I am starting to suspect that outside of Ethiopia population densities were very, very, low.

The antiquity of this ancient hybridization event to me is attested by the fact that Ethiopians lack any of the other Middle Eastern components besides the one modal in Saudi Arabia. There is a great deal of intra-population variance in the Saudi data set. Why? Part of this must be the slave trade, as well as pilgrims who remained in places like Mecca. But, I think part of the untold story here is that there may have been a larger genetic impact on Arabia after the rise of Islam from the Levant than vice versa! Probably the gene flow precedes Islam, as Arabia was hooked into worldwide trade and population movements, which Ethiopia was relatively insulated from. The Saudi data set has several people who are “pure” Southwest Asian, but also several who have a great deal of West Asian + South European. These seem likely to be people who have some background in the Fertile Crescent.

Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"