Email This Page to Someone

Razib Khan
Bantu Expansion

Some have asked what the point is in poking around African population structure when Tishkoff et al. and Henn et al. have done such a good job in terms of coverage. First, it is nice to run your own analyses so you can slice & dice to your preference, and not rely on the constrained menu provided by others. There’s value in home cooking; you can flavor to your taste. Second, you never know what data people might leave on your doorstep. I’ve received the genotypes of three Somalis. Nothing too surprising, a touch more Cushitic than the Ethiopians in Behar et al., but interesting nonetheless.

Also, you can see how ADMIXTURE tends to come to weird conclusions in certain circumstances. Below is a K = 12 run ~50,000 SNPs. I’ve included in a few Behar et al. and HGDP populations to the Henn et al. set, as well as pruned a lot of the African groups which seem redundant in terms of information. I’ve added a few geographically informative labels as well.

Observe below that there is a Fulani cluster. I think this is pretty much an artifact. At K = 7 the Fulani have a majority component which is modal in West Africa & Bantu speakers, and a minority component which is identical to the one modal in Mozabite Berbers from Algeria. The Mozabites reside in the far northern Sahara, and their modal component drops off as one goes east toward western Asia and the eastern Mediterranean. I suspect that what is showing up in ADMIXTURE is the ancient hybridization of the Fulani, and perhaps their demographic expansion from this core group. We have some glimmers of the prehistory of the Fulani, and no expectation for them to be such a distinctive cluster, so I naturally jump to these inferences. But it does make me reconsider the nature of the “Sandawe,” “Mbuti” or “San” clusters in ADMIXTURE. These populations are culturally distinctive in deep ways from their neighbors, so a reflexive inference one might make is that they’re “pure” ancient substrate groups which have been overlain and marginalized by their Bantu neighbors. But their prehistory is far murkier than the Fulani because of their geographical isolation, so there is far less to go on. These “ancient” isolated groups themselves may have gone through the same sort of distinctive recent ethnogenesis processes which we presume occurred with the Fulani (also, in the plot below the Biaka are pure; but in most of the bar plots they have a minor element which they share with their neighbors, probably due to greater admixture and interaction between western Pygmies and their Bantu neighbors than among the easter ones).

OK, now let’s prune some of the “pure” and extraneous populations. Additionally, I’ll remove some of the K’s. So the proportions are going to be recalculated with a new base. So, keep in mind that the South African Bantus show elevated West African in part because the Khoisan proportion was removed, inflating the percentages for all the other elements.

Now let’s look at the pairwise Fst values between inferred populations. Remember, this measures the proportion of genetic variance which can be attributed to between population differences. The bigger the value, the larger the genetic distance. I’ll given the inferred populations labels, but don’t take that too seriously.

Fst divergences between estimated populations:
Fulani San Euro Maya Nilotic Biaka W African SW Asian Sandawe Mbuti Mozabite Bantu
Fulani 0.00 0.19 0.15 0.26 0.11 0.13 0.09 0.14 0.10 0.18 0.12 0.10
San 0.19 0.00 0.27 0.37 0.16 0.11 0.13 0.25 0.13 0.13 0.23 0.13
European 0.15 0.27 0.00 0.18 0.17 0.22 0.19 0.05 0.15 0.26 0.06 0.19
Maya 0.26 0.37 0.18 0.00 0.27 0.31 0.28 0.19 0.25 0.36 0.20 0.28
Nilotic 0.11 0.16 0.17 0.27 0.00 0.10 0.07 0.17 0.08 0.14 0.13 0.07
Biaka 0.13 0.11 0.22 0.31 0.10 0.00 0.07 0.21 0.09 0.09 0.18 0.07
W African 0.09 0.13 0.19 0.28 0.07 0.07 0.00 0.17 0.07 0.12 0.14 0.05
SW Asian 0.14 0.25 0.05 0.19 0.17 0.21 0.17 0.00 0.14 0.25 0.06 0.18
Sandawe 0.10 0.13 0.15 0.25 0.08 0.09 0.07 0.14 0.00 0.13 0.12 0.07
Mbuti 0.18 0.13 0.26 0.36 0.14 0.09 0.12 0.25 0.13 0.00 0.22 0.12
Mozabite 0.12 0.23 0.06 0.20 0.13 0.18 0.14 0.06 0.12 0.22 0.00 0.14
Bantu 0.10 0.13 0.19 0.28 0.07 0.07 0.05 0.18 0.07 0.12 0.14 0.00

Here’s the genetic distance between non-African groups and African ones on a bar plot .

Some consistent trends:

- Mbuti and Khoisan show the largest distance from non-Africans.

- Biaka are next. Again, this may be due to admixture between Biaka and neighboring groups, or, a closer relationship between the Biaka Pygmies and the non-Khoisan/Mbuti African groups with reference to the last common ancestors.

- Roughly equal distance of Bantus and West Africans.

- Marginally smaller distances between the Nilotic cluster and non-Africans.

- Finally, a consistently smaller difference between non-Africans and the Sandawe cluster.

As always we need to remember that these probably aren’t pure concrete real ancestral groups. I have no hesitation in presuming some low level consistent gene flow over time between the western Mediterranean groups of which Mozabites are part and some of the Nilotic populations in north-central Africa. This equilibration of gene frequencies would reduce the Fst value naturally. Second, the relative closeness of the Sandawe cluster jumped out at me initially when I looked at the African data. It just strikes me as weird.

Here’s Wikipedia on the Sandawe:

The Sandawe are an agricultural ethnic group based in the Kondoa district of Dodoma Region in central Tanzania. In 2000 the Sandawe population was estimated to number 40,000.

The Sandawe language is a tonal language with clicks, apparently related to the Khoe languages of southern Africa. Recent research suggests that the ancestors of the Khoe were pastoralists, and migrated into southern Africa from the northeast, perhaps from the region of the modern Sandawe.

But the Sandawe don’t seem to be that close to the South African Bushmen samples. Here’s a multidimensional scaling of the Fst relationships of selected inferred ancestral African groups (weight the x-axis more):

An aspect of PCA plots which always jumps out you is the gap between African groups and non-African ones, often spanned by populations which have likely recent admixture. One hypothesis to explain this is that there’s been little gene flow between Africa and the rest of the world since the Out of Africa event. Probably due to ecology (the Sahara). But here’s another explanation: the Bantu expansion has wiped clean much of the genetic variation of central and eastern Africa, the very variation which might span in part the African vs. non-African gap. The archaeology and anthropology indicate that both the groups currently dominant in much of eastern Africa and down to the south, the Bantu and Nilotic peoples, are intrusive on the scale of the past 3,000 years. So groups like the Hadza and the Sandawe are presumed to be relics of the older cultural and genetic variation. This may be why the Sandawe are closer to Eurasians than other African groups once you control for clear likely admixture (e.g., the Fulani). Or, it may be that the Sandawe themselves have an older admixture event due to back-migration from Eurasia….

Finally, let me leave you with a bunch of MDS plots which visualize the Fst differences.

Image Credit: Mark Dingemanse

I recall years ago someone on the blog of Jonathan Edelstein, a soc.history.what-if alum as well, mentioning offhand that archaeologists had “debunked” the idea of the Bantu demographic expansion. Because, unfortunately, much of archaeology consists of ideologically contingent fashion it was certainly plausible to me that archaeologists had “debunked” the expansion of the Bantu peoples. But how to explain the clear linguistic uniformity of the Bantu dialects, from Xhosa of South Africa, up through Angola and Kenya, to Cameroon? One extreme model could be a sort of rapid cultural diffusion, perhaps mediated by a trivial demographic impact. The spread of English exhibits this hybrid dynamic. In some areas (e.g., Australia) there was a substantial, even dominant, English demographic migration coincident with the rise of Anglo culture. In other areas, such as Jamaica, by and large the crystallization of an Anglophone culture arose atop a different demographic substrate, which synthesized with the Anglo institutions (e.g., English language and Protestant religion). The United States could arguably be held up as a in-between case, with an English founding core population, around which there was an accretion of a non-Anglo-Saxon stream of immigrants who serial adopted the Anglo culture, more or less. Sometimes this co-option of Anglo-Saxon norms may surprise. “Black English” (i.e., Ebonics) actually seems to be a genetic descendant of lower class northern English dialects. Other distinctive components of black American (e.g., “jumping the broom“) culture can also plausibly be derived back to the British Isles.

So cultural change is in the “its complicated” segment of dynamics. We have to go on a case-by-case basis. For the Bantu expansion though we have a good answer now thanks to genetics: this cultural change almost certainly was accompanied by a massive demographic migration. Thanks to Brenna Henn and company you can even run some analyses on your desktop to confirm the reality of this model. I pulled down the 55,000 SNPs from various African populations, merged with Palestinians, Tuscans, and Maya as outgroups, and pruned down to ~40,000 after removing those which were missing in more than 1% of the cases. The Hadza are also gone, as they’re such a small isolated group who always hogged up K’s all by themselves. I ran a bunch of different ADMIXTURES, from K = 2 to 12. You can see all 12 here, but let’s just focus on the 12th.

Below is a bar plot, somewhat sorted by ADMIXTURE elements. I’ve reedited some of the labels for clarity, adding regions. I’m sure some of you are ignorant of where the Brong people (Ghana) are from as I was before I looked them up. Also, please be careful about ADMIXTURE. There is a “Fulani” ancestral component below, but I’m 90% sure that’s just an artifact of recent Fulani demograhics + their unique genetic admixture.

K4, the dark green component, seems associated with Bantus and Bantu neighbors all across Africa. The lack of correspondence to geography is clearly suggestive of demographic leapfrogging. The existence of non-Bantu peoples in the wake of their migration (e.g., the Nilotic peoples in northeast Africa, the Pygmies, and the Sandawe) could be indicative of either ecological constraints on the Bantu toolkit (so the migrants simply moved around the uncongenial zones), or a later intrusion (this is often hypothesized to be what occurred to bring the Masai to Tanzania). There are no Horn of Africa samples here, but I have some 23andMe files, and I can tell you that it seems as Dienekes observed, the Sub-Saharan component among the people of Ethiopia and Somali seems singularly lacking the Bantu element. Why? My own suspicion is that this region had its own agricultural (or pastoralist) way of life which rendered them demographically robust in the face of the Bantu, who simply turned south once they reached a zone of serious cultural resistance.

But there’s more. Of course there are Fst, genetic distances, between these “ancestral” populations. You can find these, along with the frequencies, in an Excel file I uploaded. But let’s look at how the populations related to each other on an MDS plot, which visualizes the pairwise distances on a two dimensional plane. I’ve added labels this time. They should be pretty clear in terms of which K’s they correspond to.


For what it’s worth, the Sandawe are presumed to be the aboriginal people of Tanzania, at least in relation to the dominant Bantu around them.

324_1035_F5Last weekend I mentioned a paper, The Genetic Structure and History of Africans and African Americans, which had the best coverage of disparate African populations we’ve seen so far. The map to the left shows the various ancestral population clusters inferred from the samples they had. Really the only failing is that they didn’t have samples from Angola, Zambia, Zimbabwe and Mozambique. Unfortunately, that’s not totally trivial. These are regions which were effected by the Bantu Expansion, with southern Angola in particular still having remnants of Khoisan language speakers which likely attest to the pre-Bantu populations. Luckily for us innovation and scientific ingenuity are such that minor questions can quickly be answered because of how cheap the basic methods have become. A new paper in The European Journal of Human Genetics tackles Mozambique in particular, and discerns a heretofore unknown possible population cluster. A genomic analysis identifies a novel component in the genetic structure of sub-Saharan African populations:

Studies of large sets of single nucleotide polymorphism (SNP) data have proven to be a powerful tool in the analysis of the genetic structure of human populations. In this work, we analyze genotyping data for 2841 SNPs in 12 sub-Saharan African populations, including a previously unsampled region of southeastern Africa (Mozambique). We show that robust results in a world-wide perspective can be obtained when analyzing only 1000 SNPs. Our main results both confirm the results of previous studies, and show new and interesting features in sub-Saharan African genetic complexity. There is a strong differentiation of Nilo-Saharans, much beyond what would be expected by geography. Hunter-gatherer populations (Khoisan and Pygmies) show a clear distinctiveness with very intrinsic Pygmy (and not only Khoisan) genetic features. Populations of the West Africa present an unexpected similarity among them, possibly the result of a population expansion. Finally, we find a strong differentiation of the southeastern Bantu population from Mozambique, which suggests an assimilation of a pre-Bantu substrate by Bantu speakers in the region.

The main value-add of the research were the 279 individuals from Mozambique, who they plugged into previous data sets (e.g., HGDP, HapMap3). It must also be noted that they limited their genetic survey to ~2800 SNPs.This is sufficient for their purposes. Below are the figures of interest from the paper. Note immediately how Mozambique separates out at K = 4 in the first image. The subsequent figures are from PCA. The axes represent components of variation. The last panel shows a PCA plot transposed onto a map. In this case, PC 1 & PC3.t

[nggallery id=7]

The first figure is important because it suggests population structure we hadn’t known of in the Bantu Expansion. This doesn’t mean that it should be surprising. With Africa’s current level of genetic variation it seems implausible that the carriers of the Bantu culture would not have assimilated other groups along the wave of advance. In fact, as a cultural movement gains steam through positive feedback loops different societies may become co-opted into them, and spread the culture in their own turn. As an American example, I will give the Irish American Catholic hierarchy’s campaigns against German language parochial school instruction in the 19th century. Old English aside the language of the Irish was originally not English, but by the early 19th century apparently English had already become dominant among the Roman Catholic peasantry of Ireland. So they brought English, not Gaelic, to the United States. Similarly, the spread of Islam in India occurred predominantly under the ageis of Turks and Afghans, not Arabs, while the spread of Islam in Southeast Asia was promoted by South Asian Muslim merchants in their turn. So you have Arab cultural forms in eastern Indonesia thanks to cultural expansions at two removes from the original Arab source (in fact, it could be argued that the Turks and Afghans were Islamicized through a Persian intermediate as well).

But it is the PCA plots which are of more curiosity for me. They note that it is the third component of variation which maps well onto geographic distance. In the paper they say:

This is the PC that is mostly correlated with geography…and the fact that it is the third rather than the first component, as would be expected if isolation by distance was the predominant force shaping genetic diversity…implies that directional population movements (such as the Bantu expansion) and barriers to gene flow (such as that between food producers and hunter gatherers) are more relevant than geographic distance to understand the genetic landscape of sub-Saharan Africa….

There were folk migrations in Africa. They might simply not have been the ones we are aware of, at least in our sparest conceptions. Those folk migrations were very recent, within the last ~2,000 years or so. Which is why the distinctive correlations between language and genes persist, especially on the outer edge of the wave of advance in southern Africa (in contrast, the Pygmies of the Congo have lost their native language, and the western Pygmies are highly admixed with their neighbors).

Citation: A genomic analysis identifies a novel component in the genetic structure of sub-Saharan African populations

Addendum: The life of Shaka may give us a clue as the disturbances which pushed the Bantu ever outward.

