The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
These buttons register your public Agreement, Disagreement, Thanks, LOL, or Troll with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used three times during any eight hour period.
Ignore Commenter Follow Commenter
🔊 Listen RSS


When I wrote the Pleistocene was humanity’s Hyborian age, I meant humanity. For contingent reasons the new genetic sciences of ancient DNA have elucidated the history of northwest Eurasia first. But prior to the Great Divergence Europe was not quite so exceptional. In fact the historian Victor Lieberman wrote Strange Parallels, his macrohistory of Eurasia, to highlight just how similar the trajectories of Western Europe and mainland Southeast Asia were up until the early modern era, when the West distanced itself from the rest. In short, European prehistory updates our priors for the prehistory of us all.

For various reasons having to do with professional responsibilities I look at TreeMix plots quite often. Like PCA TreeMix is great for exploratory data analysis. You throw a bunch of populations in there, and it searches a bunch of parameters which can fit the model. But often the results are weird.

They’re not weird because they’re “wrong.” They’re weird because we’ll forcing data to give us answers, and the model pops out something which is reasonable with the conditions imposed on it. And often we just don’t have the big picture. Statistical inference was indicating strange connections between Native Americans and Europeans for the past decade…but it took ancient DNA from Siberia to resolve the mystery. Europeans and Amerindians exhibit ancestry from a shared common population. In Europe this ancestry is relatively recent, on the order of the past ~4 to 5 thousand years. Statistical genetic inference can tell us our model is missing something, but it can not always specify clearly just exactly what we’re missing.

The image above from a TreeMix plot is hard to make out; click it. But what it will show you are two things which are strange:

1) Gene flow from between the East African (mostly HapMap Masai for what it’s worth) node and Mbuti (HGDP) to the Papuans (HGDP).

2) Gene flow from near the East African node to the point which defines the whole East Eurasian, Amerindian, and Oceanian, nodes.

I would laugh this off, but I see it all the time in TreeMix. I know I’m not the only one. I have no explanation for it. It’s obviously not recent admixture. Rather, there are affinities between populations which we just don’t have a good model for. Knowing what we know about ancient Europe it is mostly likely that these gene flow edges which seem inexplicable reflect prehistoric events which make sense only in the context of population patterns which have been totally obscured over the last 10,000 years. Ancient DNA from China will probably shed a great deal of light on these topics. I predict that the Chinese will exhibit the same discontinuity with their Paleolithic ancestors that modern Europeans do, and the affinities between East Eurasians and some Africans in these TreeMix plots probably is a shadow of a “ghost population” which has been absorbed in Eurasia, and may have contributed to some of the ancestry of a group which migrated back to Africa.

Notes I set TreeMix to check for covariance across blocks of 1000 SNPs. I had 215,000 total markers in the data set (very high quality ones). I rooted it with Mbuti, set 5 migration edges, and ran it 10 times. They all looked the same. Most of the populations are pooled from public sources.

• Category: Science • Tags: Genomics, TreeMix 
🔊 Listen RSS

If you are going to use ADMIXTURE, you really need to read the original paper, Fast model-based estimation of ancestry in unrelated individuals (it’s not gated, so there’s no excuse). Though the original Jonathan Pritchard paper from 2000, Inference of Population Structure Using Multilocus Genotype Data, is probably sufficient. Unfortunatey there’s a problem with these model-based barplots: people have a real hard time not reifying them excessively. Let’s call it “Plato’s revenge.” But really Plato only elaborated what’s obviously a pretty standard-issue cognitive tick: we like to think in absolute categories. This is most of the problem with “does race exist” discussions; you always need to move past the idea of Platonic constructions, which are by necessity social.

That’s what’s nice about PCA. It’s a visual representation of the underlying variation in the data, and the clusters are not pre-specified. Unfortunately fixing the parameter as K = 5 magically means for most people that there are actually 5 real populations. And “most people” even includes a lot of geneticists.

So I’m going to do an experiment, and move away from model-based clustering, and use TreeMix to explore data. This of course behooves us to make sure we read the original paper, Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data. I have done so, and three elements jump out at me:

– “This Gaussian model was first introduced by Cavalli-Sforza and Edwards[1], and the motivation for this model is outlined in Nicholson et al.[33], if the amount of genetic drift between the two populations is small (at most on a timescale of the same order as the effective population size), then the diffusion approximation to a Wright-Fisher model…”

– “We do not model the boundaries of the allele frequencies at zero and one, nor do we consider new mutations. This means that this model will be most accurate for alleles that were at intermediate frequency in the ancestral population.

– “The contribution of each parental population is weighted; if we assume admixture occurs in a single generation….

What I took from the above. First, beware of highly drifted populations (they will probably generate ‘long branches’). Second, probably best to do minor allele frequency filters so that get the intermediate proportions (the common one of 0.05 would probably suffice). Finally, a lot of admixture isn’t a single event. So that might introduce some distortions in the tree (or at least representations which mislead naive humans).

This first post is something of a trial. I’m not looking to answer any questions, just exploring. I have a data set (unfortunately some of the data is not public, so I won’t be posting the Dropbox link this time) which is mostly skewed toward Northern Europeans. Using PCA I removed individuals which were outliers and generated some reasonable clusters around the centroids of particular nations (i.e., the national clusters are those with individual’s whose ancestors were all from a given nation, to the best of their knowledge).

The clusters are:

– E_Africa (HapMap Masai, with some outlier removal)
– England (I selected individuals who were distant from the Irish, without being German)
– Finland
– Germany (I selected individuals which were basically North German; the Netherlands to Saxony)
– Ireland
– Italy (these individuals are Southern Italian; from Roman down to the Naples, but excluding Sicily)
– Mbuti_Pygmies (HGDP)
– Mozabite (HGDP, some outlier removal)
– N_Amerindian (HGDP, Pima and Maya; some outlier removal)
– S_Amerindian (HGDP, Surui and Kariatana; some outlier removal)
– NE_Asia (HapMap and private data for Japanese and Koreans)
– N_India (HapMap and 1000 Genomes Gujarati and Punjabis)
– N_WestAsia (Armenians and Turks)
– Papuan (HGDP)
– Poland (Removed all Jews from this data)
– Scotland (tried to remove individuals too close to Irish and English; this was not easy)
– SE_Asia (1000 Genomes Dai and Vietnamese)
– S_India (1000 Genomes Tamil and Telugu)
– Spain (1000 Genomes and private data)
– Sweden
– S_WestAsia (private data, Saudis and Kuwaitis, pruned from those with recent African ancestry)
– W_Africa (1000 Genomes Yoruba and Esan)
– Yakut (HGDP)

The merged data set has 290,000 SNPs. Its missingness is 0.25% (0.0025). But there are over 5,000 individuals in the data set, and that could hide some major biases in the distribution of missingeness (e.g., the small number of HGDP Papuans could have a lot of these). So I decided to remove all SNPs that had any missingness in the data. That leaves us with ~40,000 markers. That means all 40,000 of these markers are present as calls in all 5,000 individuals in the data. For PCA 40,000 is actually pretty good, so here are the first 6….

pc1pc2PC3PC4PC5PC6 The magnitude of the dimensions are: 245, 142, 34, 27, 16, and 12. The first two, which form the “wing” shape which we’re all familiar with, represent Africa vs. everyone else, and then western vs. eastern Eurasia. These are not set in stone. Remember, what PCA does is yank out independent dimensions which explain the variation in the data. If you overload the data with a particular type of variation then it could change the rank order. Or, if you throw in a very inbred group then their component is going to be very salient. These methods depend on you not being dumb about how to interpret the data that you yourself are putting in their. Unfortunately, it’s easy to be dumb when you don’t have much foreknowledge about the data…that’s why you are doing the analysis!

Since you can’t read the PCA plots, you should click on them. They’ll pop out into something more readable. PC 3 separates out Eurasia north to south. This is a much smaller dimension than west to east. I think that fits intuition. The fourth PC separates the Amerindian groups. Really it’s a Surui vs. non-Surui axis. I really like PC 5 and PC 6, because they show the different European clusters more visibly. The issue is that there’s very little genetic variation in Europe when judged on a world-wide scale. But the lower components are starting to capture it. I’m not going to lie, ggplot’s default color scheme is hella confusing. I’ll tell you that the two populations away from the rest are Indians, with North Indians closer to Europeans than South Indians. And way up to the top right are Papuans. One way I like to think of these sorts of patterns post hoc is that the Indians are pointing to a “ghost population.” They’re not Papuans, but they have some distant affinity to Papuans….

Next I decided to run TreeMix. First with the full 290,000 SNP data set. Then the 40,000 which are 0% missingeness. I ran them each 10 times and outputted them. I set them for 5 migrations. I’ll leave them without comment, except this: the problem I have with TreeMix is that I’m reassured when I see a migration edge that I’m expecting, but don’t know what to make of those which are surprising. The reason is that the algorithm can’t lie, but it can only work with the data and assumptions that go into it. When Joe Pickrell first came out with his TreeMix results there was a weird arrow going from the Amerindians to the Europeans. No one really knew what to make of this, though it wasn’t entirely surprising (something like this shows up in ADMIXTURE plots well, and I saw it in Noah Rosenberg’s microsatellite STRUCTURE work as far back as 2005). After the fact we now can make sense of it. TreeMix was showing us the impact of the “Ancestral North Eurasians” as best it could. The Amerindians of the New World have the highest proportion of this ancestry, and the people of Northern Europe some of the highest fractions in the Old World. So it drew a migration edge from the former to the latter. When you put the Mal’ta (or Yamnaya) data into TreeMix that “spurious edge” disappears….


FinalPool300KOut.9 FinalPool300KOut.10 FinalPool300KOut.5 FinalPool300KOut.6 FinalPool300KOut.7 FinalPool300KOut.8 FinalPool300KOut.1 FinalPool300KOut.2 FinalPool300KOut.3 FinalPool300KOut.4
Here’s the 40,000 marker TreeMix output










• Category: Science • Tags: TreeMix 
🔊 Listen RSS

Citation: Decker, Jared E., et al. "Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle." arXiv preprint arXiv:1309.5118 (2013).

Citation: Decker JE, McKay SD, Rolf MM, Kim J, Molina Alcalá A, et al. (2014) Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle. PLoS Genet 10(3): e1004254. doi:10.1371/journal.pgen.1004254


440px-Steak_03_bg_040306I am a man of a particular age, old enough to remember when the idea of thousands of what were then quaintly termed ‘molecular markers’ would have left one aghast at the surfeit of data. Today the term “post-genomic” almost strikes me as as anachronistic as the “information superhighway.” This is not the post-genomic era, it just is, the wildest dreams that were, are. But the glorious present of data abundance is not without its limitations and pitfalls. As a friend explained once, bioinformaticians just “do stuff,” sometimes without understanding why they do stuff. Somewhere along the way the bio part seems to have been forgotten in the hurry to assemble the next organism as the machine demands more and more for its hungry maw. But the mechanical monster slurping through the fire hose of data with a hacked together chimera of a regular expression isn’t without some purpose. Many biologists with an interest in evolution have a dream of dense marker painting vast swaths of the tree of life, an empire of phyolgenetic information to be conquered.

But these vistas need some context, a horizon of information about the organism. This came to mind when I read Jared Decker’s new paper on the phylogenetics of domestic cattle, Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle. In many ways it is a straightforward paper. You can see discussions on the earlier iterations over at Haldane’s Sieve (the preprint process seems to have worked to make it a more robust and clear publication from what I can tell!). Decker utilizes some straightforward methods (at least straightforward in 2014) on a very large SNP marker data set with expansive geographic coverage. In particular, TreeMix, Admixture, and PCA. With about ~40,000 SNPs these packages should blast through the data rather quickly (I’ve used all of them with this marker density, and sample sizes of approximately the size of the one Decker has).

You can read the whole paper yourself since it is open access. To me it seems to reiterate that cattle truly are cattle, to be pulled and prodded and traded at the whim of human beings. The fact that many East African cattle have predominantly Indian heritage (one of the two major clades) illustrates the fact that domestic animals exhibit the protean tendencies of human culture, rather than biological organisms which are governed by standard geographical and morphological diversification through conventional population genetic pressures. But I have to still admit that much of the narrative force of this paper escapes me because I lack understanding of the cattle at a level beyond the plainly statistical genetic. In other words, the organism matters. Cattle geneticists who may “hum through” the plots may still be able to grasp the force of argument with a greater clarity because their understanding of the topic is fundamentally thicker than that of outsiders. Many of the paper’s inferences from genetic data clearly draw their plausibility from elements of natural history which bovine biologists would take for granted.

And this is just the beginning. Over the next decade it seems inevitable that the clusters at the heart of “genomics cores” across the world will be gorging on whole sequences of thousands of individuals for many organisms. It will be a “flood the zone” era for attempting to understand the tree of life. An army of bioinformaticists will be thrown at the data in human waves, absorbing shock after shock, slowly transforming the ad hoc kludge pipelines of the pre-Model T era of genomics into simpler turnkey solutions. And then the biology will come back to the fore, and the deep wellspring of knowledge by those who focus on specific organisms and is going to be the essence of the enterprise once more.

• Category: Science • Tags: Admixture, Genomics, PCA, TreeMix 
🔊 Listen RSS

To understand nature in all its complexity we have to cut down the riotous variety down to size. For ease of comprehension we formalize with math, verbalize with analogies, and visualize with representations. These approximations of reality are not reality, but when we look through the glass darkly they give us filaments of essential insight. Dalton’s model of the atom is false in important details (e.g., fundamental particles turn out to be divisible into quarks), but it still has conceptual utility.

Likewise, the phylogenetic trees popularized by L. L. Cavalli-Sforza in The History and Geography of Human Genes are still useful in understanding the shape of the human demographic past. But it seems that the bifurcating model of the tree must now be strongly tinted by the shades of reticulation. In a stylized sense inter-specific phylogenies, which assume the approximate truth of the biological species concept (i.e., little gene flow across lineages), mislead us when we think of the phylogeny of species on the microevolutionary scale of population genetics. On an intra-specific scale gene flow is not just a nuisance parameter in the model, it is an essential phenomenon which must be accommodated into the framework.

This is on my mind because of the emergence of packages such as TreeMix and AdmixTools. Using software such as these on the numerous public data sets allows one to perceive the reality of admixture, and overlay lateral gene flow upon the tree as a natural expectation. But perhaps a deeper result is the character of the tree itself is torn asunder. The figure above is from a new paper, Efficient moment-based inference of admixture parameters and sources of gene flow, which debuts MixMapper. The authors bring a lot of mathematical heft to their exposition, and I can’t say I follow all of it (though some of the details are very similar to Pickrell et al.’s). But in short it seems that in comparison to TreeMix MixMapper allows for more powerful inference of a narrower set of populations, selected for exploring very specific questions. In contrast, TreeMix explores the whole landscape with minimal supervision. Having used the latter I can testify that that is true.

The big result from MixMapper is that it extends the result of Patterson et al., and confirms that modern Europeans seem to be an admixture between a “north Eurasian” population, and a vague “west Eurasian” population. Importantly, they find evidence of admixture in Sardinians, which implies that Patterson et al.’s original were not sensitive to admixture in putative reference populations (note that Patterson is a coauthor on this paper as well). The rub, as noted in the paper, is that it is difficult to estimate admixture when you don’t have “pure” ancestral reference populations. And yet here the takeaway for me is that we may need to rethink our whole conception of pure ancestral populations, and imagine a human phylogenetic tree as a series of lattices in eternal flux, with admixed nodes periodically expanding so as to generate the artifice of a diversifying tree. The closer we look, the more likely that it seems that most of the populations which have undergone demographic expansion in the past 10,000 years are also the products of admixture. Any story of the past 10,000 years, and likely the past 100,000 years, must give space at the center of the narrative arc lateral gene flow across populations.

Cite: arXiv:1212.2555 [q-bio.PE]
Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"