The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
These buttons register your public Agreement, Disagreement, Thanks, LOL, or Troll with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used three times during any eight hour period.
Ignore Commenter Follow Commenter
🔊 Listen RSS

Citation: Decker, Jared E., et al. "Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle." arXiv preprint arXiv:1309.5118 (2013).

Citation: Decker JE, McKay SD, Rolf MM, Kim J, Molina Alcalá A, et al. (2014) Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle. PLoS Genet 10(3): e1004254. doi:10.1371/journal.pgen.1004254


440px-Steak_03_bg_040306I am a man of a particular age, old enough to remember when the idea of thousands of what were then quaintly termed ‘molecular markers’ would have left one aghast at the surfeit of data. Today the term “post-genomic” almost strikes me as as anachronistic as the “information superhighway.” This is not the post-genomic era, it just is, the wildest dreams that were, are. But the glorious present of data abundance is not without its limitations and pitfalls. As a friend explained once, bioinformaticians just “do stuff,” sometimes without understanding why they do stuff. Somewhere along the way the bio part seems to have been forgotten in the hurry to assemble the next organism as the machine demands more and more for its hungry maw. But the mechanical monster slurping through the fire hose of data with a hacked together chimera of a regular expression isn’t without some purpose. Many biologists with an interest in evolution have a dream of dense marker painting vast swaths of the tree of life, an empire of phyolgenetic information to be conquered.

But these vistas need some context, a horizon of information about the organism. This came to mind when I read Jared Decker’s new paper on the phylogenetics of domestic cattle, Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle. In many ways it is a straightforward paper. You can see discussions on the earlier iterations over at Haldane’s Sieve (the preprint process seems to have worked to make it a more robust and clear publication from what I can tell!). Decker utilizes some straightforward methods (at least straightforward in 2014) on a very large SNP marker data set with expansive geographic coverage. In particular, TreeMix, Admixture, and PCA. With about ~40,000 SNPs these packages should blast through the data rather quickly (I’ve used all of them with this marker density, and sample sizes of approximately the size of the one Decker has).

You can read the whole paper yourself since it is open access. To me it seems to reiterate that cattle truly are cattle, to be pulled and prodded and traded at the whim of human beings. The fact that many East African cattle have predominantly Indian heritage (one of the two major clades) illustrates the fact that domestic animals exhibit the protean tendencies of human culture, rather than biological organisms which are governed by standard geographical and morphological diversification through conventional population genetic pressures. But I have to still admit that much of the narrative force of this paper escapes me because I lack understanding of the cattle at a level beyond the plainly statistical genetic. In other words, the organism matters. Cattle geneticists who may “hum through” the plots may still be able to grasp the force of argument with a greater clarity because their understanding of the topic is fundamentally thicker than that of outsiders. Many of the paper’s inferences from genetic data clearly draw their plausibility from elements of natural history which bovine biologists would take for granted.

And this is just the beginning. Over the next decade it seems inevitable that the clusters at the heart of “genomics cores” across the world will be gorging on whole sequences of thousands of individuals for many organisms. It will be a “flood the zone” era for attempting to understand the tree of life. An army of bioinformaticists will be thrown at the data in human waves, absorbing shock after shock, slowly transforming the ad hoc kludge pipelines of the pre-Model T era of genomics into simpler turnkey solutions. And then the biology will come back to the fore, and the deep wellspring of knowledge by those who focus on specific organisms and is going to be the essence of the enterprise once more.

• Category: Science • Tags: Admixture, Genomics, PCA, TreeMix 
🔊 Listen RSS

• Category: Science • Tags: PCA 
🔊 Listen RSS

To the left is a PCA from The History and Geography of Human Genes. If you click it you will see a two dimensional plot with population labels. How were these plots generated? In short what these really are are visual representations of a matrix of genetic distances (those distances being general F ST), which L. L. Cavalli-Sforza and colleagues computed from classical autosomal markers. Basically what the distances measure are the differences across populations in regards to their genetics. The unwieldy matrix tables can be visualized as a neighbor-joining tree, or a two dimensional plot as you see here. But that’s not the end of the story.

In the past ten years with high density SNP-chip arrays instead of just representing the relationship of populations, these plots often can now illustrate the position of an individual (the methods differ, from components analysis or coordinate analysis, to multi-dimensional scaling, but the outcomes are the same).


For example, the famous genetic map of Europe. Here you see the colors representing nationalities, and centroid positions of the populations as well as individuals. In this manner you can take into population genetic variation in a gestalt fashion. Nevertheless, these still leave something to be desired. They are precise and powerful, but they lack a certain elegance due to their scatter. When you have over a dozen color schemes, and overlapping populations, these are not minor matters. Additionally, the human eye is often not well tuned to note the finer gradients of density difference.

This is clear when you move from a manageable number of populations (e.g., Europeans), to the world. In these cases you have to color in specific regions, else you’d get lost rather quickly. I can illustrate this easy enough. I’ve a data set I’m running right now with ~3,000 individuals and 250,000 SNPs. It’s a merge of HGDP, Behar et al., HapMap, etc. I decided to use PLINK to generate an MDS plot.


Here you see the unadorned scatter. To the top of the plot are Asian populations, and to the right African ones. Europeans are at the vertex to the bottom left. This should be familiar to you, though you may have to rotate it. One way to extract some clarity out of this picture is to color code the regions, and give different symbols to the lowest level category. Yes, this helps, but there are still limitations (and to be frank I often have a hard time making out triangles on these plots). First and foremost, I think we need to be unable to ascertain the variation in density of the scatter. A further plot will illustrate this (click to enlarge):

Most of the text is basically illegible. This is where a centroid method would do well; in lieu of a scatter of individuals you just label a population. Or, you could do something like allow points in various colors to represent populations, but put the labels at centroids only. This still runs into the problem that populations are not equidistant, so therefore you can have crowding.

Recently to address these issues I decided to use a ‘utilization distribution’ method which I saw in one of the ‘genetic map of Europe’ papers. The logic here is simple.

1) First, take the density distribution of the points on the plot by category and ‘smooth’ them. Basically this creates a continuous distribution where there was a discontinuous ones.

2) Then demarcate the central ~90% area as the bounds of the population distribution. Color these bounding lines differently.

Below you see the results:

Obviously there are some kinks to be worked out. But you see two things. First, some groups are clearly subsets of other groups in their distribution. This is very hard to discern in the other visualization methods above. Second, these plots are taking density into account, so you aren’t distracted by outliers (which may be mislabeling by the analyst or the original collector of the samples).

My ultimate aim is to develop a script which will place the text near the suitable distribution zone, without crowding out other text. I have some ideas of how to do this “on the fly,” but it will take time to implement. Until then some of you may want to know a bit about the packages used for the above.

First, download the adehabitat package from R. Actually, you may want to download various tcl development packages first, because the former won’t install without the latter. Once you have that you need data. I assume you can generate the results from PLINK above. Once you have that you need to have three colums

1) x

2) y

3) the identification

Here’s some R that might help:

#MDSData is the data frame with MDS data
plot(C1,C2,cex=cexValue,xlab="Coordinate 1",ylab="Coordinate 2")

# process the data, remove more than 5 individuals in group
loc=subset(MDSData,Group %in% names(which(table(Group) >= 5)))
loc$X = loc$C1
loc$Y = loc$C3
#load ids
id = factor(loc$Group)
#create first parameter, two columns

#90% utilization
kVert=getverticeshr(vud, 9);
#I'm removing one of the populations
plot(kVert, add=TRUE, lwd=2,colpol=NA,colborder=rainbow(kVertLength) )
legend('topright',groups,cex=.55,lty=1,lwd=3,col=rainbow(kVertLength) )
• Category: Science • Tags: Genetics, Genomics, PCA 
🔊 Listen RSS

When Zack first mooted the idea of the Harappa Ancestry Project I had no idea what was coming down the pipe. I wonder if his daughter and wife are curious as to what’s happened to their computer! Since collecting the first wave of participants he’s been a result generating machine. Today he produced a fascinating three dimensional PCA (modifying Doug McDonald’s Javascript) using his “Reference 1” data set. He rescaled the dimensions appropriately so that they reflect how much of the genetic variance they explain. The largest principal component of variance is naturally Africa vs. non-Africa, the second is west to east in Eurasia, and the third is a north to south Eurasian axis.

I decided to be a thief and take Zack’s Javascript and resize it a bit to fit the width of my blog, blow up the font size, as well as change the background color and aspects of positioning. All to suit my perverse taste. You see the classic “L” shaped distribution familiar from the two-dimensional plots, but observe the “pucker” in the third dimension of South Asian, and to a lesser extent Southeast Asian, populations.

The the topology of the first three independent dimensions of genetic variance of world populations kind of reminds me of a B-2 bomber:

🔊 Listen RSS

Long time readers know that I have a fixation on people not taking PCA too literally as something concrete. Tonight I finally merged the HGDP data set with some of the HapMap ones I’ve been playing with, and tacked my parents onto the sample. I took the ~50 HGDP populations, added the Tuscans, the two Kenyan groups, and the Gujaratis, and merged them. I thinned the marker set to 105,000 SNPs (I had to flip the HGDP strand too). Then I just let Eigensoft do its magic, and 2 hours on I produced my own plot. I’m still getting a hang of the labeling issues, but first let’s look at what 23andMe produces (I’m green):

Now let’s see what I outputted:

I suspect that the gap between my parents and the main South Asian cluster is just an artifact of the lack of South and East Indians in the sample. Additionally, things would look different if I removed the Africans, since the first principal component would be freed up. More on that later. All in all, still pretty awesome that circa 2011 this sort of thing is just an evening’s concentration.

• Category: Science • Tags: Genetics, Genomics, PCA 
🔊 Listen RSS

I have noted a few times that one thing you have to be careful about in two dimensional plots which show genetic variance is that the dimensions in which the data are projected upon are often generated from the data itself. So adding more data can change the spatial relationships of previous data points. Additionally, in 23andMe’s global similarity advanced plot you are projected onto the dimensions generated from the HGDP data set. There are some practical reasons for this. First, it’s computationally intensive to recalculate components of variance every time someone is added to the data set. Second, it isn’t as if the ethnic identity of any given individual is validated. What would you do if an alien sent in a kit and spuriously put “French” as their ancestry?

So, in reply to this comment: “Let me rephrase: is there any difference when you switch to the world-wide plot? I imagine not, or you would’ve mentioned it.” Actually, there is a slight difference. Below on the right you have a “world view,” with my position being marked with green, and on the left a “zoom in” for Central/South Asia in the HGDP data set.

Because of the “business” of the plot it is hard to see the difference. But when I wasn’t “sharing” genes with people this is what you saw:

1) There is a definite gap between a Central Asian Hazara/Uyghur cluster and a South Asian one which consists of the Pakistani groups.

2) In the Central/South Asia zoom I’m in the gap between the two clusters, about 1/3 of the way toward the Central Asian cluster away from the South Asian cluster (the next closest individual shifted in that direction who isn’t a family member is Bangladeshi).

3) In contrast, in the world view I’m on the edge of the Central Asian cluster, toward the South Asian one, but definitely separated by a clean gap from it.

You can see some generalized differences between the two plots. The Central/South Asia view has a major linear cluster, with the Kalash a distinctive outgroup. In the world view this is not so, rather, you have a group of Pakistanis with non-trivial African admixture shifted in that direction (mostly Makrani, but one of the Sindhis in the HGDP data set seems to be a brownlatto!). Since there isn’t much African variance in the South Asian zoom aside from what the admixed individuals bring to the table naturally it doesn’t shake out as one of the two top dimensions. So what’s going on with me? I don’t have a good hypothesis, but I suspect that my likely Southeast Asian ancestry shifted me further toward the Asian cluster in the world view. There are some groups very closely related to the Burmese in the HGDP (e.g., Naxi) which are in the world view, and, naturally not in the Central/South Asia zoom. When you break ancestry into “European” and “Asian” components then the Hazara/Uyghur cluster is an OK substitute (both are hybrids, with “European” and “Asian” ancestry in about equal proportions), but this is actually a first approximation. These two groups have more “northern” Asian ancestry, while mine is more “southern.” Because of their inclusion in the Central/South Asia cluster the west-east dimension in Eurasia is constructed from more northern East Asian populations, which might underestimate my East Asian element.

There’s actually a much better example than me though who I’m sharing genes with. This individual is an ethnic Persian. Note that in the world view they seem to be on the margins of the European cluster, verging toward the Central/South Asia group. But when you do the Central/South Asia zoom view, they’re in that cluster! Note the very different positions. Their “neighbor” in the zoom view is totally different from their neighbor in the world view:

My argument for why I’m more “Asian” in the world view is that the world view has Asian groups to which I am closer, which are excluded in my zoom view. A much more extreme case seems to be happening with this Persian individual, whose family is from northern Iran and has an oral history of Russian ancestry on one of his lineages.

This is the sort of reason why I assume any reader who points to a paper and a plot and asserts that “this proves X” is somewhat cognitively challenged. The patterns in PCA aren’t necessarily arbitrary. But, they do need to be interpreted with care. One set of results isn’t dispositive of any given position in a debate, at least least until you get to the ridiculous boundary conditions (in some ways, I think of a lot of genetic data visualization like I think of regression. It’s how people use/interpret it that is problematic, not the method itself).

Finally, doesn’t it seem ridiculous to you that South Asians are being projected onto a plot where the dimensions are generated from liminal populations! Imagine, if you will, that Europeans were projected onto a plot generated from the variance of Finnic and Slavic groups only. That’s a good analogy. The Pakistani groups in the HGDP data set are not good representatives of South Asian genetic variation, because they’re shifted to the margins of the distribution. That’s one reason that the Harappa Ancestry Project is so needful (and why if you just got your v3 results and are Iranian, Tibetan, Burmese, or South Asian, you should send it in. And v2 folks as well!).

🔊 Listen RSS

Mike the Mad Biologist, whose bailiwick is the domain of the small, asks in the comments:

I don’t mean to bring up a tangential point to the post, but why does the field of human genetics use PCA to visualize relationships? When I see plots like those shown here that have a ‘geometric pattern’ to them (the sharp right angles; another common pattern is a Y-shape), that tells me that there are lots of samples with zeros for many of the Y-variables (i.e., alleles that are unique to certain populations). Thus, the spatial arrangement of the points is largely an artifact of an inappropriate method: how does one calculate a correlation matrix when many of things one is correlating have values of zero?

If one really was keen on using PCA, one could calculate a pairwise distance matrix and then use that instead of the correlation matrix (Principal Coordinates Analysis).

Since I know some human geneticists do read this weblog, I thought it was worth throwing the question out there.

• Category: Science • Tags: Analysis, Genetics, PCA 
Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"