The Unz Review - Mobile
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog
Data Visualization

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New Reply
🔊 Listen RSS

I have noted a few times that one thing you have to be careful about in two dimensional plots which show genetic variance is that the dimensions in which the data are projected upon are often generated from the data itself. So adding more data can change the spatial relationships of previous data points. Additionally, in 23andMe’s global similarity advanced plot you are projected onto the dimensions generated from the HGDP data set. There are some practical reasons for this. First, it’s computationally intensive to recalculate components of variance every time someone is added to the data set. Second, it isn’t as if the ethnic identity of any given individual is validated. What would you do if an alien sent in a kit and spuriously put “French” as their ancestry?

So, in reply to this comment: “Let me rephrase: is there any difference when you switch to the world-wide plot? I imagine not, or you would’ve mentioned it.” Actually, there is a slight difference. Below on the right you have a “world view,” with my position being marked with green, and on the left a “zoom in” for Central/South Asia in the HGDP data set.

Because of the “business” of the plot it is hard to see the difference. But when I wasn’t “sharing” genes with people this is what you saw:

1) There is a definite gap between a Central Asian Hazara/Uyghur cluster and a South Asian one which consists of the Pakistani groups.

2) In the Central/South Asia zoom I’m in the gap between the two clusters, about 1/3 of the way toward the Central Asian cluster away from the South Asian cluster (the next closest individual shifted in that direction who isn’t a family member is Bangladeshi).

3) In contrast, in the world view I’m on the edge of the Central Asian cluster, toward the South Asian one, but definitely separated by a clean gap from it.

You can see some generalized differences between the two plots. The Central/South Asia view has a major linear cluster, with the Kalash a distinctive outgroup. In the world view this is not so, rather, you have a group of Pakistanis with non-trivial African admixture shifted in that direction (mostly Makrani, but one of the Sindhis in the HGDP data set seems to be a brownlatto!). Since there isn’t much African variance in the South Asian zoom aside from what the admixed individuals bring to the table naturally it doesn’t shake out as one of the two top dimensions. So what’s going on with me? I don’t have a good hypothesis, but I suspect that my likely Southeast Asian ancestry shifted me further toward the Asian cluster in the world view. There are some groups very closely related to the Burmese in the HGDP (e.g., Naxi) which are in the world view, and, naturally not in the Central/South Asia zoom. When you break ancestry into “European” and “Asian” components then the Hazara/Uyghur cluster is an OK substitute (both are hybrids, with “European” and “Asian” ancestry in about equal proportions), but this is actually a first approximation. These two groups have more “northern” Asian ancestry, while mine is more “southern.” Because of their inclusion in the Central/South Asia cluster the west-east dimension in Eurasia is constructed from more northern East Asian populations, which might underestimate my East Asian element.

There’s actually a much better example than me though who I’m sharing genes with. This individual is an ethnic Persian. Note that in the world view they seem to be on the margins of the European cluster, verging toward the Central/South Asia group. But when you do the Central/South Asia zoom view, they’re in that cluster! Note the very different positions. Their “neighbor” in the zoom view is totally different from their neighbor in the world view:

My argument for why I’m more “Asian” in the world view is that the world view has Asian groups to which I am closer, which are excluded in my zoom view. A much more extreme case seems to be happening with this Persian individual, whose family is from northern Iran and has an oral history of Russian ancestry on one of his lineages.

This is the sort of reason why I assume any reader who points to a paper and a plot and asserts that “this proves X” is somewhat cognitively challenged. The patterns in PCA aren’t necessarily arbitrary. But, they do need to be interpreted with care. One set of results isn’t dispositive of any given position in a debate, at least least until you get to the ridiculous boundary conditions (in some ways, I think of a lot of genetic data visualization like I think of regression. It’s how people use/interpret it that is problematic, not the method itself).

Finally, doesn’t it seem ridiculous to you that South Asians are being projected onto a plot where the dimensions are generated from liminal populations! Imagine, if you will, that Europeans were projected onto a plot generated from the variance of Finnic and Slavic groups only. That’s a good analogy. The Pakistani groups in the HGDP data set are not good representatives of South Asian genetic variation, because they’re shifted to the margins of the distribution. That’s one reason that the Harappa Ancestry Project is so needful (and why if you just got your v3 results and are Iranian, Tibetan, Burmese, or South Asian, you should send it in. And v2 folks as well!).

(Republished from Discover/GNXP by permission of author or representative)
🔊 Listen RSS

Chemistry likes to think of itself as the “central science.” Is that true? Intuitively it makes sense. But how can we measure that more rigorously? In comes the Stanford Dissertation Browser:

The Stanford Dissertation Browser is an experimental interface for document collections that enables richer interaction than search. Stanford’s PhD dissertation abstracts from 1993-2008 are presented through the lens of a text model that distills high-level similarity and word usage patterns in the data. You’ll see each Stanford department as a circle, colored by school and sized by the number of PhD students graduating from that department.

When you click a department, it becomes the focus of the browser and every other department moves to show its relative similarity to the centered department. The similarity scores are computed using a supervised mixture model based on Labeled LDA: every dissertation is taken as a weighted mixture of a unigram language model associated with every Stanford department. This lets us infer, that, say, dissertation X is 60% computer science, 20% physics, and so on. These scores are averaged within a department to compute department-level statistics (the similarities shown), and need not be symmetric. For instance, Economics dissertations at Stanford use more words from Political Science than vice versa. Essentially, the visualization shows word overlap between departments measured by letting the dissertations in one department borrow words from another department. Which departments borrow the most words from which others? The statistics are computed for each year in the data.

You can play around with the browser here. I’m assuming at some point in the near future this sort of analysis is going to get much, much, easier, because of the sea of data which powerful software can extract and visualize patterns out of. Below are the fold are five screen shots I thought were of interest. Genetics, biology, and chemistry dissertations in 2008. And Anthropology in 2007 and 1998.

[nggallery id=24]

(Republished from Discover/GNXP by permission of author or representative)
• Category: Science • Tags: Data Visualization 
No Items Found
Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"