The Unz Review - Mobile
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
 Gene Expression BlogTeasers
How Much Informative "Structure" Is in the HGDP Data Set?
Search Text Case Sensitive  Exact Words  Include Comments

A few weeks ago people were arguing about the utility of the model based clustering packages which produce intuitive bar plots which break down individual and population percentages. To understand the fundamental basis of these packages I’ll refer you to the original Pritchard et al. paper. As you probably know at this point one of the major parameters of the packages is the K value, which refers to the number of populations which are going to be assumed as the constituents of the genetic variation. A key point is that those who use the packages are forcing the variation to fit a particular model. You can take the data for Icelanders, to pick an example, and find K = 100. It will be produce results, but I suspect you’ll intuit that this really isn’t the best model in terms of fitting reality. Similarly, you can take a population of Northern Europeans, West Africans, and East Asians, and set K = 2. This will likely separate the Eurasians from the Africans, as that’s the natural phylogenetic affinity. But K = 3 is probably a better fit to the data. By this, I mean that Northern Europeans and East Asians are not, and have not been for a long time, random mating populations. K = 3 reflects this reality.

So far this is intuitive. Is there a formal way to check this? Yes. A variety. Structure outputs log likelihoods for each K. Admixture gives you cross-validation errors. For a full treatment of how Admixture estimates cross-validation error see Alexander et al. An intuitive way to think about how you should interpret these values is that they are giving you a sense of where you are trying to squeeze too many K’s out of the data set. Admixture’s cross-validation value has a simple interpretation, look for the lowest point on the graph.

Going to back to the HGDP data set I wanted to know where that point on the scale of K’s was. Looking over the populations I assumed more than 5, but likely less than 20. That wide range tells you that I don’t honestly have a good intuition (some distinct populations are going to be hard to separate in pooled data sets because there hasn’t been much time since divergence, or they are not really genetically separate populations).


The first thing I did was prep the HGDP data bit in terms of quality with Plink. I filtered to SNPs with minor allele frequencies greater than 0.05, to get variants which might be informative on the interpopulation scale. Then I removed SNPs which were missing in more than 1% of individuals. Finally, I also LD pruned the SNPs (basically thinning the markers so that I got rid of variants which weren’t adding more information because they were near other SNPs). Additionally I also removed individuals which were very closely related to others in the data set. This resulted in a data set of 1,024 individuals and 116,840 SNPs.

Then I ran Admixture 20 times with default five-fold cross-validation from K = 2 to K = 20. Here’s the result in a scatterplot:

cverrorbig

You can’t see some of the points because the variation in error was so small at the lower K’s. It is clear that a few K’s do not accurately capture the variation in the HGDP data set. To put it different there aren’t four distinct randomly mating populations in the HGDP data set (K = 4).

Here’s a zoom in.

cverrorZoom

These results make it clear there’s a ‘valley’ across the interval K = 11 to K = 16, with the lowest mean cross-validation error at K = 16. Not only does K = 16 have the lowest cross-validation error, but below K = 4 it has the lowest variation in cross-validation error as well. This does not mean that there are 16 natural populations which best defines the world’s genetic variation. For why this is not so I’ll point you to Daniel Falush’s post What did we learn from Rosenberg et al. 2002, actually?, which highlights some other major dependencies of Structure-like model based clustering.

But, a complementary point is that the number of K’s within the data are not arbitrary and subjective. And that’s because human genetic variation exhibits geographic structure consistently across many forms of vizualization and inference. A second more tendentious point I would also like to add is that the new generation of population structure inference methodologies are pointing to the likelihood that human genetic variation did not emerge through isolation by distance dynamics across clinal gradients.

Addendum: I’m merging my 20 runs, starting with K = 16. But that’s going to take time. I’m also running K = 2 to K = 20 with a different data set, which expands beyond the HGDP, with 20 replicates.

 
• Category: Science • Tags: Genomics, HGDP, Race, Structure 
Email This Page to Someone

 Remember My Information



=>
    []
  1. anonymous says:     Show CommentNext New Comment

    “Similarly, you can take a population of Northern Europeans, West Africans, and East Asians, and set K = 2. This will likely separate the Eurasians from the Africans, as that’s the natural phylogenetic affinity. But K = 3 is probably a better fit to the data. By this, I mean that Northern Europeans and East Asians are not, and have not been for a long time, random mating populations. K = 3 reflects this reality.”

    So what you’re saying, Razib, is that all value of K makes sense, but some values of K make “MORE SENSE,” to put it crudely.

    This makes intuitive sense because each “population”/”race” has substructure , and presumably this substructure would go all the way down to the “identical twin” level.

    Razib, if one set the value of K high enough and used enough loci, would different families form clusters , assuming that members of the same family are present in the data set?

    Read More
    ReplyAgree/Disagree/Etc.
    AgreeDisagreeLOLTroll
    These buttons register your public Agreement, Disagreement, Troll, or LOL with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used once per hour.
    Sharing Comment via Twitter
    http://www.unz.com/gnxp/how-much-informative-structure-is-in-the-hgdp-data-set/#comment-570190
    More... This Commenter Display All Comments
  2. Razib, if one set the value of K high enough and used enough loci, would different families form clusters , assuming that members of the same family are present in the data set?

    families form clusters immediately. this is why i removed related individuals in the sample above.

    Read More
  3. Razib, if one set the value of K high enough and used enough loci, would different families form clusters , assuming that members of the same family are present in the data set?

    or none make sense, but some make less sense.

    Read More
  4. […] Agustín Fuentes: “Things to Know When Talking About Race and Genetics”  (A slightly more sophisticated version of Lewontin’s Fallacy, which is unsurprising given that Marxists Richard Lewontin and Stephen Jay Gould are Fuentes’ heroes (by his own account) and he seems to share their Marxist outlook. Parody of Fuentes: “Yes, I’m always confusing Northeast Asians and Sub-Sahara Africans.  I can barely tell them apart!” Sorry, Agustín, but race is real.  Fuentes writes more. Gregory Cochran posts on Lewontin’s Fallacy. Ron Unz responds here. Gregory Cochran responds here. Nicholas Wade responds. The Black Avenger responds. Major error uncovered in Fuentes’ hit pieces; see Steve Bloomberg’s response. B Weinberg on dishonest tactics of Raff and Fuentes. Steve Sailer responds. Chuck responds with “The idiocy of race denialism“. Razib Khan responds about Structure and the biological reality of race.) […]

    Read More
  5. anonymous says:     Show CommentNext New Comment

    “or none make sense, but some make less sense.”

    What would be an example of a value of K yielding clusters that don’ t make sense?

    Read More
  6. Hello Razib

    First sorry for my poor english

    You said: «human genetic variation did not emerge through isolation by distance dynamics across clinal gradients.»

    I agree that isolation by distance is probably not the only mechanism, but i think it’s still an important part of the origin of genetic differentiation between human populations. But I can be wrong.

    By the way, according to you, which mechanisms played a major role in genetic differences between populations?

    I read some of your previous articles (always very interesting), I remember one of your article about the genetic continuum between Europeans and East Asians, you said that until a relative recent period, there were probably very little gene flows between these populations and that the present day genetic continuum is probably due to recent admixtures (during historical times).

    This is also what you mean here?

    PS: Sorry if my question sounds stupid and if I don’t understand your present article correctly.

    Read More
  7. What would be an example of a value of K yielding clusters that don’ t make sense?

    K = 100 for iceland. iceland is NOT 100 separate random mating populations.

    I agree that isolation by distance is probably not the only mechanism, but i think it’s still an important part of the origin of genetic differentiation between human populations.

    By the way, according to you, which mechanisms played a major role in genetic differences between populations?

    isolation by distance is probably dominant on within-continental scales. also, it’s a good null hypothesis normally. but i think we might have to update due to new results.

    This is also what you mean here?

    yes, to a large extent.

    mountains, deserts, seas on the large scale. inter-group conflict and cultural discontinuity (e.g., language, etc.) on the smaller scale.

    Read More
  8. Razib,

    Some thoughts to echo Falush’s comments about the STRUCTURE algorithm:

    It’s worth emphasizing again the extent to which the particular number of clusters best supported by this kind of analysis is heavily a function of the amount of data you have. I don’t have a good intuition for what the scaling between the number of individuals and the best supported K should be, but as I’m sure you know, if you doubled the number of individuals (and/or increased the number of markers included), even if all of those individuals came only from the same 52 ‘populations’ already in the HGDP dataset, the cross validation analysis would support a larger K because each individual would have more close relatives in the dataset, and thus it would be possible to predict individual genotypes more accurately using more finely subdivided slivers of the data (i.e. the clusters).

    So I agree that, subject to the constraints of the structure model, and given the data, the number of clusters well supported is not arbitrary, but given unlimited resources one could induce support for whatever number of clusters as they wanted to as “optimal” by sampling the correct number of individuals from the correct places.

    I know you know all of this, but I thought it worth spelling out in the comments.

    Jeremy

    Read More
  9. jeremy, to follow up, i’d say the problem is that people confuse the tools as means for tools as ends. IOW, clustering algorithms need complementary information from other sources to help make sense of them. that’s somewhat easy with human data. far less so when used for molecular ecological projects with understudied organisms….

    Read More
  10. (and/or increased the number of markers included)

    this is checkable. LD pruning is what removed most of the markers, so i might just rerun with more markers. of course results would take weeks with cross-validation.

    Read More
  11. Yes, I suspect running the analysis with all of the markers (especially the low frequency ones you removed*) would give you support for more clusters (although yeah, running the cross validation would be hell). As made clear by Engelhardt and Stephens 2010 (a paper I really enjoy, and which in conjunction with McVean’s genealogical interpretation of PCA paper has really made this stuff clear to me), the STRUCTURE algorithm is just an alternate way of decomposing the individual by individual kinship matrix (as opposed to, say, PCA), and so the more data you throw at it the more factors (clusters) it will be able to find support for, just like with PCA. For example, one may very well be able to find support for K = 100 as “optimal” within Iceland if you had full genomes for every person on the island (obviously the cross validation analysis would require some serious computing power). Those clusters would be “real” in the sense that they would represent real variation in the relationship structure of the population, but how to interpret them beyond that is unclear.

    ____________________________
    * And actually, it seems like using only SNPs above a frequency of 1/20 would make it pretty difficult to infer more than 20 clusters.

    Read More
  12. since iceland isn’t totally panmictic that makes intuitive sense. though i think the key then is that often you filter for ‘related’ individuals in these sorts of analyses, so it seems that that would put a ceiling on how many clusters you’d get.

    and yes, i removed low frequency variants to capture ‘between population variation,’ so the die was loaded.

    i think i’ll rerun with only removing SNPs which are *missing* and see what the result is.

    Read More
  13. Right, but but even then you’d still have to make a choice about at what point two individuals were “unrelated” or not, and there’d be no escaping the fact that that choice would influence the number of clusters you’d have power to infer. Obviously this example has taken us pretty far past the typical STURCTURE use cases, but I find that figuring out what a particular method will do when you ask it to do something it wasn’t really built for is often illuminating.

    Having thought this over a little bit now, it seems like a succinct way of summarizing things is to say that the cross validation results essentially tell you the largest K you’re allowed to look at in order to make inferences (because above that value you’re fitting noise, rather than signal), but the particular value of K identified is jointly a function of both the patterns of structure in the data and the sheer amount of data you’ve gathered, and thus isn’t very biologically meaningful.

    Read More
  14. Hans,

    Claims of anything but trivial genetic contacts between Europe and East Asia during the medieval period are bogus.

    They’re based on admixture dates arrived with rolloff, which always seem to be way too recent, and also on a total lack of understanding of the fine scale paternal ancestry of Eastern Europeans and East Central Asians/Siberians.

    For instance, one of the more absurd claims that is often made, even in scientific literature, is that the high frequencies of R1a in Eastern Europe might be in part the result of Turkic population movements into the region during the medieval period. But this ignores the fact that Turkic-specific lineages of R1a, which fall under R1a-Z93, are missing from Eastern Europe. For instance, refer to the maps here…

    http://www.nature.com/ejhg/journal/vaop/ncurrent/fig_tab/ejhg201450f3.html#figure-title

    By the way, I just left some questions at Daniel Falush’s blog about this topic. Let’s see if he replies…

    http://paintmychromosomes.blogspot.com.au/2014/02/globetrotter.html

    Read More

Comments are closed.