The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
These buttons register your public Agreement, Disagreement, Thanks, LOL, or Troll with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used three times during any eight hour period.
Ignore Commenter Follow Commenter
🔊 Listen RSS

Rosenberg_1048people_993markers A friend recently emailed to ask about the best way to pick a proper “K” value when inferring structure. K just being the parameter which defines how many putative ancestral populations you have in your model to explain some data on genetic variation. Obviously some value of K are more informative than others of population history.

For example, if you had 100 Swedes and 100 Yoruba Nigerians, to model the population structure you could select K = 2 or K = 50. The algorithm would produce results in the latter case, but you “know” a priori that really K = 2 is a really good model of the population history in a straightforward interpretable sense. There’s just not that much more juice to squeeze with many clustering methods out of this sort of data.

But it’s harder when you have population structure in organisms which we don’t know much about aside from the genetic data. How does one “objectively” select a K. The most common method is outlined in a 2005 paper, Detecting the number of clusters of individuals using the software structure: a simulation study:

The identification of genetically homogeneous groups of individuals is a long standing issue in population genetics. A recent Bayesian algorithm implemented in the software structure allows the identification of such groups. However, the ability of this algorithm to detect the true number of clusters (K) in a sample of individuals when patterns of dispersal among populations are not homogeneous has not been tested. The goal of this study is to carry out such tests, using various dispersal scenarios from data generated with an individual-based model. We found that in most cases the estimated ‘log probability of data’ does not provide a correct estimation of the number of clusters, K. However, using an ad hoc statistic ΔK based on the rate of change in the log probability of data between successive K values, we found that structure accurately detects the uppermost hierarchical level of structure for the scenarios we tested. As might be expected, the results are sensitive to the type of genetic marker used (AFLP vs. microsatellite), the number of loci scored, the number of populations sampled, and the number of individuals typed in each sample.

There’s an old saying, “garbage in, garbage out.” The method of ΔK is useful as far as it goes, but as inputs it takes the log likelihoods from the Structure program. For Admixture you can look at cross-validation. But these statistics are subject to various assumptions and approximations (in addition, some of the priors within the clustering algorithms are gross simplifications).

This is one reason I was excited about Estimating the Number of Subpopulations (K) in Structured Populations:

A key quantity in the analysis of structured populations is the parameter K, which describes the number of subpopulations that make up the total population. Inference of K ideally proceeds via the model evidence, which is equivalent to the likelihood of the model. However, the evidence in favor of a particular value of K cannot usually be computed exactly, and instead programs such as Structure make use of heuristic estimators to approximate this quantity. We show—using simulated data sets small enough that the true evidence can be computed exactly—that these heuristics often fail to estimate the true evidence and that this can lead to incorrect conclusions about K. Our proposed solution is to use thermodynamic integration (TI) to estimate the model evidence. After outlining the TI methodology we demonstrate the effectiveness of this approach, using a range of simulated data sets. We find that TI can be used to obtain estimates of the model evidence that are more accurate and precise than those based on heuristics. Furthermore, estimates of K based on these values are found to be more reliable than those based on a suite of model comparison statistics. Finally, we test our solution in a reanalysis of a white-footed mouse data set. The TI methodology is implemented for models both with and without admixture in the software MavericK1.0.

The website for MavericK 1.0 is informative if you don’t have academic access.

Unfortunately, and probably not surprisingly, this method is not scalable to genomic data sets. E.g., they’re looking that 10, 20 or 50 loci. A “modest” human genotyping array will provide you with tens of thousands of loci (SNPs). A “standard” array will provide you with on the order of 500,000 SNPs.

But the conclusion of the paper is worth keeping in mind:

Finally, it is important to keep in mind that when thinking about population structure, we should not place too much emphasis on any single value of K. The simple models used by programs such as Structure and MavericK are highly idealized cartoons of real life, and so we cannot expect the results of model-based inference to be a perfect reflection of true population structure (see discussion in Waples and Gaggiotti 2006). Thus, while TI can help ensure that our results are statistically valid conditional on a particular evolutionary model, it can do nothing to ensure that the evolutionary model is appropriate for the data. Similarly—in spite of the results in Table 2—we do not advocate using the model evidence (estimated by TI or any other method) as a way of choosing the single “best” value of K. The chief advantage of the evidence in this context is that it can be used to obtain the complete posterior distribution of K, which is far more informative than any single point estimate. For example, by averaging over the distribution of K, weighted by the evidence, we can obtain estimates of parameters of biological interest (such as the admixture parameter a) without conditioning on a single population structure. Although one value of K may be most likely a posteriori, in general a range of values will be plausible, and we should entertain all of these possibilities when drawing conclusions.


• Category: Science • Tags: K, Structure 
🔊 Listen RSS

killerenhancedcolourscheme Recently Daniel Falush’s group came out with a preprint, A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots. If you read the science posts on this weblog (basically, if you read this weblog), and you haven’t read it, read it now.

At his weblog, Paint My Chromosomes, Falush has talked about both the production of the preprint (I had a minor stimulatory role), and the attempt to get it published somewhere. This reaction is strange to me:

We also had our first journal rejection, from eLife. It has not been my habit to live-tweet journal rejections and am not intending to start now. I am a journal editor myself and do not think the process would benefit from being turned into a public performance. I was disappointed because eLife claims to hold itself to higher standards, trying to change publication by judging papers on their true worth rather than on simple measures of impact and also because the reason given was silly:

“..but feel that the target audience is a rather specialised one.”

Of course I’m biased. But this strikes me as crazy. The third most cited paper in the history of the journal Genetics, is Jonathan Pritchard’s Inference of Population Structure Using Multilocus Genotype Data. Take a look at the list, and note the papers that it is more cited than (e.g., a Sewall Wright paper from 1931, and Tajima’s 1989 paper!).

To be sure, the number of times that a paper is cited is not a good measure of how often it is read and understood. And that’s kind of the point of Falush’s preprint, to actually give some guidance to people who use model based clustering in a turnkey fashion without any deep comprehension of its limitations and biases. The nuts & bolts of the inferences of population structure may be specialized, but analysis of structure is a routine part of many different types of papers, in particular in medical genetics where variants may have different effects in different genetic backgrounds.

• Category: Science • Tags: Structure 
🔊 Listen RSS

A few weeks ago people were arguing about the utility of the model based clustering packages which produce intuitive bar plots which break down individual and population percentages. To understand the fundamental basis of these packages I’ll refer you to the original Pritchard et al. paper. As you probably know at this point one of the major parameters of the packages is the K value, which refers to the number of populations which are going to be assumed as the constituents of the genetic variation. A key point is that those who use the packages are forcing the variation to fit a particular model. You can take the data for Icelanders, to pick an example, and find K = 100. It will be produce results, but I suspect you’ll intuit that this really isn’t the best model in terms of fitting reality. Similarly, you can take a population of Northern Europeans, West Africans, and East Asians, and set K = 2. This will likely separate the Eurasians from the Africans, as that’s the natural phylogenetic affinity. But K = 3 is probably a better fit to the data. By this, I mean that Northern Europeans and East Asians are not, and have not been for a long time, random mating populations. K = 3 reflects this reality.

So far this is intuitive. Is there a formal way to check this? Yes. A variety. Structure outputs log likelihoods for each K. Admixture gives you cross-validation errors. For a full treatment of how Admixture estimates cross-validation error see Alexander et al. An intuitive way to think about how you should interpret these values is that they are giving you a sense of where you are trying to squeeze too many K’s out of the data set. Admixture’s cross-validation value has a simple interpretation, look for the lowest point on the graph.

Going to back to the HGDP data set I wanted to know where that point on the scale of K’s was. Looking over the populations I assumed more than 5, but likely less than 20. That wide range tells you that I don’t honestly have a good intuition (some distinct populations are going to be hard to separate in pooled data sets because there hasn’t been much time since divergence, or they are not really genetically separate populations).

The first thing I did was prep the HGDP data bit in terms of quality with Plink. I filtered to SNPs with minor allele frequencies greater than 0.05, to get variants which might be informative on the interpopulation scale. Then I removed SNPs which were missing in more than 1% of individuals. Finally, I also LD pruned the SNPs (basically thinning the markers so that I got rid of variants which weren’t adding more information because they were near other SNPs). Additionally I also removed individuals which were very closely related to others in the data set. This resulted in a data set of 1,024 individuals and 116,840 SNPs.

Then I ran Admixture 20 times with default five-fold cross-validation from K = 2 to K = 20. Here’s the result in a scatterplot:


You can’t see some of the points because the variation in error was so small at the lower K’s. It is clear that a few K’s do not accurately capture the variation in the HGDP data set. To put it different there aren’t four distinct randomly mating populations in the HGDP data set (K = 4).

Here’s a zoom in.


These results make it clear there’s a ‘valley’ across the interval K = 11 to K = 16, with the lowest mean cross-validation error at K = 16. Not only does K = 16 have the lowest cross-validation error, but below K = 4 it has the lowest variation in cross-validation error as well. This does not mean that there are 16 natural populations which best defines the world’s genetic variation. For why this is not so I’ll point you to Daniel Falush’s post What did we learn from Rosenberg et al. 2002, actually?, which highlights some other major dependencies of Structure-like model based clustering.

But, a complementary point is that the number of K’s within the data are not arbitrary and subjective. And that’s because human genetic variation exhibits geographic structure consistently across many forms of vizualization and inference. A second more tendentious point I would also like to add is that the new generation of population structure inference methodologies are pointing to the likelihood that human genetic variation did not emerge through isolation by distance dynamics across clinal gradients.

Addendum: I’m merging my 20 runs, starting with K = 16. But that’s going to take time. I’m also running K = 2 to K = 20 with a different data set, which expands beyond the HGDP, with 20 replicates.

• Category: Science • Tags: Genomics, HGDP, Race, Structure 
🔊 Listen RSS

Pritchard, Jonathan K., Matthew Stephens, and Peter Donnelly. “Inference of population structure using multilocus genotype data.” Genetics 155.2 (2000): 945-959.

Before there was Structure there was just structure. By this, I mean that population substructure has always been. The question is how we as humans shall characterize and visualize it in a manner which imparts some measure of wisdom and enlightenment. A simple fashion in which we can assess population substructure is to visualize the genetic distances across individuals or populations on a two dimensional plot. Another way which is quite popular is to represent the distance on a neighbor joining tree, as on the left. As you can see this is not always satisfying: dense trees with too many tips are often almost impossible to interpret beyond the most trivial inferences (though there is an aesthetic beauty in their feathery topology!). And where graphical representations such as neighbor-joining trees and MDS plots remove too much relevant information, cluttered F STmatrices have the opposite problem. All the distance data is there in its glorious specific detail, but there’s very little Gestalt comprehension.

Rosenberg, Noah A., et al. “Genetic structure of human populations.” Science 298.5602 (2002): 2381-2385.

Into this confusing world stepped the Structure bar plot. When I say “Structure bar plot,” in 2013 I really mean the host of model-based clustering phylogenetic packages. Because it is faster I prefer Admixture. But Admixture is really just a twist on the basic rules of the game which Structure set. What you see to the right is one of the beautiful bar plots which have made their appearance regularly on this blog over the past half a decade or more. I’ve repeated what they do, and don’t mean, ad nauseum, though it doesn’t hurt to repeat oneself. What you see is how individuals from a range of human populations shake out at K = 6. More verbosely, assume that your pool of individuals can be thought of as an admixture to various proportions of six ancestral populations. Each line is an individual, and the proportional shading of each line and the specific color represents a particular K (for K = 6, population 1, 2, 3, 4, 5, 6).

This is when I should remind you that this does not mean that these individuals are actually combinations of six ancestral populations. When you think about it, that is common sense. Just because someone generates a bar plot with a given K, that does not mean that that bar plot makes any sense. I could set K = 666, for example. The results would be totally without value (evil even!), but, they would be results, because if you put garbage in, the algorithm will produce something (garbage). This is why I say that population structure is concrete and ineffable. We know that it is the outcome of real history which we can grasp intuitively. But how we generate a map of that structure for our visual delectation and quantitative precision is far more dicey and slippery.

To truly understand what’s going on it might be useful to review the original paper which presented Structure, Inference of Population Structure Using Multilocus Genotype Data. Though there are follow-ups, the guts of the package are laid out in this initial publication. Basically you have some data, multilocus genotypes. Since Structure debuted in 2000, this was before the era of hundreds-of-thousands-loci-SNP-chip data. Today the term multilocus sounds almost quaint. In 2000 the classical autosomal era was fading out, but people did still use RFLP s and what not. It is a testament to the robustness of the framework of Structure that it transitioned smoothly to the era of massive data sets. Roughly, the three major ingredients of Structure are the empirical genotype data, formal assumptions about population dynamics, and, powerful computational techniques to map between the two first two elements. In the language of the paper you have X, the genotypes of the individuals, Z, the populations, and P, the allele frequencies of the populations. They’re multi-dimensional vectors. That’s not as important here as the fact that you only have X. The real grunt work of Structure is generating a vector, Q, which defines the contributions to each individual from the set of ancestral populations. This is done via an MCMC, which explores the space of probabilities, given the data, and the priors which are baked into the cake of the package. Though some people seem to treat the details of the MCMC as a black-box, actually having some intuition about how it works is often useful when you want to shift from default settings (there are indeed people who run Structure who are not clear about what the burn-in is exactly). What’s going on ultimately is that in structured populations the genotypes are not in Hardy-Weinberg Equilibrium. Structure is attempting to find a solution which will result in populations in HWE.

This brings us to the question of how we make sense of the results and which K to select. If you run Structure you are probably iterating over many K values, and repeating the iteration multiple times. You will likely have to merge the outputs for replicates because they are going to vary using a different algorithm. But in any case, each iteration generates a likelihood (which derives from the probability of the data given the K value). The most intuitive way to “pick” an appropriate K is to simply wait until the likelihood begins to plateau. This means that the algorithm can’t squeeze more informative juice going up the K values.* This may seem dry and tedious, but it brings home exactly why you should not view any given K as natural or real in a deep sense. The selection of a K has less to do with reality, and more with instrumentality. If, for example your aim is to detect African ancestry in a worldwide population pool, then a low K will suffice, even if a higher K gives a better model fit (higher K values often take longer in the MCMC). In contrast if you want to discern much finer population clusters then it is prudent to go up to the most informative K, no matter how long that might take.

Today model-based clustering like Structure, frappe, and Admixture are part of the background furniture of the population genetic toolkit. There are now newer methods on the block. A package like TreeMix uses allele frequencies to transform the stale phylogram into a more informative set of graphs. Other frameworks do not rely on independent information locus after locus, but assimilate patterns across loci, generating ancestry tracts within individual genomes. Though some historical information can be inferred from Structure, it is often an ad hoc process which resembles reading tea leaves. Linkage disequilibrium methods have the advantage in that they explicitly explore historical processes in the genome. But with all that said, the Structure bar plot revolution of the aughts wrought a massive change, and what was once wondrous has become banal.

* The ad hoc Delta K statistic is very popular too. It combines the rate of change of the likelihoods and the variation across replicate runs.

Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"