The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
 TeasersGene Expression Blog
How to Look at Population Structure
🔊 Listen RSS
Email This Page to Someone

 Remember My Information



=>

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
AgreeDisagreeLOLTroll
These buttons register your public Agreement, Disagreement, Troll, or LOL with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used once per hour.
Ignore Commenter Follow Commenter
Search Text Case Sensitive  Exact Words  Include Comments
List of Bookmarks

700px-Neighbor-joining_Tree-2

51qciM4cBhL._SX258_BO1,204,203,200_ A friend asked me about population structure, and methods to ferret it out and classify it. So here is a quick survey on the major methods I’m familiar with/utilize now and then. I’ll go roughly in chronological order.

First, you have trees. These are pretty popular from macroevolutionary relationships, but on the population genetic scale (intraspecific, microevolutionary) you’re mostly talking about representing distances between groups in a tree format. You saw this in History and Geography of Genes, where genetic distances in the form of Fst values (proportion of genetic variation unique to between two groups) were used as distance inputs.

A problem with trees is that they don’t model gene flow, a major dynamic on a microevolutionary scale. Also, complex relationships can get elided in tree frameworks, and as you add more and more populations you often end up with an incomprehensible fan-like topology.

journal.pgen.0020190.g005 Then you have principle component analyses (PCA) and related methods (e.g., multidimensional scaling, which is very different in the sausage-making but generates a similar output). Like trees, this is a visualization of the variation, in this case on a two dimensional plot (please don’t bring up three dimensional PCA, there’s no such thing until holograms show up).

The problem with PCA is that different types of dynamics can lead to the same result. For example, someone who is an F1 of two distinct groups occupies the same position as a population which happens to occupy a genetic position between two groups. Additionally, by constraining the variation into two dimensions, one can mislead in terms of relationships. There are many dimensions, but operationally you focus on on two at a time.

A paper of interest, Population Structure and Eigenanalysis.

Rosenberg4 Next you have model-based clustering introduced in Jonathan Pritchard’s Inference of Population Structure Using Multilocus Genotype Data. There are many flavors of this, but they operate under the same framework. You have a model of population dynamics, and see how the genotype data can be explained by parameters of the model. Of particular interest is assignment to one of K populations, which can be combined to explain the variation in the data.

Unlike PCA these model-based methods are rather good at identifying people who are first generation mixes, as opposed to those from stabilized groups along a cline. But, they also produce artifacts, because they are quite sensitive to the input data, and lend themselves to cherry-picking.

journal.pgen.1002967.g003 (1)Earlier I said that one problem with the tree methods is that they don’t model gene flow. Joe Pickrell’s TreeMix does so. Like the original tree methods, and unlike PCA or unsupervised model-based clustering, you specify a set of populations. Then you compare the populations in terms of their genetic distance, and fit them to a tree, but add migration parameters to that tree where the fit between the tree and the data is the most tenuous fit.

All visualizations are deformations of reality. TreeMix attempts to mitigate this somewhat by introducing another representation, that of migration.

Screenshot 2016-10-02 22.38.02Next we have local ancestry methods. By local ancestry, basically we mean methods which can assign ancestry to particular regions of the genome. While tree methods measure differences across pooled populations, PCA and model-based methods compare genotypes between individuals (this is a simplification, but bear with me). Local ancestry methods, like RFMix, compare regions of the genome with each other.

Related to, but not exactly the same, as local ancestry methods are haplotype based methods. In particular, I’m thinking of the FineStructure and its related methods. These leverage variation across the genome in terms of haplotypes, rather than just looking at genotypes. They also tend to benefit from phasing, for obvious methods. FineStructure and its relatives tend to need more marker density than model-based methods, which require more marker density than PCA, which requires more marker density that tree based methods. These haplotype based methods allow for correction of and accounting for forces such as genetic drift, which tend to skew results in other methods.

Finally, there is the AdmixTools framework which is good for testing very explicit hypotheses. While many of the above methods, such as TreeMix and unsupervised model-based clustering, explore an almost open-ended space of structure possibilities, the methods in AdmixTools exists in large part to test narrow delimited models. This goes to the fact that many of these methods are complementary, and you should use them together to arrive at a robust result. For example, if you are assigning populations for TreeMix, you should use PCA and model-based clustering to make sure that the populations are clear and distinct, and outliers are removed.

There’s a lot I left out, but many of the other methods are just twists on the ones above.

 
• Category: Science • Tags: Genomics 
Hide 7 CommentsLeave a Comment
Commenters to Ignore...to FollowEndorsed Only
Trim Comments?
  1. Even though trees don’t show gene flow as you say, it’s often the case that input modeling populations as mixed and input that doesn’t (like Fst values) often result in similar trees.
    Example (thanks to Matt for these):

  2. Hi, i found this paper by a philosopher really useful
    http://www.academia.edu/5495326/The_Genetic_Reification_of_Race_A_Story_of_Two_Mathematical_Methods

    I definitely need to learn more genetics though.

  3. I see many hits with a google search for
    3-D point cloud graphics
    and some hits with
    3D point cloud graphics with web interface.
    Sorry I have little personal experience and so no recommendations.

  4. * “For example, if you are assigning populations for TreeMix, you should use PCA and model-based clustering to make sure that the populations are clear and distinct, and outliers are removed.”

    Are there population structure tools that automate the identification of likely outliers or hidden structure starting from an initial imperfectly accurate grouping of individuals into populations the way you might by hand with a PCA?

    In the way we do population genetics today with populations with a couple to a hundred or so samples per population the sorting of outliers by hand method works well enough. But, if you have samples of tens of thousands of people per population, the way a mature version of a service like 23andme does, you’d really need to automate the process.

    * Are there tools that are well suited to determining whether a given population has meaningful within population structure or not (perhaps quantifying the extent to which the population has structure), without elucidating the nature of that within group structure?

    I’m thinking about a tool that might complement a tool like Admixture’s which models an arbitrary number of K populations to let you know when to stop because you have captured all or X% of the structure existing in the sample.

    More concretely I am also thinking about a tool that could distinguish between say, the population within a state in India that would have lots of regional and caste structure, and a population like Lutherans from Wisconsin, or global unadmixed Chinese migrant merchant communities that might have very little further internal structure. Within group variation in genes alone wouldn’t suffice, because some groups like an unstructured group of Khoi-San might have high but unstructured genetic diversity, while others might have small but clearly demarkated divisions.

    I could imagine something like a Monte Carlo based approach that might compare the FST values between subpopulations divided on the basis of a large number of pre-determined haplotypes (in parts of the genome where all humans have reached fixation on just one) in the autosomal genome, with some haplotype divisions producing large FST values and some haplotype divisions producing low FST values in some coherent manner signaling likely substructure, while a sample with similar FST values for all haplotype divisions signaling little substructure, and assigned a single number to those results.

    You could figure out analytically the score that would result from pure random variation and assign that value 1.0, with a higher score reflecting more structure (perhaps in units of standard deviation or chi-squared), and with low values reflecting an unnaturally unstructured population like the relatively recent descendants of a small founding population (or a rogue sperm donor) or at zero, a population made up entirely of perfect clones of each other (a la Star Wars clone wars) or a population that reproduces non-sexually (which amounts to pretty much the same thing).

    • Replies: @Razib Khan
    Are there population structure tools that automate the identification of likely outliers or hidden structure starting from an initial imperfectly accurate grouping of individuals into populations the way you might by hand with a PCA?



    finestructure includes a lot of the stuff u r talking about
  5. @ohwilleke
    * "For example, if you are assigning populations for TreeMix, you should use PCA and model-based clustering to make sure that the populations are clear and distinct, and outliers are removed."

    Are there population structure tools that automate the identification of likely outliers or hidden structure starting from an initial imperfectly accurate grouping of individuals into populations the way you might by hand with a PCA?

    In the way we do population genetics today with populations with a couple to a hundred or so samples per population the sorting of outliers by hand method works well enough. But, if you have samples of tens of thousands of people per population, the way a mature version of a service like 23andme does, you'd really need to automate the process.

    * Are there tools that are well suited to determining whether a given population has meaningful within population structure or not (perhaps quantifying the extent to which the population has structure), without elucidating the nature of that within group structure?

    I'm thinking about a tool that might complement a tool like Admixture's which models an arbitrary number of K populations to let you know when to stop because you have captured all or X% of the structure existing in the sample.

    More concretely I am also thinking about a tool that could distinguish between say, the population within a state in India that would have lots of regional and caste structure, and a population like Lutherans from Wisconsin, or global unadmixed Chinese migrant merchant communities that might have very little further internal structure. Within group variation in genes alone wouldn't suffice, because some groups like an unstructured group of Khoi-San might have high but unstructured genetic diversity, while others might have small but clearly demarkated divisions.

    I could imagine something like a Monte Carlo based approach that might compare the FST values between subpopulations divided on the basis of a large number of pre-determined haplotypes (in parts of the genome where all humans have reached fixation on just one) in the autosomal genome, with some haplotype divisions producing large FST values and some haplotype divisions producing low FST values in some coherent manner signaling likely substructure, while a sample with similar FST values for all haplotype divisions signaling little substructure, and assigned a single number to those results.

    You could figure out analytically the score that would result from pure random variation and assign that value 1.0, with a higher score reflecting more structure (perhaps in units of standard deviation or chi-squared), and with low values reflecting an unnaturally unstructured population like the relatively recent descendants of a small founding population (or a rogue sperm donor) or at zero, a population made up entirely of perfect clones of each other (a la Star Wars clone wars) or a population that reproduces non-sexually (which amounts to pretty much the same thing).

    Are there population structure tools that automate the identification of likely outliers or hidden structure starting from an initial imperfectly accurate grouping of individuals into populations the way you might by hand with a PCA?

    finestructure includes a lot of the stuff u r talking about

    • Replies: @ohwilleke
    Cool. Thanks. Is that open access or pricey?
  6. @Razib Khan
    Are there population structure tools that automate the identification of likely outliers or hidden structure starting from an initial imperfectly accurate grouping of individuals into populations the way you might by hand with a PCA?



    finestructure includes a lot of the stuff u r talking about

    Cool. Thanks. Is that open access or pricey?

  7. Razib:

    Is there a guide to all the acronyms you throw around so dumb asses like me can understand what you are talking about? If not, have you ever considered putting one together?

Comments are closed.

Subscribe to All Razib Khan Comments via RSS