The Unz Review - Mobile
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
 TeasersGene Expression Blog
Size (Sample) Matters More Than Coverage
🔊 Listen RSS
Email This Page to Someone

 Remember My Information



=>

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New Reply
Search Text Case Sensitive  Exact Words  Include Comments
List of Bookmarks

We live in an age where it’s almost anachronistic to talk about “-omics.” When a technology becomes seamless in our day to day life it becomes unworthy of notice. That being said we’re still in the phase of genomics where a lot of the details of “best practices” are being hashed out (the proliferation of “pipelines” for relatively pedestrian tasks makes that clear). Recently I stumbled upon two papers which I thought would be useful to give a little more coverage to, Population genomics based on low coverage sequencing: how low should we go? and Assessing the Effect of Sequencing Depth and Sample Size in Population Genetics Inferences. At issue here is coverage versus sample size. By coverage I mean the expected number of reads that will hit a nucleotide. If you have 100 × you’ll expect to get 100 hits on a base, and if you have 1 × you’re only getting one hit. Because of variation lots of positions are going to be above or below your expected coverage. Why this matters on the most prosaic level is that there is going to error in the results you get back from sequencing, and if you have many hits on the same position you can distinguish true from false polymorphism. For many projects people today seem to prefer on the order of 30 ×.


Citation: Fumagalli M (2013) Assessing the Effect of Sequencing Depth and Sample Size in Population Genetics Inferences. PLoS ONE 8(11): e79667. doi:10.1371/journal.pone.0079667

Citation: Fumagalli M (2013) Assessing the Effect of Sequencing Depth and Sample Size in Population Genetics Inferences. PLoS ONE 8(11): e79667. doi:10.1371/journal.pone.0079667

As the second paper is open access I’ll refer to its results, which are broadly in agreement with the first. When attempting to estimate simulated (so the author knows the “true” values) population genetic statistics or population substructure increasing sample size at even 1-2 x coverage gave much more bang for the buck than ratcheting up coverage. The methodology employed a trade off between sample size and coverage, so that (sample size)x(coverage) remained invariant. It actually wasn’t totally surprising for me in relation to population structure, since noisy and error prone data can still be quite useful assuming there isn’t a systematic bias (i.e., the error is random, so you’re left to thousands of useful markers after employing stringent quality control). But it did surprise how much of an effect there was in standard population genetic statistics of diversity. And the problems in that domain only increase when you have a rapidly growing population so that there is an excess of rare variants (like humans), rather than a constant population size.*

Finally, obviously this is a conclusion geared toward biologists focusing on population-scale dynamics, whether it be molecular ecologists or population geneticists. But as sequencing becomes more ubiquitous, and money remains finite, these sorts of balancing acts between coverage versus sample size will come more to the fore.

* Also, the author observes that instead of employing a hard cut off of some sort in variant calling, but utilizing a probabilistic model such as in ANGSD, you can get a lot more juice out of low coverage.

 

 

 
• Category: Science • Tags: Genomics 
Hide 3 CommentsLeave a Comment
Commenters to Ignore...to FollowEndorsed Only
    []
  1. gillt says:

    But is dense sampling computationally feasible when scaling up to genomic datasets?

    The method described here circumvents the need for phylogeographic samples by using triplets plus an outgroup to estimate species divergence histories.

    Yang Z. 2010. A likelihood ratio test of speciation with gene flow using
    genomic sequence data. Genome Biol Evol. 2:200.

    Read More
    ReplyAgree/Disagree/Etc.
    AgreeDisagreeLOLTroll
    These buttons register your public Agreement, Disagreement, Troll, or LOL with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used once per hour.
    Ignore Commenter Follow Commenter
    Sharing Comment via Twitter
    /gnxp/size-sample-matters-more-than-coverage/#comment-138585
    More... This Commenter Display All Comments
  2. But is dense sampling computationally feasible when scaling up to genomic datasets?

    what do you want to do? PCA goes reasonably fast on thousands of samples with hundreds of thousands of SNPs (what you might get out of a few X i think). even model-based clustering is feasible today with lots of samples with admixture and faststructure and all the competitors.

    Read More
    ReplyAgree/Disagree/Etc. More... This Commenter Display All Comments
  3. gillt says:

    I was thinking of MCMC programs such as IM/IMa which can’t handle large number of loci.

    Read More
    ReplyAgree/Disagree/Etc. More... This Commenter Display All Comments

Comments are closed.

Subscribe to All Razib Khan Comments via RSS