The Unz Review - Mobile
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information



=>
Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog
/
Genome

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New Reply
🔊 Listen RSS

To mark the release of the 1000 Genomes papers, here’s are pedigree files with the 2,500 1000 Genomes samples. The 290,000 SNPs overlap with HGDP and other public SNP-chip data sets. The .fam has the population IDs. For what it’s work, I just used plink 2 to convert from VCF format.

 
• Category: Science • Tags: Genome 
🔊 Listen RSS

ResearchBlogging.orgThe Pith: The rarer the genetic variant, the more likely that variant is to be specific to a distinct population. Including information about the distribution of these genetic variants missed in current techniques can increase greatly the precision of statistical inferences.

A few days ago I mentioned in passing an article in The New York Times which reported on results from a paper which illustrated how starkly differentiated populations might be on rare alleles. By this, I mean that some genetic variants are present at very low frequencies. It turns out that many of these are low frequency variants private to particular populations, in contrast to higher frequency variants which span varied human populations. The explanation presented by one of the authors of the referenced paper was that higher frequency variants presumably date back to a time before human populations had become geographically diversified across the world. Shared variants at higher frequencies then are shadows of shared past history. In contrast, rare variants are a reflection of more recent events, narrowing the circle of those effected.

I have now read the paper in question, Demographic history and rare allele sharing among human populations. From what I can gather The New York Times article was really an elaboration upon some of the issues which were highlighted in the discussion. The “meat” of the paper in terms of methods and results is actually rather technical and deeply embedded in the language of mathematical statistics. For example:

After further consideration, I have decided that I shall spare you my own clumsy exposition in plain English as to the details of site frequency spectrum calculations. There are after all enough points of interest in the paper at which I can throw my verbal talents more effectively. First, the abstract:

High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2–4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after ∼1,000 sequenced chromosomes per population, whereas ∼2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.

The first figure illustrates one of the clearest, though most unsurprising, findings in the paper: the lack of overlap of rare alleles across two distinct populations. In this panel they’re comparing Chinese from Beijing (CHB) and Yoruba from Nigeria (YRI). They focused on rare alleles as defined by variants present in 15 or less out of 100 in their sample. The union of the two populations yielded ~3,300 alleles, but only ~200 of these intersected across the populations. In other words, well over ~90% of these alleles were private across these populations. This immediately clues you in on the peculiarity of these genetic variants, as you should know that at any random polymorphic gene there will be far less between population variance than this. The zone of intersection on the histogram is notably “flat,” while it is “cool” on the heat map. In contrast, the “edges” of the graphs, which are defined by alleles exclusive to each respective population, exhibit a wide distribution in counts (observe that there are many more very rare alleles than moderately rare alleles).

An important aspect of this paper is that they synthesized results from “high coverage” and “low coverage” sequencing efforts. The former is highly accurate in terms of the actual state of the genome, but often very targeted and narrow (in this paper they focused on a set of exomes, regions of the genome which actually encode proteins). In contrast, the latter covers wider swaths of the genome (the full genome in this case), but may not be as accurate. One can immediately imagine the problem when one is fixing upon low frequency variants: errors in the data as well as limitations in the sample size may result in inflation or omission of alleles. When it comes to high frequency alleles error is of less account because a mistake here and there will not change the qualitative assessment. In any case, by comparing the rare variants found in deeply covered regions of the genome with the presumed underestimates which are yielded in the more thinly covered projects the authors generated parameters which allowed them to project the proportion of private alleles as a function of frequency across populations.

To the left you see a set of series on a line chart generated by their method. On the x-axis you have the minor allele frequency (the rare variant on a locus). For the y-axis you have the ratio of the allele shared across the two populations. What is notable to me is how even two closely related populations tend to differ a great deal at very low frequencies! The Chinese data needs a little explanation I think. The Chinese in Denver are almost certainly skewed toward a South Chinese sample. Historically American Chinese were disproportionately Cantonese, while the newer immigration waves tend to be Fuijianese, whether directly from Fujian, or ethnic Fujianese from Taiwan (where they are the majority). Though likely cosmopolitan, the Beijing Chinese are obviously going to sample more from the north of the country. This difference shows up on PCA plots, where the Beijing and Denver Chinese samples exhibit the distances from populations to their north and south that you’d expect if the latter was derived from southern Chinese populations.

The fact that very rare alleles are not shared across even closely related populations should not be too surprising when you think about it of course (everything is so obvious in hindsight!). For example, much of southern China was populated by Han ~1,500 years ago, during the first interregnum between Chinese dynasties (a period of disunity of particularly great length, lasting three centuries). During the Song ~1000 A.D. the Yangtze region and provinces to the south definitively surpassed the Yellow River basin in demographic heft. Without taking into account migration, this gives about ~1,000 years on average, or 40 generations (assuming 25 years) for new genetic variants to arise which might be private to the Han of the north and south of China respectively. The same process writ small certainly applies within putative populations, and there are going to be family private alleles. That is, genetic markers of recent origin distinctive to family lineages (more broadly construed we already know this with tandem repeats, but here we’re focusing on single nucleotide polymorphisms, changes on one base pair).

Finally, let’s hit their main demographic finding, which received a lot of coverage in The New York Times. They estimated that the last common ancestor of Asians and Africans in their data set was on the order of ~50,000 years before the present. This is absolutely unsurprising. As they note this is entirely consonant with the archeological record. What is fascinating is the confidence: 45 to 69 thousand years over the 95% interval. This immediately seemed congenially narrow to me, and they confirm this by reviewing earlier estimates with noisier data sets which had much larger intervals. Here is the rough demographic model which they inferred from their data:

CEU refers to Utah whites, CHB to Chinese in Beijing, JPT to Japanese, and YRI are Yoruba. You can see that their estimate of the last common ancestor of Europeans and Asians is ~23 thousand years B.P., in line with other calculations, though a touch on the low side for my own taste. The N refers to population sizes, while the nature of the tree illustrates the non-African bottleneck followed by demographic expansion vs. the relatively constant African population size over the past ~100,000 years.

The real good stuff comes in the discussion. Here’s something that jumped out at me: “It should be emphasized that, because we use a single Western African population as our African panel, the divergence described by our model might have occurred earlier than the actual Out-of-Africa event.” Within the discussion it is noted repeatedly that their results are sensitive to a host of conditions. For example, they were limited in the populations they used, and their demographic-historical model was obviously not as complex as it could have been. These results then perhaps should be seen as an important guide, and a pointer to things to come, rather than a substantive marker to lay down and take to heart. Given the populations they had and the data available the method outlined here seems very useful, but there are still limitations imposed by the population set and the nature of the data (which will be obviated in the near future).

Finally, there’s the practical payoff in medical genetics. The New York Times accurately reflected the inference one could make from this: if lots of diseases which are common are due to a host of rare variants, then it is even more important to gain a better understanding of fine-grained human variation. Risk alleles found in one population via genome-wide association in one population have been found to often predict well in other populations, but if these more common variants are part of our common ancestral heritage, then they should be relatively robust to genetic background. Such may not be the case with many rare variants, which reflect the peculiarities of more recent history. If medicine is to be truly personal in the genomic sense, then it seems likely that it will be more context dependent than had been hoped 10 years ago.

Citation: Simon Gravel, Brenna M. Henn, Ryan N. Gutenkunst, Amit R. Indap, Gabor T. Marth, Andrew G. Clark, Fuli Yu, Richard A. Gibbs, The 1000 Genomes Project, & and Carlos D. Bustamante (2011). Demographic history and rare allele sharing among human populations PNAS : 10.1073/pnas.1019276108

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

Recently a friend got their 23andMe genotype results, and was wondering if there was something they could do for the “greater good.” I told him that he should throw his genotype out to the public domain and attach his name to it. For various reasons he declined to go that far, but he did consent to me to putting his genotype online without personal identifying information. I can tell you that he is a relatively young male of 100% (to his knowledge) Ashkenazi Jewish heritage.

You can get a zipped folder with the raw text file and a binary pedigree formatted file here. If you click the free download option after 30 seconds you’ll get the file within about 5 minutes on a broadband connection (that was my experience at least).

If anyone else wants to throw their genotype to the public domain with as much or as little information as you want just email me at contactgxnp -at- gmail -dot- com. Here’s a spreadsheet with other people who have put their gentoypes online. I want to put up a “roundup” post with a bunch of people who do just that in the near future.

(Republished from Discover/GNXP by permission of author or representative)
 
• Category: Science • Tags: Genetics, Genome, Genomics, Public Genotype 
🔊 Listen RSS

800px-Cross-cutting_relatio

At Genomes Unzipped Luke Jostins elaborates on how the genetic facts he now has about his paternal lineage change how he views his own personal history:

… my father’s father is Latvian, and the N1 haplogroup is not rare in the Baltic regions. In fact, the subgroup, N1c1, is more common in parts of Eastern Europe than it is in Asia.

Initially, this seemed to play nicely into a part of our ancient family history. There is a folk history, relayed to me be my Dad and my uncle Johnny, that Jostins blood may contain traces of Mongolian. The justification for this is that in around 1260, just before the civil war caused the Mongol Empire to die back in Europe, the Empire extended all the way to the Baltic States. It was at this point, my fellow N1c1-bearers hypothesise, that Mongolian DNA entered the Jostins line.

Unfortunately on closer inspection this tale is not really supported by the DNA evidence. The famous Mongol Expansion haplogroup is actually C3, which is the modal haplogroup of Mongolians. In contrast, N1c1 has existed in Europe for thousands of years, and is far to old and too wide-spread to represent a recent expansion.

dnanlargergTo the left is a frequency map of the concentration of N1c1. Based on the current distribution, and the diversity being modal in the East Baltic, one has to be skeptical of a simple east-west model. Interestingly the frequency difference of this haplogroup between Finland and Sweden is very high. Also, branch of N1c1 seems to be found among the Rurikids of Russia. This was the ruling dynasty of the Rus, a people who originally seem to have been ethnic Scandinavians from Sweden. Eventually they ruled over a polyglot state of Finns, Slavs and Scandinavians, and submerged their own identity with that of the Slavic peasants. In this they followed the example of the Bulgars, who were ethnically distinctive from their Slavic subjects, but were totally absorbed excepting that their ethnonym persisted. There is some evidence that the Serbs are a similar case, an Iranian group which was eventually absorbed into the South Slav substrate.

Going back to northern Europe, let’s try to get some more perspective. Luke Jostins’ personal history is after all a slice of population history, and what we know about the background of the population impacts how Luke views his own personal history. To do that I thought I’d quickly poke around a few older papers on Baltic genetics which I had stashed away. It didn’t turn out to be so quick. But here are some figures. First, from Genome-Wide Analysis of Single Nucleotide Polymorphisms Uncovers Population Structure in Northern Europe:

finplos

From Genetic Structure of Europeans: A View from the North–East:

fi

Finally, from Migration Waves to the Baltic Sea Region (N3 = N1c1):

finfinal

Also see my recent posts on Northern European genetics, as well as the argument about agriculturalists vs. farmers. Ten years ago we have a few simple models, but now it gets more confusing and complicated. Confounders:

- Different reproductive skew parameters for males and females. In short, high fertility of “super-males” as well as dominance of patrilocality can produce different patterns in Y and mtDNA

- Selection on mtDNA. The “neutral” markers which we think of as neutral may not be neutral

- Poor correspondence between inferences of the past based on contemporary patterns of variation and what ancient DNA has discovered. Our assumptions are faulty, or we’re just too stupid to extract the real patterns

- Persistent problems with dating and typing some uniparental lineages. Consider the debate over the pan-Eurasian haplogroup R1a1a* (Dan MacArthur and I both carry this Y lineage, but what’s in a few letters?)

- Reality is complicated. This may be the most intractable issue over the long term

I have used the analogy of a palimpsest to describe the flow of genetic variation over time and space. I think that perhaps that that is misleading in some fundamental ways. Demographic patterns are characterized by different dynamics, persistent and long standing “flows,” as well as punctuated “explosions.” Rather than a palimpsest, a better analogy might be the layering of geological strata. Although there are long periods of gentle wearing and layering, volcanism and earthquakes periodically erupt to disrupt the smooth accumulations. Sequences of catastrophic events can produce inversions.

Consider three dynamics:

- Isolation-by-distance. This is the conventional band/village-to-band/village process of gene flow. This may be analogized to sedimentary accumulation (mutations) and erosion (drift)

- Demic diffusion. The rapid demographic expansion into virgin territory by a culture which introduces a more efficient mode of production. One of the most recent occurrences of this was the rapid multiplication of New England Puritans from ~30,000 circa 1640 to over 700,000 150 years later. Not only did these New Englanders “fill up” their home territory, in the early years of the republic they burst out of the northeast and populated many regions of the Great Lakes. Demic diffusion is like an earthquake, a rapid and ordered shift of the local geology

- The leap frog. The settlement of Europeans in the southern cone of Latin America, Australia, or Mongols in eastern Iran, are instances of leap frogs. We have clear textual of these leap frogs, but without that we wouldn’t know what to make of them. Leap frogs are like volcanic eruptions, reordering the layers beneath and also deposition from above

At least with Luke’s hypothesis about descent from Rurik he can test his own N1c1 profile against other Rurikids. Presumably the modal haplotype and its near relations are those of the original Rurik.

(Republished from Discover/GNXP by permission of author or representative)
 
• Category: Science • Tags: Finns, Genetics, Genome, Genomes Unzipped, Genomics 
No Items Found
Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at http://www.razib.com"