The Unz Review - Mobile
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information



=>
Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog
/
Statistical Genetics

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New Reply
🔊 Listen RSS

ResearchBlogging.orgThe Pith: The rarer the genetic variant, the more likely that variant is to be specific to a distinct population. Including information about the distribution of these genetic variants missed in current techniques can increase greatly the precision of statistical inferences.

A few days ago I mentioned in passing an article in The New York Times which reported on results from a paper which illustrated how starkly differentiated populations might be on rare alleles. By this, I mean that some genetic variants are present at very low frequencies. It turns out that many of these are low frequency variants private to particular populations, in contrast to higher frequency variants which span varied human populations. The explanation presented by one of the authors of the referenced paper was that higher frequency variants presumably date back to a time before human populations had become geographically diversified across the world. Shared variants at higher frequencies then are shadows of shared past history. In contrast, rare variants are a reflection of more recent events, narrowing the circle of those effected.

I have now read the paper in question, Demographic history and rare allele sharing among human populations. From what I can gather The New York Times article was really an elaboration upon some of the issues which were highlighted in the discussion. The “meat” of the paper in terms of methods and results is actually rather technical and deeply embedded in the language of mathematical statistics. For example:

After further consideration, I have decided that I shall spare you my own clumsy exposition in plain English as to the details of site frequency spectrum calculations. There are after all enough points of interest in the paper at which I can throw my verbal talents more effectively. First, the abstract:

High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2–4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after ∼1,000 sequenced chromosomes per population, whereas ∼2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.

The first figure illustrates one of the clearest, though most unsurprising, findings in the paper: the lack of overlap of rare alleles across two distinct populations. In this panel they’re comparing Chinese from Beijing (CHB) and Yoruba from Nigeria (YRI). They focused on rare alleles as defined by variants present in 15 or less out of 100 in their sample. The union of the two populations yielded ~3,300 alleles, but only ~200 of these intersected across the populations. In other words, well over ~90% of these alleles were private across these populations. This immediately clues you in on the peculiarity of these genetic variants, as you should know that at any random polymorphic gene there will be far less between population variance than this. The zone of intersection on the histogram is notably “flat,” while it is “cool” on the heat map. In contrast, the “edges” of the graphs, which are defined by alleles exclusive to each respective population, exhibit a wide distribution in counts (observe that there are many more very rare alleles than moderately rare alleles).

An important aspect of this paper is that they synthesized results from “high coverage” and “low coverage” sequencing efforts. The former is highly accurate in terms of the actual state of the genome, but often very targeted and narrow (in this paper they focused on a set of exomes, regions of the genome which actually encode proteins). In contrast, the latter covers wider swaths of the genome (the full genome in this case), but may not be as accurate. One can immediately imagine the problem when one is fixing upon low frequency variants: errors in the data as well as limitations in the sample size may result in inflation or omission of alleles. When it comes to high frequency alleles error is of less account because a mistake here and there will not change the qualitative assessment. In any case, by comparing the rare variants found in deeply covered regions of the genome with the presumed underestimates which are yielded in the more thinly covered projects the authors generated parameters which allowed them to project the proportion of private alleles as a function of frequency across populations.

To the left you see a set of series on a line chart generated by their method. On the x-axis you have the minor allele frequency (the rare variant on a locus). For the y-axis you have the ratio of the allele shared across the two populations. What is notable to me is how even two closely related populations tend to differ a great deal at very low frequencies! The Chinese data needs a little explanation I think. The Chinese in Denver are almost certainly skewed toward a South Chinese sample. Historically American Chinese were disproportionately Cantonese, while the newer immigration waves tend to be Fuijianese, whether directly from Fujian, or ethnic Fujianese from Taiwan (where they are the majority). Though likely cosmopolitan, the Beijing Chinese are obviously going to sample more from the north of the country. This difference shows up on PCA plots, where the Beijing and Denver Chinese samples exhibit the distances from populations to their north and south that you’d expect if the latter was derived from southern Chinese populations.

The fact that very rare alleles are not shared across even closely related populations should not be too surprising when you think about it of course (everything is so obvious in hindsight!). For example, much of southern China was populated by Han ~1,500 years ago, during the first interregnum between Chinese dynasties (a period of disunity of particularly great length, lasting three centuries). During the Song ~1000 A.D. the Yangtze region and provinces to the south definitively surpassed the Yellow River basin in demographic heft. Without taking into account migration, this gives about ~1,000 years on average, or 40 generations (assuming 25 years) for new genetic variants to arise which might be private to the Han of the north and south of China respectively. The same process writ small certainly applies within putative populations, and there are going to be family private alleles. That is, genetic markers of recent origin distinctive to family lineages (more broadly construed we already know this with tandem repeats, but here we’re focusing on single nucleotide polymorphisms, changes on one base pair).

Finally, let’s hit their main demographic finding, which received a lot of coverage in The New York Times. They estimated that the last common ancestor of Asians and Africans in their data set was on the order of ~50,000 years before the present. This is absolutely unsurprising. As they note this is entirely consonant with the archeological record. What is fascinating is the confidence: 45 to 69 thousand years over the 95% interval. This immediately seemed congenially narrow to me, and they confirm this by reviewing earlier estimates with noisier data sets which had much larger intervals. Here is the rough demographic model which they inferred from their data:

CEU refers to Utah whites, CHB to Chinese in Beijing, JPT to Japanese, and YRI are Yoruba. You can see that their estimate of the last common ancestor of Europeans and Asians is ~23 thousand years B.P., in line with other calculations, though a touch on the low side for my own taste. The N refers to population sizes, while the nature of the tree illustrates the non-African bottleneck followed by demographic expansion vs. the relatively constant African population size over the past ~100,000 years.

The real good stuff comes in the discussion. Here’s something that jumped out at me: “It should be emphasized that, because we use a single Western African population as our African panel, the divergence described by our model might have occurred earlier than the actual Out-of-Africa event.” Within the discussion it is noted repeatedly that their results are sensitive to a host of conditions. For example, they were limited in the populations they used, and their demographic-historical model was obviously not as complex as it could have been. These results then perhaps should be seen as an important guide, and a pointer to things to come, rather than a substantive marker to lay down and take to heart. Given the populations they had and the data available the method outlined here seems very useful, but there are still limitations imposed by the population set and the nature of the data (which will be obviated in the near future).

Finally, there’s the practical payoff in medical genetics. The New York Times accurately reflected the inference one could make from this: if lots of diseases which are common are due to a host of rare variants, then it is even more important to gain a better understanding of fine-grained human variation. Risk alleles found in one population via genome-wide association in one population have been found to often predict well in other populations, but if these more common variants are part of our common ancestral heritage, then they should be relatively robust to genetic background. Such may not be the case with many rare variants, which reflect the peculiarities of more recent history. If medicine is to be truly personal in the genomic sense, then it seems likely that it will be more context dependent than had been hoped 10 years ago.

Citation: Simon Gravel, Brenna M. Henn, Ryan N. Gutenkunst, Amit R. Indap, Gabor T. Marth, Andrew G. Clark, Fuli Yu, Richard A. Gibbs, The 1000 Genomes Project, & and Carlos D. Bustamante (2011). Demographic history and rare allele sharing among human populations PNAS : 10.1073/pnas.1019276108

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

Two of the main avenues of research which I track rather closely in this space are genome-wide association studies (GWAS), which attempt to establish a connection between a trait/disease and particular genetic markers, and inquiries into the evolutionary parameters which shape the structure of variation within the human genome. Often with specific relation to a particular trait/disease. By evolutionary parameters I mean stochastic and deterministic forces; mutation, migration, random drift, and natural selection. These two angles are obviously connected. Both focus on phenomena which are proximate in relation to the broader evolutionary principle: the ultimate raison d’être, replication. Stochastic forces such as random genetic drift reflect the error of sampling of genes from generation to generation during the process of reproduction, while adaptation through natural selection is an outcome of the variation of reproductive fitness as a function of variation of heritable traits. Both of these forces have been implicated in diseases and traits which come under the purview of GWAS (and linkage mapping).

GWAS are regularly in the news because of their relevance in identifying the causal genetic factors for specific diseases. For example, schizophrenia. But they can be useful in a non-disease context as well. Human pigmentation is a character whose genetic architecture has been well elucidated thanks to a host of recent association studies. The common disease-common variant has yielded spectacular results for pigmentation; it does seem a few common variants are responsible for most of the variation on this trait. But this has been the exception rather than the rule.

One reason for this disjunction between the promise of GWAS and the concrete tangible outcomes is that many traits/diseases of interest may be polygenic and quantitative. This implies that variation in phenotype is controlled by variation across many genes, and, that the variation itself exhibits gradual continuity (a continuity which can be modeled as a normal distribution of values). The power of GWAS to detect correlated variation across genes and traits of small marginal effect is obviously limited. In contrast, it seems that about half a dozen genes can explain most of the between population variation in pigmentation. One SNP is able to account for 25-40% of the difference in shade between Europeans and Africans. This SNP is fixed in Europeans, nearly absent in Africans and East Asians, and segregating in both ancestral and derived variants in groups such as South Asians and African Americans. In contrast, though traits such as schizophrenia and height are substantially heritable, much of the variation at the population level of the trait is explainable by variation in genes. The effect size at any given locus may be small, or the variation may be accumulated through the sum of larger effect variants of low frequency. In other words, many common variants of small effect, or numerous distinctive rare variants of large effect.


ResearchBlogging.org These nuances of genetic architecture are not irrelevant to the possible evolutionary arc of the traits in question. One model of the adaptation leading to the high frequency of a trait or disease is that a novel mutation rapidly “sweeps” to fixation, or nearly to fixation. In other words, it shifts from nearly ~0% to nearly ~100% frequency in the population of alleles at that locus, driven by positive selection. This sort of rapid “hard sweep” would also result in “hitchhiking” of associated variants in the genomic regions adjacent to the originally favored mutant, producing regions of high linkage disequilibrium in the genome and haplotype blocks of associated alleles across loci. Such a model does seem possible in the case of some of the variants which are responsible for diversity of pigmentation. But this neat dovetailing between the strong association of a few variants with trait variance, and signatures of positive selection being driven by adaptation, is not so easy to come by in many instances.

There are other evolutionary possibilities in terms of what could drive a high frequency of particular alleles. Population bottlenecks and inbreeding can crank up the frequency of a variant simply through chance. This may be the origin of many traits and diseases expressed recessively or in quasi-Mendelian form which run in specific populations. Let’s set such stochastic possibilities to the side for now. The well of natural selection is not quite tapped out simply by models of positive selection drawing upon singular new mutations. Another model is that of “soft sweeps” operating upon standing genetic variation. Consider for example a trait which has a heritability of 0.50. 50% of the variance in trait value can be explained by variance in genes. Selection correlated with trait value can rapidly change the distribution of the trait within the population, as modeled by the breeder’s equation. But no new mutations are necessary in this model, rather, the frequencies of extant alleles changes over time. In fact, as the proportions shift novel combinations of alleles which were once too rare to be found together in the same individual will emerge, and so offer up the possibility that the mean trait value in generation t + n generations may be outside of the range of trait values at t = 0.

Over time such selection on a quantitative trait theoretically exhausts its own fuel, genetic variation. But quite often this is not practically operative, because such traits are subject to a background level of novel mutation and balancing selection. Stabilizing selection around a median phenotype, as well as frequency dependence and shifting environmental pressures, may produce a circumstance where adaptation never moves beyond the transient flux toward a new equilibrium. The element of the eternal race is at the heart of the Red Queen’s Hypothesis, where pathogen and host engage in an evolutionary war, and host immune responses are subject to negative frequency dependence. As the frequency of an allele rises, its relative fitness declines. As its frequency declines, its fitness rises.

Naturally such complex evolutionary models, subject to contingency and less non-trivially powerful in their generality, only become appealing when simple hard sweep models no longer suffice. But it seems highly plausible that the genetic architecture of some traits, those which seem plagued by ‘missing heritability,’ are going to necessitate somewhat more baroque evolutionary models to explain their ultimate emergence & persistence. A new paper in PLoS Genetics tackles this complexity by looking at the patterns of variation of SNPs implicated in GWAS in the HGDP data set. Genome-Wide Association Study SNPs in the Human Genome Diversity Project Populations: Does Selection Affect Unlinked SNPs with Shared Trait Associations? First, the abstract:

Genome-wide association studies (GWAS) have identified more than 2,000 trait-SNP associations, and the number continues to increase. GWAS have focused on traits with potential consequences for human fitness, including many immunological, metabolic, cardiovascular, and behavioral phenotypes. Given the polygenic nature of complex traits, selection may exert its influence on them by altering allele frequencies at many associated loci, a possibility which has yet to be explored empirically. Here we use 38 different measures of allele frequency variation and 8 iHS scores to characterize over 1,300 GWAS SNPs in 53 globally distributed human populations. We apply these same techniques to evaluate SNPs grouped by trait association. We find that groups of SNPs associated with pigmentation, blood pressure, infectious disease, and autoimmune disease traits exhibit unusual allele frequency patterns and elevated iHS scores in certain geographical locations. We also find that GWAS SNPs have generally elevated scores for measures of allele frequency variation and for iHS in Eurasia and East Asia. Overall, we believe that our results provide evidence for selection on several complex traits that has caused changes in allele frequencies and/or elevated iHS scores at a number of associated loci. Since GWAS SNPs collectively exhibit elevated allele frequency measures and iHS scores, selection on complex traits may be quite widespread. Our findings are most consistent with this selection being either positive or negative, although the relative contributions of the two are difficult to discern. Our results also suggest that trait-SNP associations identified in Eurasian samples may not be present in Africa, Oceania, and the Americas, possibly due to differences in linkage disequilibrium patterns. This observation suggests that non-Eurasian and non-East Asian sample populations should be included in future GWAS

And now the author summary:

Natural selection exerts its influence by changing allele frequencies at genomic polymorphisms. Alleles associated with harmful traits decrease in frequency while those associated with beneficial traits become more common. In a simple case, selection acts on a trait controlled by a single polymorphism; a large change in allele frequency at this polymorphism can eliminate a deleterious phenotype from a population or fix a beneficial one. However, many phenotypes, including diseases like Type 2 Diabetes, Crohn’s disease, and prostate cancer, and physiological traits like height, weight, and hair color, are controlled by multiple genomic loci. Selection may act on such traits by influencing allele frequencies at a single associated polymorphism or by altering allele frequencies at many associated polymorphisms. To search for cases of the latter, we assembled groups of genomic polymorphisms sharing a common trait association and examined their allele frequencies across 53 globally distributed populations looking for commonalities in allelic behavior across geographical space. We find that variants associated with blood pressure tend to correlate with latitude, while those associated with HIV/AIDS progression correlate well with longitude. We also find evidence that selection may be acting worldwide to increase the frequencies of alleles that elevate autoimmune disease risk.

This is a paper where jumping to the methods might be useful. Though I’m sure that the authors did not intend it, sometimes it felt as if you were following the marble being manipulated by the carnival tender. Since I was not familiar with some of the terms for the statistics, a simple allusion to the methods without elaborating in detail did not suffice. In any case, the key here is that they focused on the set of SNPs which have been associated with trait variance in GWAS, and compared those to the total SNPs found in the HGDP data set of 53 populations. Note that not all SNPs in GWAS were in the HGDP SNP panel. But for the general questions being asked the intersection of SNPs sufficed. Additionally, they generated a further subset of SNPs which were highly likely to be associated with trait variance. These were SNPs where other SNPs of related function were within 1 MB, or, SNPs which were found in more than one GWAS.

There were four primary statistics within the paper: Delta, Fst, LLC, and iHS. Fst and iHS are familiar. Fst measures the extent of between population variance across a set of populations. High Fst means a great deal of population structure, while Fst ~ 0 means basically no population structure. iHS is a test to detect the probability of natural selection based on patterns of linkage disequilibrium in the genome. Basically the important thing for the purposes of this paper is that iHS tends to be good at detecting alleles at moderate frequencies still presumably going through sweeps. This is in contrast to the older EHH test, which only detects sweeps which are nearly complete. If the authors are focusing on polygenic traits and soft sweeps the likelihood of that showing up on EHH is low since that is predicated on hard, nearly complete, sweeps. LLC measures the correlation between genetic variant of a trait as a function of latitude and longitude. Presumably this would be useful for smoking out those traits driven by ecological pressures (an obvious example in a general sense are consistent changes in area-to-volume ratio across taxa as organisms proceed from warmer to colder climes). Finally, Delta measures the allele frequency difference across the set of populations. The sign of Delta is simply a function of whether the allele frequency in question is higher in the first or second population in the comparison.

In doing their comparisons the authors did not simply compare across all 53 populations in a pairwise fashion. Rather, they often pooled continental or regional groups. To the left is a slice of table 1. It shows the populations used to generate the Delta values, and how they were pooled. The HGDP populations are broken down by region in a rather straightforward manner. But also note that some of the comparisons are between populations within regions, and those with different lifestyles. I assume that the comparisons highlighted within the paper were performed with the aim of squeezing maximal informative juice in such an exploratory endeavor. There are no obligate hunter-gatherers within the Eurasian populations in the HGDP data set to my knowledge, so a comparison between agriculturalists and hunter-gatherers would not be possible. There is such a comparison available in the African data set. The authors generated p-values by comparing the GWAS SNPs to random SNPs within the HGDP data set. In particular, they were looking for signatures of distinctiveness among the HGDP data set.

Such distinctiveness is expected. The set of SNPs associated with diseases and traits of note are not likely to be a representative subset of the SNPs across the whole genome. Remember that a neutral model of molecular evolution means that we should expect most genetic variation within the genome is going to be due to stochastic forces. Panel A of figure 1 shows that in fact the SNPs derived from GWAS did exhibit a different pattern from the total set of SNPs in the HGDP panel. Observe that the distribution of minor allele frequency (MAF) is somewhat skewed toward higher values for the GWAS SNPs. If the logic of GWAS is geared toward “common variants” which will be frequent enough within the population to generate an effect which is powerful enough to be picked up by the studies given their sample sizes, the bias toward more common variants (higher MAF) is understandable.

To the left are some SNPs and traits which had low p-values (i.e., they were deviated from expectation beyond what you’d expect from random noise). Not very surprisingly they found that pigmentation related SNPs tended to show up strongly in all the measures of population differentiation and variation. rs28777 is found in SLC45A2, a locus which differentiates Europeans from non-Europeans. rs1834640 is in SLC24A5, which differentiates Europeans + Middle Easterners + Central/South Asians from other populations. rs12913832 is a “blue eye” related variant. That is, it’s one of the markers associated with blue vs. non-blue eye color differences in Europeans.

Seeing that pigmentation has been one of the few traits which has been well elucidated by the current techniques, it should be expected that more subtle and thorough methods aimed at detecting genetic variation across and within populations should stumble upon those markers first. The authors note that “SNPs and study groups associated with pigmentation and immunological traits made up a majority of those that reached significance in our analysis.” There has long been a tendency toward finding signatures of selection around pigmentation and disease related loci.

One pattern which was also evident in terms of geography in the patterns of low p-values was the tendency for Eurasian groups to be enriched. This is illustrated in figure 2. Most of the SNPs from the GWAS studies were derived from study populations which were European. Because of this there is probably a bias in the set of SNPs being evaluated which are particular informative for Europeans and related populations. Additionally, it may also be that Eurasians were subject to different selective pressures as they left the ancestral African environment ~150-50,000 years B.P. In any case, for purposes of medical analysis the authors did find that using SNPs from East Asian populations produced somewhat different results than using those from European populations. Though some studies have shown a broad applicability of SNPs across populations, there are no doubt many variants in non-European populations which have simply not been detected because GWAS studies are not particularly focused on non-European populations. Consider:

… However, our results indicate that SNPs associated with pigmentation in GWAS display unusual allele frequency patterns almost exclusively in Europe, the Middle East, and Central Asia. This suggests to us that there may be SNPs, perhaps in or near genes other than SLC45A2, IRF4, TYR, SLC24A4, HERC2, MC1R, and ASIP, which are associated with pigmentation in non-Eurasian populations, but which have yet to be identified by GWAS. GWAS for pigmentation traits carried out using non-European subjects are needed to explore this possibility further.

There are two major other classes of trait/disease which were found to vary systematically across the HGDP populations:

- High blood pressure associated variants seemed to decrease with latitude

- Infectious and autoimmune disease SNPs had elevated scores. Specifically, there were some HIV related SNPs associated with Europeans which seem to confer resistance

The first set of traits would naturally come out of GWAS derived SNPs, since so much medical research goes into identifying risk and treating high blood pressure and other circulatory ailments. A consistent pattern where geography and not ancestry predict variation is an excellent tell for exogenous selective pressures. The physical nature of the earth is such that as mammals spread away from the equators their physiques will be reshaped by different sets of ecological parameters. Siberian populations have developed adaptations to cold stress, and there seem to be consistent cross-taxa shifts in body form to maximize or minimize heat radiation among mammals.

In the second case you have resistance to disease cropping up again, as well as pleiotropy, whereby genetic changes can have multiple downstream consequences. Often this is temporally simultaneous; consider the tame silver foxes. But sometimes you have a change in the past which has a subsequent consequence later in time due to different selective pressures. It is not that surprising that immunological responses can be multi-purpose, so even though Europeans did not develop resistance to HIV as a general selective pressure, similar pressures seem to have resulted in responses with general utility and now a specific use in relation to HIV. Selection can often be a blunt instrument, interposing itself into a network of interactions with multiple consequences, reshaping many traits simultaneously in the process of maximizing local fitness. This is most clear when you have a trait such as sicke-cell disease, which emerges only because the fitness benefit of heterozygosity is so great. But no doubt when it comes to many traits the byproducts are more subtle, or may seem cryptic to us. We still do not know why EDAR was driven to higher frequency in East Asians (less body odor and thick straight hair seem implausible targets for selection).

And just as natural selection can be blunt and rude in its impact on the covariance of genes and traits, so its relaxation may remove a suffocating vice. Consider the possibilities with blood pressure: perhaps the reason that northern Eurasians have lower blood pressure is that selection for other correlated traits associated with higher values were relaxed, allowing for fitness to be maximized in this particular dimension. Similarly, African Americans have a lower frequency of the sickle-cell disease than their ~80% West African ancestry would entail, because without the pressure of endemic malaria selection for the heterozygote was removed, allowing for the purging of the allele from the gene pool.

Nevertheless, the authors do conclude::

Despite our broad-based approach, we found only a few examples of what may be a polygenic response to a single selective pressure.</b> We did use stringent significance criteria which might mean that additional examples can be found among the study groups that did not quite meet our threshold of significance. It may also be that there is something about “GWAS” traits and their underlying genetics that served to undermine our approach.

They have several suggestions for why this didn’t pan out:
- The GWAS variants aren’t the primary source of the variation. It could be copy number variants, rare large effect variants (“synthetic”)

- Epistasis. Gene-gene interaction, which would mask or confound linear associations between variants and traits

- Low impact of selection on GWAS SNPs, or, balancing or negative selection

They finish:

In summary, we have examined 1,336 trait-associated SNPs in the 53 CEPH-HGDP populations looking for individual SNPs and groups of SNPs with unusual allele frequency patterns and elevated iHS scores. We identified 13 different traits with an associated SNP or study group that produced a significantly elevated score for at least one delta, Fst, LLC, or iHS measure, a small percentage of the total number of traits analyzed. We believe that the limited number of positive results could be due to our stringent significance criteria or to features of the genetic architecture of the traits themselves. Specifically, the roles of rare variants, epistasis, and pleiotropy in human complex traits are, although areas of active inquiry, still generally not well understood. Our measures may also not be optimal for detecting all types of selection acting on GWAS traits. It has been speculated that variants underlying complex traits will be influenced primarily by negative or balancing selection, which may not produce extreme values for our measures, particularly if these forces are relatively uniform across populations or are acting on many regions in the genome.

If selective pressures on polygenic traits are so common perhaps genomicists are going to be thumbing through Introduction to Quantitative Genetics. These are traits and evolutionary processes which lack clear distinction. In many ways modeling positive selection and hard sweeps resembles the economics of equilibriums. When it comes to continuous and quantitative traits subject to the effect of many genes a different way of thinking has to come to the fore. The transient no longer becomes a punctuation between the stasis, but the thing in and of itself. There are for example HLA genes in humans which are found in chimpanzees, because the nature of the eternal race between host and pathogen means that all the old tricks are preserved, at least at low frequencies. Human variation in intelligence, height, and all sorts of other liabilities and characteristics, may have always been with us, being buffeted continuously by a swarm of selective pressures. The question is, can our crude statistical methods ever get a grip on this diffuse but all-powerful net?

Citation: Casto AM, & Feldman MW (2011). Genome-Wide Association Study SNPs in the Human Genome Diversity Project Populations: Does Selection Affect Unlinked SNPs with Shared Trait Associations? PLoS Genetics : 10.1371/journal.pgen.1001266

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

ResearchBlogging.org Over the past decade evolutionary geneticist Mike Lynch has been articulating a model of genome complexity which relies on stochastic factors as the primary motive force by which genome size increases. The argument is articulated in a 2003 paper, and further elaborated in his book The Origins of Genome Architecture. There are several moving parts in the thesis, some of which require a rather fine-grained understanding of the biophysical structural complexity of the genome, the nature of Mendelian inheritance as a process, and finally, population genetics. But the core of the model is simple: there is an inverse relationship between long term effective population size and genome complexity. Low individual numbers ~ large values in terms of base pairs and counts of genetic elements such as introns.


A quick reminder: effective population size denotes the proportion of the population which contributes genes to the next generation. So, in the case of insects with extremely high mortality in the larval stage the effective population size may be orders of magnitude smaller than the census size at any given generation evaluating over all stages of life history. In contrast, with humans a much larger proportion of children end up contributing to the genetic makeup of the subsequent generation. With large organisms I’ve heard you can sometimes use a rule of thumb that effective population size is ~1/3 of census size, though this probably overestimates the effective population size. One reason that reproductive variation reduces the effective population, because many individuals contribute far less to the next generation than other individuals. The greater the variance, the more evolutionary genetic variation is impacted by a few individuals within the population at a given generation, reducing effective population which contributes to the next (the reproductive variance is often assumed to be poisson, but that is likely an underestimate). Additionally, there is the issue of variation over time. Long term effective population is much more sensitive to low bound values than high bound values, so it is liable to be much smaller than the census size at any given period for a species which goes through cycles. Humans for example have a relatively small long term effective population size evaluated over the past 100,000 years because we seem to have expanded from a small initial population. Mathematically since long term effective population size is given by the harmonic mean it stands to reason that low bound values would be critical. If that doesn’t make sense to you, remember the outsized impact which population bottlenecks may have on the long term trajectory of a species, in particular by removing genetic variation.

How does this influence genome complexity? Basically Lynch’s thesis is that when you reduce effective population you dampen the power of natural selection, specifically purifying selection, from preventing the addition of non-adaptive complexity through random processes. It isn’t that selection is rendered moot, rather, its signal is overwhelmed by the noise. Here’s the abstract of his 2003 paper:

Complete genomic sequences from diverse phylogenetic lineages reveal notable increases in genome complexity from prokaryotes to multicellular eukaryotes. The changes include gradual increases in gene number, resulting from the retention of duplicate genes, and more abrupt increases in the abundance of spliceosomal introns and mobile genetic elements. We argue that many of these modifications emerged passively in response to the long-term population-size reductions that accompanied increases in organism size. According to this model, much of the restructuring of eukaryotic genomes was initiated by nonadaptive processes, and this in turn provided novel substrates for the secondary evolution of phenotypic complexity by natural selection. The enormous long-term effective population sizes of prokaryotes may impose a substantial barrier to the evolution of complex genomes and morphologies.

The implication here is that prokaryotes with massive population sizes are biased toward smaller genomes by the more efficacious application natural selection. In contrast, more complex organisms which have smaller population sizes, and so are more impacted by the random fluctuations generation to generation due to sample variance, are less streamlined genomically because selection can do only so much against the swelling sea of noise. One intriguing argument of Lynch is that the genomic complexity is then later useful downstream as the building block of phenotypic complexity, but let’s set that aside for now.

A new paper in PLoS Genetics challenges the statistical analysis of the original data which Lynch et al. used to make their case. Technically the argue was that there was an inverse relationship between N eu and genome size. N e is effective population size, and u is nucleotide mutation rate. Though argument is technical, and the basic objection should be easy to understand: there are other variables which may actually be responsible for the correlation which Lynch et al. discerned. To the paper, Did Genetic Drift Drive Increases in Genome Complexity?:

Genome size (the amount of nuclear DNA) varies tremendously across organisms but is not necessarily correlated with organismal complexity. For example, genome sizes just within the grasses vary nearly 20-fold, but large-genomed grass species are not obviously more complex in terms of morphology or physiology than are the small-genomed species. Recent explanations for genome size variation have instead been dominated by the idea that population size determines genome size: mutations that increase genome size are expected to drift to fixation in species with small populations, but such mutations would be eliminated in species with large populations where natural selection operates at higher efficiency. However, inferences from previous analyses are limited because they fail to recognize that species share evolutionary histories and thus are not necessarily statistically independent. Our analysis takes a phylogenetic perspective and, contrary to previous studies, finds no evidence that genome size or any of its components (e.g., transposon number, intron number) are related to population size. We suggest that genome size evolution is unlikely to be neatly explained by a single factor such as population size.

lynchfig2In the original analysis by Lynch et al. ~66% of the variation in genome size was explained by N eu! That’s a pretty large effect. Figure 1 illustrates how phylogeny could be a confound in adducing a relationship. Here’s some of the text which explains the figure:

In this hypothetical example, eight species have been measured for two traits, x and y, as indicated by pairs of values at the tips of the phylogenetic tree (A). Ordinary least-squares linear regression (OLS) indicates a statistically significant positive relationship (B; r-squared = 0.62, P = 0.02), potentially leading to an inference of a positive evolutionary association between x and y. However, inspection of the scatterplot (B) in relation to the phylogenetic relationships of the species (A) indicates that the association between x and y is negative for the four species within each of the two major lineages. Regression through the origin with phylogenetically independent contrasts…which is equivalent to phylogenetic generalized least squares (PGLS) analysis, accounts for the nonindependence of species and indicates no overall evolutionary relationship between the traits…The apparent pattern across species was driven by positively correlated trait change only at the basal split of the phylogeny; throughout the rest of the phylogeny, the traits mostly changed in opposite directions (A; basal contrast in red)….

The argument then seems to be that the relationship in the original work by Lynch was an artifact due to the evolutionary history of the species which he surveyed to infer the relationship. Instead of a general principle or law then what you have is an outcome of contingent historical processes. Not very neat and clean. You can see the taxa-clustered nature of the relationship in figure 1 from the 2003 paper in Science:

se4532044001

OK, now let’s look at the visualization of the same data set from this paper, as a tree to illustrate the correlations:

lynchfig3

lynchfig5The last figure shows the difference between a scatterplot using conventional OLS regression, and the phylogenetic least squares model (PGLS). You go from an obvious linear relationship, which translated into the high r-squared noted above, to basically nothing (r-squared near zero, no statistical significance).

The paper itself isn’t that long, the objection is pretty straightforward. They’re simply claiming that Lynch didn’t correct for an obvious alternative explanation/confound, and that we don’t know what we thought we knew. Additionally, there is the assertion that the idea that effective population size predicts genome size robustly is becoming conventional wisdom within the scientific community. I don’t know about that, this seems like such a young field in flux that I think they oversold how widespread this assumption is to make the force of their rebuttal more critical. Certainly the patterns in genome size can be quite perplexing, but my intuition is that an r-squared on the order of 2/3 of the variation in genome size being explained by one predictor variable is rather astounding. Obviously genome size is pretty easy to get in the “post-genomic era,” but N e and u are harder to come by for many taxa, or even within a given taxon for a set of species of interest. It looks to me an opportunity for experimental evolutionalists, who can control the confounds, and observe changes within a lineage. And yet even if N eu is predictive as an independent variable all things controlled, what if all things are not usually controlled, and random acts of phylogenetic history are more important? Mike Lynch is credited in the acknowledgements, so I assume we’ll be seeing a response from him in the near future.

Citation: Whitney KD, & Garland T Jr (2010). Did Genetic Drift Drive Increases in Genome Complexity? PLoS Genetics : 10.1371/journal.pgen.1001080

(Republished from Discover/GNXP by permission of author or representative)
 
No Items Found
Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at http://www.razib.com"