The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog
Evolutionary Genomics

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
These buttons register your public Agreement, Disagreement, Thanks, LOL, or Troll with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used three times during any eight hour period.
Ignore Commenter Follow Commenter
🔊 Listen RSS

elementarysofevolutionarygenetics In the early 1970s the eminent evolutionary geneticist Richard C. Lewontin wrote that population genetics “was like a complex and exquisite machine, designed to process a raw material that no one had succeeded in mining.” By this, Lewontin meant that in the 1930s when R. A. Fisher, Sewall Wright and J. B. S. Haldane established the theoretical foundations of the field, the techniques to discover the variation in populations to test their suppositions was rather thin (naturally, this resulted in many controversies, see The Origins of Theoretical Population Genetics). Geneticists were using classical methods, utilizing salient phenotypes which were proxies for underlying genetic markers, and tracing patterns of co-inheritance of traits with known locations in the genetic map with novel mutants. Researchers were not even clear at that point as to the underlying biochemical structure of the particle of Mendelian inheritance, what we term DNA. That arrived onto the scene in in the 1960s. But in the early 1970s when the above was written we’re not talking about DNA sequencing. Rather, this is the allozyme era, which Lewontin helped usher in with a paper in 1966. He expresses the excitement of the times later in the passage:

Quite suddenly the situation has changed. The mother-lode has been tapped and facts in profusion have been poured into the hoppers of this theory machine. And from the other end has issued–nothing. It is not that the machine does not work, for a great clashing of gears is clearly audible, if not deafening, but it somehow cannot transform into a finished product the great volume of raw material that has been provided.”

Despite the pessimism expressed above the emergence of molecular evolution stimulated the debates around neutral theory. Over a generation ago evolutionary geneticists were grappling with the swell of data which was confronting theoretical frameworks constructed in the early 20th century. Today we live in the “post-genomic” era, and now think in terms of whole genomes. The details may differ, but many of Lewontin’s observations in the 1970s still hold true, as novel results meet the paradigms of old. Last month in PNAS Brian Charlesworth published a paper which brought this to mind, Causes of natural variation in fitness: Evidence from studies of Drosophila populations. You may know Charlesworth as the coauthor of Elements of Evolutionary Genetics, an encyclopedia of a text which I highly recommend to all. In the paper, which is both review for those of us not steeped in Drosophila genetics, and a distillation of derivations to be found in the supplements, Charlesworth notes that there is a contradiction in terms of the typical selection coefficients inferred for deleterious alleles from population genomics in relation to those from quantitative genetics. Population genomics is a new field, and involves sequencing many markers (often whole genomes) to good accuracy across a reasonable number of individuals. Quantitative genetics is a more classical framework utilizing statistical methods which interpret variation in traits within laboratory populations.

220px-Drosophila_repleta_lateral The fruit fly has a storied role in Mendelian genetics. To a great extent the study of the fruit fly is the early history of Mendelian genetics (see Lords of the Fly: Drosophila Genetics and the Experimental Life). Therefore it is natural that a large body of research exists in this area, and one can’t accept novel results obtained through new methods such as genomics at face value without some degree of skepticism. Charlesworth notes that the extremely small fitness effects of the mutation discovered via genomic methods are biased toward single nucleotide variants (SNVs); point mutations. In contrast it seems likely that the larger effect mutations implied by quantitative genetic studies, which are rather rare, and so missed in population genomic sample sizes, are due to transposable elements (TEs) interspersing themselves across the genome, and presumably disrupting function. In line with older theoretical models, most of the variation in fitness is due to a small number of mutations. Presumably as genomic methods get better (e.g., longer read to catch repeat elements and larger sample sizes) they will converge upon the older established quantitative genetic methods. Two interesting other results in this paper is that much of the variation is due to balancing selection. For theoretical reasons balancing selection can not be pervasive across the genome (too much fitness variation would result in huge death rates per generation), but, of the variation within the population much of it is maintained by balancing selection according to Charlesworth. Another interesting dynamic is that the population genomic method seem to be better at capturing the distribution of fitness effects in humans, because of our smaller effective population size. You can read the paper for the technical reason why, but the key here is to remember that one has to be careful about extrapolating from model organisms. The models are imperfect, and we always need to never outrun our ability to generalize.

As genomics becomes pervasive in population genetics this sort of analysis will be more common. Rather than “genome-of-the-week” papers we’ll move to actually trying to grapple with what the sequence data is telling us specifically about the lineage in question, and, what we can generalize from the results about evolution writ large. Some organisms have a long history of scientific study, so population genomics will supplement and complement. In other cases though organisms do not have such a rich literature and scientific culture, and the pitfalls that are highlighted here might alert us to the deficiencies in genomic methods.

Citation: Charlesworth, Brian. “Causes of natural variation in fitness: Evidence from studies of Drosophila populations.” Proceedings of the National Academy of Sciences (2015): 201423275.

🔊 Listen RSS


When, how, and why, different lineages of the tree of life diverged has long been a preoccupation of evolutionary science. Now one must add to that a caveat that it seems a great deal of the story also has to do with the entanglement of branches which were long separated. Paleontology has looked at the macroevolutionary patterns, and attempted to move from description to formal models which scaffold the long progress of natural history. Phylogenetics has painted the branches of the tree in loving detail, and attempted to infer patterns from the shape and pulses of the diversification. Population genetics has focused upon the microevolutionary parameters which shape the flux of the genetic makeup of particular lineages; drift, mutation, and selection. Now you have new fields such as population genomics, which fuse 21st technologies with the questions and theoretical machinery of 20th century disciplines (in this case, population genetics, just as phylogenomics is an extension of phylogenetics).

Liu, Shiping, et al. "Population Genomics Reveal Recent Speciation and Rapid Evolutionary Adaptation in Polar Bears." Cell 157.4 (2014): 785-794.

Liu, Shiping, et al. “Population Genomics Reveal Recent Speciation and Rapid Evolutionary Adaptation in Polar Bears.” Cell 157.4 (2014): 785-794.

Because of the monetary investment by organizations such as the NIH (among other factors) the -omics revolution has hit Homo sapiens first. But it is moving on, and that is important, because evolutionary science really can’t constrain itself purely to the human domain. Ultimate questions such as why there are so many species requires actually surveying the nature of variation in the world out there. Nevertheless, currently most of the post-human work seems to be occurring in the classical ‘model organisms’ (e.g., Drosophila), or charismatic creatures, especially big mammals. A new paper in Cell, of all journals, is in the second class, Population Genomics Reveal Recent Speciation and Rapid Evolutionary Adaptation in Polar Bears. As you can infer from the title the paper looks at both the phylogenetic history of polar and brown bears, as well as the evolutionary genetic functional differences between the two distinct lineages. As you can see their sampling coverage was limited to particular populations, which is reasonable in light of finite sequencing resources. They had 10 brown bears and 79 polar bears, with good coverage on a lot of them (~30x not atypical). The inferences necessarily derive from these populations, though they admit in the text you can only go so far with their limited geographic coverage.

Using a variety of methods (IBS tracts and ∂a∂i) they found that polar bears and brown bears (or at least the ones in their sample) diverged on the order of ~500,000 years ago into two populations. More precisely 479-343,000 years ago. This overlaps with the fossil evidence. It translates to a separation between the ancestral populations about 20,000 generations ago. The authors state:

… the distinct adaptations of polar bears may have evolved in less than 20,500 generations; this is truly exceptional for a large mammal. In this limited amount of time, polar bears became uniquely adapted to the extremities of life out on the Arctic sea ice, enabling them to inhabit some of the world’s harshest climates and most inhospitable conditions.

This seems a little hyperbolic to me. In fact the Neandertal-modern human divergence is only about half as far back in the past in generation time, and one could argue that our two lineages were pretty diverged as well. That being said, obviously there are huge visible and physiological differences between polar and brown bears. They include in their model estimates of effective population declines in the past, presumably due to the exigencies of the Pleistocene glaciations. Using paleontological results already known they suggest that the emergence, and derivation of the polar bear lineage occurred during a period of separation from the ancestors of brown bears. In other words, allopatric speciation. In line with earlier work they also report evidence of long term gene flow between the two lineages, in particular, gene flow from polar bears to brown bears. This seems to be an old and continuous event which has become attenuated of late (they didn’t detect the sort of long haplotypes indicative recent admixture).

A note of caution again, as the samples here are geographically limited. But using measures such as D-statistics which attempt to infer patterns of admixture between populations it does seem that the initial conclusion about decreased effective population implies expansion from small initial founder groups for modern extant lineages. One wonders if this is a commonality with large mammals which have been shaped by repeated glaciation events. Obviously I’m including humans here, but for humans we have a lot of evidence that in fact there has been a lot of replacement due to ancient DNA.

Perhaps more thoroughly persuasive is the evidence they report in the paper that the polar bear exhibits lots of evolutionary change from their ancestors in particular functional regions. Polar bears are highly carnivorous, and exhibit lots of morphological and metabolic differences from brown bears. To be short it is as if brown bears were put on a very high fat diet. The functional regions which indicate signatures of selection in polar bears don’t have corresponding hits in brown bears, which isn’t surprising. They’re adapted to different conditions. Additionally a lot of these changes in polar bears are inferred to be harmful in humans. Fast evolution often occurs by breaking things; loss of function. So not surprising. The question is how polar bears function then? Also, I wonder if brown bears themselves are derived in a manner which we don’t understand yet (the sample here is skewed toward polar bears). Though brown bears are generalists, so I presume that they’re probably closer to the ancestral morphology.

They conclude intriguingly:

…Such a drastic genetic response to chronically elevated levels of fat and cholesterol in the diet has not previously been reported. It certainly encourages a move beyond the standard model organisms in our search for the underlying genetic causes of human cardiovascular diseases.

As Sydney Brenner would say, we’ve learned enough (or not) about mouse diseases.

Citation: Liu, Shiping, et al. “Population Genomics Reveal Recent Speciation and Rapid Evolutionary Adaptation in Polar Bears.” Cell 157.4 (2014): 785-794.

🔊 Listen RSS
Citation: Freedman, Adam H., et al. "Genome sequencing highlights the dynamic early history of dogs." PLoS Genetics 10.1 (2014): e1004016.

Citation: Freedman, Adam H., et al. “Genome sequencing highlights the dynamic early history of dogs.” PLoS Genetics 10.1 (2014): e1004016.

The more we scratch beneath the surface with powerful genomic techniques, the more we see that natural history which we had presumed to have a crisp understanding of is quite a bit more muddled. Once the muddle clears what we’ll gain is the gift of accurate complexity, but in many areas right now there is little such clarity. It is a truth that a new discovery or inference does not mean that there are enough points in space to construct a new explanatory constellation when the old does not suffice. Due to the biomedical focus of modern genomics there has been a disproportionate focus on humans, but over time it is clear that this will expand out across the tree of life, and the light shall give way to a temporary fog. First up are organisms of particular human interest and/or model organisms (the latter are species which are useful for elucidating general biological phenomena, and the subjects of study of a large community of researchers). Domestic dogs have the virtue of falling into both categories.


Red Basenji

There are many theories about the origins of our “best friend.” One school of thought (though not necessarily dominant) is that dogs are relatively recent obligate companions of humanity, part of the toolkit of the Neolithic revolution. To be fair this view was rejected by many researchers on the common sense grounds that dogs arrived with the Amerindians 10,000-15,000 years ago. These were clearly hunter-gatherer populations which predated the Neolithic. But there were some genomic research which did imply that even if there were early domestication events, the preponderance of modern domestic dog ancestry dated to the Middle East ~10,000 years before the present. The newest work in genomics seems to falsify that hypothesis rather robustly. These researchers have shown how looking closely and thoroughly at whole genomes (billions of base pairs) organisms, as opposed to a subset of polymorphisms (on the order of tens or hundreds of thousands of base pairs), can yield deeper historical insight.

A new paper out in PLOS GENETICS, Genome Sequencing Highlights the Dynamic Early History of Dogs, has been out as a preprint for a while now, but it seems useful to review what it highlights we now know, and don’t know. As illustrated by the figure above a key element of the revised natural history of the domestic dog must include a minimal level of complexity in the phylogenetic origins of the species. A caricature of the simplest story about the origin of the dog is that it is a tamed wolf. Highly derived from the ancestral state (many characteristics have shifted from the last common ancestor with wolves), but a wolf nonetheless. This idea needs updating because the work in the paper above highlights that extant wolves are not perfect representatives of Pleistocene wolf populations, from which dogs derive. This was already clear with some ancient DNA, but looking at whole genomes of three wolves from disparate regions of Eurasia, a West African Basenji, and an Australian Dingo (along with the Boxer as a reference domestic dog genome and a Golden Jack as an outgroup), a major finding seems to be that modern dogs derive from a population of wolves which are not represented in the populations sampled above. This is important because many inferences about dogs are made simply by assuming that modern wolves are appropriate proxies for the last common ancestor of both lineages.

This substitution seems to be rather shakier than we’d have thought, and this comes to play most obviously in the genetic diversity and bottleneck results we’d take for granted. If modern wolves are the standard for the ancestral population from which dogs derive then the bottleneck is a relatively mild one of a few fold drop in size (wolves are more diverse, but not that much more diverse). But what the authors above found by looking at patterns of genetic diversity across the whole genomes of these wolves is that all three, sampled from Croatia, Israel, and China, also exhibit evidence of a population bottleneck. This makes more sense of the result that it looks as if modern dog lineages underwent a population bottleneck on the order of one magnitude (16 fold). The timing using different methods also definitely predates the Neolithic revolution ~10,000 years ago, and so aligns with the archaeological evidence. Wolves were the companions of hunter-gatherers first before they were associated with farmers. Any possible adaptation of dogs to a starchy diet occurred after the initial bottleneck and separation of the ancestors of this lineage from the ancestors of modern wolves (who seem to have enough variation to have had this trait as part of the ancestral range of the trait in any case). Additionally, there are dog lineages, such as the Dingo, which don’t exhibit any adaptation to starch diets, which makes historical sense as they did not coexist with agricultural populations until recently.

I do want to caution that genomics does not change everything. Many of the broad outlines of what was known before with classical genetic techniques, comparative anatomy, and paleontology, do hold up. For example the domestic dogs do seem to form a monophyletic lineage. By this one simply means that domestic dogs the world over seem to share a small set of common ancestors, rather than being instances of convergent morphological evolution from disparate wolf lineages. What is more surprising though is that these results imply reciprocal monophyly with wolves. This means that domestic dogs are not a specialized branch of a particular population of modern wolves, but a sister lineage to contemporary wolves. Though it is common to say that a dog is just a tamed wolf, one might as easily state that a wolf is a wild dog (yes, I will grant that the dog is likely more derived, but I don’t think we can just substitute modern wolves for ancient ones and call it good). Both are subsets of a wider range of canid ancestors which flourished in the Pleistocene. The tests of admixture of particular lineages suggest that the origins of dogs seem to suggest gene flow with local wolf lineages. This would confound attempts to ascertain a particular zone of domestication or adaptation, as prior genetic affinities or clines in diversity may be due to gene flow rather than patterns of descent (earlier attempts to assert that domestic dogs derive from the Middle East or China may be premised on false assumptions, as well as limitations of less dense marker sets than whole genomes).

The main drawback of this study is obviously the limited sample size. It is freely acknowledged in the paper, but that is why the authors also attempted to select individuals from populations which were highly informative, both geographically and culturally (e.g., Dingoes are outside the range of the wolf, and, not coexistent with ancient agricultural populations). I am more skeptical about assuming that the wolf samples are representative than I am about their selection of three dog lineages (Basenji, Dingo, and the Boxer reference). We know a lot more about the genetics and history of dogs than we do about wolves, and it seems more likely that there are going to be more surprising loose ends in the case of the latter than the former. But if I had to bet I’d say the authors are right, and their inferences are going to hold up (reciprocal monophyly, the bottleneck in wolves, etc.). Yet there’s no doubt going to be a lot of detail added to this model as the sample sizes increase, and ancient DNA is is included in the analysis. Though recent studies seem to establish rather clearly that domestication was a function of the later Pleistocene (and not the Holocene) in the case of dogs, the exact details of where, when, an who, are still quite woolly.

But the ultimate big picture is emphasized by the title above: the Pleistocene is going to seem like a strange country after all is said and done. Many of the organisms which are going to be sequenced in great depth (high coverage) and large sample sizes first are mammals of Palearctic origin which were shaped by the Pleistocene. The importance of this geological period for humans has long been a subject of scholarly attention, but genomics and the light it sheds upon quirks of natural history, might emphasize the ecology-wide reshaping role that Ice Ages had upon the natural history of so many familiar and charismatic species. This is where genomics will open the door to evolutionary ecology of grand scope.

Citation: Freedman, Adam H., et al. “Genome sequencing highlights the dynamic early history of dogs.” PLoS Genetics 10.1 (2014): e1004016.

Related: Please see a post from one of the authors at Haldane’s Sieve.

🔊 Listen RSS

A Tree of Life

Evolutionary processes which play out across the tree of life are subject to distinct dynamics which can shape and influence the structure and characteristics of individuals, populations, and whole ecosystems. For example, imagine the phylogeny and population genetic characteristics of organisms which are endemic to the islands of Hawaii. Because the Hawaiian islands are an isolated archipelago the expectation is that lineages native to the region are going to be less shaped by the parameter of migration, or gene flow between distinct populations, than might otherwise be the case. Additionally, presumably there was a “founding” event of these endemic Hawaiian lineages at some distant point in the past, so another expectation is that most of the populations would exhibit evidence of having gone through a genetic bottleneck, where the power of random drift was sharply increased for several generations. The various characteristics, or states, which we see in the present in an individual, population, or set of populations, are the outcome of a long historical process, a sequence of precise events. To understand evolution properly it behooves us to attempt to infer the nature and magnitude of these distinct dynamic parameters which have shaped the tree of life.

Credit: Verisimilus

For many the image of evolutionary processes brings to mind something on a macro scale. Perhaps that of the changing nature of protean life on earth writ large, depicted on a broad canvas such as in David Attenborough’s majestic documentaries over millions of years and across geological scales. But one can also reduce the phenomenon to a finer-grain on a concrete level, as in specific DNA molecules. Or, transform it into a more abstract rendering manipulable by algebra, such as trajectories of allele frequencies over generations. Both of these reductions emphasize the genetic aspect of natural history.

Credit: Johnuniq

Obviously evolutionary processes are not just fundamentally the flux of genetic elements, but genes are crucial to the phenomena in a biological sense. It therefore stands to reason that if we look at patterns of variation within the genome we will be able to infer in some deep fashion the manner in which life on earth has evolved, and conclude something more general about the nature of biological evolution. These are not trivial affairs; it is not surprising that philosophy-of-biology is often caricatured as philosophy-of-evolution. One might dispute the characterization, but it can not be denied that some would contend that evolutionary processes in some way allow us to understand the nature of Being, rather than just how we came into being (Creationists depict evolution as a religion-like cult, which imparts the general flavor of some of the meta-science and philosophy which serves as intellectual subtext).

R. A. Fisher

But shifting from such near-metaphysical generalities to more in-the-trenches science as it is done, we are faced today with the swell of sequence data due to the genomic revolution. What does this matter for our understanding of evolution? Many of the original arguments of evolutionary geneticists such as R. A. Fisher and Sewall Wright were predicated on inferences from the inheritance patterns of a few genes which were easily identifiable by their phenotypic markers. But a more likely frame for the dispute was one where the inferences were purely theoretical, deduction with a minimal level of empirical messiness intervening. In contrast today we live in an age where someone may pity you if you don’t have a very well assembled genome of your organism (on the order of billions of base pairs for mammals), and so have to make due with SNP marker data of a few thousand per individual!

These new data, first and foremost from humans due to the funding priorities of biomedical science, have stimulated a renaissance of method development to take advantage of the richness of the genetic variation now being uncovered. Consider PSMC, which allows one to make demographic inferences of population history from one genome by surveying patterns of heterozygosity within a single individual. Last week I reviewed a preprint which illustrated the power of extensive data analysis in shading and refining previous results which seemed straightforward on the face of it. The reformulation yielded the possibility of natural selection as being a pervasive parameter in human evolution over the past ~100,000 years. The authors compared variation at different categories of bases (synonymous vs. nonsynonomous) across the genome to reinforce both old intuitions and extract novel insights.

Citation: Voight, Benjamin F., et al. “A map of recent positive selection in the human genome.” PLoS biology 4.3 (2006): e72.

Looking at diferences between synonymous vs. nonsyonomous substitutions is a tried & tested technique with a fine pedigree, but more recently haplotype based methods to detect natural selection have been all the rage, due to the emergence of dense genome-wide marker sets. These allow for the inference of correlated patterns of markers across adjacent genomic segments. This trend toward haplotype methods naturally triggered their antithesis, and the resulting synthesis to some extent can be seen in two papers, both Grossman et al., A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection, and Identifying recent adaptations in large-scale genomic data. These are improvements upon earlier work in the aughts, a reassessment which had already started to occur in the literature after the excesses of genomic methods in their detection of ubiquitous selection in human populations. More specifically, the newer techniques focused on recent selective events which leave long blocks of the genome within populations homogenized. As causal markers rapidly increase in frequency due to positive selection, they drag along flanking region in sweep events. For many generations after the initial selection event these flanking regions will produce regions of linkage disequilibrium, as recombination only slowly breaks apart apart the associations across loci. But a key drawback with these methods is that selection is not the only dynamic which results in long haplotypes and linkage disequilibrium. More specifically demographic stochasticity, colloquially the vicissitudes of population history, can also generate long homogeneous blocks of markers. The initial candidate regions yielded by a statistic like iHS were saturated by the effects of population specific history.

CMS, debuted in Grossman et al. 2010, is an attempt to correct for this bug, while retaining the power of haplotype based methods. Natural selection within the genome leaves more evidence behind in regards to its operation than just long halotype blocks and linkage disequilibrium. Selected alleles often exhibit greater between population difference than the average region of the genome (i.e., higher F st). Additionally, a new derived allele segregating within one population at a high frequency is often a telltale marker of recent adaptation, as a de novo mutation in a specific locale turns out to be beneficial. By combining tests which survey patterns of variation across loci (i.e., haplotype based methods), with those within loci and across populations (F st based methods) , CMS zeros in on a few precise narrow candidates by cross-checking with multiple tools. False positive hits aside, another major problem with relying upon a single coarse test is that they often highlight a large region as a target of natural selection. This does not necessarily allow for simple follow up when you have dozens of genes and millions of bases which are potential candidates.

The second paper, Grossman et al. 2013, is less a map of genome-wide variation, than a scan of genome-wide variation with an intent to select choice targets for more detailed analysis. To no one’s surprise for human data sets loci implicated in salient physical characteristics such as height and pigmentation, metabolism, and immune response, are high on the list of candidates. No matter the genuine issue of false positives it does seem that recent human evolution (and frankly, evolution more generally) has a fixation on these traits, no pun intended. I do wonder sometimes if this is just an feature of the fact that we humans notice exterior phenotypes, as well as disease related markers (e.g., metabolic and immune illnesses). One of the major concerns in the second paper is that a selection signature without a phenotype is often without utility, but perhaps the phenotypes are lacking in utility because humans are blind in terms of what traits are of interest. I am still skeptical of explanations for what exactly the target of selection around the EDAR locus in East Asians is.

Two alleles of SLC24A5, citation: Norton, Heather L., et al. “Genetic evidence for the convergent evolution of light skin in Europeans and East Asians.” Molecular biology and evolution 24.3 (2007): 710-722.

One of the more intriguing results from CMS in Grossman et al. 2013 is that a locus with the strongest association with resistance to leprosy also contains SLC24A5. This locus has an allele within it that is almost disjoint in frequency between Europeans and Sub-Saharan Africans. By this, I mean that almost all Africans carry one base, while nearly all Europeans care the other. The allele found in Europeans is dominant in West Asia, and present as frequencies as high as ~50% as far south and east as Sri Lanka. It is a gene which is famously correlated with lighter skin in humans and zebrafish. And yet there remains the mystery that it is present at very high frequencies rather far south, and it is certainly not a necessary condition for light skin. East Asians are nearly fixed for the ancestral variant which is common in Sub-Saharan Africa. A possible explanation is that these sorts of salient phenotypic loci have been reshaped due to very strong bouts of selection targeting particular diseases in the recent past. If this is correct, the phenotypic characteristics which we find salient in human beings may simply be pleiotropic side effects of selective sweeps anchored around disease resistance.

I am not proposing here that genomics can solve and explain evolution. The heirs of G. G. Simpson may have something to say about that. Rather, I am suggesting that the genetic piece of the puzzle will not be lacking in data to any extent within our lifetimes. My hunch is that many evolutionary genetic questions will be soluble when we have thousands of complete genomes of high quality on thousands of organisms. There is no likely windfall of fossils in the near future, so palentology will have to continue to operate in a relatively data constrained environment. For those who work in the domain of evolutionary genetics and genomics the onus is on human ingenuity, and analytic skill and savvy. Thinking hard and deep about difficult problems, rather than putting in long hours on the bench to glean more data.

🔊 Listen RSS

Layers and layers….

There is the fact of evolution. And then there is the long-standing debate of how it proceeds. The former is a settled question with little intellectual juice left. The latter is the focus of evolutionary genetics, and evolutionary biology more broadly. The debate is an old one, and goes as far back as the 19th century, where you had arch-selectionists such as Alfred Russel Wallace (see A Reason For Everything) square off against pretty much the whole of the scholarly world (e.g., Thomas Henry Huxely, “Darwin’s Bulldog,” was less than convinced of the power of natural selection as the driving force of evolutionary change). This old disagreement planted the seeds for much more vociferous disputations in the wake of the fusion of evolutionary biology and genetics in the early 20th century. They range from the Wright-Fisher controversies of the early years of evolutionary genetics, to the neutralist vs. selectionist debate of the 1970s (which left bad feelings in some cases). A cartoon-view of the implication of the debates in regards to the power of selection as opposed to stochastic contingency can be found in the works of Stephen Jay Gould (see The Structure of Evolutionary Theory) and Richard Dawkins (see The Ancestor’s Tale): does evolution result in an infinitely creative assortment due to chance events, or does it drive toward a finite set of idealized forms which populate the possible parameter space?*

But ultimately these 10,000 feet debates are more a matter of philosophy than science. At least until the scientific questions are stripped of their controversy and an equilibrium consensus emerges. That will only occur through an accumulation of publications whose results are robust to time, and subtle enough to convince dissenters. This is why Enard et al.’s preprint, Genome wide signals of pervasive positive selection in human evolution, attracted my notice. With the emergence of genomics it has been humans first in line to be analyzed, as the best data is often found from this species, so no surprise there. Rather, what is so notable about this paper in light of the past 10 years of back and forth exploration of this topic?**

By taking a deeper and more subtle look at patterns of the variation in the human genome this group has inferred that adaptation through classic positive selection has been a pervasive feature of the human genome over the past ~100,000 years. This is not a trivial inference, because there has been a great deal of controversy as to the population genetic statistics which have been used to infer selection over the past 10 years with the arrival of genome-wide data sets (in particular, a tendency toward false positives). In fact, one group has posited that a more prominent selective force within the genome has been “background selection,” which refers to constraint upon genetic variation due to purification of numerous deleterious mutations and neighboring linked sites.

The sum totality of Enard et al. may seem abstruse, and even opaque, in terms of the method. But each element is actually rather simple and clear. The major gist is that many tests for selection within the genome focus on the differences between nonynonymous and synonymous mutational variants. The former refer to base positions in the genome which result in a change in the amino acid state, while the latter are those (see the third positions) where different bases may still produce the same amino acid. The ratio between substitutions, replacements across lineages for particular base states, at these positions is a rough measure of adaptation driven by selection on the molecular level. Changes at synonymous positions are far less constrained by negative selection, while positive selection due to an increased fitness via new phenotypes is presumed to have occurred only via nonsynonymous changes. What Enard et al. point out is that the human genome is heterogeneous in the distribution of characteristics, and focusing on these sorts of pairwise differences in classes without accounting for other confounding variables may obscure dynamics on is attempting to measure. In particular, they argue that evidence of positive selective sweeps are masked by the fact that background selection tends to be stronger in regions where synonymous mutational substitutions are more likely (i.e., they are more functionally constrained, so nonsynonymous variants will be disfavored). This results in elevated neutral diversity around regions of nonsynonymous substitutions vis-a-vis strongly constrained regions with synonymous substitutions. Once correcting for the power of background selection the authors evidence for sweeps of novel adaptive variants across the human genome, which had previous been hidden.

There are two interesting empirical findings from the 1000 Genomes data set. First, the authors find that positive selection tends to operate upon regulatory elements rather than coding sequence changes. You are probably aware that this is a major area of debate currently within the field of molecular evolutionary biology. Second, there seems to be less evidence for positive selection in Sub-Saharan Africans, or, less background selection in this population. My own hunch is that it is the former, that the demographic pulse across Eurasia, and to the New World and Australasia, naturally resulted in local adaptations as environmental conditions shifted. Though it may be that the African pathogenic environment is particularly well adapted to hominin immune systems, and so imposes a stronger cost upon novel mutations than is the case for non-Africans. So I do not dismiss the second idea out of hand.

Where this debate about the power of selection will end is anyone’s guess. Nor do I care. Rather, what’s important is getting a finer-grained map of the dynamics at work so that we may perceive reality with greater clarity. One must be cautious about extrapolating from humans (e.g., the authors point out that Drosophila genomes are richer in coding sequence proportionally). But the human results which emerge because of the coming swell of genomic data will be a useful outline for the possibilities in other organisms.

Citation: Genome wide signals of pervasive positive selection in human evolution

* The cartoon qualification is due to the fact that I am aware that selection is stochastic as well.

** Voight, Benjamin F., et al. “A map of recent positive selection in the human genome.” PLoS biology 4.3 (2006): e72., Sabeti, Pardis C., et al. “Detecting recent positive selection in the human genome from haplotype structure.” Nature 419.6909 (2002): 832-837., Wang, Eric T., et al. “Global landscape of recent inferred Darwinian selection for Homo sapiens.” Proceedings of the National Academy of Sciences of the United States of America 103.1 (2006): 135-140., Williamson, Scott H., et al. “Localizing recent adaptive evolution in the human genome.” PLoS genetics 3.6 (2007): e90., Hawks, John, et al. “Recent acceleration of human adaptive evolution.” Proceedings of the National Academy of Sciences 104.52 (2007): 20753-20758., Pickrell, Joseph K., et al. “Signals of recent positive selection in a worldwide sample of human populations.” Genome research 19.5 (2009): 826-837., Hernandez, Ryan D., et al. “Classic selective sweeps were rare in recent human evolution.” Science 331.6019 (2011): 920-924.

🔊 Listen RSS
Central Dogma

Central Dogma

One of the elementary aspects of understanding genetics on a biophysical scale is to characterize the set of processes which span the chasm between the raw sequence information of base pairs (e.g. AGCGGTCGCAAG….) and the assorted macromolecules which are woven together to create the collection of tissues, and enable the physiological processes, which result in the organism. This suite of phenomena are encapsulated most succinctly in the often maligned Central Dogma of Molecular Biology. In short, the information of the DNA sequence is transcribed and translated into proteins. Though for greater accuracy and precision one must always add the caveats of phenomena such as splicing. The baroque character of the range of processes is such an extent that molecular genetics has become a massive enterprise, to a great extent superseding classical Mendelian genetics.

One critical structural detail from an evolutionary perspective is that the amino acids which are the building blocks of proteins are generally encoded by multiple nucleotide triplets, or codons. For example the amino acid Glyceine is “four-fold degenerate,” GG A, GG G, GG C, GG U (for RNA Uracil, U, substitutes for Thymine in DNA, T), all encode it. Notice that the change is fixed upon the third position in the codon. Altering the first or second position would transform the amino acid end product, and possibly perturb the function of the final protein (or perhaps disrupt transcription altogether in some case). These are synonymous substitutions because they don’t change the functional import of the sequence, as opposed to the nonsynonymous positions (which may abolish or change function). In an evolutionary context one may presume that these synonymous substitutions are “silent.” Because natural selection operates upon heritable variation of a phenotype, and synonymous substitutions presumably do not change phenotype, it is often assumed that evolutionary change on these bases is selectively neutral. In contrast, nonsynonymous changes may be deleterious or beneficial (far more likely the former than the latter because breaking contingent complexity is easier than creating new contingent complexity). Therefore the ratio of gentic change on nonsynonymous and synonymous bases across lineages has been a common measure of possible selection on a gene.

At this point I have sketched out in the most superficial sense a set of propositions which span the concrete physical realm of the biochemical mechanics of DNA to the abstract formal evolutionary genetic models which outline the trajectory of allele frequencies over time and space. But propositions are always embedded in axioms, and those axioms may not always be literally true. For example some codons, which are notionally equivalent in terms of their amino acid output, are favored due biases derived from the various efficiencies of the translational machinery of the cell. After a fashion this too is natural selection, but it does not manifest via fitness of individual organisms at some stage of life history in a straightforward fashion. Then there are cases where synonymous mutations change the regulatory pathway in a significant manner. And so on. Despite all these deviations from the ideal presumably the preponderance of researchers accept that the utility of neutral framework for synonymous mutations allowed for the prior assumption that they were not subject to selection.

A new paper in PLoS GENETICS, Strong Purifying Selection at Synonymous Sites in D. melanogaster, takes aim at the robustness of this axiom by highlighting the likelihood that many synonymous positions in Drosophila are subject to strong purifying selection. That is, a putative silent transition produces significant functional differences which result in a major decrease in the fitness of the organisms, removing the mutant alleles from the pool of polymorphisms. Note the key qualifier here that the selection is strong. Dynamics such as mutational bias and regulatory differences mean that many would acknowledge a weak and gentle purifying selection on even synonymous sites. These authors contend something rather more radical.

Figure 1C

To be frank the paper is rather abstruse and dense in its prose, though impressive in its disciplinary breadth, ranging from statistical genetics to developmental biology. But the core result can be boiled down to raw counts of SNPs. In particular they compared introns, which like synonymous sites are putatively neutral because they are not part of the final RNA transcript which generates the protein, as a reference against which to check their sites of interest. Though subtle you can observe in the panel at the top of this post that here seems some deviation from neutrality in the 4D (for four-fold degenerate) sites. It is clearer in the second panel above. The synonymous sites seem have less genetic variation than they should. This is a tell for purifying selection, which removes low frequency deleterious mutations from the population continuously. But why is this strong selection? The issue highlighted by the authors is that the data sets from previous research were simply not dense and rich enough to distinguish between strong and weak purifying selection, as on a coarser scale of analysis the effects would be rather similar. In contrast here the authors used more than 100 Drosophila lines, and assembled nearly 1 million 4D SNPs. With such a deep sampling of the population they were able to probe even small differences, as strong selection would be discernibly more effective in flushing out very low frequency alleles (consider that in smaller samples low N variants are simply likely to be missed).

Being a paper in PLoS GENETICS it is free for all to read, so I will save you all the gory details in terms of how they corrected for biases of GC content, possible selective sweeps distorting the signal from flanked regions, etc. They were able to use resampling techniques to confirm the robustness of their inferences, though the slicing of the data into numerous categories does concern me a bit. Additionally there is mention of utilizing “parsimony,” which is somewhat concerning, in particular due to the fact that the authors even concede that this may produce false conclusions. But the big picture result is rather impressive even if the details have a daunting number of moving parts. I should mention as well that they explored the possible role that codon bias might have in generating this pattern, and that does not seem likely (in particular because purifying selection seems to effect optimal and non-optimal codons). And, there were some rather strange results too, such as their finding that purifying selection was weaker on the X chromosome than the autosome (contrary to mine, and I think their, expectation).

The “back end” of the paper is different in that it analyses the functional and developmental aspects of the genome regions of interest (4D sites). They report for example that purifying selection is operating on conserved sites across Drosophila species. Not surprising. But there also seems a significant amount of substitution and change on sites across lineages which are subject to purifying selection within lineages. This hints to gain of function of mutations which distinguish Drosophila species. Finally, are also broad patterns as to the temporal distribution of gene expression as they relate to 4D sites which are strongly conserved. As I am not well versed in developmental biology I will leave that to others, though the results seem suggestive, if opaque to me.

One paper does not overthrow 40 years of molecular evolution. And even if some of the primary assumptions and results validating neutral theory are wrong, that does not negate the utility of neutrality as a null hypothesis. But if synonymous sites are taken as a benchmark for neutrality, and have been subject to strong purifying selection all the while, then it does mean that our understanding of the balance of forces shaping the evolutionary genetic history of Drosophila may be quite wrong. The qualifier about Drosophila is I think warranted, because from what I recall earlier results reported ubiquitous selection in this model organism, and that may not hold for all taxa. The authors make the case for the generality of their results, and they may be right, but I think one should be more cautious about such claims. What this does tell us is that modern genomics and the scaling up of data is not revealing nature on just a finer scale, but may actually be smoking out structure and patterns which have long been hiding in plain site.

Citation: Lawrie DS, Messer PW, Hershberg R, Petrov DA (2013) Strong Purifying Selection at Synonymous Sites in D. melanogaster. PLoS Genet 9(5): e1003527. doi:10.1371/journal.pgen.1003527

🔊 Listen RSS

Credit: Ealgdyth

The horse is a beautiful animal.* That is not a trivial matter, but there is the added fact that historically it is has been of great consequence. Obviously the rise of horses as vehicles of war is preeminent in our minds, but on a more prosaic level draft horses revolutionized many societies via their effect on agricultural productivity. Dogs may be man’s best friend, but horses are arguably** man’s most useful friend. Or at least they were. The critical importance of the horse is probably lost on modern people, but until the rise of the automobile they were ubiquitous in many large cities (this is clear when you view early films). Today horses are perceived to be luxurious playthings (ergo, the term “horse country”), but during the heyday of the horse-powered world they served the roles of tractors, tanks, automobiles, and telegraphs.

These are just some of the reasons that horse genomics may be of more than passing curiosity for those who are not to the manor born. Horses are part of our history, and as a large charismatic mammal there is a particular interest in the origins of this lineage. This is part of the reason that a new paper in Nature is important, Recalibrating Equus evolution using the genome sequence of an early Middle Pleistocene horse. But this is not the only reason that this paper is of worthy note. It extends the time frame of ancient DNA sequencing back by an order of magnitude, from ~50,000 years before the present to ~500,000 years before the present. Obviously that is a big leap, though it is not surprising that these DNA were retrieved from remains in Canada’s Yukon. Often mammalian ancient DNA breakthroughs, which entail the destruction of fossils, presage prehistoric analysis of our own lineage. But I am not quite sure that that will necessarily happen here (with the caveat that there is going to be a lot of ancient human DNA publication of more recent vintage over the next few years), as the expansion of Homo into the far north truly reached an extensive scope only with our particular lineage of sapiens sapiens.*** Nevertheless, this publication no doubt solidifies the new era in phylogenetics, where inferences of trees can be calibrated and checked against long extinct nodes and branches which had heretofore only been posited.

So what did they find in this paper that was worth noting? First, we have to note that their coverage on the ancient sample was not particularly high. Only about one per site. That means that there’s going to be a lot of noise in the system, but with the number of markers they have that’s not as much of an issue for phylogenetic inference of population history. It does concern me more when the authors are focusing on functional regions and possible differences between the ancient sample and modern lineages.

In any case, as you can see from the tree above, Przewalski’s horse is an outgroup to all modern domestic horse lineages. That is not so novel a result, as domestic horses probably derive from the West Eurasian steppe, while Przewalski’s_horse are localized around Mongolia. Perhaps more surprising there is no sign of admixture of domestic and Przewalski’s horse in the genomes of the latter. Using the Yukon sample as a calibration, the authors estimate that the divergence between domestic horses and the wild Mongolian lineage occurred on a population wide scale ~40,000 years before the present. Some quick back of the envelope calculations suggest to me that that’s about equivalent to ~150,000 humans years correcting for generation time, in the same range as the separation of Bushmen from other populations. Also, I might mention that there is a karyotype difference between Przewalski’s and domestic horses (though they are inter-fertile), and apparently strong behavioral divergences. The latter is to be expected, as horses have been subject to very powerful selection on a range of phenotypes (how many animals would let another animal habitually ride them?), with skewed reproductive values between the sexes (i.e., one stallion siring many offspring) driving the dynamic. Finally, the authors note that Przewalski’s_horse has relatively high genetic diversity despite a breeding bottleneck of ~15 (there are now >1,000 of them). Actually the original breeding program consciously crossed unrelated individuals, so this is confirmation of expectation. Again, this illustrates that bottlenecks in census size need not be catastrophic.

I will forgo extensive discussion of possible selection events in domestic horses, using the ancient Yukon sample as an outgroup, because it seems that Przewalski’s more than suffices for that. Rather, an interesting result here is that the authors push back the period of the emergence of the horse to ~4 million years, rather than ~2 million. It is implied that the horse mutation rate may be different from the human mutation rate. As you likely know, there are actually controversies about the nature of the human mutation rate and the viability of various clock estimates. With a paleontological peg the Yukon sample obviously serves as an excellent data point for purposes of calibration. Without getting into the technical guts the fact that it’s so ancient, and not that far from the older putative age for the origin of the horse, means that I have rather good confidence that the results here will stand the test of time.

There’s no point in me commenting on the technological wizardry that went into extracting and sequencing such degraded DNA. But much of the legwork in that domain is nevertheless truly impressive to me. The fact that they managed to get that much information out of >500,000 year old DNA is incredible when you consider that less than a generation ago sequencing one individual was perceived to be a herculean task!

Citation: Recalibrating Equus evolution using the genome sequence of an early Middle Pleistocene horse

* Also, frankly, somewhat stupid. I worked on a mule farm when I was an adolescent and the dull quiescent nature of horses in comparison to the irascible and cunning donkey was always a stark contrast.

** I say arguably because one can make the case that cattle or pig and such are of greater utility as domesticated animals. But an argument can nevertheless be had on all sides.

*** I am aware that archaic H. sapiens such as Neandertals were a northern subspecies, but in actuality their distribution was far more circumscribed than that of our own earliest forebears.

• Category: Science • Tags: Evolutionary Genomics, Genomics 
🔊 Listen RSS

There’s an excellent paper up at Cell right now, Modeling Recent Human Evolution in Mice by Expression of a Selected EDAR Variant. It synthesizes genomics, computational modeling, as well as the effective execution of mouse models to explore non-pathological phenotypic variation in humans. It was likely due the last element that this paper, which pushes the boundary on human evolutionary genomics, found its way to Cell (and the “impact factor” of course).

The focus here is on EDAR, a locus you may have heard of before. By fiddling with the EDAR locus researchers had earlier created “Asian mice.” More specifically, mice which exhibit a set of phenotypes which are known to distinguish East Asians from other populations, specifically around hair form and skin gland development. More generally EDAR is implicated in development of ectodermal tissues. That’s a very broad purview, so it isn’t surprising that modifying this locus results in a host of phenotypic changes. The figure above illustrates the modern distribution of the mutation which is found in East Asians in HGDP populations.

One thing to note is that the derived East Asian form of EDAR is found in Amerindian populations which certainly diverged from East Asians > 10,000 years before the present (more likely 15-20,000 years before the present). The two populations in West Eurasia where you find the derived East Asian EDAR variant are Hazaras and Uyghurs, both likely the products of recent admixture between East and West Eurasian populations. In Melanesia the EDAR frequency is correlated with Austronesian admixture. Not on the map, but also known, is that the Munda (Austro-Asiatic) tribal populations of South Asia also have low, but non-trivial, frequencies of East Asian EDAR. In this they are exceptional among South Asian groups without recent East Asian admixture. This lends credence to the idea that the Munda are descendants in part of Austro-Asiatic peoples intrusive from Southeast Asia, where most Austro-Asiatic languages are present.

And yet one thing that jumps out at me is that there is no East Asian EDAR in European populations, even in Russians. I am a bit confused by this result, because of the possibility of Siberian-affiliated population admixture with Europeans within the last 10,000 years, as adduced by several researchers (this is not an obscure result, it manifests in TreeMix repeatedly). The second figure shows the inferred region from which the East Asian EDAR haplotype expanded over the past 30,000 years. The authors utilized millions of forward simulations with a host of parameters to model the expansion of EDAR, so that it fit the distribution pattern that is realized (see the supplements here for the parmeters). To make a long story short they infer that there was one mutation on the order of ~30,000 years before the present, and that it swept up in frequency driven by selection coefficients on the order of ~0.10 (10% increase relative fitness, which is incredibly powerful!). This is on the extreme end of selective sweeps, and likely of the same class as the haplotype blocks which characterize SLC24A5 and LCT (the block is shorter, though that makes sense because of the deeper time depth). Again, I am perplexed why such an ancient allele, which is found in Amerindians, or Munda populations, is absent in Europeans who have putative East Eurasian admixture. The whole does not cohere for me. There is a weak point in one or more of my assumptions.

Then there’s the section on the mouse model. To me this aspect was ingenious, though I’m not particularly able to assess it on its technicalities. The earlier usage of mouse models to test the effects of mutations on EDAR was in the context of coarse copy number changes which resulted in massive dosage changes of protein. The phenotypic outcomes were rather extreme in that case. Here they used a “knockin” model where they recreated the specific EDAR point mutation. Instead of extreme phenotypes they found that the mice were much more normal in their range of traits, though the hair form shifts were well aligned with what occurred in humans. Additionally there were some changes in the number of eccrine glands, with a larger number in the derived East Asian EDAR carriers (with additive effect). Finally they noticed that there were differences in mammary gland pad area and branching. None of this is that surprising, EDAR is a significant regulatory gene which shapes the peripheries and exterior of an organism.

To double check the human relevance of what they found in the mouse model they performed a genome-wide association in a large cohort of Han Chinese. The correlations of particular traits were in the directions that they expected; those individuals with East Asian EDAR variants had thicker hair, shovel-shaped incisors, and a greater density of eccrine glands. It is perhaps important to note that the frequency of the derived variant is so high in Han populations that they didn’t have enough homozygote ancestral genotypes to perform statistics, so their comparisons involved heterozygotes with the derived mutant and also a copy of the ancestral state. This is like SLC24A5 in Europeans, where it is difficult to find individuals of European heritage who have double copies of the non-European modal variant.

Let’s review all the awesome things they did in this study. They dug deeply into the evolutionary genomics of the region around the EDAR, concluding that this haplotype was driven up in frequency from on ancestral variant ~30,000 years ago in a hard selective sweep. And a sweep of notable strength in terms of selection coefficient. This may be one of the largest effect targets of natural selection in the genome of non-Africans over the past 50,000 years. Second, they used a humanized mouse model to explore the range of phenotypes correlated with this mutational change in East Asians. So you have a strong selection coefficient on a locus, and, a range of traits associated with changes on that locus. Third, they confirmed the correlation between the traits and the mutation in humans, despite there being prior research in this area (i.e., they reproduced). This is all great science, and shows the power of collaboration between the groups.

Much of the elegance and power of the paper applies to the discussion section as well, but to be frank this is where things start falling apart for me. You can get a sense of it in The New York Times piece, East Asian Physical Traits Linked to 35,000-Year-Old Mutation. The headline here points to a legitimately important inference from this line of research, many salient physical characteristics of the human races seem to be due to strong selection events at a few loci. In addition to EDAR I’m thinking of the pigmentation loci, such as SLC24A5. I wouldn’t be surprised if there was something similar for the epicanthic fold. If it is visible, and defines between populations differences, it is generally not genomically trivial. There’s usually a story underneath that difference.

In the broad scale of human natural history the problem that arises for me is that we have traits, we have genes under selection, but we have very weak stories to explain the mechanism and context of natural selection. Here there is a strong contrast with the loci around lactase persistence and malaria resistance. In those situations the causal mechanism for the selection seems relatively clear. Critics of evolutionary psychology are wont to accuse the field of ‘Just So’ storytelling, but the same problem crops up in the more intellectually insulated domain of evolutionary genomics (in part because the field is very new, and also mathematically and computationally abstruse). To illustrate what I’m talking about I’m going to quote from the discussion of the above paper:

A high density of eccrine glands is a key hominin adaptation that enables efficient evapo-traspiration during vigorous activities such as long-distance walking and running (Carrier et al., 1984; Bramble and Lieberman, 2004). An increased density of eccrine glands in 370A carriers might have been advantageous for East Asian hunter-gatherers during warm and humid seasons, which hinder evapo-transpiration.

Geological records indicate that China was relatively warm and humid between 40,000 and 32,000 years ago, but between32,000 and 15,000 years ago the climate became cooler and drier before warming again at the onset of the Holocene (Wang et al., 2001; Yuan et al., 2004). Throughout this time period, however, China may have remained relatively humid due to varying contribution from summer and winter monsoons.

High humidity, especially in the summers, may have provided a seasonally selective advantage for individuals better able to functionally activate more eccrine glands and thus sweat more effectively (Kuno, 1956). To explore this hypothesis, greater precision on when and where the allele was under selection—perhaps using ancient DNA sources—in conjunction with more detailed archaeological and climatic data are needed.

A climate adaptation is always a good bet. The problem I have with this hypothesis is that modern day gradients in the distribution of this allele are exactly the reverse of what one might expect in terms of adaptation to heat and humidity. Additionally, is there no cost to this adaptation? After the initial sweep upward, the populations where the derived EDAR mutant is found in high frequencies went through the incredible cold of the Last Glacial Maximum, and groups like the Yakuts are known to have cold adaptations today. Not only that, but the Amerindians from the arctic to the tropics all exhibit a cold adapted body morphology, the historical consequence of the long sojourn in Berengia.

Granted, the authors are not so simplistic, and the somewhat disjointed discussion alludes to the fact that EDAR has numerous phenotypic effects, and it may be subject to diverse positive selection pressures. This seems plausible on the surface, but this complexity of mechanism seems ill-fitted to the fact that the signal of selection around this locus is so clean and crisp. It seems that this is not going to be an easy story to unpack, and there’s a good deal of implicit acknowledgement of that fact in this paper. But tacked right at the end of the main text is this whopper:

It is worth noting that largely invisible structural changes resulting from the 370A allele that might confer functional advantage, such as increased eccrine gland number, are directly linked to visually obvious traits such as hair phenotypes and breast size. This creates conditions in which biases in mate preference could rapidly evolve and reinforce more direct competitive advantages. Consequently, the cumulative selective force acting over time on diverse traits caused by a single pleiotropic mutation could have driven the rise and spread of 370A.

A simple takeaway is that the initial climatic adaptation may have given way to a cultural/sexual selective adaptation, whereby there was a preference for “good hair” as exemplified by pre-Western East Asian canons (black and lustrous), as well as a bias toward small breasts. This aspect gets picked up in The New York Times piece of course. I’ll quote again:

But Joshua Akey, a geneticist at the University of Washington in Seattle, said he thought the more likely cause of the gene’s spread among East Asians was sexual selection. Thick hair and small breasts are visible sexual signals which, if preferred by men, could quickly become more common as the carriers had more children. The genes underlying conspicuous traits, like blue eyes and blond hair in Europeans, have very strong signals of selection, Dr. Akey said, and the sexually visible effects of EDAR are likely to have been stronger drivers of natural selection than sweat glands.

The passage here is ambiguous because the author of the article, Nick Wade, doesn’t use quotes, and I don’t know what is Akey and what is Wade’s gloss on Akey. For example, for theoretical reasons of reproductive skew (a few men can have many children) in general sexual selection is considered to be driven most often by female preference for male phenotypes. I assume Akey knows this, so I suspect that that section is Wade’s gloss (albeit, a reasonable one given the proposition of preference for smaller breasts). The main question on my mind is how seriously prominent population geneticists such as Joshua Akey actually take sexual selection to be as a force driving variation and selection in human populations. It seems that quite often sexual selection is presented as a deus ex machina. A phenomenon which can rescue our confusion as to the origins of a particular suite of traits. But our assessment of the likelihood of sexual selection presumably has to be premised on prior expectations informed by a balance of different forces one can gauge from the literature, and here my knowledge of the current sexual selection literature is weak. Perhaps my skepticism is premised on my ignorance, and the population geneticists who proffer up this explanation are more informed as to the state of the literature.

All this brings me back to the farcical title. When this paper first made news last week I was having dinner with a friend of Japanese heritage (who spent his elementary school years in Japan). I asked him point blank, “Do you like small breasts?” His initial response was “WTF!?! Razib,” but as a mouse geneticist he understood the thrust of my question after I outlined the above results to him. From personal communication with many East Asian American males I am not convinced that there is a overwhelmingly strong preference for small breasts within this subset of the population. But the key here is American. These are individuals immersed in American culture. The norms no doubt differ in East Asia. The typical visual representation of celebrity East Asian females that we see in the American media depict individuals who are slimmer and more understated in their secondary sexual characteristics than is the norm among Western female celebrities (e.g., Gong Li, the new crop of Korean pop stars, even taking into account the plastc surgery of the latter). Part of this is no doubt the reality that the normal range of variation across the population differs, and part of it may be the nature of aesthetic preferences.

But the possibility of deep rooted psychological reasons driving sexual selection (to my knowledge there was no culture which spanned South China and Siberia) brings us back to old ideas about the Pleistocene mind. And, it brings us back to evolutionary psychology, a field which is the whipping boy of both skeptics of the utility of evolutionary science in understanding human nature, and rigorous practitioners of evolutionary biology. And yet here it is not the evolutionary psychologists, but rock-ribbed statistical geneticists who I often see being quoted in the media invoking sexual selection. But do we know it is sexual selection, or is it just our best guess? Because more often than not best guesses are wrong (though best guesses are much more likely to be right than worst guesses!).

Evolutionary genomics has come a long way in the past 10 years. We know, for example, the genetic architecture and some aspects of the natural history of many traits. But, there are still shortcomings. Lactase persistence is the exception to the rule. Even a phenotype as straightforward as human pigmentation has no undisputed answer as to why it has been the repeated target of selection across Eurasia over the past 40,000 years. Oftentimes the right answer is simply that we just don’t know.


🔊 Listen RSS

As most readers know I was at ASHG 2012. I’m going to divide this post in half. First, the generalities of the meeting. And second, specific posters, etc.


Life Technologies/Ion Torrent apparently hires d-bag bros to represent them at conferences. The poster people were fine, but the guys manning the Ion Torrent Bus were total jackasses if they thought it would be funny/amusing/etc. Human resources acumen is not always a reflection of technological chops, but I sure don’t expect organizational competence if they (HR) thought it was smart to hire guys who thought (the d-bags) it would be amusing to alienate a selection of conference goers at ASHG. Go Affy & Illumina!

– Speaking of sequencing, there were some young companies trying to pitch technologies which will solve the problem of lack of long reads. I’m hopeful, but after the Pacific Biosciences fiasco of the late 2000s, I don’t think there’s a point in putting hopes on any given firm.

– I walked the poster hall, read the titles, and at least skimmed all 3,000+ posters’ abstracts. No surprise that genomics was all over the place. But perhaps a moderate surprise was how big exomes are getting for medically oriented people.

– Speaking of medical/clinical people, I noticed that in their presentations they used the word ‘Caucasian‘ a lot. This was not evident in the pop-gen folks. It shows the influence of bureaucratic nomenclature in modern medicine, as they have taken to using somewhat nonsensical US Census Bureau categories.

– Twitter was a pretty big deal. There were so many interesting sessions that I found myself checking my feed constantly for the #ASHG2012 hashtag. It was also an easy way to figure out who else was at the same session (e.g., in my case, very often Luke Jostins).

– If you could track the patterns of movements of smartphones at the conference it would be interesting to see a network of clustering of individuals. For example, the evolutionary and population genomics posters were bounded by more straight-up informatics (e.g., software to clean your raw sequence data), from which there was bleed over. But right next to the evolution and population genomics sections (and I say genomics rather than genetics, because the latter has been totally subsumed by the former) you had some type of pediatric disease genetics aisles. I wasn’t the only one to have a freak out when I mistakenly kept on moving (i.e., you go from abstruse discussions of the population structure of Ethiopia, to concrete ones about the likely probability of death of a newborn with an autosomal dominant disorder, with photos of said newborn!).

– It was obvious which sessions were more multidisciplinary: just note the “churn” between speakers. People were switching sessions speaker-by-speaker, so if there was a stretch not to their liking, they would opt out.

– Number of questions per talk seemed to follow a power law. Many, many, talks had to have the moderator ask a token question. But there were a few panels where people rushed to the mics, and the moderators had to turn them away (this happened to me a couple of times, though I had the habit of sitting in the middle of aisles so that people wouldn’t have to edge past me, which disadvantaged me).

23andMe will supposedly have the new ancestry painting, with many more populations, up by the end of the year at the latest. I’ll believe it when I see it, but the person who was telling me this seemed totally sincere, and I’m hopeful.

– I drank a fair amount some of the nights, and have a lot of business cards from people I don’t remember. But one thing that seems to be emerging is a proliferation of intermediation and b2b services. With the diversity of choices it stands to reason that some firms are stepping into the clutter and attempting to make a profit by matching the two parties at the ends of the transaction. One person who I do recall Michael Heltzen of BlueSEQ, which he pitches as the “ of sequencing.”

– Overall this was a well run conference in terms of logistics. I’ll definitely be at Boston next year!

– Lots of stuff on archaic hominin admixture and selection. Lots.

– Friends don’t let friends use structure when they could use admixture. It seems that most people have switched to the latter, which is fast. But a few groups are still using the former. And they shouldn’t be, because their burn-in and replication parameters are set way low (I use structure for microsatellites) so that it won’t take a thousand years to converge. If you are doing this, why go for the power of Bayesian phylogenetics in the first place?

– Luke Jostins suggested that I looked different in real life from the head shot. I suspect that perhaps Luke has lower powers of perception in this domain. A very drunk member of a well respected lab decided to start yelling my name at a dim after party,* so I can’t look that different (and it isn’t as if there aren’t a lot of brown dudes walking around at these things).

Konrad Karczewski gave me a “free the data” button which I wore, but there were mixed results when I asked if people were going to release their data sets. Some presenters offered to email me the data, but since I wasn’t flashing my badge I’m curious why they’d offer to even do this, as opposed to releasing it to millions of other strangers!


This section will mostly cover the talks and posters in the evolutionary and population genomics category. I can comment on the talks because I went to them, and on the posters because I looked at them multiple times. One thing to note is that many of the posters and some of the talks were on papers which are already in circulation (preprint or already published). I’m not going to touch on that much. I’ve reviewed/linked to most of those.

– One of the guys involved in fetal whole genome sequencing from last spring stated that the primary cost here is going to be in the sequencing. He’s also confident that they can move the sequencing much further back from 18 weeks (i.e., in terms of sample collection and analysis turnaround).

– There is a lot of talk about structural variants, etc., but for high-throughput sequencing methods we’re still not ‘there.’ I actually went to a CNV talk where the presenter presented some RFLP results! He stated that the reality was that for clinical purposes high-throughput isn’t feasible or accurate enough to distinguish 3 vs. 4 vs. 5 copies.

– I don’t get a lot of the CNV stuff which repeats SNP-results with CNVs. For example, the posters which recapitulated geographical fine-structure with CNV. This was OK for the first pub, but doing it over and over again seems gratuitous.

– Simon Gravel has some very awesome software.

Luca Pagani is confident about rolloff‘s admixture estimate for Ethiopia. He’s moving to Ethiopian whole genomes now, and plans on doing follow ups on this question (his own methods are in line with rolloff).

– The rumored paper (i.e., I’ve heard about this paper for a few years) which connects Northeast African populations with the Khoisan of southern Africa will finally be published soon. At least that was what I was told…as I noted, this result has been around for a long time, but someone hasn’t been published. Basically the group has some Cushitic speaking samples from Kenya, and it looks like that these are the Ethiopian analogy to Andaman Islanders (or as close as you can get).

– We’ll see something on Afrikaner genomics soon enough. I wasn’t told explicitly, but it was pretty obvious.

– The Nielsen Group is still working on high altitude adaptations. They don’t see hard sweeps. Of course I didn’t get confirmation of whether these were old variants, but it looks as if a lot of preliminary stuff did not have the power to detect anything in the first group. As usual they are up to something.

– Speaking of the Nielsen Group, Melissa Wilson Sayres’s work on purifying selection on Y chromosomal lineages was persuasive to me. Basically, effective population differences (e.g, polygyny) just can not explain the lower diversity of the Y lineages (they ran simulations). Luckily for the phylogeographers this won’t impact the utility of Y trees (positive selection would, but that’s not what she’s talking about). I’m a little confused whether it was Sayres’ talk or not, but these results may explain the discordance in coalescence between mtDNA and Y lineages (the former has a deeper coalescence).

– Also, Amy Goldberg from Noah Rosenberg‘s lab presented some theoretical work that showed that complex demographic history has an impact on the variance, as opposed to the mean, effective population size you might infer for a given sex. Someone from Michael Hammer’s lab started asking me if I liked their research while I was looking at Amy’s poster, and I said sure (I’d blogged it), but her theoretical results also explain some of the weird stuff I’d see out of their lab.

– Sriram Sankararaman had a poster on Neandertal admixture in modern human lineages. In the broad outlines the Reich lab and the Wall lab seem to agree (along with others, such as Melinda Yang in the Slatkin lab). We’re seeing the convergence of a new orthodoxy/paradigm. And they seem to agree broadly with Graham Coop’s conjecture.

– There was a lot of stuff on East Asian genetics, but nothing too cutting edge. I was kind of disappointed. A massive Y and mtDNA study did suggest two waves of admixture in the Tibetan highland, which a priori seems plausible to me. But the rule-of-thumb I have is not to bet against the Nielsen Group, which remains skeptical. Another paper suggested deep lineages of haplogroup M among the Burmans. This is interesting because the Burmans are presumably culturally somewhat intrusive, supplanting the Mon populations.

– The guy from the Peopling of the British Isles presented. Two points. First, ~40 percent of the ancestry in England proper seems Anglo-Saxon. Second, their clustering method seemed to find many more ‘micro-populations’ along the “Celtic Fringe” and in Scotland. Why? My hunch is that the Anglo-Saxon expansion wasn’t a diffusion process. Rather, the hordes of Hengist and Horsa probably admixed with the local Brythonic Celtic population on the East Anglian shore, and the rapidly expanded. There is a high probability of some later assimilation (there is some suggestion that Alfred the Great’s line were Brythonic nobles who were absorbed into the Anglo-Saxon power structure), but the emergence of a huge Anglo-Saxon/England proper cluster was very evident in the figure displayed. The main opposition to this thesis I can think of is that isolation-by-distance gene flow is very efficacious in the topography of England, but less so in the more rugged borderlands.

– Speaking of isolation-by-distance, an Estonian geneticist claimed to me that the distinction between Estonians and Finns probably has to do with the arrival of the original Finnic populations from the east, and their subsequent separation. While the Estonians engaged in gene flow with the Latvians, they diverged from the Finns across the water, who were more isolated until the Swedes arrived.

– There was a poster (didn’t talk to the presenters) which did whole genome analysis of a South Indian man or two, and indicated that there is evidence that these individuals are basal to all other non-Africans. This is another attempt to reaffirm the possibility of an ancient “southern route” out of Africa. I wasn’t convinced because there wasn’t much detailing of their methods (they pointed to a diversity estimate, but that’s not enough these days).

– Another Indian group confirmed a lot of stuff that Zack has found already, but supplemented it with lots of low caste/tribal samples, which most people lacked. They assert (rightly) that within South Asia there are genetic distances across populations/castes which are analogous to inter-continental differences.

– I am excited by the synthesis of spatial and genetic variation data…but am beginning to realize that this has limitations, because we can’t transpose genetic variation representation onto tesseracts (because we can’t visualize tesseracts). In short, two or three dimensional representations remove important information at the finer-grain. And it’s at the finer-grain that we’re focusing now.

– Apparently Mexicans and Chileans overestimate their European ancestry. The presenters found that 40-45% of the ancestry of their Chilean sample was Amerindian. I asked about sampling, and they admitted this might be an issue. The same applied to their other results. We need thicker data sets here. Basically if it’s a heterogeneous country, you can’t have a pie-graph labelled with that country.

– There was a poster on associating OCA2-HERC2 in Brazilians with hair, eye, and skin color. The association of OCA2-HERC2 with skin color is unadmixed Europeans is mixed, but seems to show up in this population. Assuming stratification is not a problem (I believe they looked at that genomically), it seems that the effect on skin only shows up when you have a particular pigmentation genetic architecture. It’s a matter of statistics, not biology.

– Speaking of pigment, Mark Shriver had a poster which correlated perceived, apparent, and genomic racial ancestry. Perceived means how you’re perceived by others. Apparent is taking physical traits and averaging them quantitatively (facial features, skin color, etc.). And genomic ancestry is what you know about. Estimating ancestry quanta. The surprising thing is that people seemed to underestimate African ancestry from apparent physical features (looking at the scatter of apparent to genomic ancestry). This goes against folk wisdom, which asserts that “African features are dominant.”

– Lots of corrections of naive usages of Fst in the literature. A poster out of the Price lab suggested using likelihood ratios, and if not possible, Hudson’s Fst. This showed up multiple times in various forms. Fst will not die, but will be reborn!

– Saw a poster which claimed first cousin marriage decreases expected value of offspring by 3 cm! (this was not in the evolution and pop genomics sections, and I probably should have spent more time looking over complex traits, etc., but there’s only so much you can do)

– More evidence of multiple migrations into New World. Lots of New World genomics. I didn’t talk to these presenters because they were always busy.

– Spencer Wells told me that they’d finally be publishing their paper using their Geno 2.0 results soon. They had really good population coverage, though I wish they’d had the bar plot rotated 90 degrees. I couldn’t read labels too well.

Finally, there was A LOT of software, and A LOT of methods. This is one of the things where I assume over the next decade it will shake out into a few big players. Right now labs are pumping out software to infer ancestry, phase data, etc., and playing up their advantages. This is all good, but at some point the focus will go back to biology, and the software will be the wind beneath its wings. I’m trying to free up time to play with some of the software, though much of it isn’t online yet (the presenters always assured it would be up soon, but I know how that goes.).

* This was not a pleasant experience.

🔊 Listen RSS

The Pith: Natural selection comes in different flavors in its genetic constituents. Some of those constituents are more elusive than others. That makes “reading the label” a non-trivial activity.

As you may know when you look at patterns of variation in the genome of a given organism you can make various inferences from the nature of these patterns. But the power of those inferences is conditional on the details of the real demographic and evolutionary histories, as well as the assumptions made about the models one which is testing. When delving into the domain of population genomics some of the concepts and models may seem abstruse, but the reality is that such details are the stuff of which evolution is built. A new paper in PLoS Genetics may seem excessively esoteric and theoretical, but it speaks to very important processes which shape the evolutionary trajectory of a given population. The paper is titled Distinguishing between Selective Sweeps from Standing Variation and from a De Novo Mutation. Here’s the author summary:

Considerable effort has been devoted to detecting genes that are under natural selection, and hundreds of such genes have been identified in previous studies. Here, we present a method for extending these studies by inferring parameters, such as selection coefficients and the time when a selected variant arose. Of particular interest is the question whether the selective pressure was already present when the selected variant was first introduced into a population. In this case, the variant would be selected right after it originated in the population, a process we call selection from a de novo mutation. We contrast this with selection from standing variation, where the selected variant predates the selective pressure. We present a method to distinguish these two scenarios, test its accuracy, and apply it to seven human genes. We find three genes, ADH1B, EDAR, and LCT, that were presumably selected from a de novo mutation and two other genes, ASPM and PSCA, which we infer to be under selection from standing variation.

The dynamic which they refer to seems to be a reframing of the conundrum of detecting hard sweeps vs. soft sweeps. In the former you case have a new mutation, so its frequency is ~1/(2N). It is quickly subject to natural selection (though stochastic processes dominate at low frequencies, so probability of extinction is high), and adaptation drives the allele to fixation (or nearly to fixation). In the latter scenario you have a great deal of extant genetic variation, present in numerous different allelic variants. A novel selection pressure reshapes the frequency landscape, but you can not ascribe the genetic shift to only one allele. It is no surprise that the former is easier to model and detect than the latter. Much of the evolutionary genomics of the 2000s focused on hard sweeps from de novo mutations because they were low hanging fruit. The methods had reasonable power to detect them (as well as many false positives!). But of late many are suspecting that hard sweeps are not the full story, and that much of evolutionary genetic process may be characterized by a combination of hard sweeps, soft sweeps (from standing variation), various forms of negative selection, not to mention the plethora of possibilities which abound in the domain of balancing selection.

Many of the details of the paper may seem overly technical and opaque (and to be fair, I will say here that the figures are somewhat difficult to decrypt, though the subject is not one that lends itself to general clarity), but the major finding is straightforward, and illustrated in figure 4 (I’ve added labels):

– The y-axis represents the frequency of the selected allele(s) at the initial start of the selection phase

– The x-axis frequency represents a population scaled selection coefficient: α = 4 Ns. Recall that N is the population size, and s is the standard selection coefficient, which measures the relative fitness difference between an individual/gene against the population median. A selection coefficient of 0.10 (10% increased fitness) is strong. One of 0.01 (1%) is modest.

What the results above, derived from simulations using particular parameters relevant to population genetic models and the output statistics (e.g., iHS, EHH, Tajima’s D), show you is that it is easier to differentiate forms of selection when:

– For standing variation the selected variants are present at a higher initial frequency when selection initiates. This is not relevant for de novo mutation, where the frequency is very low by definition. Remember that the latter case is actually a subset of the former. If the standing variation model has a parameter which varies in frequency, as the proportion converged upon 1/(2 N) you just get the de novo scenario.

– The stronger the selection event, the greater the power to detect and correctly assign selection for standing variation. This is rather straightforward on first blush. The main exception seems to be in panel e, where increased strength of selection decreases the ability to differentiate the models when the adaptive phase initiates when the initial allele frequency is low. I assume here you have a situation where it is difficult to distinguish the two models, as de novo and standing variation are converging. Note that it is easier to assign a hard sweep from a de novo mutation when the final frequency (or the frequency you are attempting to detect) is lower. Why? Probably because as the mutation fixes you are removing much of the variant genomic information you need to infer the trajectory of the selected variant (this is true for iHS).

All this may seem abstract. But what you need to do to make some sense of this is to visualize the trajectory of the evolutionary dynamics in temporal and concrete terms. For example, a de novo mutation which drives adaptation will rapidly expand in the population over time. Because of this phenomenon there will be a hitchhiking event where the flanking regions of the favored allele also rise in frequency. This generates a extended region of homogeneity in the genome, in direction proportion of the frequency of the haplotype. This block of homogeneity eventually decays as genetic recombination breaks apart the physical association of the markers which were found together on the original mutant by chance. This is why the power to detect these events declines over time; the perturbation decays, and the genome reverts to equilibrium. In contrast selection against standing variation is more complex, and therefore more difficult to detect, as it does not produce a clear and distinct signal as often. You may have numerous alleles dispersed across wide regions of the genome amenable to being driven up in frequency by adaptive pressures. This generates a mass action shift in variants, but does not entail the production of wide and distinctive homogeneous blocks across the genome. Rather, you have a larger number of alleles subject to less intensive individual selection. Though some of the same consequences are entailed as in the de novo mutation case, the magnitude will be sharply attenuated in any given region of the genome.

Though the conceptual & methodological issues here are of interest in and of themselves (e.g., can you trust the Approximate Bayesian Computation framework to generate simulations which give useful results?), there are also some analyses of real human genes. These are not revolutionary, they’re loci which have been analyzed before. But methods need to be judged against reality at some point, and this is an attempt. The table below shows their results.

Some of these genes should be familiar to you. If not, see the function column. I do want to mention that EDAR has been implicated in hair thickness in East Asians. The most amusing aspect of this gene is that it can turn mice into Asians, at least in their hair form. Obviously they focused on single populations. They note in the methods that more populations would introduce demographic complexities into their simulations, and it seems likely that they were already pushing the realistic boundaries of computations which you might want to run routinely in a laboratory. But, this simplification might explain some ambiguity with ADH1B, which has been found in West Asia as well (forgoing the straightforward model in all likelihood of one single sweep in East Asia). An important issue then may be the population sensitivity of these methods. One could imagine that selection at a gene is easy to discern in population A, but not population B. One population may shift to a different phenotype through standing variation, while another was subject to a hard sweep from de novo mutation. The devil here is in the details. There may not be one narrative to rule them all.

The most important result from this paper was its exploration of the reasonable parameter space over which one can make robust inferences about the specific variety of selection which is operative (or lack thereof). In the near future computational power and a surfeit of empirical data sets will make it so that there will be great temptation to generate reams of results in a blind fashion utilizing off the shelf techniques. But techniques without subtly and human judgment can lead to confusion and falsity. It is useful to know the scenarios where one would expect large numbers of false positives or low statistical power, a priori. That way you may save yourself a great deal of time after the fact.

As for soft vs. hard sweeps. This isn’t simply a question of interest and relevance to population geneticists and genomicists. The nature of adaptation is a question of deep importance across evolutionary biology. The balance between these two phenomena are important in characterizing the mode and tempo of evolution. It may be that in fact the ratio varies as a function of the tree of life, so that evolution may operate with slightly different rules contingent upon taxon.

Citation: Peter BM, Huerta-Sanchez E, Nielsen R (2012) Distinguishing between Selective Sweeps from Standing Variation and from a De Novo Mutation. PLoS Genet 8(10): e1003011. doi:10.1371/journal.pgen.1003011

🔊 Listen RSS

A friend pointed me to the heated comment section of this article in Nature, Rebuilding the genome of a hidden ethnicity. The issue is that Nature originally stated that the Taino, the native people of Puerto Rico, were extinct. That resulted in an avalanche of angry comments, which one of the researchers, Carlos Bustamante, felt he had to address. Eventually Nature updated their text:

CORRECTED: This article originally stated that the Taíno were extinct, which is incorrect. Nature apologizes for the offence caused, and has corrected the text to better explain the research project described.

Here’s Wikipedia on the Taino today:

Heritage groups, such as the Jatibonicu Taíno Tribal Nation of Boriken, Puerto Rico (1970), the Taíno Nation of the Antilles (1993), the United Confederation of Taíno People (1998) and El Pueblo Guatu Ma-Cu A Boriken Puerto Rico (2000), have been established to foster Taíno culture. However, it is controversial as to whether these Heritage Groups represent Taíno Culture accurately as some Taino groups are known to ‘adopt’ other native traditions (mainly North American Indian). Many aspects of Taino culture has been lost to time and or blended with Spaniard and African culture on the Caribbean Islands. Peoples who claim to be of native descent in the islands of Puerto Rico, Hispaniola and Eastern Cuba attempt to maintain some form of cultural connection with their historic identities. Antonio de Moya, a Dominican educator, wrote in 1993, “the [Indian] genocide is the big lie of our history… the Dominican Taínos continue to live, 500 years after European contact.”

One of the ways that Taino activists now use to strengthen interest and identity is by the creation of two unique scripts. The scripts are used to write Spanish, not a retained language from pre-Columbian ancestors. The organization Guaka-kú teaches and uses their script among their own members, but the LGTK (Liga Guakía Taína-ké) has promoted their script among elementary and middle school students to strengthen their interest in Taino identity.

It is undeniable that the Amerindian ancestry found in the Caribbean probably derives from that pre-Columbian population. And it may be that there are cultural forms which exhibit unbroken continuity. But it seems that the modern Taino are a re-precipitation out of a cultural milieu whose Amerindian self-identity had gone extinct. By analogy, Argentines have about the same proportion of Amerindian ancestry as Puerto Ricans on a population-wide basis. In fact, over 90% of the Amerindian distinctive ancestry in Argentina is not found in self-identified Amerindians (who do continue to exist as a minority, especially in the South). But to my knowledge for various cultural reasons there has not been a groundswell to shift the Argentine self-conception from being a European settler nation to a mestizo nation, let alone individuals declaring themselves Amerindian.

In comparison to the possibilities which are opened up in this case, the issue of Aboriginal genomics looks rather cut & dried. I suppose we would laugh if some people decided to “reclaim” their Neandertal heritage, but there’s a huge corpus of paleoanthropological scholarship which these individuals could draw upon to reconstruct their identities as Neandertals. It might sound ludicrous, but this is a world where a lot happens that you wouldn’t expect.

🔊 Listen RSS

The Pith: The human X chromosome is subject to more pressure from natural selection, resulting in less genetic diversity. But, the differences in diversity of X chromosomes across human populations seem to be more a function of population history than differences in the power of natural selection across those populations.

In the past few years there has been a finding that the human X chromosome exhibits less genetic diversity than the non-sex regions of the genome, the autosome. Why? On the face of it this might seem inexplicable, but a few basic structural factors derived from the architecture of the human genome present themselves.

First, in males the X chromosome is hemizygous, rendering it more exposed to selection. This is rather straightforward once you move beyond the jargon. Human males have only one copy of genes which express on the X chromosome, because they have only one X chromosome. In contrast, females have two X chromosomes. This is the reason why sex linked traits in humans are disproportionately male. For genes on the X chromosome women can be carriers of many diseases because they have two copies of a gene, and one copy may be functional. In contrast, a male has only a functional or nonfunctional version of the gene, because he has one copy on the X chromosome. This is different from the case on the autosome, where both males and females have two copies of every gene.

This structural divergence matters for the selective dynamics operative upon the X chromosome vs. the autosome. On the autosome recessive traits pay far less of a cost in terms of fitness than they do on the X chromosome, because in the case of the latter they’re much more often exposed to natural selection via males. In the rest of the genome recessive traits only pay the cost of their shortcomings when they’re present as two copies in an individual, homozygotes. A simple quasi-formal example illustrates the process.

Imagine a population which has an allele which expresses recessively and has sharply reduced fitness when it expressed. Assume that the allele in question, q, is present at a proportion of 0.50. All the other functional alleles are classed together as p, and are also 0.50. In the next generation the Hardy-Weinberg Equilibrium would entail that: 75% of the individuals would not express the recessive trait, but 25% of the individuals would.* But for ever copy of the deleterious allele which is expressed and so exposed to natural selection, there’s another copy of the deleterious allele which is “masked” in a heterozygous individual with one good copy, and so evades natural selection. As natural selection decreases the frequency of the deleterious allele fewer and fewer copies will be found in recessively expressed individuals, and so the power of selection to remove the allele will decrease as its own frequency declines. When the frequency of the deleterious alleles is ~0.01, only about 1 out of 100 copies will be found in a homozygote exposed to natural selection. In this way genetic diversity of even deleterious alleles can be preserved as many low frequency recessively expressed variants.

The situation differs on the X chromosome. If the population consisted only of females then the model above would hold. The trait only expresses if a female has two copies of the faulty gene. But one out of every three X chromosomes in the typical human population is present in a male. That means that every deleterious allele on that X will bear its full cost if it happens to be in a male, a 1 out of 3 probability. So I calculate that when you have a situation where the deleterious allele is present as a fraction ~0.01 on the X chromosome about 1 out of 4 copies will be expressed, overwhelmingly in males. This is a 25-fold difference between the X and autosome in terms of copies of a deleterious allele exposed to natural selection, all due to the hemizygosity of males.

But the effect of selection isn’t uniformly negative, the purification of bad gene copies from the population. Positive forces can also reduce diversity via a selective sweep. How and why this happens is rather straightforward. Imagine that you have a single base pair which fortuitously has a mutation which is very beneficial in a single individual. To make the expression simple imagine that it is dominant, and the individual is a heterozygote. The single individual who carries the favored mutation has a very large family because ~50 percent of their offspring also carry the favored mutation and are much more fit than the population average. And so on. This favored variant can spread very fast. Lactose tolerance is a good concrete case of this. When I say the favored variant spreads, I’m actually talking about one gene copy from one person which starts to increase in frequency because of its adaptive value. But recall that a single base pair is embedded within the genome, and that chromosomal regions are generally passed on together from parent to offspring. It’s quite often a package deal. When a favored allele emerges it enables the “hitchhiking” of nearby variants which have no selective advantage, except that they luckily exist next to a very adaptively beneficial allele (think of them as the gene’s “posse” or entourage). Of course genetic recombination breaks apart these associations over time, but this process takes generations. Until then what you see is the proliferation of a particular genomic segment along with the increase in frequency of the favored gene which is embedded in that particular region. By straightforward logic when a whole segment with associated alleles starts to increase in frequency aggregate genetic diversity decreases, as variation is swept aside.

And yet evolution is not simply natural selection. There are two processes which have nothing to do with selection as such which might reduce genetic variation. The motor which both these phenomenon turn on is random genetic drift. As you increase the power of drift to fluctuate gene frequencies generation to generation you also increase its power to render alleles extinct as they are extinguished once they hit the zero frequency boundary condition. This is why populations which have gone through population bottlenecks are so homogeneous; drift has squeezed most of the variation out of the gene pool by capriciously favoring some alleles and eliminating most of the rest.

The dynamics relevant to this specific case are differences in male and female effective population size, and large fluctuations in long term effective population size. For purposes of reduced X chromosomal diversity one would have to posit lower female effective population size than male effective population size. The reason why this would impact the diversity of the X relative to the autosome is that the X spends 2/3 of its time in females, while the autosome only spends 1/2 of its time in both sexes. So if females have lower effective population sizes than males the X chromosome is being buffeted by greater stochastic forces than the autosome. More generally, the X chromosome has a lower effective population even assuming sex balance because for every 4 copies of an autosomal chromosome there are 3 X chromosomes. Because of this reduced effective population size the X would be more sensitive to bottlenecks and the like, one of the consequence of which is reduced genetic diversity.

All the above is important to keep in mind when reading a new report in Nature Genetics on the balance between selection and drift in reducing variation on the X chromosome and across populations. The second refers to the fact that Africans seem to exhibit less relative reduction of variation on the X chromosome than non-Africans. First, the paper’s abstract, Analyses of X-linked and autosomal genetic variation in population-scale whole genome sequencing:

The ratio of genetic diversity on chromosome X to that on the autosomes is sensitive to both natural selection and demography. On the basis of whole-genome sequences of 69 females, we report that whereas this ratio increases with genetic distance from genes across populations, it is lower in Europeans than in West Africans independent of proximity to genes. This relative reduction is most parsimoniously explained by differences in demographic history without the need to invoke natural selection.

This research is part of the trend I’ve alluded to toward looking at whole-genome sequences. Remember, a lot of the 1 million SNP papers are focusing only upon genetic variants, polymorphisms, across the 3 billion base pairs. These variants are especially informative, but they miss a lot of the genome. Additionally there are some statistical problems with bias in the selection of the variants because they’re usually tuned toward one population, Europeans (different populations have somewhat different variants across the genome). The takeaway is that the time is now nearly here when we can look at the genome at its most precise and fine-grained scale, rather than using approximations, whether it be one locus, or 1 million SNPs.

With this broad canvas in mind, if there’s one thing you’ve read about the genome it’s that much of it is not functional. It doesn’t code. There are zones of the genome which are intergenic, between genes. Natural selection generally targets functional regions, not intergenic ones. If natural selection is the primary dynamic effecting the pattern we see here then differences should manifest between genic and intergenic regions since selection plays a much larger role in the former than the latter, both in constraining variation and increasing the frequency of favored alleles.

The figure below has four panels. Every panel has an x-axis defined by distance from a gene, left to right with increasing distance. So the leftmost point can be thought of as genic, and the rightmost point as intergenic. The left panels define Europeans, and the right panels Africans. More precisely they’re displaying results from whole-genome sequences of 36 West African Yoruba and 33 European American females. The top row shows the change in raw nucelotide diversities for autosomes and X, and the bottom row illustrates the change in ratio of diversity of the two genomic classes (X vs. autosome) as a function of distance.

In molecular evolutionary genetics it often useful to assume that the null hypothesis is neutrality. Basically that means that selection is not a main effect in driving the variation. Instead it’s a function of random forces such as mutation and drift. When one sees deviation from neutrality then one considers the effect of natural selection and the possibility of adaptation. You see here clear evidence for natural selection. The genetic diversity on the X chromosome has a much stronger relationship to distance from genes than the autosome. This matters because as you recall the X chromosome is much more brutally sculpted by natural selection on a priori grounds because disfavored alleles would be pruned more efficiently, while recessively expressing favored alleles would be less handicapped by the fact that their favored traits often did not express when they were present (because they were suppressed when in heterozygote). The pattern above is entirely in keeping with that model.

So now we’ve seen that a closer whole-genome examination of these samples implies that the X vs. autosomal difference in diversity is not just a function of neutral forces, but may have been driven by natural selection. But there’s a second part of the phenomenon: the disjunction is usually more stark in non-Africans. If so, does this imply that non-Africans have been subject to more natural selection? The manner in which they explored this question was clean and elegant: they compared the ratios of ratios as a function of distance from genes. By this, I mean that they looked at the ratio of diversity of the genome between the X and the autosome, and then generated a ratio from this value by comparing across Europeans and Africans. Unlike those above the figure to the left shows no differences as a function of genetic distance. What does this tell us? If natural selection was more efficacious in Europeans than Africans then the differences in diversity across these two populations should be stronger near genetic regions, because that is where the power of selection is most felt. Instead, what you see is that though the difference across X and autosomal genomes is real, it is consistent between the genomes of Africans and Europeans across the X and the autosome.

This suggests that the difference between Africans and Europeans is driven by demographics and not adaptation (positive selection) or functional constraint (negative or purifying selection). Random evolutionary forces don’t see genic or intergenic regions. They’re random, and blind or neutral to functional import. Unlike selection their impact is going to be genome-wide, just as the inter-regional differences we see here are.

In this case what happened? Going back to the beginning there were two specific possibilities: sex-biased migration and greater fluctuation in effective population size among non-Africans. The latter model is entirely consistent with an “Out of Africa” scenario where non-Africans derive from a small ancestral population which left Africa. This is the great “Out of Africa” bottleneck which seems to be a consistent finding by human molecular evolutionists. Because the X chromosome has a somewhat smaller effective population it would presumably have been more impacted by the homogenizing force of this bottleneck.

The first option though is intriguing, if peculiar. What if there were multiple “Out of Africa” pulses which consisted disproportionately of groups of young males? This would have enriched the genetic diversity of non-Africans on the autosome far more than the X chromosome, because the males would bring only one X chromosome for every two autosomes. I think the “Out of Africa” model is more plausible, but I’m not going to dismiss this scenario out of hand. We live in interesting and strange times when it comes to the origin of modern humans.

* p2 + 2pq + q2 = 1 = 0.502 + 2(0.50)(0.50) + 0.502

Citation: Gottipati, Srikanth, Arbiza, Leonardo, Siepel, Adam, Clark, Andrew G, & Keinan, Alon (2011). Relative autosomal, X-linked and X/A diversity are not correlated with genetic distance from the nearest gene. Nature Genetics : 10.1038/ng.877

🔊 Listen RSS

Image credit

The Pith: In this post I review a paper which covers the evolutionary dimension of human childbirth. Specifically, the traits and tendencies peculiar to our species, the genes which may underpin those traits and tendencies, and how that may relate to broader public health considerations.

Human babies are special. Unlike the offspring of organisms such as lizards or snakes human babies are exceedingly helpless, and exhibit an incredible amount of neoteny in relation to adults. This is true to some extent for all mammals, but obviously there’s still a difference between a newborn foal and a newborn human. One presumes that the closest analogs to human babies are those of our closest relatives, the “Great Apes.” And certainly the young of chimpanzees exhibit the same element of “cuteness” which is appealing to human adults. Still there is a difference of degree here. As a childophobic friend observed human infants resemble “larvae.” The ultimate and proximate reason for this relative underdevelopment of human newborns is usually attributed to our huge brains, which run up against the limiting factor of the pelvic opening of women. If a human baby developed for much longer through extended gestation then the mortality rates of their mothers during childbirth would rise. Therefore natural selection operated in the direction it could: shortening gestation times. You might say that in some ways then the human newborn is an extra-uterine fetus. A new paper in PLoS Genetics attempts to fix upon which specific genomic regions might be responsible for this accelerated human gestational clock. An Evolutionary Genomic Approach to Identify Genes Involved in Human Birth Timing:

Coordination of fetal maturation with birth timing is essential for mammalian reproduction . In humans, preterm birth is a disorder of profound global health significance. The signals initiating parturition in humans have remained elusive, due to divergence in physiological mechanisms between humans and model organisms typically studied. Because of relatively large human head size and narrow birth canal cross-sectional area compared to other primates, we hypothesized that genes involved in parturition would display accelerated evolution along the human and/or higher primate phylogenetic lineages to decrease the length of gestation and promote delivery of a smaller fetus that transits the birth canal more readily. Further, we tested whether current variation in such accelerated genes contributes to preterm birth risk. Evidence from allometric scaling of gestational age suggests human gestation has been shortened relative to other primates. Consistent with our hypothesis, many genes involved in reproduction show human acceleration in their coding or adjacent noncoding regions. We screened >8,400 SNPs in 150 human accelerated genes in 165 Finnish preterm and 163 control mothers for association with preterm birth. In this cohort, the most significant association was in FSHR, and 8 of the 10 most significant SNPs were in this gene. Further evidence for association of a linkage disequilibrium block of SNPs in FSHR, rs11686474, rs11680730, rs12473870, and rs1247381 was found in African Americans. By considering human acceleration, we identified a novel gene that may be associated with preterm birth, FSHR. We anticipate other human accelerated genes will similarly be associated with preterm birth risk and elucidate essential pathways for human parturition.

The practical significance of this research seems clear. A fair number of pregnancies end in preterm births, and identifying “at risk” genomic profiles can be very important information when it comes to allocation of finite medical resources. Additionally, if genomic regions which increase risk of preterm birth and correlate with reduced gestation time are well defined then one can possibly further explore the functional pathways, and perhaps even at some point in the future design treatments.

Below is the first figure from the paper. I’ve added some labels:

The results in sum are this:

– Humans have bigger brains controlling for body size than higher primates, who have bigger brains than mammals as a whole

– Humans and primates have equally large brains for their body size at birth

– Humans have larger brains at birth for their length of gestation than primates and other mammals

– Higher primates have longer gestation times in relation to their size at birth than other mammals, including humans!

The second figure shows this pattern on a phylogenetic tree of primates. You can see that the human branch is peculiar. When you have a set of characters which confound expectations from phylogeny that warrants some investigation. This may be a case where natural selection is at work, insofar as the characters of the lineage have diverged from that which is typical in the ancestral forms. They have derived characteristics distinctive to their clade, which is so important to the logic of modern systematics.

One could characterize human gestational length as a derived characteristic in relation to other primates, who exhibit a delay vis-a-vis other mammalian lineages. You might argue then that humans are “devolved throwbacks,” but in light of other aspects of human development, such as our longevity, I think this is too glib. Rather, there are only so many degrees of freedom which natural selection seems to have when it comes to balancing opposing fitness pressures in a matrix of genotypic and phenotypic states. In plain English, there’s only so much a human female pelvis can increase in terms of width before serious functional problems in locomotion make change in that direction unfeasible. Despite the fact that the brain is a metabolically expensive organ, our hominin lineage has tended toward larger cranial capacities over time. Therefore once the point of negative fitness returns on human female pelvic width was reached there had to be a changes in the development of our species’ reproductive arc. If the pelvis was prevented from getting any wider due to biomechanics, and a large adult brain was a necessary condition of high fitness value for humans, then one had to accelerate the timing of childbirth so that the neonate exited while the cranium was manageable in circumference.

To explore the possible genomic underpinnings of this derived characteristic the authors of this paper focused on ‘human accelerated regions’ (HAR). These are areas of the genome which exhibit evidence of more rapid evolutionary change in humans than in their near taxonomic relations. This is suggestive of positive directional selection, though one must be careful of false positives here. Because these are loci which manifest a strong separation between humans and chimpanzees, they are often pegged as “genes which make us human.” Notwithstanding the media hype, it seems likely that focusing on these regions is a fruitful avenue by which to flesh out the nature of distinctive human characteristics, such as language, bipedality, or our large brains.

In this paper they identified noncoding as well as coding HARs. You are probably well aware of the debates around this topic. The short of it is that there are strong suspicions that some noncoding regions of the genome may regulate coding regions, and so be functionally very significant, even if indirectly. Functional significant implies that they are the targets of natural selection, both negative constraining pressures which conserve the genomic sequence, and positive directional selection which would result in substitutions at the base pair level. After obtaining a set of genes in HARs, they then narrowed that list to those which are known to be functionally significant in pathways which may be associated with gestation term length.

Eventually the authors focus on a set of markers in a noncoding region which define a genomic block in linkage disequilibrium. This is just a region of the genome with a set of correlated variants, and may be suggestive of natural selection or rapid fluctuations in population size changing the fraction of this fragment within the population from one ancestral copy. This block is within the FSH-receptor gene. Its full name is follicle stimulating hormone receptor. Modulating reproductively significant hormones is clearly a major way in which one might be able to somehow tune the pathways which modify the probability of birth at any given point in gestation. Their first study population for this gene were Finns, who are genetically homogeneous and presumably will maximize heritability of gestation length because of the minimization of environmental noise. Intriguingly they note that FSH-R “harbors SNPs with extreme iHS values in the Yoruban population, reflecting extended haplotype homozygosity and suggesting a recent selective sweep.” In its important to remember that the research group which that iHS value came out of has walked back the confidence that one should give any particular iHS value.

But, if FSH-R was targeted for gestation term length in Finns and Yoruba, why? The iHS statistic in particular is sensitive to relatively recent selective events, in contrast to the other tests which the authors focused on which compare humans and non-humans across millions of years of evolutionary time (this is the case with most HARs measuring the rates of base pair substitution). All I can posit is that both Finns and Yoruba have gone through relatively recent population expansions due to the shift to agriculture. In a broad evolutionary framework once full genome sequencing becomes common one might want to compare populations which have had radically different demographic histories and see if there is genetic variation in gestation related genes. If the above iHS statistic for the Yoruba was not a false positive, then I would predict that the Bushmen would not exhibit a sweep, and would have a lower rate of preterm birth, all things controlled. In fact one would predict that populations which had experienced the Neolithic transition, and so constantly pushed up against the Malthusian trap with a high fertility and high mortality strategy, would have higher rates of premature birth than those groups which had not experienced such demographic changes.

Finally, these sorts of biological studies with medical relevance are clearly important in the era of universal health care. If there are inter-population differences in risk susceptibility that is crucial to know when evaluating the balance of environmental and genetic variables which affect health outcomes. Population level evaluations may be critically important so long as our genetic and genomic understanding of the root causes of such differences are provisional and tentative. All things equal an individual risk assessment is preferred, but we do not live in the best of all worlds.

I don’t know if the authors have found a marker which is going to be actionable for biomedical research. But I think their methods are a useful framework with which to start. “Evolution you can use,” so to speak.

Citation: Plunkett J, Doniger S, Orabona G, Morgan T, Haataja R, & et al. (2011). An Evolutionary Genomic Approach to Identify Genes Involved in Human Birth Timing PLoS Genetics : 10.1371/journal.pgen.1001365

🔊 Listen RSS

Credit: Karl Magnacca

The Pith: In this post I review some findings of patterns of natural selection within the Drosophila fruit fly genome. I relate them to very similar findings, though in the opposite direction, in human genomics. Different forms of natural selection and their impact on the structure of the genome are also spotlighted on the course of the review. In particular how specific methods to detect adaptation on the genomic level may be biased by assumptions of classical evolutionary genetic models are explored. Finally, I try and place these details in the broader framework of how best to understand evolutionary process in the “big picture.”

A few days ago I titled a post “The evolution of man is no cartoon”. The reason I titled it such is that as the methods become more refined and our data sets more robust it seems that previously held models of how humans evolved, and evolution’s impact on our genomes, are being refined. Evolutionary genetics at its most elegantly spare can be reduced down to several general parameters. Drift, selection, migration, etc. Exogenous phenomena such as the flux in census size, or environmental variation, has a straightforward relationship to these parameters. But, to some extent the broadest truths are nearly trivial. Down to the brass tacks what are these general assertions telling us? We don’t know yet. We’re in a time of transitions, though not troubles. Going back to cartoons, starting around 1970 there were a series of debates which hinged around the role of deterministic adaptive forces and random neutral ones in the domain of evolutionary process. You have probably heard terms like “adaptationist,” “ultra-Darwinian,” and “evolution by jerks” thrown around. All great fun, and certainly ripe “hooks” to draw the public in, but ultimately that phase in the scientific discourse seems to have been besides the point. A transient between the age of Theory when there was too little of the empirics, and now the age of Data, when there is too little theory. Biology is a very contingent discipline, and it may be that questions of the power of selection or the relevance of neutral forces will loom large or small dependent upon the particular tip of the tree of life to which the question is being addressed. Evolution may not be a unitary oracle, but rather a cacophony from which we have to construct a harmonious symphony for our own mental sanity. Nature is one, an the joints which we carve out of nature’s wholeness are for our own benefit.

The age of molecular evolution, ushered in by the work on allozymes in the 1960s, was just a preface to the age of genomics. If Stephen Jay Gould and Richard Dawkins were in their prime today I wonder if the complexities of the issues on hand would be too much even for their verbal fluency in terms of formulating a concise quip with which to skewer one’s intellectual antagonists. Complexity does not make fodder for honest quips and barbs. You’re just as liable to inflict a wound upon your own side through clumsiness of rhetoric in the thicket of the data, which fires in all directions.

In any case, on this weblog I may focus on human genomics, but obviously there are other organisms in the cosmos. Because of the nature of scientific funding for reasons of biomedical application humans have now come to the fore, but there is still utility in surveying the full taxonomic landscape. As it happens a paper in PLos Genetics, which I noticed last week, is a perfect complement to the recent work on human selective sweeps. Pervasive Adaptive Protein Evolution Apparent in Diversity Patterns around Amino Acid Substitutions in Drosophila simulans:

In Drosophila, multiple lines of evidence converge in suggesting that beneficial substitutions to the genome may be common. All suffer from confounding factors, however, such that the interpretation of the evidence—in particular, conclusions about the rate and strength of beneficial substitutions—remains tentative. Here, we use genome-wide polymorphism data in D. simulans and sequenced genomes of its close relatives to construct a readily interpretable characterization of the effects of positive selection: the shape of average neutral diversity around amino acid substitutions. As expected under recurrent selective sweeps, we find a trough in diversity levels around amino acid but not around synonymous substitutions, a distinctive pattern that is not expected under alternative models. This characterization is richer than previous approaches, which relied on limited summaries of the data (e.g., the slope of a scatter plot), and relates to underlying selection parameters in a straightforward way, allowing us to make more reliable inferences about the prevalence and strength of adaptation. Specifically, we develop a coalescent-based model for the shape of the entire curve and use it to infer adaptive parameters by maximum likelihood. Our inference suggests that ~13% of amino acid substitutions cause selective sweeps. Interestingly, it reveals two classes of beneficial fixations: a minority (approximately 3%) that appears to have had large selective effects and accounts for most of the reduction in diversity, and the remaining 10%, which seem to have had very weak selective effects. These estimates therefore help to reconcile the apparent conflict among previously published estimates of the strength of selection. More generally, our findings provide unequivocal evidence for strongly beneficial substitutions in Drosophila and illustrate how the rapidly accumulating genome-wide data can be leveraged to address enduring questions about the genetic basis of adaptation.

Figure 1 C shows the top line. As you can see, there’s a “trough” around non-synonymous substitutions. Non-synonymous simply means that a base pair substitution at that position within the codon changes the amino acid encoded. In contrast, a synonymous change does not. A substitution is not just a mutant variant though. It is rather an assessment of a population level shift from one allele to another. Neutral theory posited that most substitutions were not driven by natural selection, but rather random walk processes. Ergo, most evolutionary change was not adaptive. A simple way to check the power of selection against this background of stochastic variation is to measure the ratio of substitution between non-synonymous and synonymous bases. But this sort of thing is more appropriate when comparing closely related species. In the paper on selective sweeps in humans obviously that’s not going on, they were looking within one species. Instead the authors looked at reduction of variation across regions which may have been targets of natural selection. The reduction occurs because when one particular allele becomes the target of strong positive selection it pulls along adjacent linked regions in a “hitchhiking” process. Recombination works against this, resulting in decay over time of linkage disequilibrium which spikes in th wake of selection.

But these conceptions are predicated on a simple model of the emergence of variants, and the way selection does, or doesn’t, target these variants. One imagines a new mutant which arises against the ancestral genetic background. In a single-gene model the probability of fixation, that is, going to ~100% and substitution in the population, is 1/N (or 2N for diploid). In plain English the fixation probability for a mutant is inversely proportional to the effective population size. In contrast, the probability of fixation of a mutant which is selectively favored is proportional to its selection coefficient, which simply measures its fitness as a ratio to that of the population mean. The fixation of neutral variants is random walk, and the time until fixation is directly proportional to population size. In contrast, selectively favored variants can sweep to fixation rather quickly. Being very conservative one can infer that the fixation of lactose tolerance in Northern Europeans due to a mutation on the LCT gene took about ~7,000 years, or a little less than 300 generations. Because of this rapidity recombination has far less leisure with which to “chop” apart the physical associations of variants on the ancestral mutant genetic background. No wonder the LCT locus has one of the longest “haplotype blocks” in the European genome; a sequence of associate markers.

But let’s modify our mental model a bit. Imaging that a genetic variant has been floating around at a low frequency for a long time. There may be many copies of the mutant, associated with different genetic variants due to the impact of recombination. We can for example imagine a recessively deleterious allele which persists in low frequencies because of the lack of efficacy of selection (most alleles are found in heterozygote individuals with normal fitness). Many variants have multiple effects. Imagine that this allele has a dominant phenotypic effect which goes from being neutral to being very selectively favored. Now you have a situation where the genomic region will be dragged upward in frequency during adaptation, but, there will be many region s, not just one. Concretely, if the selective event occurred only a few generations after the original mutant the impact on the local genome would be much stronger in terms of generating homogenization than if the event occurred dozens of generations after the original mutant, as the original genetic background would have been recombined and so lost its distinctive coherency.

This is a form of natural selection from “standing variation.” Old mutants floating around in the background noise, rather than new mutants. In the paper above the authors find a fair amount of conventional selective sweeps, but, they suggest that the higher ratios of the proportion of the genome under natural selection found by some researchers in Drosophila may be due to the fact that some methods catch the whole basket of selection, while others focus on more tractable “cartoon” models.

Of the selection which can be modeled as a classic selective weep the authors also found a “power law” effect. There was a combination of a few hits of powerful selection, and more numerous bouts of weak selection. This is not totally unexpected according to theory. Some of the human traits which have been amenable to genome-wide association, such as pigmentation, probably fall under this category. Most of the trait variance is due to a few genes of large effect, but there are a larger number of loci which account for the minority balance of variance. The same no doubt can hold across evolutionary time with the dynamics of natural selection.

But we also shouldn’t get lost in the genomic trees and lose sight of the forest. Not only are evolutionary processes subject to molecular scale parameters such as recombination and mutation rates, but they are also impacted by organism and population scale parameters. One presumes that fruit flies are subject to a different pressures and have had a different history from human beings, just as both have from philopatric amphibians. Humans have an enormous census size, huge populations, and, we’ve undergone a massive change in lifestyle over the last 10,000 years. But as land bound mammals we may exhibit more population substructure than some species, for example birds with a wide range. Additionally, because of a low long term effective population we have only so much genic variation to work with. Such a welter of details distorts attempts at elegance, but they need to be kept in mind.

The authors conclude:

In summary, our findings establish a distinctive, genome-wide signature of adaptation in D. simulans, suggesting that many amino acid substitutions are beneficial and are driven by two classes of selective effects. Enabled by a richer summary of diversity patterns that avoids an a priori choice of scale, these conclusions offer a coherent interpretation of the results of previous inferences. It will now be interesting to see whether similar findings emerge in other Drosophila species, which vary in their recombination rates, effective population sizes, and ecology.

I wouldn’t limit this just to Drosophila. Because the different fruit fly species have different distributions, natural histories, as well as common ancestral traits and genes, they’re an excellent laboratory of evolution. But eventually we’ll start sweeping our gazes across all the multitudinous branches of the tree of life. Soon.

Citation: Sattath S, Elyashiv E, Kolodny O, Rinott Y, & Sella G (2011). Pervasive Adaptive Protein Evolution Apparent in Diversity Patterns around Amino Acid Substitutions in Drosophila simulans PLoS Genetics : 10.1371/journal.pgen.100130

🔊 Listen RSS Does the chart above strike you as strange? What it shows is that the mean fitness of a population drops as you increase the rate of deleterious mutation (many more mutations are deleterious than favorable)…but at some point the fitness of the population bounces back, despite (or perhaps because of?) the deleterious mutations! This would seem, to me, an illustration of bizzaro-world evolution. Worse is better! More is less! Deleterious is favorable? By definition deleterious isn’t favorable, so one would have to back up and check one’s premises.

And yet this seems just what a new paper in PLoS ONE is reporting. Purging Deleterious Mutations under Self Fertilization: Paradoxical Recovery in Fitness with Increasing Mutation Rate in Caenorhabditis elegans:

Compensatory mutations can be more frequent under high mutation rates and may alleviate a portion of the fitness lost due to the accumulation of deleterious mutations through epistatic interactions with deleterious mutations. The prolonged maintenance of tightly linked compensatory and deleterious mutations facilitated by self-fertilization may be responsible for the fitness increase as linkage disequilibrium between the compensatory and deleterious mutations preserves their epistatic interaction.

Got that? OK, you probably need some background first….

The authors used C. elgans as a model organism. This “worm” is ubiquitous in biology. There’s an enormous community of developmental biologists, geneticists, and neuroscientists, who work with elegans as a model organism. For the purposes of evolutionary genetics you need to know a few things about elegans though. The vast majority of reproduction of elegans occurs through “selfing.” That is, most elegans are hermaphrodites who fertilize themselves. They’re obviously not asexual, but their habits are straight out of South Park. A small minority of reproductive events among elegans are sexual in a conventional manner, because a few of the worms in any given generation are males. For the purposes of this experiment you need to ignore this aspect; they’re focusing on the selfing. To do this they removed males out of the equation, either by introducing a male killing mutation, xol-1, or, manually removing them.

So now we have just the selfers. If you pick up a standard pop gen text, e.g. Principles of Population Genetics, you’ll find out that selfers tend to have some peculiar and interesting properties when it comes to the long term arc of evolutionary genetics. In particular, they “purge” “genetic load” like crazy. What this means is that deleterious alleles get removed from selfing populations very fast through negative selection. Why? How?

Let’s go back to genetics 101. Imagine a locus where an individual is a heterozygote, and carries an allele which is “wild type” and another which is deleterious, and recessively expressed. Cystic fibrosis is a recessive disease that is common among Europeans. 1 out of 25 Europeans is a heterozygote, and there is a 1 out of 25 chance that these individuals will mate with someone who is also a carrier. Out of these pairings, 50% of the offspring will also be carriers, 25% will be wild type homozygotes, and 25% will express the cystic fibrosis disease because they’re homozygotes for the deleterious allele. With the numbers given that means 1 out of 2,500 births will result in a child with cystic fibrosis.

Cystic fibrosis is a lethal disease which sharply reduces fitness (many individuals are just infertile). This is negative selection against the deleterious allele. But, the selection is relatively weak. Why? Take a look at the ratio between those who carry the allele, but have normal fitness, and those who carry two copies and have reduced fitness. It’s 100 to 1. Most copies of the deleterious allele are “masked” from any negative fitness consequences because they’re paired up with a normal wild type which complements and compensates the function of the mutant variant. This is one reason why we carry so many deleterious alleles; they’re often paired up with a “good” copy which prevents the fitness of the individual from cratering.

Now let’s bring this back to selfing. In a human population we pair up with others. So you have to multiply independent probabilities, 1/25 × 1/25, to produce a Punnett square where two heterozygotes are crossed. In a scenario of selfing the probabilities are different. There’s perfect assorting of genotype to genotype for selfers, because the genotypes are simply being crossed with themselves. If you’re a fertile hermaphrodite who carries the mutant cystic fibrosis allele there’s a 25% chance that you’re offspring will be homozygotes for cystic fibrosis, because you know that the cross will be with another heterozygote (yourself). Now imagine that the whole population consists of selfers. Instead of 1 out of 100 copies being exposed to selection, 1 out of 2 copies are exposed to selection! This is how selfers purge genetic load so well. When selection only operates on homozygotes, their tendency to produce homozygotes means that deleterious alleles are far more exposed to selection. Why do selfing populations in the aggregate produce so many homozygotes? Heterozygotes mating with heterozygotes produce both heterozygotes and homozygotes. Homozygotes mating with homozygotes produce only homozygotes. The “toy” chart I’ve put together shows what happens when you take a uniform population of heterozygote selfers in generation 1, and allow them to reproduce down the generations. Each generation the proportion of heterozygotes, those individuals where deleterious alleles are masked and so protected from the purging power of natural selection, decreases. Selection becomes more and more efficacious in purging genetic load from the population.

There are still two other concepts important to understanding the implications of this paper. Epistasis and genetic linkage. But let’s move on to some results first, and then digest them with a further helping of conceptual condiments. Here’s figures 3 & 4, which I’ve reedited a bit. On the left you see fitness (fecundity) as a function of the concentration of mutagen. In other words, as you move up the mutagen concentration on the x-axis the mutation rates are increasing. On the right you see a plot which shows the mean fitness after x # of generations, which each set of data points represent differing concentrations of the mutation. I’ve highlighted the lines with no mutagen, and maximal mutagen.

The bizarro aspect is the jump between 80 mM and 100 mM. As mutation rates increase there is a bounce back of fitness. Imagine that you were rolling a boulder up an incline which got progressively steeper in its grade. Common sense and basic physics would tell you that you’d have to use more and more force to move the boulder the same distance. Now imagine that beyond a certain grade of steepness you actually had to use less force! That would make no sense. In some ways that’s what’s going on here. But then, evolutionary processes may not be so linear and predictable as Newtonian mechanics.

Of course there could be some straightforward reasons for this strange behavior. For example, the xol-1 mutant which produced maleless populations may have had pleiotropic effects. To test for this they manually removed males from a population without the mutation, and obtained similar results. Additionally, they also took a divergent elegans line with the xol-1 mutant and performed the same experiments, and again the same pattern recapitulated itself. Finally, there’s always the possibility that resistance to the mutagen had developed above a certain concentration. If resistance to the mutagen had developed presumably taking the population which had exhibited the increased fitness ~100 mM, and placing it back into lower concentration environments, would produce a different response curve than we saw before. That is not what occurred, as you can see in figure 6.

Now that we have the core results under our belt, let’s move on to trying to make sense of how water can flow uphill like this. So back to the concepts, genetic linkage, and epistasis. The first is easy. Genes are arrayed along physical DNA strands. The closer the physical position of the genes, the more likely they are inherited together in a straightforward fashion. The kink in the expectation is recombination. In diploid organisms you have two copies of each gene on the two strands. Recombination can shuffle specific gene copies from one strand to the other (or, more accurately, break and recombine strands in a fashion so that both differ from the state before the event). The further the distance between any two gene copies on a physical strand, the greater the likelihood for recombination to separate the two. When two copies are very close there’s only a small physical distance across which recombination might operate to separate them. Therefore, the closer the copies the more “linked” the genes are.

Before explaining why this matters, let’s talk about epistasis. Epistasis can be thought of generally as gene-gene interaction. In the mechanistic molecular sense you’re referring to biophysical processes whereby one gene has some interaction with another gene. But there’s another way to think about: fitness or trait value. In this sense epistasis as gene-gene interaction introduces non-linearities into the mapping of genotype to phenotype, as well as genotype to fitness. This is what matters for the purposes of this paper. In particular, epistasis manifesting as compensatory deleterious mutations.

So how does this matters for selfers? Recall that above we were talking about how selfers purge deleterious genetic load by cranking up the proportion of homozygotes exposed to negative selection. Implicitly our model was single locus. We were looking at one gene, and one mutant. But how about if you had a large number of mutants? Can selfers produce all those homozygotes simultaneously, and so purge the load efficiently? Purging load through natural selection entails reduced fitness for many members of a population; purge too much and the population crashes and you’re liable to just go extinct through mutational meltdown. This where linkage and recombination come back to the fore. Recombination is often thought of as a way to create new genetic combinations. But in homozygous selfing lineages recombination doesn’t live up to that promise: there’s not enough heterozygosity within the genomes of these organisms so that the shuffling of the strands across each other produces anything new! Selfing lineages exhibit very strong linkage between sequences of genetic variants across loci because of the inability of recombination to break apart associations. So, if you have two genes, A and B, which are linked, and A is very fit and B is moderately unfit, if they are co-inherited B may sweep up to fixation with A. As you crank up mutation rates then the theory predicts that deleterious alleles will simply swamp out the ability of selfing lineages to purge the load fast enough to prevent ultimate extinction. Even if the genetic background wasn’t homozygous, too many mutations within the genome would be swapping out deleterious copies for other deleterious copies during recombination.

That theory was born out more or less at concentrations of the mutagen below 100 mM. But then expectation was confounded. Why? This is where epistasis steps into the picture. In the previous model we implicitly assumed an additive model. Imagine the fitness of allele 1 at gene A ~ 3 and the fitness of allele 2 and gene B ~ -2. Summing them together ~1. And so on. Epistasis confuses this simple picture because it implies non-linear computations. The fitness value of A and B may be conditional on the state of a third gene, C. In any case, a compensatory mutation is one where more deleterious is in fact less deleterious. Precisely, having two deleterious mutations may actually have less of a fitness hit than having one deleterious mutation! In some ways this becomes a matter of semantics and analytic philosophy. -10 + – 10 > – 10 is just incoherent.

Since this is not a philosophy blog, how does this relate to selfing lineages? It goes back to linkage. Recall that tight linkage may produce situations where recombination can not break apart unhealthy associations where favorable variants are linked with unfavorable ones, and the latter may hitchhike with the former in selective sweeps (in populations with more heterozygosity recombination would increase the range of combinations across which selection operated; see Muller’s ratchet). This is the bad. But in the case of compensatory mutations the inability of recombination to break apart associations may be a positive. These epistatic interactions are contingent on robust combinations persisting. Recombination would break apart those combinations, preventing the fitness gains from persisting across generations. But in these selfing linages the homogenized genetic backgrounds are relatively fixed palettes against which these mysterious genetic interactions which turn expectations upside down can perform their magic.

This paper had some moderately weird results. The response to mutagen concentration increases seemed robust within their set of experiments, but who knows how general this phenomenon is? A reliance on compensatory mutation also strikes me as only less weird because the results were so weird. In the last paragraph the authors seem to acknowledge the general strangeness at work:

Regardless of the mechanism driving the fitness increase exhibited by populations exposed to 100 mM EMS, the result is a testament to the resiliency of the genome. Consistent exposure to high mutation rates should wreak havoc on the genome, and repeated exposure to 80 mM EMS (Figure 5) appears to do just that. However, the genome is able to recover a large proportion of the fitness lost at 80 mM EMS when exposed to 100 mM EMS (Figure 3). This result is quite surprising and challenges the long-held beliefs concerning the relationship between mutation rates and fitness.

The long-held belief presumably being that high mutation rates are correlated with decreased mean fitness, and ultimately likely extinction. A great deal of post-apocalyptic fiction from the Cold War period was predicated on just this assumption. And clearly in most cases this seems to be a warranted axiom. On the other hand, sometimes in biology the minor exceptions are more important in explaining the patterns of diversity we see around us. If there was a veil of ignorance over us and we had to predict the nature of replicating organisms on this planet would we predict the incredible diversity we see all around us? Would we predict intelligent life? I suspect that there would be the preference for a simple and elegant model where life on earth was optimized toward extremely simple and highly robust rapid replicators. Prokaryotes. And to a first approximation that logical inference based on Darwinian assumptions would be correct. Prokaryotes are omnipresent. In fact, some estimate that there are 10 times as many bacterial cells within a human body as human cells. But obviously there are creatures on the Earth besides prokaryotes. And we care a great deal about this “residual” from the expected trend line….

Citation: Morran LT, Ohdera AH, & Phillips PC (2010). Purging Deleterious Mutations under Self Fertilization: Paradoxical Recovery in Fitness with Increasing Mutation Rate in Caenorhabditis elegans. PloS one, 5 (12) PMID: 21217820

🔊 Listen RSS

800px-IMGP2147The number 1 gets a lot more press than -1, and the concept of heterozygosity gets more attention than homozygosity. Concretely the difference between the latter two is rather straightforward. In diploid organisms the genes come in duplicates. If the alleles are the same, then they’re homozygous. If they’re different, then they’re heterozygous. Sex chromosomes can be an exception to this because in the heterogametic sex you generally have only one copy of gene as one of the chromosomes is sharply truncated. This is why in human males are subject to X-linked recessive traits at such a great frequency in comparison to females; recessive expression is irrelevant when you don’t have a compensatory X chromosome to mask the malfunction of one allele.

Of course recessive traits are not simply a function of sex-linked traits. Consider microcephaly, an autosomal recessive disease. To manifest the trait you need two malfunctioning copies of the gene, one from each parent. In other words, you exhibit a homozygous genotype with two mutant copies. I suspect that this particularly common context of homozygosity, recessive autosomal diseases, is one reason why it is less commonly discussed outside of specialist circles: there are whole cluster of medical and social factors which lead to homozygosity which are already the focus of attention. The genetic architecture of the trait is of less note than the etiology of the disease and the possible reasons in the family’s background which might have increased the risk probability, especially inbreeding. In contrast heterozygosity is generally not so disastrous. Even if functionality is not 100%, it is close enough for “government work.” The deleterious consequences of a malfunctioning allele are masked by the “wild type” good copy. The exceptions are in areas such as breeding for hybrid vigor, when heterozygote advantage may be coming to the fore. The details of complementation of two alleles matter a great deal to the bottom line, and the concept of hybrid vigor has percolated out to the general public, with the more informed being cognizant of heterozygosity. But homozygosity is of interest beyond the unfortunate instances when it is connected to a recessive disease. Like heterozygosity, homozygosity exists in spades across our genome. My 23andMe sample comes up as 67.6% homozygous on my SNPs (which are biased toward ~500,000 base pairs which tend to have population wide variation), while Dr. Daniel MacArthur’s results show him to be 68.1% homozygous across his SNPs. This is not atypical for outbred individuals. In contrast someone whose parents were first cousins can come up as ~72% homozygous. This is important: zygosity is not telling you simply about the state of two alleles, in this case base pairs, it may also be telling you about the descent of two alleles. Obviously this is not always clear on the base pair level; mutations happen frequently enough that even if you carry two minor alleles it is not necessarily evidence that they’re identical by descent (IBD), or autozygous (just a term which denotes ancestry of the alleles from the same original copy). What you need to look for are genome-wide patterns of homozygosity, in particular “runs of homozygosity” (ROH). These are long sequences biased toward homozygous genotypes.

220px-Morgan_crossover_1What ROH can tell you about an individual, and perhaps a population, becomes more clear when you conceptualize in your mind’s eye the basic dynamics which occur in the course of biological replication in diploid sexual organisms. Each individual receives half their autosomal genome from each parent. Though genes are abstractions, individual units at the root of a complex causal sequence which maps to a phenotype, a trait, they’re also physical entities embedded within the structure of DNA. This structure is a physical sequence, whereby you have adjacent base pairs, clusters of which define genes, intergenic regions, exons, introns, promoters, etc. In other words, the whole alphabet soup of molecular genetics. The spatial relationship of genes to each other along the chromosome allowed for linkage mapping decades before the biophysical substrate of DNA was known to be critical to the whole process. Particular sequences of alleles may therefore be inherited together, and form a haplotype. Over the generations the associations of these distinctive alleles in haplotypes dissolve through recombination, a physical process which erodes the structural integrity of chromosomal sequences.

210px-Juan_de_Miranda_Carreno_002With these basics in mind, let’s move to a specific repulsive example. Imagine a father who impregnates his daughter. Why is this repulsive to us? From a consequential “gene’s eye” perspective the father is suborning the beauty of sexual reproduction whereby genetic variation is mixed & matched across individuals. Colloquially, where the daughter would be 50% of the father genetically, the child of the daughter and her father would be 75% of the father genetically. From a gene-only perspective this may be favorable, as the father is coming closer to cloning himself, but we all know that the rate of breakdown of the “vehicle” in these individuals is high. Why? Inbreeding leads to a relatively massive increase in homozygosity as chromosomal segments identical by descent are paired off against each other. We know that the problem is that a host of nasty recessive diseases are highly likely in inbred individuals.

All humans carry a large load of deleterious alleles. Some of these may be potentially lethal. But like bombs without the trigger a functional copy of the alleles complements and masks the mutant variety and we carry on. Many of these mutants are particular to our family, and some of them are private even to ourselves, the outcome of de novo mutations which make each human distinctive genetic islands (at least until they reproduce and pass on their mutational distinctions). Therefore a man who mixes his own genes together in the act of incest is potentially lighting the fuse whereby these hidden malevolent mutants will explode from being cryptic genetic abormalities toward full-blown disease monstrosities.

One statistic which would register incest would be ROH; naturally when you have long regions of recently IBD chromosomal segments adjacent to each other you’ll have a lot of homozygosity, since the paired alleles are replica copies. Assuming that an individual with many long ROH can survive and reproduce over time these massive swaths of homogeneity will be wiped away by mutation and recombination as well as outbreeding. Incest is still arguably a health disaster, but one can imagine the motive genetic engines of evolutionary variation healing the damage over time.

And it doesn’t have to be so extreme. Father-daughter or sibling incest is only a boundary condition. First cousin marriages aren’t nearly as disastrous, the fecundity of British Pakistanis despite higher rates of genetic abnormalities being clear evidence of this. They are certainly more evolutionarily fit than non-Pakistani Brits, who do not reproduce at the clip of 4 children per family. These clans will exhibit more modest levels of ROH because the coefficient of relationship between cousins is only 1/8, as opposed to 1/2 between parents and children or full siblings.

roh1The figure to the left is from a 2008 paper on ROH in Europeans. Specifically these are Orcadians or part-Orcadians. A population you should be familiar with from the HGDP panel. Orcadians are natives of the Orkney islands just off the north coast of Scotland. Though of somewhat diverse origins, Viking, Scot and Pict, being islanders they’ve developed their own genetic peculiarities because of their isolation. A good rule of thumb is that any body of water is a fearsome barrier to casual gene flow. On the y-axis you see the total number of ROH in the genome of a given individual. I point you to the methods if you are curious as to the exact parameters they specified in their calculation. ROH is assessed over a window of the genome, and naturally one can vary its width, as well as the stringency in registering a particular region as a run or not a run. On the x-axis are the total lengths in terms of base pairs. What you see is a positive correlation between the number of ROH, and the total genomic length of the sequences. Those Orcadians who are genetically more diverse because of non-Orcadian parentage have the least homozygosity in their genomes. Those who are products of the recent cousin marriage have the most. But notice a peculiar pattern: there’s a curvilinear trend to the values. In those individuals who presumably have very high inbreeding coefficients the total length of ROH seems to exceed one’s expectation based on just the total number of ROHs. Why? Because they have very long runs of homozygosity indeed. This is just what we’d expect from the sort of process I described earlier, where it takes many generations for the long chromosomal sequences to be broken apart by recombination.

Before I get you too excited about the genetics of European homozygosity, let’s take a wider view. Some of the same researchers who published the paper above have come out with a set of results which survey the world. Genomic Runs of Homozygosity Record Population History and Consanguinity:

The human genome is characterised by many runs of homozygous genotypes, where identical haplotypes were inherited from each parent. The length of each run is determined partly by the number of generations since the common ancestor: offspring of cousin marriages have long runs of homozygosity (ROH), while the numerous shorter tracts relate to shared ancestry tens and hundreds of generations ago. Human populations have experienced a wide range of demographic histories and hold diverse cultural attitudes to consanguinity. In a global population dataset, genome-wide analysis of long and shorter ROH allows categorisation of the mainly indigenous populations sampled here into four major groups in which the majority of the population are inferred to have: (a) recent parental relatedness (south and west Asians); (b) shared parental ancestry arising hundreds to thousands of years ago through long term isolation and restricted effective population size (N e), but little recent inbreeding (Oceanians); (c) both ancient and recent parental relatedness (Native Americans); and (d) only the background level of shared ancestry relating to continental N e(predominantly urban Europeans and East Asians; lowest of all in sub-Saharan African agriculturalists), and the occasional cryptically inbred individual. Moreover, individuals can be positioned along axes representing this demographic historic space. Long runs of homozygosity are therefore a globally widespread and under-appreciated characteristic of our genomes, which record past consanguinity and population isolation and provide a distinctive record of the demographic history of an individual’s ancestors. Individual ROH measures will also allow quantification of the disease risk arising from polygenic recessive effects.

Their data set consists of the HGDP sample populations, so you naturally have the broad geographic clusters such as Africa, Europe, West Asia, Central/South Asia, East Asia, Oceania, and the New World. Two big dynamics are superimposed upon each other in the patterns of ROH: “deep history” demographic processes such as bottlenecks and population expansions, and cultural anthropological patterns which we see around us such as cousin marriage within inbred clans. To find the former you need to survey the genome finely. In contrast the latter leaves pretty obvious signs genomically in the form of very long ROH, as well as clusters of recessive diseases.

The first figure shows the distribution of different lengths of ROH by population:


Here’s the take away:

– Oceanians have many short ROH, but as you increase the length of ROH threshold they are not exceptional at all

– The New World samples persist in having a disproportionately number of ROH no matter the length, though the number does drop as you increase length threshold. This makes sense, the human genome is of finite length and you can only have so many very long ROHs

– The West Asian and Central/South Asian populations seem to have more long ROHs than the other Eurasian or African groups, though they’re not exceptional in the lowest category

– The Africans have the least ROH, especially in the category of very short runs

Before I comment on these patterns in detail, let’s quickly check out the next figure. It looks at Africans only, but divides the sample into those which are hunter-gatherers and those which are agriculturalists.


The hunter-gatherers have more, and longer, ROH than the agriculturists. Why? The answer in large part explains the geographical patterns as well: larger long term effective population. Effective population just refers to the proportion of the population which contributes genetically to the next generation. Small effective populations means a lot of genetic drift because of increased sample variance, and tends to converge upon consanguinity. If your tribe is small enough the only people you may find to marry are your cousins. As I noted above, this will produce long ROH as individuals will have descent through multiple lines from the same ancestor, increasing the probability of autozygosity greatly. The same process explains why West Asians and Central/South Asians are enriched for long LOH relative to other groups excepting Amerindians. Here’s a map from


Many Muslim societies practice cousin marriage, and many Muslims even argue that it is the Islamic practice (he married one of his cousins among his many wives. Strangely somehow these Muslims don’t argue that it is also the Muslim custom to marry old rich widows, though some do argue for the importance of marrying barely pubescent girls). Additionally, in India many Hindu groups in the South practice consanguineous marriages, including uncle-niece marriage. This is all occurring now, and so produces signatures of long ROH in many families. The final figure breaks down the individuals from selected populations, with again the y-axis being the number of ROH and x-axis being total length of the ROH:


The population sets are representative of broader geographic clusters. The Karitiana are from the Amazon, the Mandenka from Senegal, and the Balochi from Pakistan. If you don’t know where the French and Japanese are from, I would ask you never leave a comment on this weblog. Notice a few French, Mandenka, and Japanese individuals deviated away from their main clusters. These are cryptically inbred, perhaps their parents were cousins, or some of their grandparents were cousins. In contrast the Baloch have a wide range in terms of length of ROH; this is typical of populations where a large proportion of individuals are the products of cousin marriage, but many are not. The fact that individuals would exhibit a large variance of expected relatedness between their parents means that their own inbreeding coefficients and the genomic correlates (in this case ROH) would also vary greatly. The same parameter is operative among the Karitiana, an endangered ethnic group which presumably has a small “mate market” available to each individual.

So what about the Papuans? Their cluster is tight, and they don’t have nearly the total length of ROH as the Amazonian tribe. But remember that in the first figure they had many short ROH. A plausible explanation for this is the the Papuans went through an ancient bottleneck, from which they have expanded. The bottleneck increased genetic drift and so generated highly common haplotype blocks which combined to produce runs of homozygosity. But over time these blocks would have disintegrated through mutation and recombination. ROH in the Papuans then is simply a shadow of demographic events past, while ROH in Baloch is evidence of demographic events present.

roh2These two balancing realities are starkly illustrated in the supplements when you drill down to the South and Central Asian groups. In the figure it is clear that the group with the consistently highest number of ROH are the Kalash. This makes sense. The Kalash are a genetic isolate because they’re traditionally a pagan non-Muslim group isolated in the remote Chitral region of Pakistan. Because Muslims can not join their tribe for over a thousand years the gene flow has been unidirectional, as the Kalash convert to Islam and so assimilate into the broader Pakistani society. In contrast the other Pakistani groups have a huge variance in the total amount of ROH. The individuals with the least ROH in both total length and number in the sample are Baloch, Brahui and Makrani, as are some of the individuals with the highest values on these statistics! While the Kalash have been slowly and consistently ground down by the pressure of small population size, the Baloch, Brahui, and Makrani, are subject to the hammer-blows of several generations of first cousin marriages in inbred clans. These repeated marriages across the generations rapidly increase the ROH as first cousins may be more closely related to each other genetically than they are anthropologically.

roIn the pre-genomic era it was simple to calculate inbreeding. Just look at pedigrees. From this you derived the inbreeding coefficient. The key is to remember that the relationship of one’s sum totality of ancestors were critical in this calculation. In the USA marriages between first cousins occur between individuals whose grandparents are not usually related. But in other societies the generation of the grandparents, and perhaps great-grandparents, may also have been cousins. But pedigrees have limits, and may miss deep ancestry. The figure to the left, from the first paper I referenced, shows the relationship of the proportion of an individual’s ancestry which is identical by descent as calculated by genomic (ROH) methods on the y-axis and conventional ones on the x-axis (pedigree). There’s an obviously correlation, but observe the slight bias toward values above the line of best fit, and the fact that the y values are higher than the x. Genomic estimates capture common ancestry which lay outside the purview of conventional genealogy!

The implications of these patterns are two-fold: first, looking backward toward human history, and second, forward toward biomedical science. Patterns of ROH here are roughly in line with a serial bottleneck model Out of Africa; the further populations are from Africa the more short ROH they have. African populations have the least of these because of their larger long term effective population size, and relative insulation from the bottlenecking process. A shorter term phenomenon is that of consanguineous marriage patterns, whether conscious and culturally normative (as in the the Muslim world and parts of South Asia), or due to demographic constraint, as is the case among hunter-gatherers. These two processes together are relevant because of the prominence of recessive diseases within the domain of medical genetics. Clearly very long ROH is a sign of inbreeding, and so a likely higher susceptibility of an individual to a host of ailments. But the authors note that the sum effect of many short ROH may also be problematic, especially due to the fact that these together may form the preponderance of the ROH within the genomes of many populations.

So far I’ve basically alluded to demographic history, and how it shapes the genome through processes which are fundamentally neutral and stochastic. Inbreeding itself can be thought of as a form of super-charged drift, as the long term effective population of a breeding group collapses in on itself. But what about natural selection? I decided to take a closer look at Dr. Daniel MacArthur of Genomes Unzipped ROH. One of his longest regions is on Chromosome 2, is about ~2 Mb in length, and runs from position 134606441 to position 136593184. In 23andMe there’s a position which I think might explain this: 136325116. That’s the number for rs4988235 in the 23andMe data file. Variation on this SNP tracks lactase persistence in Europeans. Dr. Daniel MacArthur has the genotype for lactase persistence in the homozygote form. Are we seeing the long haplotype associated with lactase persistence here in this long ROH which rose rapidly in frequency in the last 10,000 years because of natural selection? In general the parameters outlined in the paper satisfy the broad sketch of human history, but there may be interesting detail on the margins left out of the picture.

Finally, let’s go back to heterozygosity vs. homozygosity. I recently watched the documentary “Is it Better to be Mixed Race?” Setting aside the obvious reality that this sort of program reflects the Zeitgeist of the era (it is rather obvious that a Victorian scientist could have produced a different documentary, even with the same evidence), near the end there is a comparison of ROH across populations and individuals. The comparison was actually done by the research group which published the paper I just reviewed. If you jump to 38 minutes into the film and just watch they’ll lay out the results, but I’ll tell you what they found. They compared two European men, a South Indian woman, and a man whose father was English and mother Nigerian. The European men had expected levels of homozygosity; on the higher end. The South Indian woman had lower levels of aggregate homozygosity. This should be expected, as India is relatively genetically diverse on a pan-Eurasian scale. Finally, the mixed race male had almost no homozygosity to speak of. The principle investigator admitted that out of 5,000 individuals who had he tested and analyzed this was the most extreme result, and he had to recheck it. Why? Three factors:

– The mother is Nigerian, which is a population which is relatively genetically diverse

– The genetic distance between the father and mother is rather high

– Finally, because the man is a first generation hybrid on all the loci where Africans and Europeans tend to differ he’ll be much more likely to be heterozygous

I’ll let the authors have the last word:

Long ROH are a neglected feature of our genome, which we have shown here to be universally common in human populations and to correlate well with demographic history. ROH are, however, only partially predictable from an individual’s background (due to the stochastic nature of inheritance). As well as conferring susceptibility to recessive Mendelian diseases, ROH are also potentially an underappreciated risk factor for common complex diseases, given the evidence for a recessive component in many complex disease traits…they will allow quantification of the risk arising from recessive genetic variants in different populations.

Citation: Mirna Kirin, Ruth McQuillan, Christopher S. Franklin, Harry Campbell, Paul M. McKeigue, & James F. Wilson (2010). Genomic Runs of Homozygosity Record Population History and Consanguinity PLoS ONE : 10.1371/journal.pone.0013996

Image Credit: Allison Stillwell

🔊 Listen RSS

509px-Drosophila_residua_heNatural selection happens. It was hypothesized in copious detail by Charles Darwin, and has been confirmed in the laboratory, through observation, and also by inference via the methods of modern genomics. But science is more than broad brushes. We need to drill-down to a more fine-grained level to understand the dynamics with precision and detail, and so generate novel inferences which may then be tested. For example, there are various flavors of natural selection: stabilizing selection, negative selection, and positive directional selection. In the first case natural selection buffets the phenotype about an ideal mean, in the second case deleterious phenotypes and their associated alleles are purged from the genome, and finally, natural selection can also drive a novel trait toward greater prominence, and concomitantly the allelic variants which are associated with the fitter phenotype.

The last case is of particular interest to many because it is often with positive natural selection by which evolution as descent with modification occurs. Over time trait values and the nature of traits themselves shift such that a lineage changes its character beyond recognition. This phyletic gradualism and the scale independence of evolutionary process has been challenged, in particular from the domain of developmental biology (albeit, not all ,or even most, developmental biologists). But ultimately no one doubts that a classical understanding of evolution as change in allele frequency, often driven by natural selection, is part of the larger puzzle of how the tree of life came to be. One of the phenomena associated with positive directional evolution is the selective sweep. How a selective sweep occurs, and its consequences, are rather straightforward. A genome consists of a sequence of base pairs (e.g., we have 3 billion base pairs). If a new mutation emerges at a particular base pair, a novel single nucelotide polymorphism (SNP), and, that allelic variant is ~10% fitter than the ancestral variant, natural selection could drive up its frequency (the conditionality is due to the fact that in all likelihood it would still go extinct because of the power of stochastic forces when a mutant is at low frequency). So the variant could in theory shift from ~0% (1 out of N, N being the number of individuals in a population, 2N if diploid, and so forth) to ~100%. This would be the fixation of the novel variant, driven by selective dynamics. So what’s the sweep aspect? The sweep in this case refers to the effect of the very rapid rise in frequency of the SNP in question on the adjacent genomic region. What is termed a genetic hitchiking dynamic results if the sweep occurs rapidly, so that nearby regions of the genome also move to fixation along with the favored SNP. But in a diploid organism with sexual reproduction genetic recombination persistently breaks apart associations across the physical genome. Therefore the span of the sequence of genetic markers nearby a favored SNP which form a haplotype is dependent on the rate of recombination as well as the rate of the rise in frequency of the allele, which is contingent on the strength of selection. A powerful selective sweep has the effect of homogenizing wide regions of the genome flanking the favored mutant; in other words the sweep “cleans” the gene pool of variation as one very long haplotype replaces many shorter haplotypes. As an example, in the genomes of Northern Europeans the locus LCT is characterized by a very long haplotype, which itself seems to correlate well with the trait of lactase persistence. The implication here is that the lactase persistence conferring variant arose relatively recently, and was swept up to near fixation by positive directional natural selection.

That’s the broad theory. But as you know, evolution and its subcomponents are more than “just a theory,” they’re a set of models which are amenable to testing, whether through observation, or via controlled laboratory experiments. A new letter to Nature elaborates how exactly selective sweeps play out in Drosophila melanogaster, a classic “model organism.” Interestingly, this is a case of experimental evolution, something we are more familiar with Richard Lenski’s E. coli. Genome-wide analysis of a long-term evolution experiment with Drosophila:

Experimental evolution systems allow the genomic study of adaptation, and so far this has been done primarily in asexual systems with small genomes, such as bacteria and yeast…Here we present whole-genome resequencing data from Drosophila melanogaster populations that have experienced over 600 generations of laboratory selection for accelerated development. Flies in these selected populations develop from egg to adult ~20% faster than flies of ancestral control populations, and have evolved a number of other correlated phenotypes. On the basis of 688,520 intermediate-frequency, high-quality single nucleotide polymorphisms, we identify several dozen genomic regions that show strong allele frequency differentiation between a pooled sample of five replicate populations selected for accelerated development and pooled controls. On the basis of resequencing data from a single replicate population with accelerated development, as well as single nucleotide polymorphism data from individual flies from each replicate population, we infer little allele frequency differentiation between replicate populations within a selection treatment. Signatures of selection are qualitatively different than what has been observed in asexual species; in our sexual populations, adaptation is not associated with ‘classic’ sweeps whereby newly arising, unconditionally advantageous mutations become fixed. More parsimonious explanations include ‘incomplete’ sweep models, in which mutations have not had enough time to fix, and ‘soft’ sweep models, in which selection acts on pre-existing, common genetic variants. We conclude that, at least for life history characters such as development time, unconditionally advantageous alleles rarely arise, are associated with small net fitness gains or cannot fix because selection coefficients change over time

Critical to understanding what’s going on here is the distinction they make between ‘classic’ ‘hard sweeps’ and ‘soft sweeps.’ Hard sweeps follow the spare description I outlined above:

1) A new mutant arises in the genetic background

2) Selection favors the mutant

3) The mutant rises in frequency and sweeps to fixation, 0% → 100%, replacing the ancestral variants

In contrast, for a soft sweep:

1) Selection favors a set of minor polymorphisms already segregating in the gene pool

2) These polymorphisms rise in frequency

3) But they may not sweep to fixation

In the first case the signature of natural selection will be clear, distinct, and indubitable. A novel haplotype which has replaced the ancestral variants and produced a wide region of genetic homogeneity as all other allele states are expunged by the sweep will have resulted. That isn’t what they saw at the genomic level.

phendiffBut first, what did they do? The flies used in this experiment derive from a 30 year old lineage, and they selected them for 600 generations in the case of the treatments which were being driven to new phenotype values. 600 generations for humans would be about 15,000 years assuming 25 years per generation. If a trait is heritable, and you select offspring deviated away from the mean, over time you will see a shift in the trait value. This is classic quantitative genetics, and that’s what they saw. They had five lineages which exhibited accelerated development (ACO), and five which were controls which exhibited the ancestral phenotypes (CO). “Eclosion” refers to the fly’s emergence from the pupae. The lineages which were subject to natural had very different life histories from the control groups. The cluster of traits here shouldn’t be too surprising, we know from other taxa that short-lived fast-developing species tend to be smaller and metabolically more under-the-gun than the inverse.

But the real interesting aspects of this study are not the phenotypes. Who hasn’t seen weird things among the Drosophila? That’s one of the reasons they were chosen as model organisms in the first place! Rather, they explored the patterns of genomic variation within and across the lineages, and integrated the results into a broader theoretical framework of how evolutionary processes occur, and their implications for the genome-wide structure one should see. Below I’ve stitched together figure 2 & 3, which illustrate particular patterns of genomic variation.


The left figure shows differences in allele frequencies between the ACO and CO pooled lineages. The spikes indicate large differences, with the dotted line representing the threshold where there’s a 0.1% random chance of such a between population frequency difference. The vertical axis is log-scaled. The grey line at the bottom indicate the differences in one particular ACO lineage with the pooled ACO sample. In the right panel you see heterozygosities, with blue denoting the CO lineages, and red the selected ACO lineages which have shortened life histories. The grey again is a particular ACO lineage. Each vertical panel corresponds to a chromosomal arm of the the Drosophila melanogaster genome.

First, note the widespread distribution of allele frequency differences between ACO and CO. Additionally, there’s little difference between the specific ACO lineage, and the pooled sample. Despite their independent histories they seem to exhibit the same allelic configuration. Second, note that the heterozygosities in the case of the ACO pooled sample is lower than in the CO ancestral phenotype lineages. Why? Remember that selective sweeps should expunge genomic variation. But, the sweeps do not seem to have gone to fixation, otherwise we’d see many more inverted peaks converging to heterozygosity of ~0, as the selected variant replaces all others in the population.

What’s going on in the regions which exhibit differences between the controls and selected linages? They looked at the ~650 non-synonymous SNPs on ~500 genes which were most differentiated between ACO and CO (L 10FET score > 4) and found the following categories of genes enriched: imaginal disc development, smoothened signalling pathway, larval development, wing disc development, larval development (sensu Amphibia), metamorphosis, organ morphogenesis, imaginal disc morphogenesis, organ development and regionalization. Life history is complex. Combine the wide class of genes with the dispersed genomic impact of selection as evident in figures 2 and 3, you get a good sense of the sort of consequences on the substrate level which quantitative genetic evolutionary dynamics have. Also of interest, they found that the X chromosome seemed enriched for signatures of selection and evolution. Why? They note that this chromosome would be more subject to selection for recessive or partially recessive expressing SNPs.

Clearly this study did not find the clean hard sweeps which theory may have predicted. Rather, the researchers found a lot of partially completed sweeps distributed all across the genome. Sound familiar? Before we move on to broader considerations, here are their explanations:

– The sweeps are hard, but haven’t reached fixation. So the selection coefficients have be rather small for them to still be in transient

– Selection is operating on “standing variation.” That is, the genetic variation extant naturally within a given population, and which may be operated upon by natural selection to change the population trait value mean through classical breeding techniques

– And finally, selection coefficients (the greater fitness of positively selected variants against the population mean) may not be static parameters, but change over time as a function of allele frequency. This shouldn’t be that surprising. Frequency dependence and epistasis can impact on linear assumptions within a statistical genetic model. The authors refer to deleterious alleles or antagonistic pleiotropy as possible genetic level forces which also prevent fixation

I personally lean against the first option, because it seems like we see a similar pattern in human evolutionary genomics, lots of partial sweeps and incomplete fixation. How much time does a brother need? In the long run we’re dead, and heat death swallows the universe. In the short run evolutionary pressures are always shifting. Fix now, or forget it say I! The wide distribution of allelic differences as well as moderate heterozygosities seems to be an indication that a quantitative trait, life history, is being modified through mass action on genetic variation. Interestingly, there’s also the parallel to humans insofar as the X chromosome seems to have more signatures of selection and variation in this evolutionary experiment. Next question: who’s working on experimental evolution of 600 generations in mice?

Citation: Burke, Molly K., Dunham, Joseph P., Shahrestani, Parvin, Thornton, Kevin R., Rose, Michael R., & Long, Anthony D. (2010). Genome-wide analysis of a long-term evolution experiment with Drosophila Nature : 10.1038/nature09352

Image Credit: Karl Magnacca

🔊 Listen RSS

tib1With all the justified concern about “missing heritability”, the age of human genomics hasn’t been a total bust. As I have observed before in 2005’s excellent book Mutants the evolutionary geneticist Armand M. Leroi asserted that we really didn’t have a good understanding of normal variation of human pigmentation. At the time I think it was a defensible claim, but within three years I’d say that most of the mystery had been cleared up. Though there are still some holes to be plugged, and details to be elucidated, the genetic architecture of pigmentation is now understood more or less. By the fall of 2006 Richard Sturm penned a review titled A golden age of human pigmentation genetics, an age I think which in some ways probably was closed with his 2009 review Molecular genetics of human pigmentation diversity. It’s not surprising that many of the traits that 23andMe tells you about have to do with your pigmentation. Of course there’s some limited utility in this, one assumes that most individuals don’t gain much benefit from the knowledge that they have an “85% change of having brown eyes,” though it may be useful in terms of offspring prediction (I would say I have an 85% chance of having brown eyes, but since I’m not European the genetic background isn’t right to make that probability assertion). But as the golden age of pigmentation genetics comes to a close and the low hanging fruit is stripped bare, where next? I wonder if it may be altitude adaptations. Like pigmentation altitude genetics has been around for a while, but it seems there’s a recent cresting of papers in the area, focusing in particular on the three canonical high altitude peoples, the Tibetans, Andeans, and the Ethiopians. Last spring two major groups came out with papers on the genetics of Tibetan altitude adaptation, and its evolutionary history, using somewhat different techniques. A new paper in PLoS Genetics builds upon that work (verifying two of the loci as targets of selection in Tibetans implicated in the previous papers), and, adds Andean populations to the mix to assess the possibilities of convergent adaptations. Identifying Signatures of Natural Selection in Tibetan and Andean Populations Using Dense Genome Scan Data:

High-altitude hypoxia is caused by decreased barometric pressure at high altitude, and results in severe physiological stress to the human body. Three human populations have resided at high altitude for millennia including Andeans on the Andean Altiplano, Tibetans on the Himalayan plateau, and Ethiopian highlanders on the Semian Plateau. Each of these populations exhibits a unique suite of physiological changes to the decreased oxygen available at altitude. However, we are just beginning to understand the genetic changes responsible for the observed physiology. The aim of the current study was to identify gene regions that may be involved in adaptation to high altitude in both Andeans and Tibetans. Genomic regions showing evidence of recent positive selection were identified in these two high-altitude human groups separately. We found compelling evidence of positive selection in HIF pathway genes, in the globin cluster located on chromosome 11, and in several chromosomal regions for Andeans and Tibetans. Our results suggest that key HIF regulatory and targeted genes are responsible for adaptation to altitude and implicate several distinct chromosomal regions. The candidate genes and gene regions identified in Andeans and Tibetans are largely distinct from one another. However, one HIF pathway gene, EGLN1, shows evidence of directional selection in both high-altitude populations.

In this paper the authors looked at around 50 Andeans (Quechua and Aymara speakers) and 50 Tibetans, and compared them to various outgroups. In addition to the European and Asian HapMap populations they also looked at some Amerindian populations. The map below shows the geographical scope of their sampling (the right inset are the Amerindian lowland groups):


The ancestral relationships of the two highland groups sampled in relation to the lowlanders was relatively straightforward. Panel A and B show PCA plots for the Andeans and Tibetans, while C and D show frappe bar plots. The only thing notable for me is that the Quechua speakers seem to show residual European ancestry which the Aymara do not, and the Colombian indigenous groups seems to have more affinity with Mesoamerican populations than with the other South American samples. I can give no insight as to the latter, but if it is not just a quirk of non-representativeness one may be seeing the higher number of Spanish men who married into the nobility of the Quechua speaking highlands than further south in lands of the Aymara (though Potosi was in Bolivia, so this may not be plausible).

[nggallery id=12]

We already have some evolutionary expectations of how these groups came to have these adaptations to their high altitude environments. It seems that the physiological processes for the three groups are somewhat different, and this has been a source of curiosity for geneticists for a long time. It stands to reason if the physiology is somewhat varied, the genetics should be too, and that seems to be a broadly correct assumption. In this paper they took two general approaches, looking at the total genome, and focusing on specific candidate regions. From what I can tell they did not find much novel using the first technique, but they did clarify the relationship between Tibetans and Andeans in terms of their genetic adaptations a bit by looking at specific genes. As noted in the author summary it looks as if the two populations do have somewhat different genetic architectures. Many of the genes which seem to have been targets of selection do not overlap, and of those that do there seem different localized selection events so that the haplotypes being driven by positive selection differ.

They used a compound of techniques to detect possible regions of natural selection:

– locus specific branch length (LSBL)

– the log of ratio of heterozygosities (lnRH)

– a modified Tajima’s D statistic

– whole genome long range haplotype (WGRLH)

LSBL is an elaboration on Fst, so it is finding between population differences in allele frequency. Recall that at any given locus you don’t expect much between population difference, so if there is a great deal of ecological adaptation you may see a lot of variance as a function of geography. Heterozygosity is simply a measure of the fraction of loci where the two gene copies are in different states. It’s just a way to measure genetic variation (though there are others). The Tajima’s D statistic is a test for whether the locus seems deviated from neutral expectations. This means that there may have been a bottleneck, selective sweep, or, balancing selection. Finally, the last test looks for sets of correlated markers within the genome. If there is a haplotype, a sequence of markers, at high frequency then it may be that you’re witnessing a genomic region which is in, or just after, the occurrence of a selective sweep.

Why four different tests? Because one given test is not dispositive of natural selection. As noted with Tajima’s D, there are demographic processes of a stochastic nature which can produce false positives, so it is best not to live or die by one technique alone.

Here is figure 4, which shows the differences in allele frequencies on the EGLN1 gene:


We’ve seen EGLN1 before. In the figure above the left panels show the Andean derived SNPs, and the right panels the Tibetan ones. Note the differences in frequency in A and B. The red denotes statistically significant values for a statistic in panels C & D. Both Andeans and Tibetans show indications of selection, but the details in the patterns vary when you zoom in on the gene. The very last panel has an arrow which points to the SNPs in each population where the between population variance is maximized. Interestingly the ancestral allele seems to have risen in frequency here in the high altitude populations, as black denotes ancestral and red derived in the first and last panels.

Let me jump to their conclusion:

In summary, we performed a genome scan on high- and low-altitude human populations to identify selection-nominated candidate genes and gene regions in two long-resident high-altitude populations, Andeans and Tibetans. Several chromosomal regions show evidence of positive directional selection. These regions are unique to either Andeans or Tibetans, suggesting a lack of evolutionary convergence between these two highland populations. However, evidence of convergent evolution between Andeans and Tibetans is suggested based on the signal detected for the HIF regulatory gene EGLN1. In addition to EGLN1, a second HIF regulatory gene, EPAS1, as well as two HIF targeted genes, PRKAA1 and NOS2A, have been indentified as selection-nominated candidate genes in Tibetans (EPAS1) or Andeans (PRKAA1, NOS2A). PRKAA1 and NOS2A play major roles in physiological processes essential to human reproductive success…Thus, in addition to demonstrating the likely targets of natural selection and the operation of evolutionary processes, genome studies also have the clear potential for elucidating key pathways responsible for major causes of human morbidity and mortality. Based on the findings of this study, it will be important to confirm the results with genotype-phenotype association studies that link genotype to a specific high-altitude phenotype.

I wanted to show the alphabet soup of genes in case you’re a geneticist with an interest in any of these loci. I’ve seen these before in previous papers, I assume the key that got this published in PLoS Genetics is the deep comparative dimension, as the researchers explored the lack or existence of evolutionary convergence between these two populations. Should the finding be surprising? I don’t think so. High altitudes are extreme environments, and the literature is filled with references to problems which emerge even in these populations because of the nature of their adaptations. There are likely deleterious side effects, especially if one of last spring’s papers on Tibetans is correct and that they’re relatively recent settlers of the highlands. But you never know until you play the game, so it is good to confirm.

A further exploration of the genetic architecture and nature of adaptations, especially when the research is extended to Ethiopians, may give us a further window into contingency in evolutionary history. These three occurrences are basically three independent experiments. In this paper they indicate that some of the variants being subject to natural selection may have been in the ancestral population, so standing variation. Others are new mutations, unique and novel. Though there are different pathways to the final expression of the phenotype, which in the details of implementation (physiology) still differ across the groups, there are also genes which in this comparison seem to be implicated in both Tibetans and Andeans as having been subject to selection. How constrained is the sample space subject to possible selection and the implied G-matrix? How contingent are the evolutionary pathways that different populations take to attain the state of adaptive fitness in similar ecologies? These are the sort of long term questions which I think will be possibly answered as the tentative silver age of altitude adaptation gives way to the golden age.

Citation: Bigham A, Bauchet M, Pinto D, Mao X, & Akey JM (2010). Identifying Signatures of Natural Selection in Tibetan and Andean Populations Using Dense Genome Scan Data PLoS Genetics

Image Credit: Micah MacAllen

Note: I am aware that classically the silver age follows the golden age, instead of precedes it. But we live in Whiggish times indeed!

🔊 Listen RSS Over the past decade evolutionary geneticist Mike Lynch has been articulating a model of genome complexity which relies on stochastic factors as the primary motive force by which genome size increases. The argument is articulated in a 2003 paper, and further elaborated in his book The Origins of Genome Architecture. There are several moving parts in the thesis, some of which require a rather fine-grained understanding of the biophysical structural complexity of the genome, the nature of Mendelian inheritance as a process, and finally, population genetics. But the core of the model is simple: there is an inverse relationship between long term effective population size and genome complexity. Low individual numbers ~ large values in terms of base pairs and counts of genetic elements such as introns.

A quick reminder: effective population size denotes the proportion of the population which contributes genes to the next generation. So, in the case of insects with extremely high mortality in the larval stage the effective population size may be orders of magnitude smaller than the census size at any given generation evaluating over all stages of life history. In contrast, with humans a much larger proportion of children end up contributing to the genetic makeup of the subsequent generation. With large organisms I’ve heard you can sometimes use a rule of thumb that effective population size is ~1/3 of census size, though this probably overestimates the effective population size. One reason that reproductive variation reduces the effective population, because many individuals contribute far less to the next generation than other individuals. The greater the variance, the more evolutionary genetic variation is impacted by a few individuals within the population at a given generation, reducing effective population which contributes to the next (the reproductive variance is often assumed to be poisson, but that is likely an underestimate). Additionally, there is the issue of variation over time. Long term effective population is much more sensitive to low bound values than high bound values, so it is liable to be much smaller than the census size at any given period for a species which goes through cycles. Humans for example have a relatively small long term effective population size evaluated over the past 100,000 years because we seem to have expanded from a small initial population. Mathematically since long term effective population size is given by the harmonic mean it stands to reason that low bound values would be critical. If that doesn’t make sense to you, remember the outsized impact which population bottlenecks may have on the long term trajectory of a species, in particular by removing genetic variation.

How does this influence genome complexity? Basically Lynch’s thesis is that when you reduce effective population you dampen the power of natural selection, specifically purifying selection, from preventing the addition of non-adaptive complexity through random processes. It isn’t that selection is rendered moot, rather, its signal is overwhelmed by the noise. Here’s the abstract of his 2003 paper:

Complete genomic sequences from diverse phylogenetic lineages reveal notable increases in genome complexity from prokaryotes to multicellular eukaryotes. The changes include gradual increases in gene number, resulting from the retention of duplicate genes, and more abrupt increases in the abundance of spliceosomal introns and mobile genetic elements. We argue that many of these modifications emerged passively in response to the long-term population-size reductions that accompanied increases in organism size. According to this model, much of the restructuring of eukaryotic genomes was initiated by nonadaptive processes, and this in turn provided novel substrates for the secondary evolution of phenotypic complexity by natural selection. The enormous long-term effective population sizes of prokaryotes may impose a substantial barrier to the evolution of complex genomes and morphologies.

The implication here is that prokaryotes with massive population sizes are biased toward smaller genomes by the more efficacious application natural selection. In contrast, more complex organisms which have smaller population sizes, and so are more impacted by the random fluctuations generation to generation due to sample variance, are less streamlined genomically because selection can do only so much against the swelling sea of noise. One intriguing argument of Lynch is that the genomic complexity is then later useful downstream as the building block of phenotypic complexity, but let’s set that aside for now.

A new paper in PLoS Genetics challenges the statistical analysis of the original data which Lynch et al. used to make their case. Technically the argue was that there was an inverse relationship between N eu and genome size. N e is effective population size, and u is nucleotide mutation rate. Though argument is technical, and the basic objection should be easy to understand: there are other variables which may actually be responsible for the correlation which Lynch et al. discerned. To the paper, Did Genetic Drift Drive Increases in Genome Complexity?:

Genome size (the amount of nuclear DNA) varies tremendously across organisms but is not necessarily correlated with organismal complexity. For example, genome sizes just within the grasses vary nearly 20-fold, but large-genomed grass species are not obviously more complex in terms of morphology or physiology than are the small-genomed species. Recent explanations for genome size variation have instead been dominated by the idea that population size determines genome size: mutations that increase genome size are expected to drift to fixation in species with small populations, but such mutations would be eliminated in species with large populations where natural selection operates at higher efficiency. However, inferences from previous analyses are limited because they fail to recognize that species share evolutionary histories and thus are not necessarily statistically independent. Our analysis takes a phylogenetic perspective and, contrary to previous studies, finds no evidence that genome size or any of its components (e.g., transposon number, intron number) are related to population size. We suggest that genome size evolution is unlikely to be neatly explained by a single factor such as population size.

lynchfig2In the original analysis by Lynch et al. ~66% of the variation in genome size was explained by N eu! That’s a pretty large effect. Figure 1 illustrates how phylogeny could be a confound in adducing a relationship. Here’s some of the text which explains the figure:

In this hypothetical example, eight species have been measured for two traits, x and y, as indicated by pairs of values at the tips of the phylogenetic tree (A). Ordinary least-squares linear regression (OLS) indicates a statistically significant positive relationship (B; r-squared = 0.62, P = 0.02), potentially leading to an inference of a positive evolutionary association between x and y. However, inspection of the scatterplot (B) in relation to the phylogenetic relationships of the species (A) indicates that the association between x and y is negative for the four species within each of the two major lineages. Regression through the origin with phylogenetically independent contrasts…which is equivalent to phylogenetic generalized least squares (PGLS) analysis, accounts for the nonindependence of species and indicates no overall evolutionary relationship between the traits…The apparent pattern across species was driven by positively correlated trait change only at the basal split of the phylogeny; throughout the rest of the phylogeny, the traits mostly changed in opposite directions (A; basal contrast in red)….

The argument then seems to be that the relationship in the original work by Lynch was an artifact due to the evolutionary history of the species which he surveyed to infer the relationship. Instead of a general principle or law then what you have is an outcome of contingent historical processes. Not very neat and clean. You can see the taxa-clustered nature of the relationship in figure 1 from the 2003 paper in Science:


OK, now let’s look at the visualization of the same data set from this paper, as a tree to illustrate the correlations:


lynchfig5The last figure shows the difference between a scatterplot using conventional OLS regression, and the phylogenetic least squares model (PGLS). You go from an obvious linear relationship, which translated into the high r-squared noted above, to basically nothing (r-squared near zero, no statistical significance).

The paper itself isn’t that long, the objection is pretty straightforward. They’re simply claiming that Lynch didn’t correct for an obvious alternative explanation/confound, and that we don’t know what we thought we knew. Additionally, there is the assertion that the idea that effective population size predicts genome size robustly is becoming conventional wisdom within the scientific community. I don’t know about that, this seems like such a young field in flux that I think they oversold how widespread this assumption is to make the force of their rebuttal more critical. Certainly the patterns in genome size can be quite perplexing, but my intuition is that an r-squared on the order of 2/3 of the variation in genome size being explained by one predictor variable is rather astounding. Obviously genome size is pretty easy to get in the “post-genomic era,” but N e and u are harder to come by for many taxa, or even within a given taxon for a set of species of interest. It looks to me an opportunity for experimental evolutionalists, who can control the confounds, and observe changes within a lineage. And yet even if N eu is predictive as an independent variable all things controlled, what if all things are not usually controlled, and random acts of phylogenetic history are more important? Mike Lynch is credited in the acknowledgements, so I assume we’ll be seeing a response from him in the near future.

Citation: Whitney KD, & Garland T Jr (2010). Did Genetic Drift Drive Increases in Genome Complexity? PLoS Genetics : 10.1371/journal.pgen.1001080

🔊 Listen RSS One of the great things about evolutionary theory is that it is a formal abstraction of specific concrete aspects of reality and dynamics. It allows us to squeeze inferential juice from incomplete prior knowledge of the state of nature. In other words, you can make predictions and models instead of having to observe every last detail of the natural world. But abstractions, models and formalisms often leave out extraneous details. Sometimes those details turn out not to be so extraneous. Charles Darwin’s original theory of evolution had no coherent or plausible mechanism of inheritance. R. A. Fisher and others imported the empirical reality of Mendelism into the logic of evolutionary theory, to produce the framework of 20th century population genetics. Though accepting the genetic inheritance process of Mendelism this is original synthesis was not informed by molecular biology, because it pre-dated molecular biology. After James Watson and Francis Crick uncovered the biophysical basis for Mendelism molecular evolution came to the fore, and neutral theory emerged as a response to the particular patterns of genetic variation which new molecular techniques were uncovering. And yet through this much of R. A. Fisher’s image of an abstract genetic variant floating against a statistical soup of background noise variation persisted, sometimes dismissed as “bean bag genetics”.

We’ve come a long way from the first initial wave of discussions which were prompted by the molecular genetic revolution. We have epigenetics, evo-devo and variation in gene regulation. None of these processes “overthrow” evolutionary biology, though in some ways they may revolutionize aspects of it. Science is over the long haul after all an eternal revolution, as the boundaries of comprehension keep getting pushed outward. A few days ago I pointed to Sean Carroll’s recent work, which emphasizes that one must think beyond the sequence level, and focus on particular features such as cis-regulartory elements. Here we’ve been tunneling down to the level of the gene, but what about the traits, the phenotypes, which are affected by genetic variation?

It is well known that the sparest abstraction of genotypic-phenotypic relationship can be illustrated like so:

genetic variation → phenetic variation

But each element of this relation has to be examined greater detail. What type of genetic variation? Sequence level variation? Epigenetic variation? The second component is perhaps the most fraught, with the arrow waving away the myriad details and interactions which no doubt lurk between genotype and phenotype. And finally you have the phenotype itself. Are they all created alike in quality so that we can ascribe to them dichotomous values and quantities?

A new paper in PNAS examines the particulars of morphological phenotypes and physiological phenotypes, and their genetic control, as well as rates of evolution. Contrasting genetic paths to morphological and physiological evolution:

The relative importance of protein function change and gene expression change in phenotypic evolution is a contentious, yet central topic in evolutionary biology. Analyzing 5,199 mouse genes with recorded mutant phenotypes, we find that genes exclusively affecting morphological traits when mutated (dubbed “morphogenes”) are grossly enriched with transcriptional regulators, whereas those exclusively affecting physiological traits (dubbed “physiogenes”) are enriched with channels, transporters, receptors, and enzymes. Compared to physiogenes, morphogenes are more likely to be essential and pleiotropic and less likely to be tissue specific. Morphogenes evolve faster in expression profile, but slower in protein sequence and gene gain/loss than physiogenes. Thus, morphological and physiological changes have a differential molecular basis; separating them helps discern the genetic mechanisms of phenotypic evolution.

Morphology here refers to gross anatomical features. The sort of traits and characteristics which a paleontologist or anatomist might take interest in. Physiology is more about function, and the physical structures which enable that function. It is naturally closer to the scale of molecular biology as physiology melts into biochemistry. Of course at the other end physiology also merges with anatomy as physiology occurs within features of interest to the anatomist. By way of generalization perhaps physiology may be considered more granular, while morphology more gross, in the context of this paper.

They used the mouse because it’s a species which has long served as a model organism, and there are a host of well known and characterized mutations for both physiology and morphology. Utilization of mice in these fields in the context of evolutionary research dates back to the early 20th century. So systems biologists have a lot of research that’s already been done to work with. They found 5199 mouse genes with known phenotypes in the Mouse Genome Informatics database. 821 affected only morphological traits and 912 affected only physiological traits.

Figure 1 shows the breakdown by Gene Ontology:


Going by what little I know about these topics the second to the fourth panels aren’t surprising. Morphological traits are built from molecular structures, while the transporter activity classes are a more cellular scale, and so would seem to be below the threshold of salience for morphological traits. The first panel is not something I’d expected, but it makes sense after the fact. Figure 2 clarifies. The right panels have proportions, the left counts.


The primary point is this: morphogenes seem to affect more traits than physiogenes, and, their affect is less tissue specific when it comes to a particular trait. When this pattern is highlighted the enrichment toward transcriptional regulation makes more sense to me it is transcriptional regulation might allows for more trait by trait level control of variation. If there is a relationship of many traits to one gene that would probably impose a constraint on the sequence level to a greater extent than if the gene was implicated in variation on one trait. The gap in pleiotropy is closed somewhat when you constrain to essential genes, those whose mutation results in decrease of fitness to zero (through death or lack of ability to reproduce). Pleiotropy presumably is constraining the genetic landscape toward particular fitness peaks. Tissue specificity seems understandable when you consider the localization of many physiological processes, and their biochemical complexities (I’m thinking of the vagaries of gene expression in the liver here).

But they looked at more than how the traits and genes distribute now, they tried to sniff out if there were differences in the rate of evolution of morphogenes and physiogenes contingent upon the class of genetic variants. Remember that you have sequent level changes on exons which can alter proteins. You have cis-acting elements as critical cogs in gene regulation. And you have more gross genomic features such as gene duplication or deletion.

morphfig3Figure 3 shows the differences between mice and humans on particular genes in relation to sequence level substitutions as well as gene expression profiles. Specifically in the case of the former you want to know the rate of nonsynonymous substitution, those substitutions at base pairs which change the amino acid translated, standardized by the overall mutation rate. So panel C is the one to focus on. Note that physiogenes seem to have evolved more since the last divergence between human and mice lineages than morphogenes. Why might this be? An immediate thought that comes to mind is that tissue-specific expressing physiological processes are liable to be modulated more often than gross morphology, which might be controlled by genes with a lot of pleiotropic effects and so constrained. Even when you control to tissue-specificity the pattern remains, as evident in panel D. The pattern seems somewhat inverted in relation to rate of evolution when it comes to gene expression profiles, as you can see in the last three panels. Evolution happens, but by somewhat different genetic means in these cases. The authors finger pleiotropy in particular as the problem for sequence level evolution in morphogenes, as changes in proteins are much more likely to be problematic if those proteins are upstream from many more traits.

In a way these results show that evolution has to be a versatile designer. When it comes to physiogenes the illustrator is in charge, creating new traits from the most basic genetic raw material, changes in a base pair here and a base pair there. But for morphogenes evolution has to use the tools and tricks of photoshopping, making recourse to extant elements and rearranging or tweaking things here and there so as not to upset the complex applecart while modulating on the margins.

What about cis-acting regulatory elements? In the paper they allude to the argument of Sean Carroll that cis-acting regulatory elements are critical for the evolution of morphological traits. That would imply that morphogenes should be enriched vis-a-vis physiogenes for changes on these elements. They didn’t find that in figure 4. On the contrary.


But I don’t think they perceive their result as a rock-solid refutation of Carroll because it was somewhat indirect. I’ll quote from the paper:

…Because experimentally confirmed mammalian cis elements are few, are likely to have been confirmed in only one species, and are potentially biased toward certain classes of genes,we tested the above hypothesis by using cis-elements that were predicted exclusively by motif sequence conservation among a set of vertebrate genome sequences and recorded in the cisRED database (20). In cisRED, 8,440 predicted mouse cis-elements and 7,688 predicted human cis-elements were found to be in the proximity of 586 mouse morphogenes and their human orthologs, respectively. Similarly, 7,082 mouse cis-elements and 7,215 human cis-elements were predicted for 621 physiogenes….

I’m inclined to accept this result and its generalizability, but there’s a layer of analysis and modeling in this case which doesn’t exist in the others. Additionally, Carroll’s thesis is about the whole animal kingdom and a mouse-human comparison may be atypical.

Finally they wanted to look at gene duplication. They found:

Together with the D fam result, our analyses show that, whereas physiogene families expand/contract faster than morphogene families, the rate of expansion/contraction is relatively constant across lineages for a given family.

I wonder if the duplication here might have something to do with modulating dosages of various substrates in biochemical processes. This may have more direct relevance to physiological processes.

It is important to note as they did that the category “morphogene” and “physiogene” is somewhat artificial, as is the distinction between morphology and physiology. Nature is fundamentally one, and we break it apart as particular joints for ease of our own abstractions and categorizations. Additionally all genes presumably have some effect on morphology and physiology, and though this exploration looks under the hood a bit more than some of the older abstractions it too is a simplification. The key is that the argument here seems to be that these breaking apart of categories and processes gives us useful marginal return in comprehension of evolutionary dynamics. A trait is not always just a trait. Different classes of phenotypes may have different evolutionary genetic implications by their very nature. Some of this is common sense, those traits which are less functionally significant will exhibit more genic variation. But distinctions in terms of form and function themselves are at a further level of detail. And, I presume that generalizations that we make from mouse-human comparisons as here have some limitations across the tree of life.

Citation: Liao BY, Weng MP, & Zhang J (2010). Contrasting genetic paths to morphological and physiological evolution. Proceedings of the National Academy of Sciences of the United States of America PMID: 20368429

Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"