The Unz Review - Mobile
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

 Teasers[email protected] Blogview

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
These buttons register your public Agreement, Disagreement, Troll, or LOL with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used once per hour.
Ignore Commenter Follow Commenter
🔊 Listen RSS

Recent genome-wide association studies have identified a large number of non-genic regions associated with disease risk; the standard interpretation of this observation is that these are regions involved in gene regulation. A few years back, though, another possibility was raised: what if there are simply a large number of genes in the human genome that we don’t know about yet? This hypothesis came from the observation, using gene expression microarrays, that there appeared to be much more transcription in the human genome than currently annotated (for example,[1]).

A new study, however, throws a bit of cold water on this possibility[2]. Using RNA sequencing, the authors show that 1) there doesn’t appear to be that much unannotated transcription, and 2) most of the unannotated transcription that does exist is connected in some way to previously known genes. I don’t want to focus too much on these observations; however, I do want to spend a bit of time on the new transcription that they do find. In particular, they identify ~8,000 novel, short, unspliced transcripts, some of which show evidence of evolutionary conservation. This seems like quite a few! However, my feeling is that many of these are not true transcripts, but rather experimental artifacts. Let me explain why.

Let’s take the example of such a transcript that the authors give us in their Figure 8, reproduced below. On the x-axis is the position along the genome, the top track shows a measure of evolutionary conservation, and the tracks below show measures of gene expression from RNA sequencing in various tissues. Notice the discrete peaks of mapped sequencing reads, consistently positioned in the different tissues.

Now, let’s take a look at this region in the UCSC genome browser—the centralized resource for all annotations of the human genome. If we consider the conservation track, we see the picture below. Like the authors of the paper, we see relatively high levels of sequence conservation in the region. However, if we look more closely at the multiple genome alignment that leads to this high sequence conservation, we notice a strange pattern—there are gaps in the alignment in the mouse and the dog (which both have high quality genome assemblies), but the sequence is present in human and much more distantly related species with lower quality genomes (like sticklebacks). This should give you pause—how could a DNA sequence be present in humans and sticklebacks, but absent from dogs and mice? There are some possibilities, but the most likely one is that this sequence is not conserved at all, but instead the alignment algorithm is aligning non-homologous, but similar, sequences in humans and the distantly related species.

What could lead to this phenomenon? One possibility is processed pseudogenes (a phenomenon where a transcribed–and, importantly, intronless–copy of a gene is re-inserted elsewhere in the genome): if processed pseudogenes arose independently in two lineages, the two copies could look pretty similar and might get aligned. Indeed, if we take the sequence in question and look to see if it matches anywhere else in the genome, we find it’s an excellent match to a gene on a different chromosome (see figure below).

This fact explains a number of observations about this region—the apparent sequence conservation is due to misalignment of the genomes in the region, and the apparent expression is due to RNA-Seq reads that cover exon-exon junctions[3]. In reality, the most likely situation is that there is neither conservation nor expression of the region. It’s unclear how many of the 8,000 putative new regions match this sort of profile, but my guess is that there are quite a few. I hasten to add that this actually supports the main conclusions made by the authors—there’s less unannotated transcription than previously reported, and even less than they themselves report[4]!

As a final aside, why is there no annotation of processed pseudogenes on the UCSC browser? With a bit of searching, I found[5], which maintains a database of pseudogene locations in humans (the region discussed is indeed annoatated in this database). Unfortunately, these annotation are not found in the UCSC browser; my guess is the authors of this paper took a look at their region in the browser and saw nothing out of the ordinary. If the pseudogene annotation had been in this centralized database, it might have raised a red flag. In any case, a good genome browser can make life a lot easier, and the more information the better.


[1] Cheng, Kapranov et al. (2005) Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution. Science. DOI: 10.1126/science.1108625

[2] van Bakel et al. (2010) Most “Dark Matter” Transcripts Are Associated With Known Genes. PLoS Biology. DOI:10.1371/journal.pbio.1000371

[3] Exon-exon junction sequences are separated by introns in the gene itself, but are together in the processed pseudogene. Thus any RNA-Seq read from the gene that spans the junction will match the pseudogene, but not the gene from which it came. Note that there are 11 exon-exon junctions in the gene, and about 11 peaks of gene expression in the pseduogene.

[4] It’s possible, of course, that some processed pseudogenes are actually expressed. However, this is probably a small fraction of all such pseudogenes, and would require more evidence than presented to prove.

[5] Zhang et al. (2003) Millions of Years of Evolution Preserved: A Comprehensive Catalog of the Processed Pseudogenes in the Human Genome. Genome Research. DOI: 10.1101/gr.1429003

• Category: Science 
🔊 Listen RSS

A few months ago, I mentioned an article in Cell arguing that many results of genome-wide association studies are false positives. This is obviously wrong, and this week, a pair of letters to the editor (including one by Kai Wang summarizing arguments he made in various comment threads here and elsewhere) take the authors to task. The response from McClellan and King is laden with non sequiturs and simple factual errors, but they conclude:

We understand that many believe that most GWAS findings are valid….Currently, GWAS results fail to explain the vast majority of genetic influence on any human illness. Further, most risk variants implicated by GWAS have no demonstrated biological, functional, or clinical relevance for disease.

I’ve bolded the last sentence there, because it’s somewhat ironic that this was published the day after an epic genome-wide association study titled “Biological, clinical and population relevance of 95 loci for blood lipids“. This paper is worth a read not just if you’re interested in lipids, but because of the massive effort put into functional characterization of the identified loci. For a few genes, the authors are able to show that altering the expression level of the gene in mice leads to the exact phenotype they’d expect, highlighting a number of potential therapeutic targets. At one gene, a companion paper describes a heroic series of experiments to identify the precise mechanism by which a non-coding SNP exerts its effect.

I allow myself to hope that these sorts of experiments will lead to a push back against the bizarre notion that associations between non-coding polymorphisms and disease are somehow suspect.

• Category: Science 
🔊 Listen RSS

One of the major criticisms leveled against genome-wide association studies for complex diseases is that they have identified loci which account for a relatively small proportion of the variance in most traits. The difference between this small proportion of variance explained by known loci and the (generally large) total amount of variance known to be due to genetic factors has been called the “missing heritability”. Much ink has been spilled speculating about where this missing heritability lies.

Two papers published this week suggest that maybe much of the heritability isn’t actually missing at all. The argument is simple: when performing a genome-wide association study, people use very stringent thresholds for calling a SNP associated with a trait. This is reasonable; people generally want to follow up only on true positives. However, there are probably many loci which don’t reach these highly stringent cutoffs but which truly influence the trait in question. Using methods to determine how much of the variance can be explained by these loci of smaller effect, one group suggests that about half of the heritability of height can be explained by common SNPs, and possibly close to all of it if other factors are taken into account. The authors have, in their discussion, one of the most reasonable, non-hyperbolic discussions of where the “missing heritability” lies, and how whole-genome sequencing will affect genome-wide association studies. It’s worth reading the whole thing, but here’s their conclusion::

If other complex traits in humans, including common diseases, have genetic architecture similar to that of height, then our results imply that larger GWASs will be needed to find individual SNPs that are significantly associated with these traits, because the variance typically explained by each SNP is so small. Even then, some of the genetic variance of a trait will be undetected because the genotyped SNPs are not in perfect LD with the causal variants. Deep resequencing studies are likely to uncover more polymorphisms, including causal variants that will be represented on future genotyping arrays. Our data provide strong evidence that the variation contributed by many of these causal variants is likely to be small and that very large sample sizes will be required to show that their individual effects are statistically significant. A similar conclusion was drawn recently for schizophrenia. In some cases the small variance will be due to a large effect for a rare allele, but this will still require a large sample size to reach significance. Genome-wide approaches like those used in our study can advance understanding of the nature of complex-trait variation and can be exploited for selection programs in agriculture and individual risk prediction in humans.


Park et al. (2010) Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nature Genetics. doi:10.1038/ng.610.

Yang et al. (2010) Common SNPs explain a large proportion of the heritability for human height. Nature Genetics. doi:10.1038/ng.608.

• Category: Science 
🔊 Listen RSS

Two articles in the New York Times this week revisit the promises made 10 years ago about how the sequencing of the human genome would revolutionize medicine (and how, obviously, it has not). There are things to quibble about–I could go on again about how some of the arguments against genome-wide association studies are silly, but hey, if silly arguments get published by prestigious people in prestigious journals, then I can hardly fault the Times for repeating them–but by and large the articles are spot on. The sequencing of the human genome has absolutely revolutionized biology, but that fact has had little impact on medicine to date.

There are reasons for optimism though. The most important reason is a bit of historical perspective–the road from identifying a gene involved in a disease to a treatment for that disease is…rocky, to say the least. The genetic cause of a particular type of cancer (CML) was discovered in 1960, but a treatment was approved by the FDA in 2001. For those keeping track, that’s 41 years. The gene for cystic fibrosis was discovered in 1988; drugs for that disease are finally in clinical trials now (and so will take at least 22 years to hit the market). Most drugs take well more than a decade to go through the regulatory pipeline. So even if associations between genes and common diseases had been discovered the day the genome was published (obviously, it took half a decade to even get those associations), the drugs inspired by those findings would still not be out. This quote is particularly relevant:

“I don’t think any of us in the business believed it would be a cornucopia,” said Frank L. Douglas, the former head of research and development at the drug company Aventis. “What we did believe, however, was that it would get easier. We forgot our history.”

And once again, life imitates The Onion.

• Category: Science 
🔊 Listen RSS

In a couple recent posts (and, I remember thanks to google, at least one very old–in internet years–post), I’ve pushed back against criticisms of genome-wide association studies using SNP genotyping arrays. This is despite the fact that I agree it’s clear that rare variants contribute to common diseases, and that sequencing technologies are eventually going to replace genotyping arrays for most genome-wide association studies. I’ve thought a bit about why I feel the need to push back on this (I’m not involved in any studies of common diseases, for what it’s worth), and I think there are two, entirely non-scientific reasons.

1. The self-aggrandizing tone and straw man arguments. McClellan and King, in their essay on the subject, refer to the “realization” that rare variants influence common diseases as a “paradigm shift”. This would seem to imply that no one had thought about this subject before, which is obviously preposterous. Almost a decade ago, at least a couple people crunched some numbers, and found that rare variants were likely to contribute quite a bit to common diseases. The relative contributions of alleles at different frequencies depends on a number of a priori unknowable parameters, of course, so what did you expect people to do? They could either sit on their hands for a decade or two and let sequencing technologies progress, or they could hope that the value of those parameters for their disease of interest were favorable, and give it a shot. For many diseases, this paid off. For a few, it has been overwhelmingly successful. Have these same people been chomping at the bit to look at rare (or rarer) variants as technologies have progressed? Of course; anyone who presents the idea of using sequencing technologies to understand disease as novel is confused.

2. Insufficient amount of awe. Yes, ok, the genetic associations identified for many traits explain, at best, a few percent of the variance (the alternative approaches that were feasible a few years ago–ie. sitting on your hands or doing candidate gene studies–would have yielded associations that explain 0% of the variance, of course, but I suppose that’s neither here nor there). But many of those associations are really cool. A variant that influences prostate cancer risk has the opposite effect on type II diabetes, providing evidence that the negative correlation between the two diseases is partially genetic. A number of associated alleles in several diseases have different effects depending on whether they were inherited maternally or paternally. A region influencing cancer risk exerts its effect by looping over 300,000 bases away to interact with a gene’s promoter. A SNP in a gene cluster that influences body patterning during development influences variation in number of teeth at the age of one. And now you want to argue we haven’t learned anything from genome-wide association studies? Are you kidding?

In any case, nothing of the above is particularly objective–maybe there are people out there who are absolutely shocked that common SNPs aren’t everything, or who think anything which isn’t immediately medically applicable is a waste of time. But for what it’s worth (not much), I’ll point out again that genome-wide association studies have revolutionized the study of human traits; the identification of genes involved in traits has become so routine that for some it’s even now boring!

• Category: Science 
🔊 Listen RSS

I think it is probably (or should be) an uncontroversial statement to say that recent genome-wide association studies have revolutionized our understanding of the molecular basis of variation in disease risk in humans. From a handful of polymorphisms reliably associated with a few diseases, there are now hundreds of such associations for a wide spectrum of disease and non-disease traits. That said, these studies have been disappointing to some–even now, the genetic loci identified are generally a poor predictor of whether a person will get a disease or not. This has led to something of a backlash against these sorts of studies. Some of this backlash is fair enough, but some of the arguments presented are problematic. One bizarre argument that seems to be gaining some traction is that, since genome-wide association studies are finding many non-genic regions associated with disease risk, they’re not identifying anything functionally relevant. See, for example, this article in the New York Times, and a recent commentary by McClellan and King. Here are McClellan and King:

A major limitation of genome-wide association studies is the lack of any functional link between the vast majority of risk variants and the disorders they putatively influence…Very few published risk variants lie in coding regions, in UTRs, in promoters, or even in predicted intronic or intergenic regulatory regions. Far fewer have been shown to alter the function of any of these sequences. How did genome-wide association studies come to be populated by risk variants with no known function?

Their answer to this rhetorical question is that common SNPs (used on current genotyping platforms) are generally nonfunctional. The alternative, the evidence for which I’ll present here, is that our ability to predict functional SNPs is poor. In the phrase “no known function”, the emphasis should be on the word “known”.

So how could all these non-genic polymorphisms of unknown function influence disease risk? The obvious answer is that they influence gene regulation–the expression levels and/or timing of expression of relevant genes. Is there evidence that this is the case? Here are three points from the recent literature:

1. I’ll start with a recently published mouse model of cancer[1]. In this paper, the authors generated a mutant mouse which expressed a particular gene at 80% of its normal levels (this is in contrast to many studies of this type, which remove a gene completely). This is a rather subtle alteration of the physiology of a mouse. That said, these slightly modified mice developed a range of cancers at higher rates than controls. So the first point is: relatively slight changes in the expression of a gene can predispose to disease.

2. From the above, you might guess that polymorphisms in humans which lead to subtle changes in gene expression might be likely to also have shown up in genome-wide association studies (even if we don’t known the precise mechanism). This would be a correct guess. In a recent paper[2], a group showed that polymorphisms found to influence gene expression in human lymphoblastoid cell lines were more likely than control polymorphisms to also influence different traits. In a particular example, another group[3] asked whether polymorphisms associated with celiac disease (most of which were non-genic) were also influencing gene expression in blood. Of the 38 associated regions they found, 20 of the influenced gene expression. So the second point is, common polymorphisms with relatively subtle influences on gene expression can and do influence disease risk.

3. The last point is that there’s been one heavily-studied example of a polymorphism influencing disease risk despite being far from any known gene. This is a region on chromosome 8 associated with a number of cancers. In the last year, multiple groups have shown that this region contains a long-range enhancer element, with a common polymorphism in a binding site for a relevant transcription factor (for example,[4]). It’s unclear exactly how this polymorphism influences cancer risk, but the point remains: even loci extremely far from known genes can influence gene regulation.

In sum, the weight of evidence suggests that our lack of functional knowledge about the majority of signals coming from genome-wide association studies can be attributed, not to some issue with how the studies are designed, but rather from a lack of understanding of the relevant biology. This will hopefully soon change.

[1] Alimonti et al. (2010) Subtle variations in Pten dose determine cancer susceptibility. Nature Genetics. doi:10.1038/ng.556

[2] Nicolae et al. (2010) Trait-Associated SNPs Are More Likely to Be eQTLs: Annotation to Enhance Discovery from GWAS. PLoS Genetics. doi:10.1371/journal.pgen.1000888

[3] Dubois et al. (2010) Multiple common variants for celiac disease influencing immune gene expression. Nature Genetics. doi:10.1038/ng.543

[4] Jia et al. (2009) Functional Enhancers at the Gene-Poor 8q24 Cancer-Linked Locus. PLoS Genetics. doi:10.1371/journal.pgen.1000597

• Category: Science 
🔊 Listen RSS

A commentary published this week in the prestigious journal Cell is the latest salvo in the rare variants versus common variants “debate” (see my overall thoughts on this topic here). The commentary contains a number of false claims (ie. many SNP-disease associations found to date are false positives due to population structure) and non sequiturs (ie. the inability to find a known function for polymorphisms associated with disease means those polymorphisms have no effect on disease), which I’ll skip over in the interest of getting to the main point.

The conclusion of the authors, of course, is that the community should be doing full-genome resequencing of cases and controls to identify rare variants that cause disease. The more I see the argument presented as a bold “paradigm shift” (yes, believe it or not, the authors use this term), the more it makes me smile. It’s sort of like, back when the telescope was invented, someone writing a passionate essay saying, “Hey! You know where we should point this thing? The sky, you bloody idiots!”. The same people who have been very successful with genome-wide association studies using common SNPs have already begun to publish targeted resequencing studies (for example), and there’s no one in the field who hasn’t salivated over the prospect of dirt-cheap resequencing.

In any case, the authors of this essay have blinders on regarding the potential problems with this approach. As they say, the problem is in the biology. If the truth is that many common variants of extremely weak effect account for the majority of the variance in disease risk for many diseases, such is life, and resequencing studies simply won’t get around it. The authors cite a paper that argues this is exactly the case for schizophrenia, but of course don’t mention this conclusion, and seem oblivious to the problem it presents to their discussion. A prediction: after the first rounds of resequencing studies don’t account for all the so-called “missing heritability” of common diseases, the herd will come stampeding back to common variants (just as nonsensically as they seem to be stampeding away now).

• Category: Science 
🔊 Listen RSS

Razib has a nice discussion of an interesting observation just published in PLoS Genetics— that there is a negative correlation between recombination rate in the human genome and population differentiation. This observation, along with the complementary observations of correlations between nucleotide diversity and recombination and between nucleotide diversity and density of functional elements, form part of a growing body of literature establishing that the signatures of natural selection–positive and negative–have influenced overall patterns of genetic diversity in humans.

It’s important to emphasize again that these observations are influenced by both positive selection (the removal of genetic diversity at sites linked to advantageous alleles) and background selection (the removal of genetic diversity at sites linked to deleterious alleles). One important question is the relative role of these two forces in generating these overall patterns (the implications for human evolution of extensive positive selection are somewhat different than the implications of extensive negative selection); there are a couple ways forward on addressing this discussed by the authors here.

The authors here also raise the intriguing possibility of leveraging populations which have diverged at different times to examine differences in the efficiency of natural selection over time; they don’t quite have the data to do this yet, but they certainly will in the next couple years. They do make the observation, using admittedly suboptimally ascertained data, that there does appear to be the same qualitative relationship–perhaps even stronger– between recombination rate and differentiation even between very closely related populations like the Chinese and the Japanese; though only suggestive, this raises the possibility that the signatures of selection (again, both positive and negative) are detectable even on a quite short timeframe. Overall, this is an exciting direction for the use of resequencing datasets that will be coming out soon.

Finally, since John Hawks doesn’t have comments, I’ll make a comment on his post on this paper. In particular, based on the observation above (about the relationship between differentiation between closely-related populations and recombination), he writes:

There are a lot more genes that are geographically circumscribed and low in frequency affecting FST at a more localized level, and fewer affecting major allele frequencies between continental regions.

Though this may be true, the correlation between FST and differentiation between closely-related populations observed here is almost certainly not due to any effect of this sort. The data used in the Chinese-Japanese comparison (for example) is from the Affymetrix and Illumina genotyping chips (ie. HapMap 3), which contain mostly common variation and no (or very few) low-frequency SNPs specific to the Japanese (or Chinese). This effect is likely due to small differences in allele frequency between the Chinese and the Japanese at relatively *common, non-geographically circumscribed* SNPs. That is, imagine two SNPs, one at 55% frequency in Japan and 50% frequency in the rest of the world, and one at 50% frequency everywhere. Their observation (I think) is that SNPs of the former type are more common in low recombination rate areas of the genome, not that they find a bunch of new alleles that have arisen in the last few thousand years since those populations split. One could double-check this, but based on the chips they used, I’m pretty confident this is the case.

• Category: Science 
🔊 Listen RSS

There’s been a recent uptick in interest in the genetic architecture of complex traits (by which I mean the allele frequencies and effect sizes of the relevant loci), some of which has been driven by a much-hyped recent paper from David Goldstein’s group pointing out using simulations that, as one commenter put it, “LD exists”. Though the main point of that paper, that “[associations caused by rare variants] are likely to account for or contribute to many of the recently identified signals reported in genome-wide association studies”, is almost certainly wrong (depending on what you mean by “likely” or “contribute to” or “many”), what is true is that there are alleles at low frequency in the population that contribute to disease risk for just about any disease. Another thing that’s true is that there are alleles that are common in the population that contribute slightly to disease risk, as shown recently in schizophrenia.

The way to go about identifying these loci is straightforward, and I’m pretty sure all geneticists would agree on this: with an infinite budget, you would sequence the genomes of every individual with the disease and every individual without the disease, and do a truly genome-wide association study–identify all the polymorphisms that differ in frequency between people who have the disease and the people who don’t.

The problem, and this is where tempers start to flare, is that obviously the budget isn’t infinite. So, there’s the choice between collecting a sample of N individuals and typing them on a SNP chip (which are currently very skewed towards assaying common variation, though the next generation of chips is somewhat reducing that skew), or collecting a sample of N/10 (or probably fewer, but let’s go with an order of magnitude for argument’s sake) and performing full-genome sequencing. Which do you choose? If you think the rarer variation is more “important”, you choose the latter, while if you think the common variation is more “important”, you choose the former. If we define importance as the proportion of variance in a trait explained, this choice is based on your prior beliefs about what the relevant parameters are for the genetic architecture of your trait of interest. Once the price of sequencing drops sufficiently, this question becomes moot, but for the moment there’s a choice, and we find ourselves in this situation: people have heated, vehement arguments about prior beliefs that seem to outsiders like real, heady scientific debate, but are really about getting funding for your preferred study design. In 20 years this debate will be of interest only to historians (and maybe the people that had to suffer through it); there’s no real contentious scientific question[1]

Personally, I lean slightly towards the common variation crowd, though not because I have a particularly strong feeling about it–bigger sample sizes are always better (or at least nice to have), and chips are covering rarer and rarer variation at a more-or-less fixed cost. But will cool things be found in the initial sequencing studies in smaller samples? Of course. It’s also important to note that sequencing studies are not a radical re-thinking of how to do disease genetics; they’re simply a more comprehensive way to do the exact same genome-wide association studies that people are doing now.

[1] There are some interesting scientific questions that could be answered by simply describing the genetic architecture of a trait (or perhaps more interestingly, comparing this across traits), but the volume of debate is probably not due to them.

• Category: Science 
🔊 Listen RSS

I’ve been reading Sperm Biology: An Evolutionary Perspective; an engaging comparative look at, well, sperm biology. One fairly remarkable thing to me is that, while sperm evolve incredibly rapidly in morphology (at one point in the book, the claim is made that just about any animal can be distinguished visually by sperm cells alone[1]), the precise genetic changes involved in this variation are entirely unknown.

Given that the human and chimp Y chromosomes have diverged massively, the rapid evolution of sperm, and the fact that many genes on the Y are involved in spermatogenesis, it stands to reason that there is a large amount of variation within current human populations (there is very little work on this, so direct evidence of this is hard to come by), and that some of the relevant genetic loci lie on the Y chromosome.

So how variable is the Y chromosome within humans? It appears this is largely unknown as well (outside of the markers used for ancestry testing and the like), largely due to the fact that its repetitiveness makes it difficult to genotype or sequence. Here’s the thought: Pacific Biosciences now claims to be able to generate quality sequencing reads of up to a few kilobases; this alone might be enough to overcome the repetitiveness of the Y. Is it time for a “HapMap” of the Y chromosome, and incorporation of this chromosome into association studies for relevant traits?

[1] random aside: many nemotodes, including C. elegans, have ameoba-like sperm, rather than flaggela. How many C. elegans genetics talks have I listened to without knowing this? Many.

• Category: Science 
🔊 Listen RSS

David Goldstein and collegues report today the results of a genome-wide association study for a particular side effect (treatment-induced anemia) of treatment for hepatitis C. It turns out that variants in a single gene–ITPA–are overwhelmingly associated with the development of this side effect. This is a nice, probably clinically-important result, and there’s likely some interesting biology here as well.

One twist is that the authors identify two presumably causal variants in the gene–one a nonsynonymous SNP, and the other a SNP falling in a splice site. The authors make the following point:

Two related features of these observations are worth emphasizing. First, the ITPA variants constitute a clear example of a synthetic association in which the effects of rarer functional variants are observed as an association for a more common variant present on a whole-genome genotyping chip: indeed, the minor-allele frequency is higher for the top-associated SNP rs6051702 (19.4%) than for the causal variants rs1127354 (7.6%) and rs7270101 (12.3%) in European-Americans

Some readers will recall the paper recently published by this group on “synthetic associations“, where they posited a model for common diseases in which multiple rare (< 5% minor allele frequency) SNPs in a gene can lead to identification of as association with a common allele. Now, it appears, any gene with more than one functional variant, rare or not, fits their model!

That aside, I can see their point–the patterns of linkage disequilibrium around a locus with two causal variants leads in this case to a strong association signal at a SNP that happens to be correlated with both of them. But this isn’t a new phenomenon worthy of a special name; for example, multiple correlated SNPs in the MHC influence risk for celiac disease, and most people are happy to call it that–multiple causal variants at a locus. It seems a bit like the authors are trying to shoehorn the data to fit their theories a bit awkwardly.

• Category: Science 
🔊 Listen RSS

Daniel MacArthur points me to a Newsweek article on the bankruptcy of Decode Genetics. The author describes (one of) Decode’s problems like this:

The genetics of illness turned out to be more complex than researchers expected. At deCODE and elsewhere, the new genes linked to common diseases turned out to be rare or to have only small effects on individual risk. That killed any prospect of using deCODE’s discoveries to make blockbuster drugs.

The leap–that small genetic effect sizes means no prospects of drug discovery–sounds reasonable, but is actually wrong. Here’s an example of why:

Consider a trait like, say, cholesterol levels. Massive genome-wide association studies have been performed on this trait, identifying a large number of loci of small effect. One of these loci is HMGCR, coding for HMG-CoA reductase, an important molecule in cholesterol synthesis. The allele identified increases cholesterol levels by 0.1 standard deviations, meaning a genetic test would have essentially no ability to predict cholesterol levels. By the logic of the Newsweek piece, any drug targeted at HMGCR would have no chance of becoming a blockbuster.

Any doctor knows where I’m going with this: one of the best-selling groups of drugs in the world currently are statins, which inhibit the activity of (the gene product of) HMGCR. Of course, statins have already been invented, so this is something of a cherry-picked example, but my guess is that there are tens of additional examples like this waiting to be discovered in the wealth of genome-wide association study data. Figuring out which GWAS hits are promising drug targets will take time, effort, and a good deal of luck; in my opinion, this is the major lesson from Decode (which is not all that surprising a lesson)–drug development is really hard.

• Category: Science • Tags: Genetics 
🔊 Listen RSS

In Why Evolution is True, Jerry Coyne has the following parenthetical aside about population variation in morphology in H. erectus:

(H. erectus from China…had shovel-shaped incisor teeth not found in other populations)

This stopped me dead in my tracks: modern East Asian populations have similar tooth morphology, caused in part by a positively-selected nonsynonymous change in the gene EDAR. Could this be an example of convergent evolution of tooth morphology in hominins?

However, a cursory google suggests that shovel-shaped incisors might be thought to be a trait general to H. erectus, not specific to Asian populations. Can anyone clarify this?

• Category: Science • Tags: Genetics 
🔊 Listen RSS

Last week, I made a silly error in describing a problem in the sickle cell anemia example given by Dickson et al. (2010) as an empirical example of the phenomenon they call “synthetic association”. So allow me to take a mulligan, and re-try this:

The authors performed an association study in African-Americans, using ~200 individuals with sickle cell anemia as cases, and >7,000 controls. From their description, they simply performed a logistic regression of disease status on common polymorphisms genome-wide. This turned up a large (~2.5Mb) region surrounding HBB (known to harbour the rare disease-causing mutation) as highly associated with the phenotype. This large region of association stands in contrast, they argue, to the known patterns of linkage disequilibrium in the region, which extends over a few kilobases at most.

This observation, they argue, is an empirical example of how associations due to rare variants can lead to large blocks of associations at common variants. This effect is due to the fact that haplotypes surrounding rare variants are longer and have had little time to be broken up by recombination. Under certain genetic models, this effect of “synthetic associations” is plausible, however, this example is a poor one for making their case.

The reason is that individuals with sickle cell anemia have two chromosomes of African ancestry in the region of HBB, while individuals without sickle cell anemia have approximately the background distribution of European and African chromosomes at the locus–~20% European and ~80% African. To put it another way, let X_d be number of chromosomes of African ancestry of an individual some distance d from HBB (X can be 0, 1, or 2), and Y be the number of chromosomes of African ancestry of an individual at HBB. In the cases, they’ve conditioned on the fact that Y=2, while in the controls they have not. P(X_d) != P(X_d | Y =2), so much of their association is likely due simply to differences in ancestry between the cases and controls in the HBB region (recall that admixture linkage disequilibrium in African-Americans extends for megabases).

More concretely, any SNP near the HBB locus that happened to be fixed for opposite alleles in Europe and Africa would have a whopping 20% allele frequency difference between cases and controls in their analysis, attributable simply to differences in local ancestry. That’s the extreme (and unlikely) situation, but alleles with more modest allele frequency differences between populations will show the same effect.

To some extent, this is their point–the haplotype carrying the causal mutation is long. But the effect in this case is massively exaggerated by admixture, and the presentation of this exaggerated effect is misleading.

• Category: Science • Tags: Genetics 
🔊 Listen RSS

There’s a bit of press surrounding the interesting result from David Goldstein’s group that, in certain situations, a number of “rare” (defined as an allele frequency less than 5%[1]) variants influencing a trait can lead to an association signal at “common” SNPs. This phenomenon they authors call a “synthetic association”.

The authors claim this is potentially the cause of many of the associations found in genome-wide association studies (with common SNPs), as well as a potential solution to the “missing heritability problem” (this isn’t mentioned in the paper itself, but rather in a Times article describing it). In other words, this could be a panacea for all the ills of the human genetics community. Unfortunately, this seems rather unlikely.

1. There are a range of parameter values for which “synthetic associations” are plausible–where the effect of the rare variants is small enough to have avoided detection by linkage studies but big enough to show up via correlation with common variants. This range of parameters is kind of small–from Figure 2, it looks like maybe a set of mutations at a gene with a genotypic relative risk greater than 2 but less than 6. Will this be the case for some loci? Sure, that sounds plausible. Is it going to explain everything? No, of course not.

2. It has been pointed out (rightly) that diseases that are selected against should have their genetic component enriched for rare variants. Goldstein himself has made this argument about diseases like schizophrenia. So if schizophrenia has all these rare variants, and rare variants cause rampant “synthetic associations” at common SNPs, why hasn’t anyone picked up whopping associations using common SNPs in schizophrenia?

3. The sickle cell anemia example, as presented in the paper, is extremely misleading. It seems the authors did a simple case control test for sickle cell in an African-American population. Recall that African-Americans are an admixed population, with each individual carrying large chunks of “European” and “African” chromosomes. Anyone will sickle cell will have at least one block of African chromosome surrounding the beta-globin locus, while those without will have two chromosomes sampled from the overall distribution of chromosomes in the population–15-20% of which, approximately, will be of European descent[2]. So any SNP with an allele frequency difference between African and European populations in this region will show up as a highly significant association with the disease due to the way they’ve done the test, and these associations will extend out to the length of admixture linkage disequilibrium–well, well beyond the LD found in African populations alone. The presentation of this example in the paper–the large block of association contrasting with the small blocks of LD in the Yoruban population–is a bit silly.

If I had to guess, and put a concrete bet on how this will play out, let’s take the associations listed in their Table 1, which they call candidates for being due to synthetic associations. My bet: none of them are. Ok, maybe one.

[1] These sorts of thresholds are important to watch–in a year people will be calling things at 1% frequency “common” if it suits them for rhetorical purposes.

[2] Corrected from: “… will have two large blocks of “African” chromosomes surrounding the beta-globin locus, and everyone without will have at least one European chromosome in the same area”; see comments.

• Category: Science • Tags: Genetics 
🔊 Listen RSS

Online this week in Science, a group presents a method for identifying genes under positive selection in humans, and gives some examples. I have somewhat mixed feelings about this paper, for reasons I’ll get to, but here’s their basic idea:

Readers of this site will likely be familiar with genome-wide scans for loci under positive selection in humans (see, eg., the links in this post). In such a scan, one decides on a statistic that measured some aspect of the data that should be different between selected loci and neutral loci–for example, extreme allele frequency differences between populations, or long haplotypes at high frequency–and calculates this statistic across the genome. One then decides on some threshold for deciding a locus is “interesting”, and looks at those loci for patterns–are there genes involved in particular phenotypes among those loci? Or protein-coding changes?

In this paper, the authors note that many of these statistics are measuring different aspects of the data, such that combining them should increase power to distinguish “interesting” loci from non-“interesting” loci. That is, if there’s an allele at 90% frequency in Europeans and 5% frequency in Asians, that’s interesting, but if that allele is surrounded by extensive haplotype structure in one of those populations, that’s even more interesting. The way they combine statistics is pretty straightforward–they essentially just multiply together empirical p-values from different tests as if they were independent. I wouldn’t believe the precise probabilities that come out of this procedure (for one, the statistics aren’t really fully independent), but it seems to work–in both simulations of new mutations that arise and are immediately under selection and in examples of selection signals where the causal variant is known (Figures 1-3)–for ranking SNPs in order of probability of being the causal SNP underlying a selection signal.

With this, the authors have a systematic approach for localizing polymorphisms that have experienced recent selection. It’s necessarily somewhat heuristic, sure, but it does the job. They then want to apply this procedure to gain novel insight into recent human evolution. This is sort of the crux of the matter–does this new method actually give us new biological insight?

The novel biology presented consists of a few examples of selection signals where they now think they’ve identified a plausible mechanism for the selection–a protein-coding change in PCDH15, and regulatory changes near PAWR and USF1 (their Figure 4). On reflection, however, these examples aren’t new. Consider PCDH15–this gene was mentioned in a previous paper by the same group, where they called a protein-coding change in the gene one of the 22 strongest candidates for selection in humans (Table 1 here, and main text). It’s unclear what is gained with the new method (except perhaps to confirm their previous result?).

Or consider the regulatory changes near PAWR and USF1. The authors use available gene expression data to show that SNPs near these genes influence gene expression, and that the signals for selection and the signals for association with gene expression overlap. Early last year, a paper examined in detail the overlap between signals of this sort, and indeed, both of these genes are mentioned as examples where this overlap is observed. So using different methods, a different group published the same conclusion about these genes a year ago. Again, it’s unclear what one gains with this new method.

In general, then, this paper has interesting ideas, but puzzlingly fails to really take advantage of them[1]. That said, they’ve taken some preliminary steps down a path that is very likely to yield interesting results in the future.


[1] I wonder if I’m being too harsh on this paper just because it was published in a “big-name” journal. If this were published in Genetics, for example, I certainly wouldn’t be opining about whether or not it contains any novel biology.

Citation: Grossman et al. (2010) A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection. Science. DOI: 10.1126/science.1183863

• Category: Science • Tags: Population Genetics 
🔊 Listen RSS

This week in Science, three papers report that the product of the gene PRDM9 is an important determinant of where recombination occurs in the genome during meiosis. Though this may sound like something of an esoteric discovery, it’s actually pretty remarkable, and brings together a number of lines of research in evolutionary genetics. How so?

A bit of background.

A few somewhat related facts:

1. A major goal in the study of speciation is the identification of the genes that underlie reproductive barriers between species. In 2008, the first such gene in mammals was found–in a cross between two subspecies of mouse where the male offspring are sterile (note that this follows Haldane’s rule), a introduction of the “right” version of a single gene was sufficient to restore fertility. This gene? PRDM9, which encodes a histone methyltranferase expressed in the mouse germline. This gene has evolved rapidly across animals, especially in the part of the protein that binds DNA. This suggests it is binding a sequence that is changing particularly rapidly over evolutionary time.

2. The positions in the genome at which recombination during meiosis are not scattered randomly, but rather cluster together in what are called “recombination hotspots“. Enriched within these hotspots in humans is a particular sequence motif, presumably an important binding site for whatever factor is controlling recombination. As this fact was becoming clear, a group compared the positions of these recombination hotspots between humans and chimpanzees. The result? The positions of these hotspots are remarkably different between these species. In fact, the positions of recombination hotspots in humans and chimpanzees are nearly non-overlapping, a fairly impressive fact given that the genomes themselves are 99.X% identical.

3. But perhaps #2 isn’t all that surprising. If there are two alleles at a hotspot, one of which is “hot” and the other of which is “cold” (ie. doesn’t initiate recombination), the mechanism of recombination results in gene conversion of the “hot” allele to the “cold” allele (for details, see here). This should result in the relatively rapid loss over evolutionary time of recombination hotspots, which in turn results in what has been called “the hotspot conversion paradox“–if hotspots should trend over time to be more “cold”, how is it that they exist? One plausible resolution of this paradox–a sequence or gene that doesn’t contain a hotspot itself might control the positioning of recombination elsewhere in the genome.

4. Indeed, such genes exist. In mice, two groups last year identified regions of the genome (though they didn’t at the time narrow it down to a gene) controlling the usage of individual hotspots. Importantly, one such region was located distantly to the hotspot, indicating an important regulator of recombination positioning. In humans, a group last year showed that these is extensive variability between humans in how often previously identified hotspots are used, and that this variation is heritable.

PRDM9 brings all of these observations together

These three papers all report that item #1 and items #2-4 above are all related. What do they show?

1. Two groups followed up on the observation in #4 above that there was a particular region in mouse controlling hotspot usage, and identified the relevant gene as PRDM9. One group went further, testing whether variation in this gene also influenced hotspot usage in humans. Remarkably, it did, showing that variation in PRDM9 in both mice and humans leads to variation in hotspot usage. This variation changes the binding specificity of the gene, leading to changed hotspots and a resolution of the “hotspot conversion paradox” mentioned in #3 above.

2. Another group took a different route to a similar conclusion. They followed up the sequence motif mentioned in #2 above as being enriched in recombination hotspots in humans. The “hotspot paradox” predicts that, if this motif is “hot”, it should be in the process of being removed from the human genome. Similarly, if it’s not “hot” in chimpanzees, it should not be in the process of being removed from the chimp genome. Indeed, this motif has been preferentially lost along the human lineage as compared to the chimp lineage. They then asked, what is binding this motif? They had two criteria–a protein with a predicted binding site similar to their motif, and lack of conservation of this protein between humans and chimpanzees. Only one gene fit these criteria–PRDM9. Thus, the rapid evolution of PRDM9 is responsible for the puzzling observation that recombination hotspots are entirely unconserved between humans and chimps.

A brief conclusion

I’ll reiterate that this is a pretty remarkable discovery, opening up the possibility of a direct link between the evolution of recombination and speciation. Is the effect of PRDM9 on recombination responsible for the conformation to Haldane’s rule in the mouse cross described in #1? Or is there some additional effect of this gene? Is the evolution of PRDM9 sufficient to describe the evolution of recombination hotspots in all animals? One can imagine a whole host of additional questions. Certainly, this is a story to be continued.

• Category: Science • Tags: Evolution, Genetics 
🔊 Listen RSS

Understanding the precise molecular mechanisms underlying changes in animal morphology is a tricky problem–usually two species which have diverged morphologically (say, mice and humans) are now so unrelated as to make genetic study exceedingly difficult, if not impossible. For years, a group led by David Kingsley has been addressing this problem in a cleverly-chosen model–three-spined sticklebacks. Importantly for the question of morphological evolution, freshwater populations of this fish have lost many of the spines and pelvic girdle carried by the saltwater populations (there are a number of hypotheses, probably not all mutually exclusive, for why this has been under selection).

In a new paper, this group demonstrates the precise genetic alteration underlying this change in a number of freshwater populations. Perhaps surprisingly, it appears to be due to the recurrent deletion (in different freshwater populations) of an enhancer of an important developmental gene. Strikingly, creating a transgenic freshwater fish with a copy of this enhancer (which normally is missing) leads to freshwater fish with a pelvis like the saltwater fish.

In fact, this enchancer seem to fall in a “fragile” (read: repeat-laden) region of the genome, which presumably increases the rate of deletion at this site. If one imagines there are a number of genetic paths to get to the reduced pelvis size favored in freshwater environments, the probability of each path depends on the mutation rate of each genetic change. In this case, many (though not all) freshwater populations have independently taken the same path, likely due to the increased mutation rate at this fragile site.


Citation: Chan et al. (2009) Adaptive Evolution of Pelvic Reduction in Sticklebacks by Recurrent Deletion of a Pitx1 Enhancer. Science. Published Online December 10, 2009 [DOI: 10.1126/science.1182213]

• Category: Science • Tags: Genetics 
🔊 Listen RSS

There has previously been some discussion on this site about the failure of past candidate gene association studies for identification of genetic variants that truly influence a phenotype. Much of this involved discussion of the interpretation of p-values in this context (for example, see this thread). Nature Reviews Genetics has just published a must-read review for people interested in these topics.

• Category: Science 
🔊 Listen RSS

I thought I’d point quickly to a really nice paper showing that the RNAi pathway, thought to be absent in budding yeasts, is actually only missing from baker’s yeast, Saccharomyces cerevisiae. Remarkably, the authors are able to reconstitute the pathway (which was presumably present in the ancestor of all budding yeasts) in S. cerevisiae with exogenous expression of only two genes. The authors close with a remark about the role of contingency (in particular with regards to the choice of model organism) in research:

While anticipating a productive future for RNAi research in budding yeasts, we note that if in the past S. castellii [a yeast with an endogenous RNAi pathway] rather than S. cerevisiae had been chosen as the model budding yeast, the history of RNAi research would have been dramatically different.

• Category: Science • Tags: Genetics 
The “war hero” candidate buried information about POWs left behind in Vietnam.
Are elite university admissions based on meritocracy and diversity as claimed?
The sources of America’s immigration problems—and a possible solution