The “missing heritability” problem: current genetic analysis cannot explain as much variance as that suggested by population heritability estimates. This has been a cue for “Down with twin studies” arguments, in which those of dramatic inclinations have chosen to imagine that heritability estimates were thereby disproved. Not so. I was never particularly worried about this argument, regarding it as only a matter of time before the genetic code was cracked sufficiently to bridge the gap.
Another problem about breaking the genetic code is that some important human characteristics, like height and intelligence, are controlled by many genes of small effect. As regards height, this is in fact a problem of proportionality: tall people are usually taller not just because they have longer legs, though they do, but that they are generally longer and thus taller as a consequence. Building a taller body involves a large set of changes. Indeed, perhaps as many at 20,000 SNPs are required, each of them doing only a little. Equally, for intelligence as many as 10,000 SNPs may be involved. However, if many SNPs are required for an important trait, each doing very little, it is hard to prove or disprove their involvement. Rather than just identifying significant SNPs, showing that a technique can account for a good proportion of the overall variance is important. Prediction matters.
Now a paper comes along which claims to have hoovered up the SNP heritability variance for height, and to have done so by using machine learning, namely the LASSO or compressed sensing technique. It also gets 9% of the variance for scholastic attainment, close to the 10% I had previously mentioned as the current upper limit.
Accurate Genomic Prediction of Human Height. Louis Lello, Steven G. Avery, Laurent Tellier, Ana I. Vazquez, Gustavo de los Campos, and Stephen D.H. Hsu. bioRxiv preprint first posted online Sep. 18, 2017.
We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ∼40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate∼0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the “missing heritability” problem – i.e., the gap between prediction R-squared and SNP heritability. The∼20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.
The introduction sets out the problems clearly, and distinguishes the SNP hunting techniques from genomic prediction which, “based on whole genome regression methods, seek to construct the most accurate predictor of phenotype, tolerates possible inclusion of a small fraction of false-positive SNPs in the predictor set. The SNP heritability of the molecular markers used to build the predictor, can be interpreted as an upper bound to the variance that could be captured by the predictor”.
The authors have used the UK Biobank database with nearly 500,000 genotypes. The paper has, quite necessarily, very technical supplementary appendices, but the underlying approach is to use large samples of the data to train the learning procedure, and then test the results on samples of 5,000 genotypes which had been held apart for that purpose. In my primitive terms, the sample of discovery is used to generate the best predictor, warts and all, and that is tested on the sample of proof. I like this, because it is pragmatic, not burdened by too many prior assumptions about genes, uses all the data to advantage, and is willing to include weak signals.
Fig 3 shown above reveals a good fit with the data for height.
As the authors say in their discussion
Until recently most work with large genomic datasets has focused on finding associations between markers (e.g., SNPs) and phenotype. In contrast, we focused on optimal prediction of phenotype from available data. We show that much of the expected heritability from common SNPs can be captured, even for complex traits affected by thousands of variants. Recent studies using data from the interim release of the UKBB reported prediction correlations of about 0.5 for human height using roughly 100K individuals in the training. These studies forecast further improvement of prediction accuracy with increased sample size, which have been confirmed here.
We are optimistic that, given enough data and high-quality phenotypes, results similar to those for height might be obtained for other quantitative traits, such as cognitive ability or specific disease risk. There are numerous disease conditions with heritability in the 0.5 range, such as Alzheimer’s, Type I Diabetes, Obesity, Ovarian Cancer, Schizophrenia, etc. Even if the heritable risk for these conditions is controlled by thousands of genetic variants, our work suggests that effective predictors might be obtainable (i.e., comparable to the height predictor in Figure (4)). This would allow identification of individuals at high risk from genotypes alone. The public health benefits are potentially enormous.
We can roughly estimate the amount of case-control data required to capture most of the variance in disease risk. For a quantitative trait (e.g., height) with h2∼0.5, our simulations predict that the phase transition in LASSO performance occurs at n∼30s where n is the number of individuals in the sample and s is the sparsity of the trait (i.e., number of variants with non-zero effect sizes). For case-control data, we find n∼100s (where n means number of cases with equal number controls) is sufficient. Thus, using our methods, analysis of∼100k cases together with a similar number of controls might allow good prediction of highly heritable disease risk, even if the genetic architecture is complex and depends on a thousand or more genetic variants
In summary, this is exciting stuff. It would appear that, given large samples and meeting signal sparsity requirements, compressed sensing may help track down predictive formulas for many traits and conditions. The benefits are enormous, as are that greatest benefit, a gain in understanding.