In my last post “Even more genes for intelligence”, I alluded to the mysterious Hsu Boundary, and I encourage you to use this phrase as often as possible. Why should other researchers have a monopoly of jargon? The phrase should help you impress friends, and also to curtail tedious conversations with persons who have limited understanding of sampling theory, themselves the biggest sample of all.
The “Hsu boundary” is Steve Hsu’s estimate that a sample size of roughly 1 million people may be required to reliably identify the genetic signals of intelligence. However, that has to be 1 million real persons, with individual data points, on which the best available techniques can be applied, not aggregated samples which are then subjected to a meta-analysis.
The reason for this is that the genetic code is a very long message. Even when summarized according to agreed principles, it can generate multiple comparisons, and is a rich soil for false positives. In reaction to that, significance levels are correspondingly raised to demanding levels, but that may rule out some real signals. A sample of at least 1 million, Steve calculated, would be required to get around this problem. Once gathered, then more advanced methods, beyond linear regression, could be applied to the data.
Aggregated samples, put together by international collaborative projects, cannot always take the level of analysis down to the individual patient. They are doing meta-analysis, aggregating together data from many sources. They share summary statistics i.e., the statistical evidence from linear regression in favour of association of a specific SNP with the phenotype. This has the advantage of making it easier to pool data, but it is not the most effective method for building a predictor. Hsu does not believe they will cross any special threshold from summary statistics on ~1M samples. They will, however, obtain better and better results as power increases. They will find patterns which have tighter confidence limits, and as such they will be identifying stronger signals.
On a wider note, it may be available somewhere, but we need an accessible central register of the samples used in all studies, particularly for those studies that then go on to aggregate them for larger sample meta-analysis. This would allow us to understand overlaps between different meta-analyses.
A complexity that we have discussed before is that internationally aggregated samples on intelligence will probably have been measured with different tests. For once, the theory of general intelligence assists us here, in that a comparable g can be extracted from a broad range testing procedures, putting all subjects onto the same g scale. An additional complexity is that for many samples no psychometric test scores are available, but scholastic tests are far more commonly obtainable. Scholastic attainment is very important, but it is not perfectly correlated with intelligence.
In a major study, Ian Deary and colleagues found a correlation of .8 between cognitive ability at 11 years and national examinations at age 16.
Intelligence and educational achievement. / Deary, Ian J.; Strand, Steve; Smith, Pauline; Fernandes, Cres. Intelligence, Vol. 35, No. 1, 2007, p. 13-21.
Excellent, but probably as high as can be achieved, and international scholastic levels will vary considerably, thus making the aggregation of subjects in different national school systems somewhat error prone. An even less powerful measure of intelligence is “years of education”. This is subject to many artefacts, typically that it is a reasonable measure when the extra years are only open to brighter students, but less so when nations are seeking to boost the abilities of all students by requiring them to stay in school longer.
Back to the analysis of genetic data. If you have all the individual data in one place, and have a reliable and valid measures of mental ability, you can use more sophisticated machine learning techniques, where Hsu predicts a threshold at ~million or so genomes (could be 2 million; not that precise). Summary statistics + linear regression has advantage that it can be applied through meta-analysis without sharing samples – you can pool lots of data without altering the original ethical requirements, since individual data are not shared.
What are these more sophisticated machine learning techniques? Compressed Sensing is the front runner, a signal processing paradigm which has an algorithm which captures all the locations with some effect on intelligence, so long as there are not too many of them relative to the sample size. The more advanced technique where Hsu predicts a boundary is called Compressed Sensing:
At the reasonable level of heritability of roughly .5 and a high probability threshold required for a real hit, then:
For heritability h2 = 0.5 and p ~ 1E06 SNPs, the value of C log p is ~ 30. For example, a trait which is controlled by s = 10k loci would require a sample size of n ~ 300k individuals to determine the (linear) genetic architecture.
We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.
So, this approach identifies a real boundary. So long as the important signals are sparse (usually the case) then a third of a million individuals suffice.
Finally, we appear to have come to the true HSU boundary, a phase transition in which selection of signals becomes easier. Is it like moving from the troposphere to the stratosphere? Perhaps it is more like the familiar natural phase transition or phase boundary shown at a very precise threshold (e.g., 100 degrees Celsius) where the basic organization of atoms and molecules can change drastically (e.g., H2O changes from a liquid to a vapor).
Similarly, the behaviour of an optimization algorithm involving a million variables can change suddenly as the amount of data available increases. We see this behavior in the case of Compressed Sensing applied to genomes, and it allows us to predict that something interesting will happen with complex traits like cognitive ability at a sample size of the order of a million individuals.
Machine learning is now providing new methods of data analysis, and this may eventually simplify the search for the genes which underpin intelligence.