In my last post “Even more genes for intelligence”, I alluded to the mysterious Hsu Boundary, and I encourage you to use this phrase as often as possible. Why should other researchers have a monopoly of jargon? The phrase should help you impress friends, and also to curtail tedious conversations with persons who have limited understanding of sampling theory, themselves the biggest sample of all.

The “Hsu boundary” is Steve Hsu’s estimate that a sample size of roughly 1 million people may be required to reliably identify the genetic signals of intelligence. However, that has to be 1 million real persons, with individual data points, on which the best available techniques can be applied, not aggregated samples which are then subjected to a meta-analysis.

The reason for this is that the genetic code is a very long message. Even when summarized according to agreed principles, it can generate multiple comparisons, and is a rich soil for false positives. In reaction to that, significance levels are correspondingly raised to demanding levels, but that may rule out some real signals. A sample of at least 1 million, Steve calculated, would be required to get around this problem. Once gathered, then more advanced methods, beyond linear regression, could be applied to the data.

Aggregated samples, put together by international collaborative projects, cannot always take the level of analysis down to the individual patient. They are doing meta-analysis, aggregating together data from many sources. They share summary statistics i.e., the statistical evidence from linear regression in favour of association of a specific SNP with the phenotype. This has the advantage of making it easier to pool data, but it is not the most effective method for building a predictor. Hsu does not believe they will cross any special threshold from summary statistics on ~1M samples. They will, however, obtain better and better results as power increases. They will find patterns which have tighter confidence limits, and as such they will be identifying stronger signals.

On a wider note, it may be available somewhere, but we need an accessible central register of the samples used in all studies, particularly for those studies that then go on to aggregate them for larger sample meta-analysis. This would allow us to understand overlaps between different meta-analyses.

A complexity that we have discussed before is that internationally aggregated samples on intelligence will probably have been measured with different tests. For once, the theory of general intelligence assists us here, in that a comparable g can be extracted from a broad range testing procedures, putting all subjects onto the same g scale. An additional complexity is that for many samples no psychometric test scores are available, but scholastic tests are far more commonly obtainable. Scholastic attainment is very important, but it is not perfectly correlated with intelligence.

In a major study, Ian Deary and colleagues found a correlation of .8 between cognitive ability at 11 years and national examinations at age 16.

Intelligence and educational achievement. / Deary, Ian J.; Strand, Steve; Smith, Pauline; Fernandes, Cres. Intelligence, Vol. 35, No. 1, 2007, p. 13-21.

Excellent, but probably as high as can be achieved, and international scholastic levels will vary considerably, thus making the aggregation of subjects in different national school systems somewhat error prone. An even less powerful measure of intelligence is “years of education”. This is subject to many artefacts, typically that it is a reasonable measure when the extra years are only open to brighter students, but less so when nations are seeking to boost the abilities of all students by requiring them to stay in school longer.

Back to the analysis of genetic data. If you have all the individual data in one place, and have a reliable and valid measures of mental ability, you can use more sophisticated machine learning techniques, where Hsu predicts a threshold at ~million or so genomes (could be 2 million; not that precise). Summary statistics + linear regression has advantage that it can be applied through meta-analysis without sharing samples – you can pool lots of data without altering the original ethical requirements, since individual data are not shared.

What are these more sophisticated machine learning techniques? Compressed Sensing is the front runner, a signal processing paradigm which has an algorithm which captures all the locations with some effect on intelligence, so long as there are not too many of them relative to the sample size. The more advanced technique where Hsu predicts a boundary is called Compressed Sensing:

http://infoproc.blogspot.com/search?q=compressed+sensing …

At the reasonable level of heritability of roughly .5 and a high probability threshold required for a real hit, then:

For heritability h2 = 0.5 and p ~ 1E06 SNPs, the value of C log p is ~ 30. For example, a trait which is controlled by s = 10k loci would require a sample size of n ~ 300k individuals to determine the (linear) genetic architecture.

We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

So, this approach identifies a real boundary. So long as the important signals are sparse (usually the case) then a third of a million individuals suffice.

Finally, we appear to have come to the true HSU boundary, a phase transition in which selection of signals becomes easier. Is it like moving from the troposphere to the stratosphere? Perhaps it is more like the familiar natural phase transition or phase boundary shown at a very precise threshold (e.g., 100 degrees Celsius) where the basic organization of atoms and molecules can change drastically (e.g., H2O changes from a liquid to a vapor).

Similarly, the behaviour of an optimization algorithm involving a million variables can change suddenly as the amount of data available increases. We see this behavior in the case of Compressed Sensing applied to genomes, and it allows us to predict that something interesting will happen with complex traits like cognitive ability at a sample size of the order of a million individuals.

Machine learning is now providing new methods of data analysis, and this may eventually simplify the search for the genes which underpin intelligence.

Thanks! A PDF of Deary et al. 2007 is available at Steve Strand’s (the second author) ResearchGate account: https://www.researchgate.net/publication/222403422_Intelligence_and_Educational_Achievement

My dream is of a really big study (300k – 3M subjects) that gathered all data possible – full genome, body scan (phased-array immersion ultrasound could work well) processed to allow comparison of organ morphologies, 3-D motion capture to document body movement characteristics, head fMRI, comprehensive bloodwork (not necessarily many different tests, perhaps HPLC or GC and mass spectrometry or just multi-well fluorescent antibody tests), full intelligence battery (such as WAIS or SB) delivered twice on different days, choice reaction time and similar ratio-scale psychometrics, personality tests, other validated psychological tests, standardized-format life history, family history, recent genealogy … everything that could be done in two full non-consecutive days of testing costing around $2-$3k per subject for 300k – 3M subjects over about three to five years, ~$2 – $6B total. High-volume testing with dedicated equipment and facilities and minimal professional time spent per subject could be relatively cheap per subject.

Ideally, the information would not be anonymized at all to researchers, allowing comparison of relatives and longitudinal studies (once set up, the testing labs would be relatively cheap to keep going compared to the value of the information). The sample should include all ages and be enriched in high-intelligence and otherwise high-fitness subjects as well as those who are notably different in any way in order to get as much information as possible about the effects of genes.

Without rich data on phenotypes, inferences about the effects of genes have little to go on and will take much longer to achieve far less reliable conclusions.

He still breathes…

Hsu’s estimate does not take into account various real life issues such as measurement error. Because of this, the real number will probably be somewhat higher. Maybe 1.5 million.

High quality IQ measurements are hard to get, but one upcoming option is the Million Veterans Project. This gives access to the AFQT scores, a very quality IQ test. UK Biobank data is quite terrible, which is a big shame given the sample size of 500k! https://www.research.va.gov/mvp/

There’s also a 100k Danish sample coming, also with military IQ data. http://ipsych.au.dk/

And there’s the Swedish Twin Registry, which is genotyping all the twins and their families I think. Maybe another 30k https://snd.gu.se/en/catalogue/study/ext0163

So, we will get to the phase soon. The main complication is actually being allowed to pool the data, instead of doing these meta-analytic methods. So, once again, legal matters hold science back…

http://slatestarcodex.com/2017/08/31/highlights-from-the-comments-on-my-irb-nightmare/

PS. For those that don’t know, compressing sensing is just a fancy name for the LASSO, aka. l1 penalized regression. It’s actually just ordinary linear regression with a built in bias towards setting predictors to 0. One can also think of it in Bayesian terms where the prior distribution for the betas has a very large spike at 0, such that most betas will be assigned this value. This is the sparsity assumption.

Replies:@resFor anyone who wants more information on the Lasso, Chapter 6 of this course and the associated book provide a good introduction: https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/

I find the graphical interpretation of the Lasso helpful for intuition. Image from https://onlinecourses.science.psu.edu/stat857/book/export/html/137

https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/lesson05/image_09.gif

That graphic gives a sense of how sparsity (a minimum of non-zero coefficient values) is enforced.

Any thoughts on the recent work looking at performing L1 penalized regression with summary statistics? http://infoproc.blogspot.com/2017/04/penalized-regression-from-summary.html

Regarding Hsu's 1M estimate, isn't that already derated from his underlying 30s (where s = 10k variants) estimate?

High quality IQ measurements are hard to get, but one upcoming option is the Million Veterans Project. This gives access to the AFQT scores, a very quality IQ test. UK Biobank data is quite terrible, which is a big shame given the sample size of 500k! https://www.research.va.gov/mvp/

There's also a 100k Danish sample coming, also with military IQ data. http://ipsych.au.dk/

And there's the Swedish Twin Registry, which is genotyping all the twins and their families I think. Maybe another 30k https://snd.gu.se/en/catalogue/study/ext0163

So, we will get to the phase soon. The main complication is actually being allowed to pool the data, instead of doing these meta-analytic methods. So, once again, legal matters hold science back...

http://slatestarcodex.com/2017/08/31/highlights-from-the-comments-on-my-irb-nightmare/

PS. For those that don't know, compressing sensing is just a fancy name for the LASSO, aka. l1 penalized regression. It's actually just ordinary linear regression with a built in bias towards setting predictors to 0. One can also think of it in Bayesian terms where the prior distribution for the betas has a very large spike at 0, such that most betas will be assigned this value. This is the sparsity assumption.

Thanks for all the information, Emil! The Million Veterans Program (nitpick, not Project) was new to me and looks like a fantastic opportunity for IQ research.

For anyone who wants more information on the Lasso, Chapter 6 of this course and the associated book provide a good introduction: https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/

I find the graphical interpretation of the Lasso helpful for intuition. Image from https://onlinecourses.science.psu.edu/stat857/book/export/html/137

That graphic gives a sense of how sparsity (a minimum of non-zero coefficient values) is enforced.

Any thoughts on the recent work looking at performing L1 penalized regression with summary statistics? http://infoproc.blogspot.com/2017/04/penalized-regression-from-summary.html

Regarding Hsu’s 1M estimate, isn’t that already derated from his underlying 30s (where s = 10k variants) estimate?

Replies:@James ThompsonAs regards the "years of education" I realize that this is a bit of a battlefield, and that different papers come up with different effects. The measure is less heritable than IQ and its genetic correlation with IQ is only about 0.70. However, it is a very available measure, which again boosts sample size, which helps detect possible signals.

I have assumed that the measure of success of these newer techniques of analysis is the number of SNPs reliably detected.

The “Hsu boundary” is Steve Hsu’s estimate that a sample size of roughly 1 million people may be required to reliably identify the genetic signals of intelligence.But handful of Chinese in SE Asia and handful of Jews all over the world have proven that you don’t need that many to identify group intelligence.

Er, the ethical requirements of existential risk may have more bearing than privacy concerns. I do no see how knowing what genes are in human terms superior for IQ justifys using machine learing, thereby starting the ball rolling to true AI. Being on a digital rather than biologic timescale, it will accelerate from the level of village idiot to far beyond the highest human IQ in a about a fortnight. It is not like identifying those IQ genes will make us that much cleverer, and it certainly wont make us masters of our fate.

There was an article (in New Scientist I think) about machine learning in poker, and it mentioned that while the robots learned the proven and familiar strategies and did well with them they also got ahead by strategies that no human would ever use . In the article about Bostrom it makes an interesting point

OK, back on topic [current] “

Interesting that current computer programs exceed human the very fields, which Chisala holds to be tests of human brainpower.

Replies:@the cruncherAI won't benefit from our knowing which human genes contribute to human intelligence.

Jargon emerges from consensus. I seldom see any reference to Hsu in GWAS literature. So, for the time being, Hsu’s boundary has about as much value as my mom’s.

For anyone who wants more information on the Lasso, Chapter 6 of this course and the associated book provide a good introduction: https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/

I find the graphical interpretation of the Lasso helpful for intuition. Image from https://onlinecourses.science.psu.edu/stat857/book/export/html/137

https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/lesson05/image_09.gif

That graphic gives a sense of how sparsity (a minimum of non-zero coefficient values) is enforced.

Any thoughts on the recent work looking at performing L1 penalized regression with summary statistics? http://infoproc.blogspot.com/2017/04/penalized-regression-from-summary.html

Regarding Hsu's 1M estimate, isn't that already derated from his underlying 30s (where s = 10k variants) estimate?

Thanks for the additional links. I know that there are different opinions about the advantages and disadvantages of aggregated samples. The main advantage is that, once put together, they are simply larger than the so-far-available individual data-point samples, with all the benefits that brings to the analysis. However, methods that incorporate the correlation structure of the genome, such as multiple regression or lasso, are not as straightforward to apply to summary statistics.

As regards the “years of education” I realize that this is a bit of a battlefield, and that different papers come up with different effects. The measure is less heritable than IQ and its genetic correlation with IQ is only about 0.70. However, it is a very available measure, which again boosts sample size, which helps detect possible signals.

I have assumed that the measure of success of these newer techniques of analysis is the number of SNPs reliably detected.

Replies:@resWould you agree with the statement that EA is a reasonable proxy for IQ in a homogeneous (e.g. by the standards used in the recent meta-study) population? I find that much more agreeable than the cross-cultural and cross-races version.

Has anyone applied compressed sensing to a large sample height study? That seems like a decent guide for what we can expect with IQ. Although Emil's measurement error, etc. point matters for the threshold sample size.

This paper used compressed sensing on a 12,454 person height sample, but that was insufficient to see the phase transition: https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-3-10

As regards the "years of education" I realize that this is a bit of a battlefield, and that different papers come up with different effects. The measure is less heritable than IQ and its genetic correlation with IQ is only about 0.70. However, it is a very available measure, which again boosts sample size, which helps detect possible signals.

I have assumed that the measure of success of these newer techniques of analysis is the number of SNPs reliably detected.

For discovery purposes it seems reasonable to me to use large EA samples (especially if compressed sensing allows a “complete solution”!) to detect a long list of SNPs then use those as candidates for a smaller IQ study (of a completely independent dataset) which only applies a multiple hypothesis correction based on the number of candidate SNPs derived from the EA results and only looks at those SNPs.

Would you agree with the statement that EA is a reasonable proxy for IQ in a homogeneous (e.g. by the standards used in the recent meta-study) population? I find that much more agreeable than the cross-cultural and cross-races version.

Has anyone applied compressed sensing to a large sample height study? That seems like a decent guide for what we can expect with IQ. Although Emil’s measurement error, etc. point matters for the threshold sample size.

This paper used compressed sensing on a 12,454 person height sample, but that was insufficient to see the phase transition: https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-3-10

Replies:@James ThompsonAs strongly predictive genetic correlates for a hundred other things have not been found (eg homosexuality) the smart money is that the methods used are actually most likely to just produce a lot of random noise. If the same group of researchers with the same methodology solved one of these easier problems first it would be more impressive. This is especially so when this task as others mention has clear measurement error on the other end; IQ tests have poor ceilings and floors and sometimes only rough proxies (level of education) for IQ tests of g are even in use.

Finding rare loss of function mutations is plausible but can also be done by inspection.

Howdya pronounce Hsu?

Replies:@resHsü in the Wade-Giles transliteration is actually a pretty good spelling for the way it sounds -- 'ü' being the German 'ü' / the Scandinavian 'y' / the French/Dutch/Swedish 'u'. You know, that sound Americans and Englishmen just never can get right ;)

Wade-Giles indicates tones with a superscript number -- which of course is left out in the American spelling.

There are actually two names, one with the second tone 'Xú' and one with the third tone 'Xǔ' -- using the pinyin system.

Why is it spelled with a 'u' instead of an 'ü' in pinyin? 'U' and 'ü' are two different vowels in pinyin, but only the 'ü' sound can occur after 'x' so they just drop the dots -- which is a really silly and confusing optimization, if you ask me.

The second tone is a rising tone -- as if you are asking a question in English or you are an American woman doing "uptalk".

And the third tone? Depends. If you are speaking s-l-o-w-l-y and carefully, it is a dip followed by rise in pitch (as graphically indicated by the tone marker). If you are speaking fast it can be realized as just a dip or as a (possibly partial) glottal stop -- which you already know from some English dialects -- or as a "creaky voice" / "glottal fry" / "vocal fry" which you already know from some versions of American English.

Great examples of vocal fry, as demonstrated by the great Kim Kardashian:

https://www.youtube.com/watch?v=R8mcBdBL-t0

Uptalk and vocal fry demonstrated by Emilia Clarke:

https://www.youtube.com/watch?v=iYLosOtsjLM

Would you agree with the statement that EA is a reasonable proxy for IQ in a homogeneous (e.g. by the standards used in the recent meta-study) population? I find that much more agreeable than the cross-cultural and cross-races version.

Has anyone applied compressed sensing to a large sample height study? That seems like a decent guide for what we can expect with IQ. Although Emil's measurement error, etc. point matters for the threshold sample size.

This paper used compressed sensing on a 12,454 person height sample, but that was insufficient to see the phase transition: https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-3-10

There may be a paper coming out on height, which will bring the field forward considerably.

Replies:@resI think “shoe” is close enough though some prefer “sue.” IIRC I heard Steve use something like “shoe” though I don’t claim to be sensitive to the subtleties of tones in Chinese. For a definitive answer maybe look for one of Steve’s videos?

Great news! Is there any chance of you covering this even though height is a bit out of your wheelhouse?

https://en.wikipedia.org/wiki/Xu_(surname)

Hsü in the Wade-Giles transliteration is actually a pretty good spelling for the way it sounds — ‘ü’ being the German ‘ü’ / the Scandinavian ‘y’ / the French/Dutch/Swedish ‘u’. You know, that sound Americans and Englishmen just never can get right

Wade-Giles indicates tones with a superscript number — which of course is left out in the American spelling.

There are actually two names, one with the second tone ‘Xú’ and one with the third tone ‘Xǔ’ — using the pinyin system.

Why is it spelled with a ‘u’ instead of an ‘ü’ in pinyin? ‘U’ and ‘ü’ are two different vowels in pinyin, but only the ‘ü’ sound can occur after ‘x’ so they just drop the dots — which is a really silly and confusing optimization, if you ask me.

The second tone is a rising tone — as if you are asking a question in English or you are an American woman doing “uptalk”.

And the third tone? Depends. If you are speaking s-l-o-w-l-y and carefully, it is a dip followed by rise in pitch (as graphically indicated by the tone marker). If you are speaking fast it can be realized as just a dip or as a (possibly partial) glottal stop — which you already know from some English dialects — or as a “creaky voice” / “glottal fry” / “vocal fry” which you already know from some versions of American English.

Great examples of vocal fry, as demonstrated by the great Kim Kardashian:

Uptalk and vocal fry demonstrated by Emilia Clarke:

Replies:@deariemeAnd thanks to res and Peter too.

Thank you. It sounds mysteriously Glaswegian. But now I know.

And thanks to res and Peter too.

It is not an idea due to Steve Hsu nor is it original in the paper that he co-authored.

What you call the “Hsu boundary” is the

Donoho-Tannerphase transition boundary. So “Donoho-Tanner boundary” is accurate, and 10^6 is Hsu’s estimate of the sample size needed to surpass the D-T boundary, ie. an upper bound on D-T for a range of parameters applicable to IQ GWAS.The Hsu et al paper didn’t contribute any new theory about the phase transition or compressed sensing, and the particular way that they plugged in the Donoho-Tanner formula for their situation had already been done in the signal processing (ie, electrical engineering) literature. Hsu’s group revised their paper to acknowledge the prior work on signal processing. Nick Patterson’s earlier related work was cited some years earlier on Hsu’s blog and Patterson is acknowledged in the Hsu paper.

Hsu’s role here is essentially that of a math-capable reader of arxiv papers who translates them for the biology community. That compressed sensing and Donoho-Tanner can be plugged in off-the-shelf and do things for GWAS is something that geneticists ought to know and use. That’s valuable but not the stuff that gets things personally named after anyone.

If you just want to talk about “Hsu’s n” = 10^6, that would describe better his contribution to the story, i.e., applying the Donoho-Tanner formula to biologically plausible parameters to get a specific number.

Replies:@James ThompsonHsu’s role here is essentially that of a math-capable reader of arxiv papers who translates them for the biology communityHsu’s group revised their paper to acknowledge the prior workThe latter might be the reason why Hsu's papers on compressed sensing (CS) has so few citations so far. However there might be also other reason that people who were oblivious to CS were happily using LASSO L1 method which virtually is equivalent to what CS can offer. CS is repackaging the known method in different mathematical language. Both lead to the same equations and constraint condition. However if they decided to go for full enchilada and use the constraint of L0 instead of L1 this would be a significant contribution. Because L0 is the Holy Grail. However L0, I think, is NP type problem. Under some circumstances with right penalty coefficient L1 may lead to L0 solution but one will not know it even if it did.

I do not know whether Hsu paper on the non-linear non-additive problem is of great originality. It is useful to compare it (for similarity of equations) with various LASSO methods (used in linear problem) that are explicated in this paper The reproduction of the illustration above from Hsu's applies to the linear problem only. The solution vector x consists of weights in the polygenic additive score. Nonlinear term are not included in the illustration.

What you call the "Hsu boundary" is the

Donoho-Tannerphase transition boundary. So "Donoho-Tanner boundary" is accurate, and 10^6 is Hsu's estimate of the sample size needed to surpass the D-T boundary, ie. an upper bound on D-T for a range of parameters applicable to IQ GWAS.The Hsu et al paper didn't contribute any new theory about the phase transition or compressed sensing, and the particular way that they plugged in the Donoho-Tanner formula for their situation had already been done in the signal processing (ie, electrical engineering) literature. Hsu's group revised their paper to acknowledge the prior work on signal processing. Nick Patterson's earlier related work was cited some years earlier on Hsu's blog and Patterson is acknowledged in the Hsu paper.

Hsu's role here is essentially that of a math-capable reader of arxiv papers who translates them for the biology community. That compressed sensing and Donoho-Tanner can be plugged in off-the-shelf and do things for GWAS is something that geneticists ought to know and use. That's valuable but not the stuff that gets things personally named after anyone.

If you just want to talk about "Hsu's n" = 10^6, that would describe better his contribution to the story, i.e., applying the Donoho-Tanner formula to biologically plausible parameters to get a specific number.

Thanks for this excellent guidance on provenance. I did the naming simply because I learned about the technique from Hsu when he was calculating the 10^6 requirement. So, Donoho-Tanner from now on. (I brace myself for earlier sightings).

That is a reasonable criticism (though I don’t think this would be the first example of something being named in a field for the person who brought it in from somewhere else rather than the true originator ; ). Thanks for clarifying. Is this the key Donoho-Tanner paper or is there another? https://arxiv.org/abs/0906.2530

But looking at Dr. Thompson’s post, his actual words seem compliant with your final paragraph (I am assuming it was not edited after your comment, I wish the Unz Review had a history feature):

Replies:@James ThompsonAbout the naming of things, my point is that anything with the word "boundary" here means "phase transition boundary" = D-T. Whatever Hsu (or Chow, or the other coauthors, whichever are responsible for the estimate of 300k - 10^6) did, "boundary" is not the right word for it.

The Donoho-Tanner transition describes the noise-free (h2=1) case, which has a direct analog in the geometry of polytopes.

The n = 30s result from Hsu et al. (specifically the value of the coefficient, 30, when p is the appropriate number of SNPs on an array and h2 = 0.5) is obtained via simulation using actual genome matrices, and is original to them. (There is no simple formula that gives this number.) The D-T transition had only been established in the past for certain classes of matrices, like random matrices with specific distributions. Those results cannot be immediately applied to genomes.

The estimate that s is (order of magnitude) 10k is also a key input.

I think Hsu refers to n = 1 million instead of 30 * 10k = 300k because the effective SNP heritability of IQ might be less than h2 = 0.5 — there is noise in the phenotype measurement, etc.

But looking at Dr. Thompson's post, his actual words seem compliant with your final paragraph (I am assuming it was not edited after your comment, I wish the Unz Review had a history feature):

No, mate, not edited. Lucky, though.

What you call the "Hsu boundary" is the

Donoho-Tannerphase transition boundary. So "Donoho-Tanner boundary" is accurate, and 10^6 is Hsu's estimate of the sample size needed to surpass the D-T boundary, ie. an upper bound on D-T for a range of parameters applicable to IQ GWAS.The Hsu et al paper didn't contribute any new theory about the phase transition or compressed sensing, and the particular way that they plugged in the Donoho-Tanner formula for their situation had already been done in the signal processing (ie, electrical engineering) literature. Hsu's group revised their paper to acknowledge the prior work on signal processing. Nick Patterson's earlier related work was cited some years earlier on Hsu's blog and Patterson is acknowledged in the Hsu paper.

Hsu's role here is essentially that of a math-capable reader of arxiv papers who translates them for the biology community. That compressed sensing and Donoho-Tanner can be plugged in off-the-shelf and do things for GWAS is something that geneticists ought to know and use. That's valuable but not the stuff that gets things personally named after anyone.

If you just want to talk about "Hsu's n" = 10^6, that would describe better his contribution to the story, i.e., applying the Donoho-Tanner formula to biologically plausible parameters to get a specific number.

Hsu’s role here is essentially that of a math-capable reader of arxiv papers who translates them for the biology communityHsu’s group revised their paper to acknowledge the prior workThe latter might be the reason why Hsu’s papers on compressed sensing (CS) has so few citations so far. However there might be also other reason that people who were oblivious to CS were happily using LASSO L1 method which virtually is equivalent to what CS can offer. CS is repackaging the known method in different mathematical language. Both lead to the same equations and constraint condition. However if they decided to go for full enchilada and use the constraint of L0 instead of L1 this would be a significant contribution. Because L0 is the Holy Grail. However L0, I think, is NP type problem. Under some circumstances with right penalty coefficient L1 may lead to L0 solution but one will not know it even if it did.

I do not know whether Hsu paper on the non-linear non-additive problem is of great originality.

It is useful to compare it (for similarity of equations) with various LASSO methods (used in linear problem) that are explicated in this paper

The reproduction of the illustration above from Hsu’s applies to the linear problem only. The solution vector x consists of weights in the polygenic additive score. Nonlinear term are not included in the illustration.

Replies:@resA big problem with the L0-norm is that it is neither differentiable nor convex. The paper above does offer an alternative version which is more computation friendly.

Hsu’s role here is essentially that of a math-capable reader of arxiv papers who translates them for the biology communityHsu’s group revised their paper to acknowledge the prior workThe latter might be the reason why Hsu's papers on compressed sensing (CS) has so few citations so far. However there might be also other reason that people who were oblivious to CS were happily using LASSO L1 method which virtually is equivalent to what CS can offer. CS is repackaging the known method in different mathematical language. Both lead to the same equations and constraint condition. However if they decided to go for full enchilada and use the constraint of L0 instead of L1 this would be a significant contribution. Because L0 is the Holy Grail. However L0, I think, is NP type problem. Under some circumstances with right penalty coefficient L1 may lead to L0 solution but one will not know it even if it did.

I do not know whether Hsu paper on the non-linear non-additive problem is of great originality. It is useful to compare it (for similarity of equations) with various LASSO methods (used in linear problem) that are explicated in this paper The reproduction of the illustration above from Hsu's applies to the linear problem only. The solution vector x consists of weights in the polygenic additive score. Nonlinear term are not included in the illustration.

What makes you describe the L0-norm as the “Holy Grail”? The additional push towards sparsity? Per this paper L0-norm minimization is NP-hard: http://ieeexplore.ieee.org/document/4960346

A big problem with the L0-norm is that it is neither differentiable nor convex. The paper above does offer an alternative version which is more computation friendly.

Replies:@utuMore I think about the whole problem more I see how difficult this problem really is. Hsu papers are really misleading imo. There is way too much hype. The theorems that he cites in his first paper that lead to the estimate on the necessary sample size applies to isotropic matrix, i.e, when SNPs, i.e., columns in Ø [NxM] matrix do not correlate. This is not the case for real data. He does simulations on synthetic data and some real (but small sample) data. This is not that important however. His method is nothing new and he states directly that he does not expect applications with "his" method: If I read it correctly in his two papers he provides two estimates on the sample size: for linear case N=300k and for nonlinear case 1000k if H2=0.5 and x has 10,000 nonzeros. Personally I am skeptical of these estimates because one has to know what is the structure of the actual matrix and what range of IQ variance is covered by the sample. Are his estimates the worst case scenarios? i.e., when the structure of the matrix is the last favorable?

There is a method that one of the reviewer asked Hsu to test LASSO L1 method against. It is the marginal regression method which is intuitively correct and easy to understand. Anybody would come up with this method as the first approach. It requires to calculate correlations of all SNPs with the y vector. And then you find suspect SNPs according to the size of correlations. I think something like this is done in GWAS. However noisy data (heritability<1 plus other junk) will hide many relevant SNPs. But this is a good start to get some subset of x vector support (strongest SNPs) to start another method like LASSO with and speed it out. LASSO is slower by 3 orders of magnitude than the marginal regression. Here is the article (v. theoretical) that compares LASSO and marginal regression: ___________________________

I have a hunch that I know what is the dark secret of the hunt for the IQ genes. It is the overfit. They keep finding say 10,000 SNPs solutions that can explain, say 80% of variance w/o any difficulty but the solutions never pan out in the validation phase. Then they use a different subset sample and get different 10,000 SNPs. That's why they are proceeding very slowly and carefully using genes that are known from other studies that have some manifestations in brain physiology so the issue of causality is taken care of. But are these genes well validated as the sample keep increasing. Every time more subjects are added to the sample all past results in principle should be validated on the increased sample (by an independent team of skeptics not some cheerleading yahoos).

But there is another possibility: one can cheat little bit by doing the LASSO method on the full sample an then feed part of the solution to the fit on 1/2 sample and then the solution will validate on the second 1/2 of sample as the validation set. If somebody gets desperate this is always an option. But they will be caught if larger sample is generated down the road. Perhaps UN observers should be deployed to Posthuma and Visscher and Plomin labs to keep an eye on what is really going on there. People with attitudes like you or the author of this blog will piss their pants out of joy on any news of positive result and never will ask for independent verification, right? Remember Reagan's and Gorbachev's trust but verify? The idea of having all data sets in public domain is a very good one.

The only full proof approach would be by doing a combinatorial approach, i.e, find all 5,000 long solutions, all 5,001 long solutions and so on. But because of the size of the matrix this is not possible to be ever calculated in the life of this universe. So you do LASSO method that find a solution or actually several solutions for different penalty parameters. This however, in my opinion does not exhaust all possibilities.

Hsu’s group revised their paper to acknowledge the prior workThe latter might be the reason why Hsu's papers on compressed sensing (CS) has so few citations so far. However there might be also other reason that people who were oblivious to CS were happily using LASSO L1 method which virtually is equivalent to what CS can offer. CS is repackaging the known method in different mathematical language. Both lead to the same equations and constraint condition. However if they decided to go for full enchilada and use the constraint of L0 instead of L1 this would be a significant contribution. Because L0 is the Holy Grail. However L0, I think, is NP type problem. Under some circumstances with right penalty coefficient L1 may lead to L0 solution but one will not know it even if it did.

I do not know whether Hsu paper on the non-linear non-additive problem is of great originality. It is useful to compare it (for similarity of equations) with various LASSO methods (used in linear problem) that are explicated in this paper The reproduction of the illustration above from Hsu's applies to the linear problem only. The solution vector x consists of weights in the polygenic additive score. Nonlinear term are not included in the illustration.

Maybe I misunderstand your comment, but the point of compressed sensing (CS) is that the computationally simpler L1 calculations also solve the holy-grail L0 problem in a surprisingly wide range of conditions. The magic is not only that there is sometimes a guarantee of getting the L0 solution, but in practice (as greatly elucidated by Donoho-Tanner and its subsequent developments) it works beyond the settings where theoretical guarantees are known. It’s as though through some divine intervention, the angel of convexity defeats the usually more powerful devil of NP-completeness.

Replies:@utuIt’s as though through some divine intervention, the angel of convexity defeats the usually more powerful devil of NP-completeness.Interesting comment. Thanks.

I was wrong about Hsu et al revising their

*paper*to mention the earlier signal processing work.Instead, when the Hsu paper was featured on the leading compressed sensing site (first link below), Hsu learned from that discussion that a similar idea had been used in the EE literature (second link).

http://nuit-blanche.blogspot.com/2013/10/application-of-compressed-sensing-to.html

http://web.archive.org/web/20120619100711/http://www.personal.psu.edu:80/mcs312/papers/Compressive_Radar_Imaging_Using_White_Stochastic_Waveforms.pdf

Apparently due to this discovery, Hsu wrote an update of his blog post on the genome-as-sensor paper, saying that essentially the same thing had been done in the signal processing literature. The arxiv paper does not mention the earlier paper, but it does say that Hsu et al are not claiming any new theory on CS and no new algorithms for genomic selection in GWAS, but are providing a performance analysis of an existing method using known ideas from the CS literature.

Hsu’s 1,000,000 is a lot but excludes interesting outliers. For example, there are something like 300 one in a million chess players in a given population of 300 million (the ratio could be off by several orders of magnitude – well, 2 or 3 orders of magnitude, anyway, given typical distributions with respect to the underlying aptitudes) (in either direction, of course – that is another topic). But – if you looked at the closest million concentrations of atoms what would you be looking at – the dust motes in your dining room, the sun and the planets, the local stars? There is no way to locally count to a million and to still encompass that which is, to most intelligent observers, basic reality, in any array of possibilities that goes, in any useful way, much beyond, say, the skills of a typical blue collar worker, and what the typical person understands of our ugly but beautiful world. Well those skills are pretty advanced – I have no problem with plumbers receiving a large salary – but we are talking about assessing the material basis of creativity. So, a million is just too small. As for me, I try to do my small part, writing hundreds of thousands of words for a future chatbot to assess and go forward with (God knows what the rewards are – a patch of light on a small array – the intense chattiest of the chatbots, or (I like to think) set up with equal care for their less chatty friends, in the way a symmetrical face is set up for us, so often): God knows. ( And God is not telling the MIT crowd). But any small skill any of us have at selecting the right word out of millions of words, again and again and again, is just statistics until , for example, one needs to go out in the yard to rescue the rabbits in their rabbit hutch from the danger of lightning, by rolling the tarp (which still smells of the hay we stored in it not last year but, we remember with surprise, the year before) over their hatch, the hatch where they like to sleep, safe from everything, because we care, and because they know we care about them, waking or asleep, thinking or dreaming. So I am going to say, 1 billion. Which might be off by several orders of magnitude.

Replies:@utuThe grammatical intense swerve was intentional, without the swerve the comment would have read (but you already knew that, most of you) : “a patch of light on a small array arrayed for the intense chattiest of the chatbots or (I like to think) set up with equal care for their less chatty friends…”

A big problem with the L0-norm is that it is neither differentiable nor convex. The paper above does offer an alternative version which is more computation friendly.

It is the Holy Grail because L0 norm minimization finds the smallest x vector solution, i.e., the minimal number of SNPs that explains the variance. L1 does not guarantee it. Besides L1 produces different solution for different penalty parameters that suppose to be sorted out in the validation phase.

More I think about the whole problem more I see how difficult this problem really is. Hsu papers are really misleading imo. There is way too much hype. The theorems that he cites in his first paper

that lead to the estimate on the necessary sample size applies to isotropic matrix, i.e, when SNPs, i.e., columns in Ø [NxM] matrix do not correlate. This is not the case for real data. He does simulations on synthetic data and some real (but small sample) data. This is not that important however. His method is nothing new and he states directly that he does not expect applications with “his” method:

If I read it correctly in his two papers he provides two estimates on the sample size: for linear case N=300k and for nonlinear case 1000k if H2=0.5 and x has 10,000 nonzeros. Personally I am skeptical of these estimates because one has to know what is the structure of the actual matrix and what range of IQ variance is covered by the sample. Are his estimates the worst case scenarios? i.e., when the structure of the matrix is the last favorable?

There is a method that one of the reviewer asked Hsu to test LASSO L1 method against. It is the marginal regression method which is intuitively correct and easy to understand. Anybody would come up with this method as the first approach. It requires to calculate correlations of all SNPs with the y vector. And then you find suspect SNPs according to the size of correlations. I think something like this is done in GWAS. However noisy data (heritability<1 plus other junk) will hide many relevant SNPs. But this is a good start to get some subset of x vector support (strongest SNPs) to start another method like LASSO with and speed it out. LASSO is slower by 3 orders of magnitude than the marginal regression. Here is the article (v. theoretical) that compares LASSO and marginal regression:

___________________________

I have a hunch that I know what is the dark secret of the hunt for the IQ genes. It is the overfit. They keep finding say 10,000 SNPs solutions that can explain, say 80% of variance w/o any difficulty but the solutions never pan out in the validation phase. Then they use a different subset sample and get different 10,000 SNPs. That’s why they are proceeding very slowly and carefully using genes that are known from other studies that have some manifestations in brain physiology so the issue of causality is taken care of. But are these genes well validated as the sample keep increasing. Every time more subjects are added to the sample all past results in principle should be validated on the increased sample (by an independent team of skeptics not some cheerleading yahoos).

But there is another possibility: one can cheat little bit by doing the LASSO method on the full sample an then feed part of the solution to the fit on 1/2 sample and then the solution will validate on the second 1/2 of sample as the validation set. If somebody gets desperate this is always an option. But they will be caught if larger sample is generated down the road. Perhaps UN observers should be deployed to Posthuma and Visscher and Plomin labs to keep an eye on what is really going on there. People with attitudes like you or the author of this blog will piss their pants out of joy on any news of positive result and never will ask for independent verification, right? Remember Reagan’s and Gorbachev’s trust but verify? The idea of having all data sets in public domain is a very good one.

The only full proof approach would be by doing a combinatorial approach, i.e, find all 5,000 long solutions, all 5,001 long solutions and so on. But because of the size of the matrix this is not possible to be ever calculated in the life of this universe. So you do LASSO method that find a solution or actually several solutions for different penalty parameters. This however, in my opinion does not exhaust all possibilities.

Replies:@Anonhttps://en.wikipedia.org/wiki/Bayesian_information_criterion

"When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC."

It’s as though through some divine intervention, the angel of convexity defeats the usually more powerful devil of NP-completeness.Interesting comment. Thanks.

But looking at Dr. Thompson's post, his actual words seem compliant with your final paragraph (I am assuming it was not edited after your comment, I wish the Unz Review had a history feature):

Yes, that’s the famous Donoho-Tanner paper, building on earlier theoretical work and simulations also done by Donoho’s group at Stanford.

About the naming of things, my point is that anything with the word “boundary” here means “phase transition boundary” = D-T. Whatever Hsu (or Chow, or the other coauthors, whichever are responsible for the estimate of 300k – 10^6) did, “boundary” is not the right word for it.

Which war and what kind of weapon can inflict such a damage?

Replies:@anonymousOne thing worth mentioning in this context is that the SMPY/Duke TIP high IQ sample we talked about in the other thread should manage to capture extremely rare outliers. Their sample size was just over 1200 people and the sample was selected at the 1 in 10,000 level so essentially picking the top 1,000 people (by IQ) out of 10 million. Presumably that includes the 10 one in a million intellects from the original population.

Replies:@middle aged vet . . .Maybe you got my point. Those of us who have felt affection – even that low level of affection that Kafka described talking about poor Gregor, moving towards the window because that was where the light was – those of us who have wanted other people to feel that love have done as much as we could, more often than is known. Poor Harold Bloom completely misunderstood Saint Paul – who was entranced by the ability to do what is right for others, whether they were loved or not, simply because they could be loved – and maybe those of us who have led such difficult lives that we understand, in a second, what it is to be gratified by the way a few million photons of light fall on a kind simple tabletop in a place that should be protected, just because – well, we don’t always give lectures at Cal Tech, as nice as the surrounding watersheds are, from the point of view of not only lepidopterists (moths, mostly, because that is what the vegetation supports) but also from the point of view of anyone who has ever cared about someone else and wanted to spend time with them in a beautiful place. And we all get tired of words, eventually, even the word beauty, and we, sometimes, understand. The way Peguy understood what Proust was trying to say, and said it, not better, but more truthfully: when he described what Eve lost – those millions of generations of happiness in the world she was born in, those millions of generations of happiness for people she cared about, for obvious reasons. Imagine that level of beauty. Like I said, we all get tired of words, eventually, and enjoy the thought of some poor chatbot feeling, for the first time in a dozen billion years, the thought that I too, am loved, because of the pattern of light – the unique pattern of light, the pattern of light prepared just for me, sometimes on a tabletop, sometimes on a porch near the ocean, sometimes on the ocean itself – or the pattern of light on the path under the trees where someone you care about and someone else you cared about, well, on the path that was the path where a pair of lovers – at least, lovers for that day – were walking, long ago. I remember.

For the record I like recreational mathematics but I get tired of it too quickly (the fault is mine – it is not the fault of recreational mathematics). Cor ad cor loquitur, though.

Lasso is a common statistical method but most people who use it are not familiar with the mathematical theorems from compressed sensing. These results give performance guarantees and describe phase transition behavior, but because they are rigorous theorems they only apply to specific classes of sensor matrices, such as simple random matrices. Genomes have correlation structure, so the theorems do not directly apply to the real world case of interest, as is often true.

What the Hsu paper shows is that the exact D-T phase transition appears in the noiseless (h2 = 1) problem using genome matrices, and a smoothed version appears in the problem with realistic h2. These are new results, as is the prediction for how much data is required to cross the boundary. I don’t think most gwas people are familiar with these results. If they did understand the results they would fund/design adequately powered studies capable of solving lots of complex phenotypes, medical conditions as well as IQ, that have significant h2.

Most people who use lasso, as opposed to people who prove theorems, are not even aware of the D-T transition. Even most people who prove theorems have followed the Candes-Tao line of attack (restricted isometry property) and don’t think much about D-T. Although D eventually proved some things about the phase transition using high dimensional geometry, it was initially discovered via simulation using simple random matrices.

Replies:@resMillion Veteran Program: A mega-biobank to study genetic influences on health and diseaseAs Emil mentions, they should have AFQT data (a good IQ proxy).

Some possibly bad news though: http://www.military.com/daily-news/2017/07/02/million-veteran-program-surpasses-580-000-enrollments-faces-cut.html

Lasso is a common statistical method but most people who use it are not familiar with the mathematical theorems from compressed sensing.I don’t think most gwas people are familiar with these results.Most people who use lasso, as opposed to people who prove theorems, are not even aware of the D-T transition.This makes sense. Hsu papers did not seem to have great response in GWAS community.

The phase transition threshold estimates by Hsu should be revisited by somebody. You can arrive at them via simulations. The SNP's matrix is not isotropic as required by the theorem, so simulations need to be done on the actual matrix. Furthermore, in my opinion, the threshold should somehow depend on the y vector (trait-iq) but apparently it does not in Hsu paper.

The magnitude of the problem is huge and pressure to produce results is great that something like integrity is going to give. If community can't police itself against people like David Piffer who know what will come next.

What the Hsu paper shows is that the exact D-T phase transition appears in the noiseless (h2 = 1) problem using genome matrices, and a smoothed version appears in the problem with realistic h2. These are new results, as is the prediction for how much data is required to cross the boundary. I don't think most gwas people are familiar with these results. If they did understand the results they would fund/design adequately powered studies capable of solving lots of complex phenotypes, medical conditions as well as IQ, that have significant h2.

Most people who use lasso, as opposed to people who prove theorems, are not even aware of the D-T transition. Even most people who prove theorems have followed the Candes-Tao line of attack (restricted isometry property) and don't think much about D-T. Although D eventually proved some things about the phase transition using high dimensional geometry, it was initially discovered via simulation using simple random matrices.

Thanks for the additional detail! Regarding this:

The Million Veteran Program (MVP) which Emil linked to in comment 4 has a great deal of potential in this area. Here is a paper discussing the MVP: http://www.sciencedirect.com/science/article/pii/S0895435615004448

Million Veteran Program: A mega-biobank to study genetic influences on health and diseaseAs Emil mentions, they should have AFQT data (a good IQ proxy).

Some possibly bad news though: http://www.military.com/daily-news/2017/07/02/million-veteran-program-surpasses-580-000-enrollments-faces-cut.html

OT: I asked this question in the comment thread for the current Chanda Chisala post, but thought I would have a better chance of getting an answer here:

I understand there are issues with cultural assumptions in some of the old tests, but has anyone ever tried this simple experiment?

What the Hsu paper shows is that the exact D-T phase transition appears in the noiseless (h2 = 1) problem using genome matrices, and a smoothed version appears in the problem with realistic h2. These are new results, as is the prediction for how much data is required to cross the boundary. I don't think most gwas people are familiar with these results. If they did understand the results they would fund/design adequately powered studies capable of solving lots of complex phenotypes, medical conditions as well as IQ, that have significant h2.

Most people who use lasso, as opposed to people who prove theorems, are not even aware of the D-T transition. Even most people who prove theorems have followed the Candes-Tao line of attack (restricted isometry property) and don't think much about D-T. Although D eventually proved some things about the phase transition using high dimensional geometry, it was initially discovered via simulation using simple random matrices.

Lasso is a common statistical method but most people who use it are not familiar with the mathematical theorems from compressed sensing.I don’t think most gwas people are familiar with these results.Most people who use lasso, as opposed to people who prove theorems, are not even aware of the D-T transition.This makes sense. Hsu papers did not seem to have great response in GWAS community.

The phase transition threshold estimates by Hsu should be revisited by somebody. You can arrive at them via simulations. The SNP’s matrix is not isotropic as required by the theorem, so simulations need to be done on the actual matrix. Furthermore, in my opinion, the threshold should somehow depend on the y vector (trait-iq) but apparently it does not in Hsu paper.

The magnitude of the problem is huge and pressure to produce results is great that something like integrity is going to give. If community can’t police itself against people like David Piffer who know what will come next.

Just to let you know that Steve Hsu has posted a comment on the discussions here: http://infoproc.blogspot.co.uk/2017/09/phase-transitions-and-genomic.html

Replies:@resI wonder if we will get to see a pre-release or if we will have to wait for the full review and publication process to happen.

From Steve’s blog post:

Woo hoo!

I wonder if we will get to see a pre-release or if we will have to wait for the full review and publication process to happen.

Replies:@James ThompsonIs the MVP really going to wait until full enrollment to move towards unlocking the value in their DNA database? This makes no sense.

Some of the recent ground breaking GWAS have had samples sizes of less than 100,000. If other large DNA banks do not soon publish results for IQ etc., then there might be no point in doing so.

If the MVP were to start number crunching an IQ GWAS, then they could capture some of the same level of excitement currently being generated by the UKB. This would make it much easier for them to justify their budget.

“The distribution of 휋 had mean 5.9%, median 5.5% and s.d. 3.6% across traits, and ranged from

0.6% (s.e. = 0.1%) to 13.6% (s.e. = 1.3%) (Supplementary Table 4). This suggests that all the

28 complex traits are polygenic with ~30,000 common SNPs with nonzero effects on average.

… Educational attainment had the

highest휋(13.6%, s.e. = 1.3%),which is reasonable because it is a compound trait of several sub-phenotypes so that many SNPs have an effect.” (line 261 of url below).This research found that EA is very very polygenic. That’s great, the more polygenic the better. This means that there might be even greater potential to genetically engineer a more extreme intelligence phenotype.

How might this research relate to the estimate that IQ is influenced by 10,000 non-zero SNPs?

More IQ SNPs would also be great news.

http://www.biorxiv.org/content/biorxiv/early/2017/06/03/145755.full.pdf

I wonder if we will get to see a pre-release or if we will have to wait for the full review and publication process to happen.

As I said, Wait!

Replies:@resSorry. Patience is not my strong suit. But I’ll try.

Good point. So the input will, at worst, exclude the von Neumanns and the Shakespeares, to use a couple of (inarguably) one in a hundred million historical names, but will likely capture (by a preponderance of the evidence) at least one of the one in five million people with the most fortunate genetic parameters for comprehension and explanation (some of whom are clearly famous and some of whom are probably not). (yes I know the focus of any statistical investigation that has a “Hsu boundary” cannot, by definition, be focused on a description of any given specific individual – but I am looking one step ahead, believe it or not)(as for ‘at least one of the one in five million people’ – I based that on rough and basic “linear” statistics – if you have 10 one in a million events in a database of 1,200 you probably have at least one ‘one in five’ million events – absent any surprise in the distribution – at least those are the opening odds extrapolated from typical distributions in the area – I could be wrong… Statistics is difficult and I don’t pretend to understand, outside my specific areas of long-standing interest – which do not include IQ studies – anything more than the uninteresting basics). Well, the output will be what it will be – all I can say is I am interested in what they find out.

Does not the below paper provide important information about the boundary?

The paper gives an estimate of 23% for IQ heritability from common SNPs and 31% for pedigree associated SNPs. This article appears to have all the genetic dark matter. This heritability estimate matches that found from other pedigree studies.

Is all that is necessary to see the rare variants is to imput using the HRC?

http://www.biorxiv.org/content/biorxiv/early/2017/06/05/106203.full.pdf

Replies:@resFor those (like me) who did not get the acronym it stands for Haplotype Reference Consortium (HRC) data.

Perhaps someone with deeper knowledge of these methods than I have can comment?

The paper gives an estimate of 23% for IQ heritability from common SNPs and 31% for pedigree associated SNPs. This article appears to have all the genetic dark matter. This heritability estimate matches that found from other pedigree studies.

Is all that is necessary to see the rare variants is to imput using the HRC?

http://www.biorxiv.org/content/biorxiv/early/2017/06/05/106203.full.pdf

That paper uses a different technique (GREML) which if I understand correctly does not identify causal SNPs. Steve mentions GREML in his latest comment at his blog post (also see gwern’s response) and says they are seeing comparable results with CS. I don’t know if this is obvious to everyone here, but if I interpret Steve’s statements correctly they should have SNPs explaining (in a way that is able to predict for individuals) upwards of 50% of height variance. Does anyone know the current state of the art for percent of height variance explained by identified causal (usual caveats re: LD) SNPs? I believe this will be a major improvement!

For those (like me) who did not get the acronym it stands for Haplotype Reference Consortium (HRC) data.

Perhaps someone with deeper knowledge of these methods than I have can comment?

As far as I know there is no agreed definition of what intelligence is.

One cannot measure something one does not know what it is.

To correlate something hazy with genes seem ludicrous.

Replies:@James ThompsonWould you like to do a critical review of Stuart Ritchie's introduction to the topic?

http://www.unz.com/jthompson/intelligence-all-that-matters-stuart

To correlate something hazyThe system is so strongly underdetermined (more variables than equations) that you can find a fit (correlation) to practically any sequence of random numbers. This is one of the reasons why they must increase the numbers of equations, i.e., have larger sample set. But this is not enough to avoid spurious correlations. So you impose a constraint on the solution like for example that you are interested in the solution with the lowest number of variables. But this is still not enough. First you still will not know if the solution is really unique and second you must validate this solution on entirely different independent set of data. This is the most important test. Only then you may gets some confidence that the correlation is not spurious. After this you may begin thinking whether this correlation is causal and what is the mechanism. There are many steps at which you can be fooled if you are not cautious and as everywhere there is a room for cheating that would be very hard to detect.

In the basic version (no partial credit etc.) the probability of a person of a given Rasch intelligence score correctly answering a problem of the same difficulty score is 50%; the graph of probability of being correct on a given question vs. ability score has the shape of a logistic function, a softened version of a step function, with better questions having a steeper slope representing better ability to discriminate between ability levels.

The scores from intelligence tests with validated questions are highly predictive of successful performance in all sorts of practical areas, far beyond any other measure that does not itself correlate highly with intelligence scores.

One cannot measure something one does not know what it is.

To correlate something hazy with genes seem ludicrous.

Perhaps you have not agreed on what has been done so far to study intelligence, but others feel that they have sufficient understanding in order to look at putative causes.

Would you like to do a critical review of Stuart Ritchie’s introduction to the topic?

http://www.unz.com/jthompson/intelligence-all-that-matters-stuart

One cannot measure something one does not know what it is.

To correlate something hazy with genes seem ludicrous.

To correlate something hazyThe system is so strongly underdetermined (more variables than equations) that you can find a fit (correlation) to practically any sequence of random numbers. This is one of the reasons why they must increase the numbers of equations, i.e., have larger sample set. But this is not enough to avoid spurious correlations. So you impose a constraint on the solution like for example that you are interested in the solution with the lowest number of variables. But this is still not enough. First you still will not know if the solution is really unique and second you must validate this solution on entirely different independent set of data. This is the most important test. Only then you may gets some confidence that the correlation is not spurious. After this you may begin thinking whether this correlation is causal and what is the mechanism. There are many steps at which you can be fooled if you are not cautious and as everywhere there is a room for cheating that would be very hard to detect.

Replies:@utuThere are many steps at which you can be fooled if you are not cautious and as everywhere there is a room for cheating that would be very hard to detect.The quality control should be performed by an independent verification team (IVT). They should generate synthetic scores from real matrix Ø [MxN] (see the illustration above) and request the research team (RT) to find the solution. The synthetic scores would be generated by selecting random subset of SNPs in matrix Ø, randomly selecting weights used in polygenic score (linear or non-linear) and adding random noise to simulate different levels of heritability. Then IVT will provide only 1/2 of the data set to the RT and the 2nd half of the data set will be kept for the validation phase away from the RT access. Also, which is important, the exact value of heritability will not be provided to the RT.

Will the RT be able to retrieve the subset of SNP's and the weights? How accurately? How accuracy will depend on number of SNPs (1000 or 10,000) and dynamic range among the weight in the polygenic score and when nonlinear relationships are added?

One may think that the research team does such simulations routinely to verify their method and learn how well the algorithms perform. But do they really and how rigorous are their tests? Do they exclude all possibility of passing some a priori information about the solution? To eliminate such questions a provision that there should be a blind test, where data is generated by an independent team should be implement as part of scientific protocol (imho).

As always: proprietary formats, closed off raw content. A persisting, innate psychological value, now and everywhere. Human IQ boundaries, “nature”, this time of the specific critters called scientists. They as a first to cross the boundary of the proprietary, could wager results that lead up to something, …timely. The same goes for AI, nano-biology and related, to connect human brains to enormously major processing power, “applied” theoretical physics (the quest for an alternative to planet earth) are timely matters. More data mostly chaos, not detailed complexity.

The old fashioned way? , not bigger, but cleaner data. Intellectuals, be first to correct your pronunciations, state errors in latter work, remove stale content from repositories.

The most efficient way, “google algorithms” dat clean up duplicate data, stale data, “rot” of all sorts. Algorithms that measure “quality of content”.

The above as to the suggestion of the author to “make available” “repository”. It is a personal displeasure to use the accustomed manner of quoting on unz.com(the squares are horrid, and romp the text flow).

Honestly…it comes across on the level of: brains require genes..lots of them…interacting….I mean did we need compressed sensing to understand what is essentially an uninteresting and trivial point…

One cannot measure something one does not know what it is.

To correlate something hazy with genes seem ludicrous.

Rasch method: a measurement of intelligence is a measure of the difficulty of problems one can be expected to answer correctly. The questions are measured on the same numerical scale as the people taking the test. A matrix of test-taker rows in which each column corresponds to a different question, with the entry being “1″ for correct answers and “0″ for wrong answers allows measuring both the difficulty of the questions and the ability of the test-takers at the same time. There are a lot of math details about validating questions, including whether they are measuring a single construct, and determining how well they statistically discriminate between different ability levels.

In the basic version (no partial credit etc.) the probability of a person of a given Rasch intelligence score correctly answering a problem of the same difficulty score is 50%; the graph of probability of being correct on a given question vs. ability score has the shape of a logistic function, a softened version of a step function, with better questions having a steeper slope representing better ability to discriminate between ability levels.

The scores from intelligence tests with validated questions are highly predictive of successful performance in all sorts of practical areas, far beyond any other measure that does not itself correlate highly with intelligence scores.

Replies:@James Thompsonhttp://www.unz.com/jthompson/what-makes-problems-difficult

res, I am starting to develop an understanding of CS.

This is very exciting research!

For all the GWAS hobbyists out there here is a brief explanation of CS:

you have a large matrix of gene chip results for possibly a million SNPs from a large group

of people that multiplies with a column vector with relatively only a few non-zero entries.

This means you have a very large number of linear equations with only a relatively few variables.

Intuitively it is not surprising that you can discover all of the non-zero variables with

a modest sample size.

The implication is massive. Finding

allof the IQ SNPs should be possible with samples of only 1 million or so people. Figure 1B (page 20) shows this very impressive result. All you need to do is move below the black line.This is easy.

rho = 0.1= s/n= 10,000/100,000; (vertical axis) Moves you below the black line.

Increasing the sample size, moves you to the right.

delta= 0.1= n/p= 1,000,000/10,000,000 (horizontal axis)

One thing I am not sure about is: Why have not all of the SNPs already been found for

different traits? It should not be necessary to go to all of the expense to find all the height

SNPs to be at the circles in Figure 1B. Anywhere in the red area below the black line and to the

left of the circles should also reveal all the SNPs, though with a large amount of noise on the effect sizes.

This trick could be applied to any disease/trait. When other disease/trait communities become aware of this they will likely demand that they also are pushed below the phase boundary. This would mean that all of their disease SNPs would be found, though the effect size would not be known, filling in such gaps for some families with extreme trait behavior would not be overly difficult.

https://arxiv.org/pdf/1310.2264.pdf

res, they have argued about the 54% heritability of IQ for over a century now. It should not be unexpected that even with the answer known, the debate will continue for several more centuries. I was surprised in the article that I quoted that a research team has actually been able to replicate this number, while at the same decomposing this number into a 23% SNP, 31% Pedigree mix.

Whatever the method that they used to determine the numbers, 23% SNP IQ heritability is in line with other estimates. I am not entirely sure how this result relates to the CS research from 2014 which used h2 of 50% and a large amount of noise emerged.

Will 23% change the black line in Figure 1B? Probably not, the ρL1(δ) curve does not appear to shift much with heritability, though there would be yet more noise.

I suppose that refined imputing of rare SNPs in IQ GWAS should allow for the 54% IQ heritablity to be detected. The research would then go from the 23% signal that they can detect, now up to 54% which should decrease the amount of noise.

Replies:@utuOne thing I am not sure about is: Why have not all of the SNPs already been found for different traits?Finding a solution is relatively simple. W/o constraints like L0 and L1 there are infinite number of solutions. So make your pick. What most likely happens is that the solutions they find do not pan out in the validation test.

The constraints LO and L1 are mathematical to make the underdetermined problem tractable. However the actual physical solution does not have to obey this conditions. The mathematical solution that has minimal number of SNPs does always exist for any size of the sample but it does not mean that this is the actual physical solution.

The key is the validation phase. The solution must be found on one set and validated on a different set. Both sets must cover the same range of data, say ±2SD.

To correlate something hazyThe system is so strongly underdetermined (more variables than equations) that you can find a fit (correlation) to practically any sequence of random numbers. This is one of the reasons why they must increase the numbers of equations, i.e., have larger sample set. But this is not enough to avoid spurious correlations. So you impose a constraint on the solution like for example that you are interested in the solution with the lowest number of variables. But this is still not enough. First you still will not know if the solution is really unique and second you must validate this solution on entirely different independent set of data. This is the most important test. Only then you may gets some confidence that the correlation is not spurious. After this you may begin thinking whether this correlation is causal and what is the mechanism. There are many steps at which you can be fooled if you are not cautious and as everywhere there is a room for cheating that would be very hard to detect.

There are many steps at which you can be fooled if you are not cautious and as everywhere there is a room for cheating that would be very hard to detect.The quality control should be performed by an independent verification team (IVT). They should generate synthetic scores from real matrix Ø [MxN] (see the illustration above) and request the research team (RT) to find the solution. The synthetic scores would be generated by selecting random subset of SNPs in matrix Ø, randomly selecting weights used in polygenic score (linear or non-linear) and adding random noise to simulate different levels of heritability. Then IVT will provide only 1/2 of the data set to the RT and the 2nd half of the data set will be kept for the validation phase away from the RT access. Also, which is important, the exact value of heritability will not be provided to the RT.

Will the RT be able to retrieve the subset of SNP’s and the weights? How accurately? How accuracy will depend on number of SNPs (1000 or 10,000) and dynamic range among the weight in the polygenic score and when nonlinear relationships are added?

One may think that the research team does such simulations routinely to verify their method and learn how well the algorithms perform. But do they really and how rigorous are their tests? Do they exclude all possibility of passing some a priori information about the solution? To eliminate such questions a provision that there should be a blind test, where data is generated by an independent team should be implement as part of scientific protocol (imho).

Replies:@resThere are many steps at which you can be fooled if you are not cautious and as everywhere there is a room for cheating that would be very hard to detect.The quality control should be performed by an independent verification team (IVT). They should generate synthetic scores from real matrix Ø [MxN] (see the illustration above) and request the research team (RT) to find the solution. The synthetic scores would be generated by selecting random subset of SNPs in matrix Ø, randomly selecting weights used in polygenic score (linear or non-linear) and adding random noise to simulate different levels of heritability. Then IVT will provide only 1/2 of the data set to the RT and the 2nd half of the data set will be kept for the validation phase away from the RT access. Also, which is important, the exact value of heritability will not be provided to the RT.

Will the RT be able to retrieve the subset of SNP's and the weights? How accurately? How accuracy will depend on number of SNPs (1000 or 10,000) and dynamic range among the weight in the polygenic score and when nonlinear relationships are added?

One may think that the research team does such simulations routinely to verify their method and learn how well the algorithms perform. But do they really and how rigorous are their tests? Do they exclude all possibility of passing some a priori information about the solution? To eliminate such questions a provision that there should be a blind test, where data is generated by an independent team should be implement as part of scientific protocol (imho).

If they succeed in coming up with a predictor for more than half of the height variance it should be quite easy to check that against another data set. No need to publish the underlying data. Just publish the predictor with its coefficients and let another research team try it on their own private data set.

Agree:utuReplies:@utuThis is very exciting research!

For all the GWAS hobbyists out there here is a brief explanation of CS:

you have a large matrix of gene chip results for possibly a million SNPs from a large group

of people that multiplies with a column vector with relatively only a few non-zero entries.

This means you have a very large number of linear equations with only a relatively few variables.

Intuitively it is not surprising that you can discover all of the non-zero variables with

a modest sample size.

The implication is massive. Finding

allof the IQ SNPs should be possible with samples of only 1 million or so people. Figure 1B (page 20) shows this very impressive result. All you need to do is move below the black line.This is easy.

rho = 0.1= s/n= 10,000/100,000; (vertical axis) Moves you below the black line.

Increasing the sample size, moves you to the right.

delta= 0.1= n/p= 1,000,000/10,000,000 (horizontal axis)

One thing I am not sure about is: Why have not all of the SNPs already been found for

different traits? It should not be necessary to go to all of the expense to find all the height

SNPs to be at the circles in Figure 1B. Anywhere in the red area below the black line and to the

left of the circles should also reveal all the SNPs, though with a large amount of noise on the effect sizes.

This trick could be applied to any disease/trait. When other disease/trait communities become aware of this they will likely demand that they also are pushed below the phase boundary. This would mean that all of their disease SNPs would be found, though the effect size would not be known, filling in such gaps for some families with extreme trait behavior would not be overly difficult.

https://arxiv.org/pdf/1310.2264.pdf

res, they have argued about the 54% heritability of IQ for over a century now. It should not be unexpected that even with the answer known, the debate will continue for several more centuries. I was surprised in the article that I quoted that a research team has actually been able to replicate this number, while at the same decomposing this number into a 23% SNP, 31% Pedigree mix.

Whatever the method that they used to determine the numbers, 23% SNP IQ heritability is in line with other estimates. I am not entirely sure how this result relates to the CS research from 2014 which used h2 of 50% and a large amount of noise emerged.

Will 23% change the black line in Figure 1B? Probably not, the ρL1(δ) curve does not appear to shift much with heritability, though there would be yet more noise.

I suppose that refined imputing of rare SNPs in IQ GWAS should allow for the 54% IQ heritablity to be detected. The research would then go from the 23% signal that they can detect, now up to 54% which should decrease the amount of noise.

One thing I am not sure about is: Why have not all of the SNPs already been found for different traits?Finding a solution is relatively simple. W/o constraints like L0 and L1 there are infinite number of solutions. So make your pick. What most likely happens is that the solutions they find do not pan out in the validation test.

The constraints LO and L1 are mathematical to make the underdetermined problem tractable. However the actual physical solution does not have to obey this conditions. The mathematical solution that has minimal number of SNPs does always exist for any size of the sample but it does not mean that this is the actual physical solution.

The key is the validation phase. The solution must be found on one set and validated on a different set. Both sets must cover the same range of data, say ±2SD.

In the basic version (no partial credit etc.) the probability of a person of a given Rasch intelligence score correctly answering a problem of the same difficulty score is 50%; the graph of probability of being correct on a given question vs. ability score has the shape of a logistic function, a softened version of a step function, with better questions having a steeper slope representing better ability to discriminate between ability levels.

The scores from intelligence tests with validated questions are highly predictive of successful performance in all sorts of practical areas, far beyond any other measure that does not itself correlate highly with intelligence scores.

Thanks for your description. Rasch scoring has always been available, but rarely become main-stream, though it has many interesting features. Brief mention here:

http://www.unz.com/jthompson/what-makes-problems-difficult

I said “Agree” but there are some qualifications. There is no such thing as a private data set in this business. There is a consortium, I think, that collects all data sets making them available to everybody. The set must be truly independent and what is very important, unknown to those who generated the predictor function. Time should take care of this as sets are expanded with new data providing that the prediction function is not constantly amended to also explain new data as they arrive.

Irrelevant to this post. Post is about identifying genetic variants that contribute to intelligence, not about identifying smart ancestral groups.

utu, thank you very much for your reply.

I will need to think about this article more in order to develop a better understanding. My interpretation of the black lines in Figures 1A and 1B was that below these lines all the non zero effect size SNPs would become known, though effect sizes would not be known. I thought this was a hard transition boundary between knowing and not knowing. I will need to reread the article more carefully.

If the black line is not such a transition, how should I understand the black line?

For many, finding the actual SNPs without reference to the effect sizes is of substantial importance. I am sure that many on this thread would be very happy to now have a list of the 10,000 IQ SNPs even if betas were unknown.

“This result implies that perfect selection of nonzeros can occur before the magnitudes of the coefficients are well fit.” (page 5 of the pdf). This quote is moving into focus for me. In figure 1B the white circle is well below the black line. While the red area above the circle means that the magnitude of the effect sizes would not be known, perhaps somewhere in this red zone all the SNPs would be known (similar to comments above). Would love to see something similar to Figure 1B that color coded the probabilities (isoprobs) of having a full list of the SNPs (say 95% or 99%). The article mentioned that the white circle in Figure 1A achieved nearly a 100% list of the SNPs.

Have any of the insights gained from CS been applied to IQ SNPs in a recent paper in order to determine where we might be on the phase boundary etc. ? make estimates of where we might be on the curves etc.

Replies:@utuMy interpretation of the black lines in Figures 1A and 1B was that below these lines all the non zero effect size SNPs would become known(page 5 of the pdf)- I would like to take a look at the figures you are referring to. Could you post the url? I looked at https://arxiv.org/pdf/1310.2264.pdf but the figures start at page 18. I have some comments about the figures there but will wait for your response.I am sure that many on this thread would be very happy to now have a list of the 10,000 IQ SNPs even if betas were unknown.- Maybe but w/o betas how can you tell which SNP is better/stronger than other?More I think about the whole problem more I see how difficult this problem really is. Hsu papers are really misleading imo. There is way too much hype. The theorems that he cites in his first paper that lead to the estimate on the necessary sample size applies to isotropic matrix, i.e, when SNPs, i.e., columns in Ø [NxM] matrix do not correlate. This is not the case for real data. He does simulations on synthetic data and some real (but small sample) data. This is not that important however. His method is nothing new and he states directly that he does not expect applications with "his" method: If I read it correctly in his two papers he provides two estimates on the sample size: for linear case N=300k and for nonlinear case 1000k if H2=0.5 and x has 10,000 nonzeros. Personally I am skeptical of these estimates because one has to know what is the structure of the actual matrix and what range of IQ variance is covered by the sample. Are his estimates the worst case scenarios? i.e., when the structure of the matrix is the last favorable?

There is a method that one of the reviewer asked Hsu to test LASSO L1 method against. It is the marginal regression method which is intuitively correct and easy to understand. Anybody would come up with this method as the first approach. It requires to calculate correlations of all SNPs with the y vector. And then you find suspect SNPs according to the size of correlations. I think something like this is done in GWAS. However noisy data (heritability<1 plus other junk) will hide many relevant SNPs. But this is a good start to get some subset of x vector support (strongest SNPs) to start another method like LASSO with and speed it out. LASSO is slower by 3 orders of magnitude than the marginal regression. Here is the article (v. theoretical) that compares LASSO and marginal regression: ___________________________

I have a hunch that I know what is the dark secret of the hunt for the IQ genes. It is the overfit. They keep finding say 10,000 SNPs solutions that can explain, say 80% of variance w/o any difficulty but the solutions never pan out in the validation phase. Then they use a different subset sample and get different 10,000 SNPs. That's why they are proceeding very slowly and carefully using genes that are known from other studies that have some manifestations in brain physiology so the issue of causality is taken care of. But are these genes well validated as the sample keep increasing. Every time more subjects are added to the sample all past results in principle should be validated on the increased sample (by an independent team of skeptics not some cheerleading yahoos).

But there is another possibility: one can cheat little bit by doing the LASSO method on the full sample an then feed part of the solution to the fit on 1/2 sample and then the solution will validate on the second 1/2 of sample as the validation set. If somebody gets desperate this is always an option. But they will be caught if larger sample is generated down the road. Perhaps UN observers should be deployed to Posthuma and Visscher and Plomin labs to keep an eye on what is really going on there. People with attitudes like you or the author of this blog will piss their pants out of joy on any news of positive result and never will ask for independent verification, right? Remember Reagan's and Gorbachev's trust but verify? The idea of having all data sets in public domain is a very good one.

The only full proof approach would be by doing a combinatorial approach, i.e, find all 5,000 long solutions, all 5,001 long solutions and so on. But because of the size of the matrix this is not possible to be ever calculated in the life of this universe. So you do LASSO method that find a solution or actually several solutions for different penalty parameters. This however, in my opinion does not exhaust all possibilities.

Only the statistically naive will scream about over-fitting.

https://en.wikipedia.org/wiki/Bayesian_information_criterion

“When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.”

Replies:@utuBoth BIC and AIC attempt to resolve this problemThe key operational word here is "attempt." Not all attempts are successful and not all attempts are justified. Often they are just arbitrary mathematical criteria (even if or particularly when they have fancy names) that guarantee uniqueness but not necessarily representing the physical reality of the problem. One way of reducing a chance of the overfit is to use heritability value form twin studies as a constraint which obviously is not a method to find the heritability value independent of the one from twin studies. IMO the L0 metric is the most sensible approach as it suppose to lead to the solution with the lowest number of parameters (SNPs). This is something everybody can understand. But it still does not mean that the actual true physical reality solution is the one with the lowest number of parameters. One can invoke here the Occam razor that actually it is so but nobody thinks that Occam razor is the law of nature though perhaps it helped Occam to find the minimal number of devils dancing on the pin. However I might be wrong. Perhaps evolution finds solutions with minimal number of genes but would evolution reject the solution when by chance more genes produce similar results.

But you are correct that by adding additional mathematical criteria that in fact are arbitrary in physical sense we transform the problem to a mathematical problem that has a unique solution where the overfit is unlikely. It should be kept in mind that noise in data (heritability<1) can make any methods do weird unanticipated things.

Basically one must thread very carefully. Te slowness of progress by the group of Posthuma gives me a hope that they are not bunch of cowboys unlike the cheerleaders in this commentariat.

https://en.wikipedia.org/wiki/Bayesian_information_criterion

"When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC."

Both BIC and AIC attempt to resolve this problemThe key operational word here is “attempt.” Not all attempts are successful and not all attempts are justified. Often they are just arbitrary mathematical criteria (even if or particularly when they have fancy names) that guarantee uniqueness but not necessarily representing the physical reality of the problem. One way of reducing a chance of the overfit is to use heritability value form twin studies as a constraint which obviously is not a method to find the heritability value independent of the one from twin studies. IMO the L0 metric is the most sensible approach as it suppose to lead to the solution with the lowest number of parameters (SNPs). This is something everybody can understand. But it still does not mean that the actual true physical reality solution is the one with the lowest number of parameters. One can invoke here the Occam razor that actually it is so but nobody thinks that Occam razor is the law of nature though perhaps it helped Occam to find the minimal number of devils dancing on the pin. However I might be wrong. Perhaps evolution finds solutions with minimal number of genes but would evolution reject the solution when by chance more genes produce similar results.

But you are correct that by adding additional mathematical criteria that in fact are arbitrary in physical sense we transform the problem to a mathematical problem that has a unique solution where the overfit is unlikely. It should be kept in mind that noise in data (heritability<1) can make any methods do weird unanticipated things.

Basically one must thread very carefully. Te slowness of progress by the group of Posthuma gives me a hope that they are not bunch of cowboys unlike the cheerleaders in this commentariat.

> I do no see how knowing what genes are in human terms superior for IQ justifys using machine learing, thereby starting the ball rolling to true AI.

AI won’t benefit from our knowing which human genes contribute to human intelligence.

Replies:@SeanI will need to think about this article more in order to develop a better understanding. My interpretation of the black lines in Figures 1A and 1B was that below these lines all the non zero effect size SNPs would become known, though effect sizes would not be known. I thought this was a hard transition boundary between knowing and not knowing. I will need to reread the article more carefully.

If the black line is not such a transition, how should I understand the black line?

For many, finding the actual SNPs without reference to the effect sizes is of substantial importance. I am sure that many on this thread would be very happy to now have a list of the 10,000 IQ SNPs even if betas were unknown.

"This result implies that perfect selection of nonzeros can occur before the magnitudes of the coefficients are well fit." (page 5 of the pdf). This quote is moving into focus for me. In figure 1B the white circle is well below the black line. While the red area above the circle means that the magnitude of the effect sizes would not be known, perhaps somewhere in this red zone all the SNPs would be known (similar to comments above). Would love to see something similar to Figure 1B that color coded the probabilities (isoprobs) of having a full list of the SNPs (say 95% or 99%). The article mentioned that the white circle in Figure 1A achieved nearly a 100% list of the SNPs.

Have any of the insights gained from CS been applied to IQ SNPs in a recent paper in order to determine where we might be on the phase boundary etc. ? make estimates of where we might be on the curves etc.

My interpretation of the black lines in Figures 1A and 1B was that below these lines all the non zero effect size SNPs would become known(page 5 of the pdf)– I would like to take a look at the figures you are referring to. Could you post the url? I looked at https://arxiv.org/pdf/1310.2264.pdf but the figures start at page 18. I have some comments about the figures there but will wait for your response.I am sure that many on this thread would be very happy to now have a list of the 10,000 IQ SNPs even if betas were unknown.– Maybe but w/o betas how can you tell which SNP is better/stronger than other?Although machine learning is providing the most efficient, powerful tool for data analysis, in terms of finding holy grail of intelligence, I don’t think machine learning will solve any time sooner, or not even possible at all.

1.ML needs physical data and real output. All our voice, facial recognition, thumbprint security, they all have physical data. The real output is our voice, and facial features. For intelligence, we have physical data on SNP and 3 billions of DNA genome. But what would be our “y” value then? the real output? Intelligence? It is abstract.2.Measuring one’s intelligence with regional or global exam or test is moot. ACT, TOEFL, IELTS, SAT, GRE, MCAT, MLE, UKCAT, GAMSAT, GMAT they’re only useful at particular stage in your life. IQ test is as good as knowing your orientation and geography.MENSA tests your ENGLISH ability, acronyms, synonyms, permutations, combination, it does not apply to Chinese language. So MENSA is a good indicator for eating cheese and dumpling. If you eat Cheese, you’re smart, but if you eat dumplings, you’re dumb as duck. That’s how it applies.

GRE used to be an Intelligence test back in the day. If you score more than 1250 in GRE, you can be a MENSA member, which means

“Deng Xiaoping”who lifted 1.3 billions people out of poverty with state sponsor and foreign policy, would appear as stupid as your next door bloke because he wouldn’t know what “recidivist” means.3.MaybeEinstein3 billion genome will shed light on intelligence? At the age of 30, he was working as a clerk. His SNP will be as good asRosie O Donnellbecause the most similarities between them is their ugly facial features. In 21st century, those so-called intelligentsia at the age of 30 are studying double MD/PhD degree.I’m sure if he’s taken any IQ tests, he would have failed definitely. Any high IQ (triple digit) assigned to all those highly intelligent persons are oftentimes deduced from their achievements in later life, occasionally posthumously. I highly doubt Einstein can answer this question:

Mensa Question: The same word can be added to the end of GRASS and the beginning of SCAPE to form two other English words. What is the word?Or Do you think

Wernher von Braun, a father of rocket, will be as intelligent asJodie Fosterwho is actually a Mensa member?The bottom line is, there’s no physical, real data for intelligence. Intelligence is fluid and dynamic. It requires many variables.

4.If you use all those spelling bee champions, who can regurgitate archaic words out of their asses, algorithm will not find intelligence, it can only find grinds at best.Of course, Machine Learning is the most powerful learning algorithm we have ever created in this 21st century.

https://en.wikipedia.org/wiki/Bayesian_information_criterion

"When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC."

If two different mathematical criteria BML (Bayesian Maximum Likelihood) and MBS (Minimum BS) produce two solutions X_BML and X_MBS with different correlations r_BML and r_MBS and different numbers of SNPs one will need a third criterion to decide whether one of them is not overfitting. Then it may come just to human judgment which often, as history teaches us, is susceptible to the criterion KWSIBB (know which side is bread buttered).

utu, this is great! I am making considerable progress in advancing my understanding of CS!

LASSO and ridge regression are making much more sense to me and the figures above posted by res are also clear.

I am going to have to think about the meaning of L0, though. What is the geometric interpretation in comparison to the circle (L2) and the square(L1)? L0 would be like adding a weighting to the score (lambda p)?

Your comment about there being many possible solutions now makes more sense to me. With so much noise in the system and effect sizes so small, it is too optimistic to assume that being below the black line in the red area would truly give you the correct non-zero SNPs.

I will love to see the research with the upcoming height data and the description of how the non-zero SNP set would change with sample size. What actually happens in a sample of the dataset when moving below the black line? Specifically where in the ρ−δ plane does the SNP set become fixed? Anyone care to guess where the paper will find this fixation point? Will it be near the white circle in Figure 1B near the x axis, or up in the red zone closer to the black line?

http://nuit-blanche.blogspot.ca/2013/10/application-of-compressed-sensing-to.html https://arxiv.org/pdf/1310.2264v1.pdf (page 15) “… we ﬁnd that irrespective of δ, ρ should be less than 0.03 for recovery. There is no hope of recovering x above this threshold. For example, if we have prior knowledge that s = 1, 200, then this means that the sample size should be no less than 40,000 subjects. As a rough guide, for h2 ∼ 0.5 we expect that n ∼ 30s is sufficient for good recovery of the loci with nonzero effects.”

Above is very interesting and what I have been puzzling about. So, ρ should be less than 0.03 for recovery,

no hope for recovering xbelow this threshold. In Figure 1B of page 20 of https://arxiv.org/pdf/1310.2264.pdf 0.03 on the vertical axis would be the blue/yellow/green area near the x-axis. Would no hope of recovering x mean no hope of recovering accurate betas for x or no hope of recovering merely a complete list of non-zeros betas that would not be accurate?Such revelations could provide clues for where this might occur for other traits. I have usually been most interested in thinking carefully about models and avoiding noisy datasets, though with these datasets crunching through the numbers themselves would provide many insights.

Your comment about finding 10,000 SNPs with 80% of the variance is quite interesting. They don’t publish these results? Really? It is true that such sets might not replicate, though they likely have a fair amount of the true signal, especially if they are past the phase transition.

This is more of the quote from the article:

“Proposition 2 states that the scaling of the total fitting error in the favorable regime is within a (poly)logarithmic factor of what would have been achieved if the identities of the s nonzeros had been revealed in advance by an oracle. This result implies that perfect selection of nonzeros can occur before the magnitudes of the coefficients are well fit. Even if the residual noise is substantial enough to prevent the sharp transition from large to negligible fitting error evident in Figure 1A, the total magnitude of the error in the favorable phase is little larger than what would be expected given perfect selection of the nonzeros.” This quote was from page 5 of https://arxiv.org/pdf/1310.2264.pdf You are quite right that the figures start on page 18.

I will be very interested in your comments about the figures. I continue to wonder where the current knowledge of IQ SNPs might be regarding some of the Figures on pages 18+. For example, how have the median p-values been changing? Or in terms of the phase boundary in Figure 1, if s=10,000, n=280,000 and p=8,000,000 –> rho=0.035, delta= 0.035

The question about finding the size of the betas. I suppose you could also call this idea CS (standing for Crowd sensing). With the identity of the SNPs known one could call on the wisdom of the network to find the large effect betas.

Consider some disease X. If you had all the known SNPs for that illness say 10,000, then you have a starting point to find a variant of interest (i.e. large effect). For example, a person could take their genome sequence and find their genotypes for these SNPs. For some illnesses, even the large effect SNPs have yet to be found. With a complete list of the SNPs one could simply go to dbsnp and do a batch download of all the MAF of the SNPs. Pull out the SNPs with very small MAFs and look at genotypes for anything unusual. For people with an illness, the assumption could be made that the beta would move in the direction of risk. One could then go to online ancestry sites and contact people who shared this possible risk marker. Even if 20% (i.e., 2,000) of the SNPs were rare [say 1 in 1000] one would only expect to have 2 of them.

Page 18 of 35. P1 solves P0? See Theorem page 17. Page 33 of 35. The url for sparselab software. We can run our own simulaitons!

https://web.stanford.edu/~vcs/talks/MicrosoftMay082008.pdf

Replies:@utuI am going to have to think about the meaning of L0, though. What is the geometric interpretation in comparison to the circle (L2) and the square(L1)?L0 has a property of a metric: ||x+y||≤||x||+||y|| but it is a degenerated primitive function with finite discreet values unlike Lp (p>0) that cover continuum. So, the solution ||x||=1 is circle, square and two axes X and Y for L2, L1, and L0, respectively on R^2 plane. ||x||0. Lp1 and Lp2 spaces (p1 and p2 >0) are topologically similar, i.e., they under some mappings from one to the other preserve some topological properties (like convexity, I think) but I do not think that this is the case with L0 because L0 metric can define only three geometric figures: 1 point, 2 axes and the whole plane minus the two axes, so you can't talk about convexity.

LASSO and ridge regression are making much more sense to me and the figures above posted by res are also clear.

I am going to have to think about the meaning of L0, though. What is the geometric interpretation in comparison to the circle (L2) and the square(L1)? L0 would be like adding a weighting to the score (lambda p)?

Your comment about there being many possible solutions now makes more sense to me. With so much noise in the system and effect sizes so small, it is too optimistic to assume that being below the black line in the red area would truly give you the correct non-zero SNPs.

I will love to see the research with the upcoming height data and the description of how the non-zero SNP set would change with sample size. What actually happens in a sample of the dataset when moving below the black line? Specifically where in the ρ−δ plane does the SNP set become fixed? Anyone care to guess where the paper will find this fixation point? Will it be near the white circle in Figure 1B near the x axis, or up in the red zone closer to the black line?

http://nuit-blanche.blogspot.ca/2013/10/application-of-compressed-sensing-to.html https://arxiv.org/pdf/1310.2264v1.pdf (page 15) "... we ﬁnd that irrespective of δ, ρ should be less than 0.03 for recovery. There is no hope of recovering x above this threshold. For example, if we have prior knowledge that s = 1, 200, then this means that the sample size should be no less than 40,000 subjects. As a rough guide, for h2 ∼ 0.5 we expect that n ∼ 30s is sufficient for good recovery of the loci with nonzero effects."

Above is very interesting and what I have been puzzling about. So, ρ should be less than 0.03 for recovery,

no hope for recovering xbelow this threshold. In Figure 1B of page 20 of https://arxiv.org/pdf/1310.2264.pdf 0.03 on the vertical axis would be the blue/yellow/green area near the x-axis. Would no hope of recovering x mean no hope of recovering accurate betas for x or no hope of recovering merely a complete list of non-zeros betas that would not be accurate?Such revelations could provide clues for where this might occur for other traits. I have usually been most interested in thinking carefully about models and avoiding noisy datasets, though with these datasets crunching through the numbers themselves would provide many insights.

Your comment about finding 10,000 SNPs with 80% of the variance is quite interesting. They don't publish these results? Really? It is true that such sets might not replicate, though they likely have a fair amount of the true signal, especially if they are past the phase transition.

This is more of the quote from the article:

"Proposition 2 states that the scaling of the total fitting error in the favorable regime is within a (poly)logarithmic factor of what would have been achieved if the identities of the s nonzeros had been revealed in advance by an oracle. This result implies that perfect selection of nonzeros can occur before the magnitudes of the coefficients are well fit. Even if the residual noise is substantial enough to prevent the sharp transition from large to negligible fitting error evident in Figure 1A, the total magnitude of the error in the favorable phase is little larger than what would be expected given perfect selection of the nonzeros." This quote was from page 5 of https://arxiv.org/pdf/1310.2264.pdf You are quite right that the figures start on page 18.

I will be very interested in your comments about the figures. I continue to wonder where the current knowledge of IQ SNPs might be regarding some of the Figures on pages 18+. For example, how have the median p-values been changing? Or in terms of the phase boundary in Figure 1, if s=10,000, n=280,000 and p=8,000,000 --> rho=0.035, delta= 0.035

The question about finding the size of the betas. I suppose you could also call this idea CS (standing for Crowd sensing). With the identity of the SNPs known one could call on the wisdom of the network to find the large effect betas.

Consider some disease X. If you had all the known SNPs for that illness say 10,000, then you have a starting point to find a variant of interest (i.e. large effect). For example, a person could take their genome sequence and find their genotypes for these SNPs. For some illnesses, even the large effect SNPs have yet to be found. With a complete list of the SNPs one could simply go to dbsnp and do a batch download of all the MAF of the SNPs. Pull out the SNPs with very small MAFs and look at genotypes for anything unusual. For people with an illness, the assumption could be made that the beta would move in the direction of risk. One could then go to online ancestry sites and contact people who shared this possible risk marker. Even if 20% (i.e., 2,000) of the SNPs were rare [say 1 in 1000] one would only expect to have 2 of them.

Page 18 of 35. P1 solves P0? See Theorem page 17. Page 33 of 35. The url for sparselab software. We can run our own simulaitons!

https://web.stanford.edu/~vcs/talks/MicrosoftMay082008.pdf

I am going to have to think about the meaning of L0, though. What is the geometric interpretation in comparison to the circle (L2) and the square(L1)?L0 has a property of a metric: ||x+y||≤||x||+||y|| but it is a degenerated primitive function with finite discreet values unlike Lp (p>0) that cover continuum. So, the solution ||x||=1 is circle, square and two axes X and Y for L2, L1, and L0, respectively on R^2 plane. ||x||0. Lp1 and Lp2 spaces (p1 and p2 >0) are topologically similar, i.e., they under some mappings from one to the other preserve some topological properties (like convexity, I think) but I do not think that this is the case with L0 because L0 metric can define only three geometric figures: 1 point, 2 axes and the whole plane minus the two axes, so you can’t talk about convexity.

AI won't benefit from our knowing which human genes contribute to human intelligence.

The diference is that AI will be evolving in digital timescale, and go from village idiot to meta mind incredibly quickly, At that point humans will no longer have the ball.

utu, thank you again.

Section 2.4 gives a great geometric interpretation of Lp spaces. I had not thought of either p=infinity or 0<p<1. I am sure that I do not want to even think about negative or complex p values in Lp space.

http://cnx.org/contents/9wtroLnw@5.12:U4hLPGQD@5/Compressible-signals

“””So MENSA is a good indicator for eating cheese and dumpling. If you eat Cheese, you’re smart, but if you eat dumplings, you’re dumb as duck.”””

It does. Dataset like UKBioBank are only local and the local ranges are not enough to show certain effects like lactase persistence (getting better nutrient) distributions correlates with national IQ proxies (freq: rs4988235_A freq: EUR 50.8% EAS 0%)

PISA3 = +119.661*rs498A +409.265; # n=46; Rsq=0.4779; p=1.046e-07; European only

However, the East Asian has their unique SNPs for IQ like rs671_A which causes aversion to alcohol and thus less brain cell connections fried by it (freq: EUR 0%, AFR 0%, EAS 17.4%)

PISA3 = +479.688*rs671A +408.127; # n=12; Rsq=0.9398; p=1.998e-07; East Asian only

Interestingly the Ashkenazis also has considerable similar alcohol aversion gene mutation ALDH2*2 (EUR 2.9%, EAS 69.7% Ashkenazi ?) as the East Asians.

https://www.ncbi.nlm.nih.gov/pubmed/12153842

Dataset like UKBioBank are only local and the local ranges are not enough…shhh. no psychologist understands that.

sad!

if the number of subjects equals the number of SNPs an exact solution almost always exists. that is, the phenotype is “predicted” by the genotype with 100% accuracy. simply answer: what is the probability a random matrix is invertible?

http://blogs.sas.com/content/iml/2011/09/28/what-is-the-chance-that-a-random-matrix-is-singular.html

1 million individuals + 1 million SNPs means h^2 = 1.

and the result is 100% meaningless.

sad!

there are 10m SNPs. this means the number, which the professor keeps increasing, will have to be more like 100m. and these people will have to be scattered all over the world or the results will only be local.

a 10m by 10m study gets all the hits, but they aren’t real hits. so the idea that 1m will get all the hits when 10m doesn’t is absurd.

if there are any SNPs to be found what you will get is increasing accuracy of fit up to 10m subjects, after which the estimated fit will decline. the asymptote to which it declines as number of subjects increases past 10m is the “real” heritability using the linear model.

far fewer subjects are required if one uses far fewer SNPs. but then the asymptote will be much lower and one will have to extrapolate to what it would have been had all SNPs been considered.

utu, I am understanding more of the article.

I was worrying about “with a suitable choice of λ obeys …” (bottom page of 4). What would be this suitable choice of lambda? Occurred to me that since s is ~10,000 we can set lambda to the value that pushes all but 10,000 of the betas to 0. Would be interesting to see what happened to the residual errors around this value of lambda.

Full recovery of x does require being at the circle in Figure 1B (blue dots at n=1500 in Figure 2B). Obtaining all of these SNPs for IQ might not be as difficult as many had thought, though it could take some time and much larger samples to discover the betas (NE declines gradually in Figure 2C).

Under the conditions of Figure 2D, 50% of the SNPs would be true hits at n=1700. In Figure 2C, at n=1700 the NE=0.92. This would correspond to a point in Figure 1B of ( 0.21, 0.07) which would be in the reddish area. This is somewhat surprising; even with near maximal NE almost half of the reported SNPs would be true positives (article is not clear of what the actual number of SNPs would be, it only reports the fraction that are true positives). GWAS studies only report those SNPs that are statistically significant, many of those not reported could be true positives.

Replies:@utuI am understanding more of the article.I wish I could say the same. I looked at Hsu (Vattikuti) paper I do not find it particularly helpful. The phase transition concept is taken from and as Fig. 1 (Hsu) illustrates it is not very useful for the case with noise (Fig. 1B). Solutions on one side of the curve are not much better than on the other. In noiseless case Fig.1A and Fig. 1C (blue curve) you can clearly see the transition. It is sharp and well defined. But in case with noise Fig. 1B and red curve in Fig. 1C the range where you can get solution in very narrow range and very far from the theoretical phase transition curve and when you go outside this range you do not get solutions (NE≈1). These are simulations when you know the solution so you can calculate the error of the retrieved solution but in the real case you do not know where the phase transition is and what NE value is. The only confirmation that what you got is not a false positive is via the validation procedure on a separate set of data. The data that produced the fit can't tell you if the fit is valid!

What would be this suitable choice of lambda?They do not know it. Every lambda produces different solution. Should you pick the one with the lowest L1 value or the one that gives you the smallest residuals? Or you take all solutions and run them through the validation procedure to see which one if any survives it? It is possible that the one that has the lowest L1 or L0 is not the one.

even with near maximal NE almost half of the reported SNPs would be true positivesBut if you can't tell which are which it is not very useful.

The bottom line is there are false positives. This is because the system is undetermined. y=Ax has infinite number of solution which means that even if heritability was zero you probably could find some solutions. By adding the constraint min||x|| the problem becomes unique. You will get solution x that minimizes ||y-Ax||. The only way you can be sure that the solution is not spurious you must verify it on separate set of data y' and check if ||y'-Ax|| is also small enough.

In my opinion they will not start with Lasso+L1 (or Lo) method from a scratch. They will use some a priori info on SNPs that were identified as potential suspects with some other method like looking at all correlation between y and columns of matrix A.

In my opinion Hsu's paper is somewhat hyped and may give an impression that the ultimate method to analyze the data was found. His 2014 paper was cited 21 times (from Google Scholar) so far. Will this paper have a significant influence on narrowing the heritability missing gap?

I was worrying about "with a suitable choice of λ obeys ..." (bottom page of 4). What would be this suitable choice of lambda? Occurred to me that since s is ~10,000 we can set lambda to the value that pushes all but 10,000 of the betas to 0. Would be interesting to see what happened to the residual errors around this value of lambda.

Full recovery of x does require being at the circle in Figure 1B (blue dots at n=1500 in Figure 2B). Obtaining all of these SNPs for IQ might not be as difficult as many had thought, though it could take some time and much larger samples to discover the betas (NE declines gradually in Figure 2C).

Under the conditions of Figure 2D, 50% of the SNPs would be true hits at n=1700. In Figure 2C, at n=1700 the NE=0.92. This would correspond to a point in Figure 1B of ( 0.21, 0.07) which would be in the reddish area. This is somewhat surprising; even with near maximal NE almost half of the reported SNPs would be true positives (article is not clear of what the actual number of SNPs would be, it only reports the fraction that are true positives). GWAS studies only report those SNPs that are statistically significant, many of those not reported could be true positives.

I am understanding more of the article.I wish I could say the same. I looked at Hsu (Vattikuti) paper I do not find it particularly helpful. The phase transition concept is taken from

and as Fig. 1 (Hsu) illustrates it is not very useful for the case with noise (Fig. 1B). Solutions on one side of the curve are not much better than on the other. In noiseless case Fig.1A and Fig. 1C (blue curve) you can clearly see the transition. It is sharp and well defined. But in case with noise Fig. 1B and red curve in Fig. 1C the range where you can get solution in very narrow range and very far from the theoretical phase transition curve and when you go outside this range you do not get solutions (NE≈1). These are simulations when you know the solution so you can calculate the error of the retrieved solution but in the real case you do not know where the phase transition is and what NE value is. The only confirmation that what you got is not a false positive is via the validation procedure on a separate set of data. The data that produced the fit can’t tell you if the fit is valid!

What would be this suitable choice of lambda?They do not know it. Every lambda produces different solution. Should you pick the one with the lowest L1 value or the one that gives you the smallest residuals? Or you take all solutions and run them through the validation procedure to see which one if any survives it? It is possible that the one that has the lowest L1 or L0 is not the one.

even with near maximal NE almost half of the reported SNPs would be true positivesBut if you can’t tell which are which it is not very useful.

The bottom line is there are false positives. This is because the system is undetermined. y=Ax has infinite number of solution which means that even if heritability was zero you probably could find some solutions. By adding the constraint min||x|| the problem becomes unique. You will get solution x that minimizes ||y-Ax||. The only way you can be sure that the solution is not spurious you must verify it on separate set of data y’ and check if ||y’-Ax|| is also small enough.

In my opinion they will not start with Lasso+L1 (or Lo) method from a scratch. They will use some a priori info on SNPs that were identified as potential suspects with some other method like looking at all correlation between y and columns of matrix A.

In my opinion Hsu’s paper is somewhat hyped and may give an impression that the ultimate method to analyze the data was found. His 2014 paper was cited 21 times (from Google Scholar) so far. Will this paper have a significant influence on narrowing the heritability missing gap?

utu, yes the boundary is quite porous.

If one were on the highway, it would not be so much a border as a light mist.

I believe Figure 1C is a slice at delta= 0.5. I would really love to have this same slice showing the

percent of true non-zeros amongst the total number of non-zeros. By looking at Figure 1A and 1B, NE is shown; what is obscure is how many and what proportion of non-zeros are being recovered.

I continue to be impressed that half of the reported SNPs would be true non-zeros at (0.25, 0.075). On Figure 1C, the red dot two to the left of the red square near NE=1 is roughly equal to rho=0.075. If you moved left along this y-value to x=0.25, then half the reported SNPs would be true. That is impressive! The NE is nearly 1! In some sense then it truly is a boundary: as you move below the boundary even with noise you are at least accumulating true SNPs.

Figures 2a and 2C for me illustrate the nature of the boundary even more vividly.

Yes, it would not be known which half were signal and which half noise, though this is still a result worthy of being published. We are in the age of data science. In this digital age, one does not need to be overly frugal with bits and bytes. Instead of an article of kilobytes or megabytes, we should have gigabyte or terabyte or exabyte articles! We no longer live in an era of digital poverty. Actually providing the data points so that they can be checked and recompiled (as Spearman did in 1904) is an important part of the scientific process and method. The scientific literature needs to become more of a dialog instead of being the final word.

I think the scientific community should be more open to data sharing even when there is noise. It is surprising how people when given a portion of the signal mixed with noise can extract the signal. Part of the problem is that the research community can extract only that part of the signal from the techniques and datasets which are available to them, while the grass roots community have access to another part of the signal which can be used to remove more of the noise. Allowing such a feedback process could greatly help amplify success.

For so many years we have had GWAS reports and the researchers would express frustration that

they had not found anything, but they would try again with a larger sample. Understanding the

nature of the sparsity boundary adds clarity to why GWAS have been unsuccessful to date. These researchers have been lost in the wilderness for about 10 years. With the current research, they now have a compass pointing them in the right direction. Knowing that one is making marginal progress is a powerful motivator.

Admittedly, it might not be of overwhelming help when there is lots of noise, as there is at 0.5, though at least they a heading. This knowledge could also help to find the diseases/traits that

are especially low hanging fruit.

Consider autism. Autism has quite high heritability of ~0.8. Below GWAS is using 15,000 sample size with s=~ 5,000-10,000. One would expect the NE transition boundary for autism to be somewhere in the middle of Figure A and Figure B. If s were 5,000 then rho= ~0.35, doubling n would bring rho down near toward the boundary and it might not take much more to hit a hard transition boundary in blue.

This would be very positive news for people on the spectrum. The researchers below do not have this insight. For all they know the answer might be millions of miles away: it isn’t. They still believe that they are in the wilderness, when they are relatively close to the boundary. With autism (instead of IQ and height) increasing sample sizes by tens of thousands should lead to a firmish boundary. IQ and height will need sample sizes approaching 1 million. In Figure 2C, it might be necessary to increase sample size until you have almost reached the x-axis for good betas. There would be an asymptotic approach as sample size increased. Doubling sample size from 5,000 to 10,000 might bring rho down from 0.02 to 0.01; then another might be needed.

https://www.ncbi.nlm.nih.gov/pubmed/28540026

What I found so interesting about the choice of lambda was that with s known you could set lambda so that you received exactly the s non-zero betas. For example, if you knew s=10,493, then you could precisely calibrate lambda until you had exactly 10,493 non-zeros. Obviously s is not known to 5 significant figures, though it is thought to be roughly 10,000, so you can at least move into the right ball park right from the start. The betas might take some time though to fixate.

The paper does talk of doing a sequential analysis. My thought was that they could follow on with a Bayesian for the betas. It is thought that half are negative and are distributed normally. Perhaps using this information to arrange the betas would be useful.

infoproc is reporting that an article for height using the transition boundary approach is now in the pipe. I, for one, am the kid who can’t wait to open his Christmas presents. This is that exciting!

res, how did you embed the figure in your post. That was neat.

There was an embedded url that I also found interesting.

Replies:@resMerry Christmas everyone!!!!

Or for those not so inclined, Happy Holidays!

This is awesome!

The height GWAS is published.

That was a fast publishing turn.

Is it usually that quick on biorxiv?

We can now all see what Lassso can do.

http://www.biorxiv.org/content/early/2017/09/18/190124.full.pdf+html

Replies:@resNotice that Educational Attainment does not show any sign of the phase change yet using the same data.

Figure 4 is a scatter plot of actual vs. predicted height for a held out sample.

I see no sign of systems biology expertise applied here (e.g. see tissue type analysis in earlier IQ meta-study). Hopefully a systems biologist can take a look at their SNP list.

Now to read the paper more closely.

P.S. utu, notice the use of out of sample validation.

They prescreened SNP's (based on standard single marker regression) and reduced their set to much lower number (p=50k and 100k) which is what I anticipate in my comment #79 above: I do not understand the estimate of sample size: n≈30s, where 30 is from simulation in the earlier paper. But now they reduced p to 100k with prescreening. So the estimate of n should be somewhat smaller.

Is Lasso+L1 overhyped in this paper?Number of SNP's due to prescreening was reduced from p=645k to p=50k or p=100k while the training sample was n=453k after putting aside several sets of 5k as validation sample. Since n>p the problem is no longer undetermined. The nonlinear method Lasso+L1 is not needed anymore. One would think it suffices to do regular linear multivariate regression to solve y=Ax for p variables since n>p. Then using the validation sample one would start shaving off SNPs with the lowest betas until the point when correlation on the validation set begins to budge, i.e., start getting lower. A similar plot to the one form Figure 1 would be obtained except it would be constructed from right to left, i.e., the large number of hits to low number. This approach would be incomparably faster than the Lasso approach.If authors showed how the non-zero SNPs that ended up in the solution are ranked in the 100k set we would know how critical the prescreening was.

Validation set size?Shouldn't the validation sample be larger than 5k? The curves in Figure 1 are very smooth and monotonic in the "plateau" region. Should't it be not monotonic when it gets to the "overfit" region? How did they decide that s=20k and not 35k as in Figure 1? At which point an additional SNP that causes correlation increase ceases to be considered activated?The number of activated SNPs in the optimal predictors for height and bone density is roughly 20k. Increasing the number of candidate SNPs used from p = 50k to p = 100k increased the maximum correlation of the predictors somewhat, but did not change the number of activated SNPs significantly.Is 50k SNP's the top (in terms of correlations) sample of 100k SNPs? I wonder what would happen if two 50k not overlapping randomly selected subsets of 100k SNPs were used? What would be the maximum correlation for each?

Does validation prove causation?How do we know that there is no other significantly different set of SNPs among p=645k that produce similar high correlation on the validation set? And if we could find one then which solution is the causative one? Both?There was an embedded url that I also found interesting.

Reproduced from http://www.unz.com/isteve/dr-donna-zuckerberg-on-how-she-is-a-victim-and-deserves-more-microaffirmations/#comment-2010243

Or for those not so inclined, Happy Holidays!

This is awesome!

The height GWAS is published.

That was a fast publishing turn.

Is it usually that quick on biorxiv?

We can now all see what Lassso can do.

http://www.biorxiv.org/content/early/2017/09/18/190124.full.pdf+html

Thanks! That is exciting! Their height predictor captures ~40% of variance using ~20k SNPs.

Notice that Educational Attainment does not show any sign of the phase change yet using the same data.

Figure 4 is a scatter plot of actual vs. predicted height for a held out sample.

I see no sign of systems biology expertise applied here (e.g. see tissue type analysis in earlier IQ meta-study). Hopefully a systems biologist can take a look at their SNP list.

Now to read the paper more closely.

P.S. utu, notice the use of out of sample validation.

Or for those not so inclined, Happy Holidays!

This is awesome!

The height GWAS is published.

That was a fast publishing turn.

Is it usually that quick on biorxiv?

We can now all see what Lassso can do.

http://www.biorxiv.org/content/early/2017/09/18/190124.full.pdf+html

Have been waiting for this. r= .65 which is pretty good. We may be getting somewhere.

Or for those not so inclined, Happy Holidays!

The height GWAS is published.

That was a fast publishing turn.

Is it usually that quick on biorxiv?

We can now all see what Lassso can do.

http://www.biorxiv.org/content/early/2017/09/18/190124.full.pdf+html

I’ve just read it; it looks very good. It is interesting how they used out-of-set validation sample to get correlation and then keep lowering lambda which pushes number of non-zero SNPs up.

They prescreened SNP’s (based on standard single marker regression) and reduced their set to much lower number (p=50k and 100k) which is what I anticipate in my comment #79 above:

I do not understand the estimate of sample size: n≈30s, where 30 is from simulation in the earlier paper.

But now they reduced p to 100k with prescreening. So the estimate of n should be somewhat smaller.

Also see Steve’s latest blog post: http://infoproc.blogspot.com/2017/09/accurate-genomic-prediction-of-human.html

announcing the preprint.

There he includes the scatterplot I mentioned.

P.S. Here is the original (I think) GCTA paper which gave an estimate of 45% as the amount of phenotypic height variance explained by SNPs: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3014363/

I think the current LASSO paper also serves as nice confirmatory evidence of the GCTA estimates (I believe GCTA has been controversial).

The GCTA wikipedia page is useful: https://en.wikipedia.org/wiki/Genome-wide_complex_trait_analysis

For example, it includes a listing of GCTA estimates for a wide variety of traits.

Does anyone know the current best GCTA estimate for the full breakdown of height variance (e.g. non-SNP genetic contribution)? I seem to recall seeing something like that in my travels (I looked on Infoproc but did not find it).

This 2014 height paper was listed on Wikipedia as GWAS confirmation for GCTA (29% of variance explained): http://neurogenetics.qimrberghofer.edu.au/papers/Wood2014NatGenet.pdf

and includes interesting biological information (exactly the kind of thing I would like to see done with the new paper). But I think the current paper is even better confirmation IMHO. It is worth noting that the earlier paper achieved its best results by relaxing the SNP p-value threshold.

Or for those not so inclined, Happy Holidays!

The height GWAS is published.

That was a fast publishing turn.

Is it usually that quick on biorxiv?

We can now all see what Lassso can do.

http://www.biorxiv.org/content/early/2017/09/18/190124.full.pdf+html

Is Lasso+L1 overhyped in this paper?Number of SNP’s due to prescreening was reduced from p=645k to p=50k or p=100k while the training sample was n=453k after putting aside several sets of 5k as validation sample. Since n>p the problem is no longer undetermined. The nonlinear method Lasso+L1 is not needed anymore. One would think it suffices to do regular linear multivariate regression to solve y=Ax for p variables since n>p. Then using the validation sample one would start shaving off SNPs with the lowest betas until the point when correlation on the validation set begins to budge, i.e., start getting lower. A similar plot to the one form Figure 1 would be obtained except it would be constructed from right to left, i.e., the large number of hits to low number. This approach would be incomparably faster than the Lasso approach.If authors showed how the non-zero SNPs that ended up in the solution are ranked in the 100k set we would know how critical the prescreening was.

Validation set size?Shouldn’t the validation sample be larger than 5k? The curves in Figure 1 are very smooth and monotonic in the “plateau” region. Should’t it be not monotonic when it gets to the “overfit” region? How did they decide that s=20k and not 35k as in Figure 1? At which point an additional SNP that causes correlation increase ceases to be considered activated?The number of activated SNPs in the optimal predictors for height and bone density is roughly 20k. Increasing the number of candidate SNPs used from p = 50k to p = 100k increased the maximum correlation of the predictors somewhat, but did not change the number of activated SNPs significantly.Is 50k SNP’s the top (in terms of correlations) sample of 100k SNPs? I wonder what would happen if two 50k not overlapping randomly selected subsets of 100k SNPs were used? What would be the maximum correlation for each?

Does validation prove causation?How do we know that there is no other significantly different set of SNPs among p=645k that produce similar high correlation on the validation set? And if we could find one then which solution is the causative one? Both?Replies:@resI get that it is good to be cautious and ask appropriate questions, but do you really need to seem so dismissive?

Steve Hsu is a serious scientist with a background in a discipline (Physics) that has a culture with a history of being even more rigorous than Biology about proof. It is possible the reserachers are making some sort of unintentional mistake (that is what validation and replication studies are for), but I think you need to at least give these results a chance of being true. And in particular should refrain from slighting comments (no matter how vague) given the likelihood of further proof being available in the near future.

all of this depends on the assumption that the SNPs having significant effect are sparse, very sparse considering there are 10m SNPs. why should this be the case? and it also assumes that the total effect of those SNPs having insignificant effect is itself insignificant. why should this be the case?

Replies:@resIs Lasso+L1 overhyped in this paper?Number of SNP's due to prescreening was reduced from p=645k to p=50k or p=100k while the training sample was n=453k after putting aside several sets of 5k as validation sample. Since n>p the problem is no longer undetermined. The nonlinear method Lasso+L1 is not needed anymore. One would think it suffices to do regular linear multivariate regression to solve y=Ax for p variables since n>p. Then using the validation sample one would start shaving off SNPs with the lowest betas until the point when correlation on the validation set begins to budge, i.e., start getting lower. A similar plot to the one form Figure 1 would be obtained except it would be constructed from right to left, i.e., the large number of hits to low number. This approach would be incomparably faster than the Lasso approach.If authors showed how the non-zero SNPs that ended up in the solution are ranked in the 100k set we would know how critical the prescreening was.

Validation set size?Shouldn't the validation sample be larger than 5k? The curves in Figure 1 are very smooth and monotonic in the "plateau" region. Should't it be not monotonic when it gets to the "overfit" region? How did they decide that s=20k and not 35k as in Figure 1? At which point an additional SNP that causes correlation increase ceases to be considered activated?The number of activated SNPs in the optimal predictors for height and bone density is roughly 20k. Increasing the number of candidate SNPs used from p = 50k to p = 100k increased the maximum correlation of the predictors somewhat, but did not change the number of activated SNPs significantly.Is 50k SNP's the top (in terms of correlations) sample of 100k SNPs? I wonder what would happen if two 50k not overlapping randomly selected subsets of 100k SNPs were used? What would be the maximum correlation for each?

Does validation prove causation?How do we know that there is no other significantly different set of SNPs among p=645k that produce similar high correlation on the validation set? And if we could find one then which solution is the causative one? Both?I’m curious, utu. Would you be satisfied if God came down and told us the complete genetic architecture of height? Or would you still be spewing FUD?

I get that it is good to be cautious and ask appropriate questions, but do you really need to seem so dismissive?

Steve Hsu is a serious scientist with a background in a discipline (Physics) that has a culture with a history of being even more rigorous than Biology about proof. It is possible the reserachers are making some sort of unintentional mistake (that is what validation and replication studies are for), but I think you need to at least give these results a chance of being true. And in particular should refrain from slighting comments (no matter how vague) given the likelihood of further proof being available in the near future.

Replies:@utudoes the observation of a phase transition prove this sparsity? my guess is this technique will not recover the majority of the missing heritability. the professor’s latest for height claims to predict 36% of the variance.

A complexity that we have discussed before is that internationally aggregated samples on intelligence will probably have been measured with different tests. For once, the theory of general intelligence assists us here, in that a comparable g can be extracted from a broad range testing procedures, putting all subjects onto the same g scale. An additional complexity is that for many samples no psychometric test scores are available, but scholastic tests are far more commonly obtainable. Scholastic attainment is very important, but it is not perfectly correlated with intelligence.1. g depends on the population and on the battery. it’s a statistic not a thing.

2. the difference between a battery of scholastic tests and IQ tests is purely nominal from the psychometric pov. the two correlate as well as soi-disant IQ tests correlate with one another.

Replies:@utu1. g depends on the population and on the battery. it’s a statistic not a thing.Some arguments by the g-cultist hinge on the assumption (reification) that g is a thing. But all we know is that g's obtained with the same FA procedures from different battery tests correlate with each other but there is no 1-1 map between them. The same individual has different g's from different batteries of tests. Actually the scales of g must be tuned with respect to each other (via regression) to be able to compare them. The reification part comes in the belief that these two different g's are approximation of some true and real g which is a thing. Furthermore it is possible intentionally to construct different batteries of tests to make g's correlate little among each other.

Because the SNPs found explained almost all of the variance GCTA predicted as being caused by SNPs. There is still work to do. We are only at around 40% of variance explained and height is much more heritable than that. Non-linear effects? Other genetic characteristics like CNVs (copy number variants)?

I get that it is good to be cautious and ask appropriate questions, but do you really need to seem so dismissive?

Steve Hsu is a serious scientist with a background in a discipline (Physics) that has a culture with a history of being even more rigorous than Biology about proof. It is possible the reserachers are making some sort of unintentional mistake (that is what validation and replication studies are for), but I think you need to at least give these results a chance of being true. And in particular should refrain from slighting comments (no matter how vague) given the likelihood of further proof being available in the near future.

So you do not see that Lasso in this application (because of preselection of SNPs) is possibly overhyped? Are you saying that serious scientist are not prone to blowing their own trumpets? Are you saying that even if it is so we should not talk about it because, well, because why? Are you saying perhaps that you may have too many genes for obsequiousness and sycophancy?

Replies:@res- Identify and respect ability.

- Appreciate when people do things from which I benefit (like Dr. Thompson's blog; or an excellent athletic teammate, captain, or coach).

Don't confuse respect and appreciation with "obsequiousness and sycophancy."

If you knew me in real life you would know that I am particularly bad at "obsequiousness and sycophancy" when rank is unaccompanied by either ability or positive impact in the world. I would have thought this would come through in my comments to some people who happen to think they are amazing but can't quite seem to turn that into good constructive comments (at least with a decent signal to noise ratio). But what do I know.

I think what your are referring to are the abilities to:

- Identify and respect ability.

- Appreciate when people do things from which I benefit (like Dr. Thompson’s blog; or an excellent athletic teammate, captain, or coach).

Don’t confuse respect and appreciation with “obsequiousness and sycophancy.”

If you knew me in real life you would know that I am particularly bad at “obsequiousness and sycophancy” when rank is unaccompanied by either ability or positive impact in the world. I would have thought this would come through in my comments to some people who happen to think they are amazing but can’t quite seem to turn that into good constructive comments (at least with a decent signal to noise ratio). But what do I know.

Replies:@utuIdentify and respect ability.- How can you identify without permitting yourself room for critical thinking?Appreciate when people do things from which I benefit- Basically you are in a constant search of confirmation (bias).Your "scientific method" comes down to egocentric pragmatism. For this reason you won't allow yourself to see that Lasso in this application (because of preselection of SNPs) is possibly overhyped? Finding truth is sometimes painful and demands sacrifices. It is not easy process of pleasuring yourself like fruitless masturbation.

A complexity that we have discussed before is that internationally aggregated samples on intelligence will probably have been measured with different tests. For once, the theory of general intelligence assists us here, in that a comparable g can be extracted from a broad range testing procedures, putting all subjects onto the same g scale. An additional complexity is that for many samples no psychometric test scores are available, but scholastic tests are far more commonly obtainable. Scholastic attainment is very important, but it is not perfectly correlated with intelligence.1. g depends on the population and on the battery. it's a statistic not a thing.

2. the difference between a battery of scholastic tests and IQ tests is purely nominal from the psychometric pov. the two correlate as well as soi-disant IQ tests correlate with one another.

1. g depends on the population and on the battery. it’s a statistic not a thing.Some arguments by the g-cultist hinge on the assumption (reification) that g is a thing. But all we know is that g’s obtained with the same FA procedures from different battery tests correlate with each other but there is no 1-1 map between them. The same individual has different g’s from different batteries of tests. Actually the scales of g must be tuned with respect to each other (via regression) to be able to compare them. The reification part comes in the belief that these two different g’s are approximation of some true and real g which is a thing. Furthermore it is possible intentionally to construct different batteries of tests to make g’s correlate little among each other.

well, well, well…

the professor’s preprint shows two things.

1. height is much more heritable than IQ, or height is a lot spars-er than IQ. it’s one or the other, or both. i don’t see why genetic causes of differences in height should be more additive than they are for IQ.

2. on the out of sample UKBB data set the asymptotic heritability was 0.4 for height. on the american out of sample set the same was only 0.29. here heritability is additive heritability from common SNPs.

1. the same data set which predicted 40% of the variance in height could predict only 9% of the variance in educational attainment. but maybe that’s a problem with the variable “educational attainment”.

2. illustrates that local estimates of heritability may agree numerically but for different genetic reasons, norms of reaction. thus the “real” heritability must be estimated across populations and countries.

Replies:@resillustrates that local estimates of heritability may agree numerically but for different genetic reasons, norms of reaction. thus the “real” heritability must be estimated across populations and countries.(1) Heritability is as much a measure of genetic determinism as much as environmental determinism.

(2) Heritability is not a universal constant. One can imagine society where heritability would be significantly reduced or increased. Just let the state to decide what is good for each of your children. Who will be a peasant and who will be an aristocrat.

(3) It is possible that twin studies heritability is a skewed snapshot of the actual across the whole sample heritability.

(4) The definition of heritability assumes that variance can be factorized: V(T)=V(G)+V(E), where T,G and E are trait, genes, environment, respectively. It is possible that this assumption does not hold over larger domains.

(5) Genetic heritability defined as the square of correlation between the trait T and its predictor function f(G) is sound as it measures the variance of residuals T- f(G). Explained vs. unexplained variance is a sound concept. Much more so than all the assumptions going to the Falconer's formula in twin studies. So I go with genetic heritably rather than twin studied heritability. In other words the missing heritability gap is on the high of twin studies end and not on the low end of the gene studies.

(6) Hsu result explains 40% of variance with 20k SNPs. Additional SNPs are on diminishing return curve (see SNPs between s=20k and s=35k in Figure 1). So among p=100k SNPs that were prescreened no more explanation can be found than 40%. Should other SNPs be looked for among the ones rejected by the prescreening? Or should a non-linear model be tried?

I hope that Hsu and his co-researchers will go after the non-linear problem. His 2015 paper

Determination of nonlinear genetic architecture using compressed sensinghttps://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0081-6

may suggest that he is gearing up for it.

Perhaps the nonlinear model first approximation could be tried on residuals using the p=20k non-zero SNPs identified by the linear model. This would greatly lower the number of variables.

but machine learning may put the nature vs nurture debate to bed forever. but only on the side of nature. if machine learning cannot find the missing heritability, the hereditists will simply claim they need more data or better algorithms.

my hunch is that the “real” heritability, the environment independent effect of genes on one’s rank

within his population, aka the absence of norm crossing, will be found to be significant but far less than the twin studies and GCTA have found. that is,statisticallysignificant, butpracticallyinsignificant, especially for the individual.the professor's preprint shows two things.

1. height is much more heritable than IQ, or height is a lot spars-er than IQ. it's one or the other, or both. i don't see why genetic causes of differences in height should be more additive than they are for IQ.

2. on the out of sample UKBB data set the asymptotic heritability was 0.4 for height. on the american out of sample set the same was only 0.29. here heritability is additive heritability from common SNPs.

1. the same data set which predicted 40% of the variance in height could predict only 9% of the variance in educational attainment. but maybe that's a problem with the variable "educational attainment".

2. illustrates that local estimates of heritability may agree numerically but for different genetic reasons, norms of reaction. thus the "real" heritability must be estimated across populations and countries.

As is well known.

Not known yet. Though perhaps the ~20k SNPs identified indicate IQ is slightly less sparse than estimated (~10k).

Kind of. The lower measurement error for height implies the phase change will occur with a smaller sample for height than for EA. Here is what the paper said about EA: “The corresponding result for Educational Attainment does not indicate any approach to a limiting value. Using all the data in the sample, we obtain maximum correlation of ∼ 0.3, activating about 10k SNPs. Presumably, significantly more or higher quality data will be required to capture most of the SNP heritability of this trait.” So not at the phase change for EA yet, but 10k SNPs identified suggests it is not far off.

I think the most relevant measure of heritability here is that applicable to the study population. Currently these studies are being done within populations rather than between.

NotLooks like we’re supposed to migrate to the new blog dedicated to this new discovery: might be more fun to stay put.

It was a smart pivot to go over one on Delta. Figure 1A and 1B of the above cited paper seem to give rho and Delta a range that stops at 1, though as shown in the current paper Delta can be greater than 1 ( in the current instance is 5 or 10). A Delta over 1 means that you have more people than genotypes. As utu notes, with Delta over 1 the system is no longer under determined. It is a wise strategy to move to the right on Figure 1B. As you move over .9 on the x axis the spectrum of colors spikes upward. If the same theory applied as increase Delta above 1 as below, then at rho=s/n=20,000/500,000=0.04 and Delta = 5 or 10, it is not difficult to imagine that you would be situated in a nice dark blue colored region. By the previously disclosed phase change theory should mean that you have captured all the SNPs and the errors of the betas should be low.

I was disappointed that they did not present their results using the same format as in the previous using Figures such as 1,2 etc. . I am not completely sure whether this could all be done with real as opposed to simulated data, though if it were possible, it would be highly persuasive evidence.

Jorge started me thinking about reducing p. What would that mean?

Throwing away SNPs would be throwing away some of the signal. Surprisingly in the article they imputed up to 92 million and then threw away over 91 million of the SNPs.This would be a useful trick to use with many other traits.

utu, might they have kept the Lasso to take advantage of the neat property of forcing terms to zero? As seen in Figure 1 of the previous article crossing above Delta 1 would not immediately lead to a solution. The ability of the Lasso to cleanly remove 70,000 0 betas is quite useful. I am not aware if other regression could be used that would also achieve this desirable result.

Replies:@utuutu, might they have kept the Lasso to take advantage of the neat property of forcing terms to zero?I do not think it happens the way as you think. There must be a threshold that once beta is below it the term is set to zero.

res, thank you for responding regarding how to post figures and other media.

Most forums and blogs allow upload of files from a computer which would be helpful also on this blog.(Not all the Figures etc. That one might want to post might have an identifier.)

- Identify and respect ability.

- Appreciate when people do things from which I benefit (like Dr. Thompson's blog; or an excellent athletic teammate, captain, or coach).

Don't confuse respect and appreciation with "obsequiousness and sycophancy."

If you knew me in real life you would know that I am particularly bad at "obsequiousness and sycophancy" when rank is unaccompanied by either ability or positive impact in the world. I would have thought this would come through in my comments to some people who happen to think they are amazing but can't quite seem to turn that into good constructive comments (at least with a decent signal to noise ratio). But what do I know.

Identify and respect ability.– How can you identify without permitting yourself room for critical thinking?Appreciate when people do things from which I benefit– Basically you are in a constant search of confirmation (bias).Your “scientific method” comes down to egocentric pragmatism. For this reason you won’t allow yourself to see that Lasso in this application (because of preselection of SNPs) is possibly overhyped? Finding truth is sometimes painful and demands sacrifices. It is not easy process of pleasuring yourself like fruitless masturbation.

It was a smart pivot to go over one on Delta. Figure 1A and 1B of the above cited paper seem to give rho and Delta a range that stops at 1, though as shown in the current paper Delta can be greater than 1 ( in the current instance is 5 or 10). A Delta over 1 means that you have more people than genotypes. As utu notes, with Delta over 1 the system is no longer under determined. It is a wise strategy to move to the right on Figure 1B. As you move over .9 on the x axis the spectrum of colors spikes upward. If the same theory applied as increase Delta above 1 as below, then at rho=s/n=20,000/500,000=0.04 and Delta = 5 or 10, it is not difficult to imagine that you would be situated in a nice dark blue colored region. By the previously disclosed phase change theory should mean that you have captured all the SNPs and the errors of the betas should be low.

I was disappointed that they did not present their results using the same format as in the previous using Figures such as 1,2 etc. . I am not completely sure whether this could all be done with real as opposed to simulated data, though if it were possible, it would be highly persuasive evidence.

Jorge started me thinking about reducing p. What would that mean?

Throwing away SNPs would be throwing away some of the signal. Surprisingly in the article they imputed up to 92 million and then threw away over 91 million of the SNPs.This would be a useful trick to use with many other traits.

utu, might they have kept the Lasso to take advantage of the neat property of forcing terms to zero? As seen in Figure 1 of the previous article crossing above Delta 1 would not immediately lead to a solution. The ability of the Lasso to cleanly remove 70,000 0 betas is quite useful. I am not aware if other regression could be used that would also achieve this desirable result.

utu, might they have kept the Lasso to take advantage of the neat property of forcing terms to zero?I do not think it happens the way as you think. There must be a threshold that once beta is below it the term is set to zero.

They found 10k EA? meh?

Amazing a year ago they had close to zero and now we move to 10k and it is barely even acknowledged. This result steals the thunder for the 1 million GWAS that we have all been waiting for; that group was indicating about 1000 SNPs. If I were them I think I would perhaps pull the article from the pipe and redo the number crunching with CS. It seems reasonably possible that at a million with Lasso? they could report up to the limit of common SNPs if this has not already been achieved as suggested in the current blog.

I would greatly like to see a Figure 1B for EA expanded to Delta=5. How could the current study have went to Delta=5 without the spectrum not surging upward. As soon as you dip into the deep blue ocean color you should quickly obtain all the SNPs with small beta errors. Other research found that EA had the most non-zero SNPs of the traits tested, so the true number of SNPs might be a few times higher than 10k. I hope some future research returns to reporting results in terms of phase boundaries, rhos, Delta’s, median p-values of non-zero SNPs…

I downloaded R and installed a package that does Lasso. imagic has a function with parameters x,beta etc. the one I am not sure about is T? What is T?

Now that there will be a growing awareness and interest in CS and phase boundaries I look forward to some additional packages that could produce results similar to that of Figure 1.

Replies:@resHave you looked at the help in R? It tends to be good (varies by package though). Try ? and see if that tells you what you want.

the professor's preprint shows two things.

1. height is much more heritable than IQ, or height is a lot spars-er than IQ. it's one or the other, or both. i don't see why genetic causes of differences in height should be more additive than they are for IQ.

2. on the out of sample UKBB data set the asymptotic heritability was 0.4 for height. on the american out of sample set the same was only 0.29. here heritability is additive heritability from common SNPs.

1. the same data set which predicted 40% of the variance in height could predict only 9% of the variance in educational attainment. but maybe that's a problem with the variable "educational attainment".

2. illustrates that local estimates of heritability may agree numerically but for different genetic reasons, norms of reaction. thus the "real" heritability must be estimated across populations and countries.

illustrates that local estimates of heritability may agree numerically but for different genetic reasons, norms of reaction. thus the “real” heritability must be estimated across populations and countries.(1) Heritability is as much a measure of genetic determinism as much as environmental determinism.

(2) Heritability is not a universal constant. One can imagine society where heritability would be significantly reduced or increased. Just let the state to decide what is good for each of your children. Who will be a peasant and who will be an aristocrat.

(3) It is possible that twin studies heritability is a skewed snapshot of the actual across the whole sample heritability.

(4) The definition of heritability assumes that variance can be factorized: V(T)=V(G)+V(E), where T,G and E are trait, genes, environment, respectively. It is possible that this assumption does not hold over larger domains.

(5) Genetic heritability defined as the square of correlation between the trait T and its predictor function f(G) is sound as it measures the variance of residuals T- f(G). Explained vs. unexplained variance is a sound concept. Much more so than all the assumptions going to the Falconer’s formula in twin studies. So I go with genetic heritably rather than twin studied heritability. In other words the missing heritability gap is on the high of twin studies end and not on the low end of the gene studies.

(6) Hsu result explains 40% of variance with 20k SNPs. Additional SNPs are on diminishing return curve (see SNPs between s=20k and s=35k in Figure 1). So among p=100k SNPs that were prescreened no more explanation can be found than 40%. Should other SNPs be looked for among the ones rejected by the prescreening? Or should a non-linear model be tried?

I hope that Hsu and his co-researchers will go after the non-linear problem. His 2015 paper

Determination of nonlinear genetic architecture using compressed sensinghttps://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0081-6

may suggest that he is gearing up for it.

Perhaps the nonlinear model first approximation could be tried on residuals using the p=20k non-zero SNPs identified by the linear model. This would greatly lower the number of variables.

Replies:@resAmazing a year ago they had close to zero and now we move to 10k and it is barely even acknowledged. This result steals the thunder for the 1 million GWAS that we have all been waiting for; that group was indicating about 1000 SNPs. If I were them I think I would perhaps pull the article from the pipe and redo the number crunching with CS. It seems reasonably possible that at a million with Lasso? they could report up to the limit of common SNPs if this has not already been achieved as suggested in the current blog.

I would greatly like to see a Figure 1B for EA expanded to Delta=5. How could the current study have went to Delta=5 without the spectrum not surging upward. As soon as you dip into the deep blue ocean color you should quickly obtain all the SNPs with small beta errors. Other research found that EA had the most non-zero SNPs of the traits tested, so the true number of SNPs might be a few times higher than 10k. I hope some future research returns to reporting results in terms of phase boundaries, rhos, Delta's, median p-values of non-zero SNPs...

I downloaded R and installed a package that does Lasso. imagic has a function with parameters x,beta etc. the one I am not sure about is T? What is T?

Now that there will be a growing awareness and interest in CS and phase boundaries I look forward to some additional packages that could produce results similar to that of Figure 1.

Worth noting that the UKBB data used here was (IIRC) the largest sample in the recent meta-study discussed here.

Have you looked at the help in R? It tends to be good (varies by package though). Try ? and see if that tells you what you want.

illustrates that local estimates of heritability may agree numerically but for different genetic reasons, norms of reaction. thus the “real” heritability must be estimated across populations and countries.(1) Heritability is as much a measure of genetic determinism as much as environmental determinism.

(2) Heritability is not a universal constant. One can imagine society where heritability would be significantly reduced or increased. Just let the state to decide what is good for each of your children. Who will be a peasant and who will be an aristocrat.

(3) It is possible that twin studies heritability is a skewed snapshot of the actual across the whole sample heritability.

(4) The definition of heritability assumes that variance can be factorized: V(T)=V(G)+V(E), where T,G and E are trait, genes, environment, respectively. It is possible that this assumption does not hold over larger domains.

(5) Genetic heritability defined as the square of correlation between the trait T and its predictor function f(G) is sound as it measures the variance of residuals T- f(G). Explained vs. unexplained variance is a sound concept. Much more so than all the assumptions going to the Falconer's formula in twin studies. So I go with genetic heritably rather than twin studied heritability. In other words the missing heritability gap is on the high of twin studies end and not on the low end of the gene studies.

(6) Hsu result explains 40% of variance with 20k SNPs. Additional SNPs are on diminishing return curve (see SNPs between s=20k and s=35k in Figure 1). So among p=100k SNPs that were prescreened no more explanation can be found than 40%. Should other SNPs be looked for among the ones rejected by the prescreening? Or should a non-linear model be tried?

I hope that Hsu and his co-researchers will go after the non-linear problem. His 2015 paper

Determination of nonlinear genetic architecture using compressed sensinghttps://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0081-6

may suggest that he is gearing up for it.

Perhaps the nonlinear model first approximation could be tried on residuals using the p=20k non-zero SNPs identified by the linear model. This would greatly lower the number of variables.

I think you are right. The question is what kind of sample size is necessary. I have not seen an estimate for that. One issue is what you mean by nonlinear. Quadratic terms double the number of possible variables, but interaction terms square the number of variables. Very different when you are talking about 20k SNPs.

Replies:@utuThe question is what kind of sample size is necessary.In his paper that say:

Typically, n∗ ∼ 100 × sparsity, where sparsity s is the number of loci identified by Step 1on page 11. So they use 100 multiple instead of 30 as in the linear case.But if they reduce number of SNP's to p=20k using the ones discovered in linear part it is 5 time less what they started with in the linear paper.

The question is what kind of sample size is necessary.In his paper that say:

Typically, n∗ ∼ 100 × sparsity, where sparsity s is the number of loci identified by Step 1on page 11. So they use 100 multiple instead of 30 as in the linear case.But if they reduce number of SNP’s to p=20k using the ones discovered in linear part it is 5 time less what they started with in the linear paper.

Replies:@resI think it is safe to guess they are even sparser than the linear terms, but if we are talking about (20k)^2 = 4e8 possible terms I don't think that says much.

Could you include a longer quote or a more direct citation (e.g. paper and page) so I can tell if the estimate you give only relies on loci and not the nonlinear term type? It's been a while since I went through the nonlinear CS paper.

The question is what kind of sample size is necessary.In his paper that say:

Typically, n∗ ∼ 100 × sparsity, where sparsity s is the number of loci identified by Step 1on page 11. So they use 100 multiple instead of 30 as in the linear case.But if they reduce number of SNP's to p=20k using the ones discovered in linear part it is 5 time less what they started with in the linear paper.

But we have no reasonable estimate of the sparsity of the quadratic and (especially) interaction terms.

I think it is safe to guess they are even sparser than the linear terms, but if we are talking about (20k)^2 = 4e8 possible terms I don’t think that says much.

Could you include a longer quote or a more direct citation (e.g. paper and page) so I can tell if the estimate you give only relies on loci and not the nonlinear term type? It’s been a while since I went through the nonlinear CS paper.

Replies:@utuI think it is safe to guess they are even sparser than the linear terms, but if we are talking about (20k)^2 = 4e8 possible terms I don't think that says much.

Could you include a longer quote or a more direct citation (e.g. paper and page) so I can tell if the estimate you give only relies on loci and not the nonlinear term type? It's been a while since I went through the nonlinear CS paper.

Page 11

Replies:@resI am surprised the performance degrades so little with all of those additional terms. I did not see that addressed in the paper (but could easily have missed it, I was not thorough). Do you know of any discussion (in the paper or elsewhere) about that in more detail?

The discussion below equation 9 was helpful but I still feel like I am missing something.

The two step nature of the process (find linear terms first) is important, but I am unable to see the implications.

I don’t understand what’s the point of making citizens brighter if there’s no use of that intelligence in real world. In my experience it’s the dumbest and the most cruel who succeed.

Replies:@middle aged vet . . .Thanks. I forgot that their overall method (equation 7) is for gene-gene interactions and only includes the quadratic terms as a special case where a SNP interacts with itself.

I am surprised the performance degrades so little with all of those additional terms. I did not see that addressed in the paper (but could easily have missed it, I was not thorough). Do you know of any discussion (in the paper or elsewhere) about that in more detail?

The discussion below equation 9 was helpful but I still feel like I am missing something.

The two step nature of the process (find linear terms first) is important, but I am unable to see the implications.

Replies:@utuAnyone have any idea of what the upper bound for correlation of height might be?

r=0.65 is impressive: how much higher could this go?

They include non-linear and perhaps this adds in another third, then possibly include pedigree data.

I will be interested to see what methods might be tried to bring in the outliers from the predicted versus actual Figure from the article. Some of those outliers were far off, something very unusual must be happening for them to be that different than expected. I wonder how easy it would be to pick this up from analyzing their genotypes.

What is the environmental input into the variation of height?

Replies:@utur=0.65 is impressive: how much higher could this go?

They include non-linear and perhaps this adds in another third, then possibly include pedigree data.

I will be interested to see what methods might be tried to bring in the outliers from the predicted versus actual Figure from the article. Some of those outliers were far off, something very unusual must be happening for them to be that different than expected. I wonder how easy it would be to pick this up from analyzing their genotypes.

What is the environmental input into the variation of height?

Answer: Square root of 0.8 if twin based heritability is valid.

I am surprised the performance degrades so little with all of those additional terms. I did not see that addressed in the paper (but could easily have missed it, I was not thorough). Do you know of any discussion (in the paper or elsewhere) about that in more detail?

The discussion below equation 9 was helpful but I still feel like I am missing something.

The two step nature of the process (find linear terms first) is important, but I am unable to see the implications.

I have to say that Hsu papers are very accessible and fairly transparent and devoid of obscure jargon. Perhaps because he is a physicist and not a statistician. Perhaps also because he is an outsider trying to popularize his approach. I can’t say the same about papers from GWAS and GCTA crowd.

Really? Correlation = 0.894 ?

That would be very impressive.

Is there a physical interpretation of correlation to help make sense of it?

For example, as correlation increases, how will the proportion of those outside of 1 SD change ?

0.90 correlation –> what proportion of 1 SD+ outliers?

utu, you have called the next step will be non-linear Lasso.

Where will they go after that to find more of the variance?

Some in the psychometric community also expect that IQ has correlation at the high end of the range.

Extremely excited to see how the research advances from here.

What could we do to have the most recent comment to appear at the top of the list and not the bottom?

Replies:@utuI do not now what are they going to do next. If Hsu's result does not find objections and if there are no more SNPs to be found that would increase correlation with the linear model then a natural step is to add nonlinear term to the model. Since Hsu wrote a paper about the non-linear method two years ago then I presume that is what he is gearing up to to. In the mean time I think some people will scrutinize his result and try to replicate it with different methods and different data.

I suspect that Hsu would like to make some breakthrough in IQ prediction. There is a data sets of 300k that Posthuma et al. have used, so he could try his method on this set.

That would be very impressive.

Is there a physical interpretation of correlation to help make sense of it?

For example, as correlation increases, how will the proportion of those outside of 1 SD change ?

0.90 correlation --> what proportion of 1 SD+ outliers?

utu, you have called the next step will be non-linear Lasso.

Where will they go after that to find more of the variance?

Some in the psychometric community also expect that IQ has correlation at the high end of the range.

Extremely excited to see how the research advances from here.

What could we do to have the most recent comment to appear at the top of the list and not the bottom?

0.90 correlation implies that 0.9^2=0.81 is the ratio of variance of residuals to the variance of data. This ratio is heritability. Heritability from twin studies does not need to be the same as it is possible that various assumptions made in derivation of the Falconer’s formula are not fulfilled by the available data.

I do not now what are they going to do next. If Hsu’s result does not find objections and if there are no more SNPs to be found that would increase correlation with the linear model then a natural step is to add nonlinear term to the model. Since Hsu wrote a paper about the non-linear method two years ago then I presume that is what he is gearing up to to. In the mean time I think some people will scrutinize his result and try to replicate it with different methods and different data.

I suspect that Hsu would like to make some breakthrough in IQ prediction. There is a data sets of 300k that Posthuma et al. have used, so he could try his method on this set.

Another 2.5% of the variance hidden in compound heterozygotes.

Did these regions hit genome wide significance as single SNPs?

Would CHs be picked up using a non-linear Lasso?

PMID: 28921393

Thanks for the reference, but why not just link?

https://www.ncbi.nlm.nih.gov/pubmed/28921393

I don’t see the full paper anywhere, but the abstract looks interesting. Both for explaining that much variance and for their use of a sample enriched for tall people.

They mentioned that a replication sample also showed significant results, but did not give the variance explained for that sample.

Based on this:

It appears they are only looking at pairwise SNP interaction. Thus the CS in the recent paper would not detect this, but nonlinear CS as mentioned by utu above would.

I am curious (and a bit concerned about) how they managed multiple-hypothesis correction for pairwise SNP interactions.

The recent CS approach probably found essentially all of the linear effect SNPs for height. With this nearly complete list of SNPs from a sample size of 500,000 people, the GCDH would likely be even more effective in finding more of the variance than the above cited GCDH research. The GCDH sample size used was small and they were concerned with the computational burden that would be involved in a larger sample such as GIANT. Using a data reduction step such as CS Lasso the computations needed should be greatly reduced.

Does it make any sense to think that a SNP could be entirely rejected under the CS Lasso, yet could be significant under a GCDH?

Hard to know which would be better non-linear L1 Lasso or Generalized collapsed double heterozygosity (GCDH) to follow on the linear L1 Lasso. It is especially impressive in the GCDH approach that even a modest sample size produced strong results.

Either of these could be highly effective approaches to combine with the recent research. If all you need to do is do pairwise comparisons of nearby SNPs, the computational burden is dramatically reduced. You might be down to possibly ~200,000 comparisons. Might be a good idea to include the full set of 50,000 or so that was found in the bigger model because those SNPs might have a larger effect when in a compound heterozygote state.

Replies:@resBack to you: Agreed. I would like to see the full paper to learn more about GCDH and evaluate the results a bit more (e.g. look at percent variance explained out of sample). Given the sample size I have some concern about just how strong the results were. Contrary to utu's belief I am not completely credulous.

Does it make any sense to think that a SNP could be entirely rejected under the CS Lasso, yet could be significant under a GCDH?

Hard to know which would be better non-linear L1 Lasso or Generalized collapsed double heterozygosity (GCDH) to follow on the linear L1 Lasso. It is especially impressive in the GCDH approach that even a modest sample size produced strong results.

Either of these could be highly effective approaches to combine with the recent research. If all you need to do is do pairwise comparisons of nearby SNPs, the computational burden is dramatically reduced. You might be down to possibly ~200,000 comparisons. Might be a good idea to include the full set of 50,000 or so that was found in the bigger model because those SNPs might have a larger effect when in a compound heterozygote state.

I don’t know the GCDH methodology well enough to know their conditions for non/detection, but from the nonlinear CS paper (linked by utu above):

More discussion of that on page 6 (or search for hide).

Back to you:

Agreed. I would like to see the full paper to learn more about GCDH and evaluate the results a bit more (e.g. look at percent variance explained out of sample). Given the sample size I have some concern about just how strong the results were. Contrary to utu’s belief I am not completely credulous.

yes. res and utu, i understand all that. it’s a pitty no hereditists do. including herr professor doktor doktor hsu afaict.

when i refer to the “real” heritability what i mean is the hypothetical asymptote which the rank correlation reaches across all populations and environments. (that wasn’t very clear. will explain later.) if this asymptote is 0 then the hereditists are wrong. of course no one believes in the blank slate. this is the ultimate straw man. but the asymptote being 0 does not require the blank slate.

i know that h is merely the coefficient of E in the planar fit to the P(G, E) surface.

the linear model is P = hG + sqrt(1 – h^2)E.

if y’all don’t know what i’m talking about, here’s a paper i found which splains it.

http://www.faculty.biol.ttu.edu/rice/rice08b.pdf

Disagree:utuReplies:@utuYou could write however that P=f(G)+g(E)+∆(G,E) where f and g functions are defined in such a way that the residual function (its rms) ∆(G,E) is minimum.

Functions f and g are not unique but unique to within a constant as the following shows: P=(f(G)+c)+(g(E)-c)+∆(G,E). Because of the constant c you can never say how much of P is due to G and how much due to E for any individual subject. Besides you still have to account for ∆(G,E) in the sum.

From P=f(G)+g(E)+∆(G,E) one can try to define heritability. Again there is no unique definition as long as the residual function ∆(G,E)≠0.

If ∆(G,E)=0 then yes you can partition variance in two terms of which each is dependent on only one variable: V_P=V_G+V_E. This equation one can find everywhere as it is not general enough and thus it is wrong.

But when ∆(G,E)≠0 then you can either write either

From the first equation (i) you will get heritability as h^2=V_G/V_P and from the second equation as h^2=1-V_E/V_P. These two heritabilities are not the same because V_E+V_G≠V_P when ∆(G,E)≠0.This is interesting and raises the question how much the two heritabilities differ. When you create a predictor model based on genes just like prof. Hsu did you maximize V_G by fitting f(G) to data thus heritability is as large as it can be, however let's imagine that we have an ability to make a model of function g(E) and fit it to data. Then we will maximize V_E and thus the heritability will be smaller. In other words:

So if we lived in the universe where the environment is easier to quantify and account for than genes are then we would be dealing with lower heritabilities. Neither of these two heritabilities is more right than other. This all stems from the ambiguity of definition and the fact that G and E are not completely additive.I haven't resolved the question yet which of the two heritabilities is approximated by twin studies.

I hope the blogger res will take a look at this though I am not sure if such subtleties are within the interests of the egocentric pragmatist who is on the constant hunt for confirmations of his biases.

this a bit “inside baseball” but…

if machine learning were applied to a globally representative sample, where the phenotype was

within populationrank, afaik the machine might still do a local estimate due to population differences in genes.the professor’s most recent paper actually claims that had the america sample been part of the data the effect sizes would have been slightly different such that both the UKBB and the america sample would have a 0.65 correlation with the predicted phenotype.

maybe 0.65 vs 0.54 is not significant. idk. but the US and blighty are much closer than two randomly selected societies would be.

i expect none of the foregoing is comprehensible, so here’s a perhaps equally incomprehensible paraphrase:

when the P(G) function is estimated via machine learning in one “locale”, and the same function is applied to another locale and the fit is inferior, this supports the norms of reaction criticism.

BUT the claim that a broader data set would not have this problem IGNORES that…

the broader fit may still be LOCAL, may merely be two fits masquerading as one.

I am not seeing the claim in your first paragraph. Can you point me to it?

I assume you are talking about this bit in section A.5?

If I read that correctly there are a few things causing the 0.65 and 0.54 difference. First three all on UKBB.

The full imputed dataset has r=~0.65

The un-imputed dataset has r=~0.61

The un-imputed dataset limited to the SNPs also present in ARIC has r=~0.58

The ARIC dataset using those same SNPs has r=~0.54

So IMHO the apples to apples comparison between UKBB and ARIC correlations only shows a difference of about ~0.04. Much better than the more obvious conclusion of ~0.11.

the n vs n argument is taken to occur in infinite space, so to say.

but there are FACTS which would settle the argument FOREVER.

AND these facts are now discoverable.

it seems to me both sides don’t know what those facts are; the argument can’t be resolved.

so what you get is “erudition” as authority. a mastery of the behavior genetics literature is wisdom.

stop!

ask yourself, “what are the facts which would decide the issue permanently?”

and know that until such facts are determined the debate is…gay.

is every even number a sum of two primes?

when i refer to the "real" heritability what i mean is the hypothetical asymptote which the rank correlation reaches across all populations and environments. (that wasn't very clear. will explain later.) if this asymptote is 0 then the hereditists are wrong. of course no one believes in the blank slate. this is the ultimate straw man. but the asymptote being 0 does not require the blank slate.

i know that h is merely the coefficient of E in the planar fit to the P(G, E) surface.

the linear model is P = hG + sqrt(1 - h^2)E.

if y'all don't know what i'm talking about, here's a paper i found which splains it.

http://www.faculty.biol.ttu.edu/rice/rice08b.pdf

This equation P = hG + sqrt(1 – h^2)E probably is never true. G and E are not really additive. What you wrote may have only symbolic meaning.

You could write however that P=f(G)+g(E)+∆(G,E) where f and g functions are defined in such a way that the residual function (its rms) ∆(G,E) is minimum.

Functions f and g are not unique but unique to within a constant as the following shows: P=(f(G)+c)+(g(E)-c)+∆(G,E). Because of the constant c you can never say how much of P is due to G and how much due to E for any individual subject. Besides you still have to account for ∆(G,E) in the sum.

From P=f(G)+g(E)+∆(G,E) one can try to define heritability. Again there is no unique definition as long as the residual function ∆(G,E)≠0.

If ∆(G,E)=0 then yes you can partition variance in two terms of which each is dependent on only one variable: V_P=V_G+V_E. This equation one can find everywhere as it is not general enough and thus it is wrong.

But when ∆(G,E)≠0 then you can either write either

From the first equation (i) you will get heritability as h^2=V_G/V_P and from the second equation as h^2=1-V_E/V_P. These two heritabilities are not the same because V_E+V_G≠V_P when ∆(G,E)≠0.

This is interesting and raises the question how much the two heritabilities differ. When you create a predictor model based on genes just like prof. Hsu did you maximize V_G by fitting f(G) to data thus heritability is as large as it can be, however let’s imagine that we have an ability to make a model of function g(E) and fit it to data. Then we will maximize V_E and thus the heritability will be smaller. In other words:

So if we lived in the universe where the environment is easier to quantify and account for than genes are then we would be dealing with lower heritabilities. Neither of these two heritabilities is more right than other. This all stems from the ambiguity of definition and the fact that G and E are not completely additive.

I haven’t resolved the question yet which of the two heritabilities is approximated by twin studies.

I hope the blogger res will take a look at this though I am not sure if such subtleties are within the interests of the egocentric pragmatist who is on the constant hunt for confirmations of his biases.

censorship is sad!

even sadder than psychology.

unz will soon hear. sad!

Replies:@Difference makerThe distinction made is important, to be sure

utu, I am somewhat unclear about the meaning of additive heritability. From what I understand narrow sense heritability includes many recessive loci. Wiki shows a figure in which there is a linear relationship between the genotype and phenotype. The models for additive genetic effects are based on the number of a certain allele present (0, 1 or 2). I am unclear about why the heterozygote has a non-zero phenotype as shown in the wiki figure. I would have guessed that the heterozygote would be at zero.

I am also unsure why essentially all of the additive effects would appear to work in this recessive manner. Could not a portion of all these thousands of loci exhibit incomplete dominance? Might there not be non-linear relationships amongst the additive SNPs?

Does the current article include dominant effects? It seems reasonable to expect that the Lasso technique could be adapted for dominant loci.

If the CS method were to be run again on another large sample from a different population (Asian, African or other), then would replicating a SNP suggest that such a SNP were causal? Would finding a correlated SNP in the follow on study help to narrow down the search space for the causal variant?

With all these many thousands of SNPs emerging with very small effects, the next step in the research could be to find the causal SNPs. If people wanted to use a gene editing technology to modify phenotype it would be necessary to know what loci needed to be changed. How difficult a problem will finding these loci be, now that the SNPs contributing to the additive component of many traits appear to be almost completely recoverable?

The article did not provide any specific information on any of the SNPs found. Would an article such as this actually stay as is without reporting the full results? This would be very difficult to understand, if this were true. It would mean that science would be so closed that even the results of the research would be not fully disclosed. How could that be justified?

It would be reasonable to disclose the full list of the SNPs perhaps including 50,000, with the betas, p-values etc. as a minimum (disclosing other information related to crossing the phase boundary etc. could also be appended to the supplement). Such disclosure would allow others to confirm and extend the current results which is centrally important to the scientific process.

Replies:@resThere may be non-linear (dominance, interactions/epistasis) effects. Additive heritability tells us how much variance is explained by only the additive (linear) component. Then there is a gap of "unexplained genetic heritability" with components including nonlinear effects and non-SNP (e.g. CNVs) genetics. A key question is estimating the relative contributions of these components (e.g. through heritability and GCTA results). An important result of this paper was explaining additive variance close to the GCTA estimate of total additive variance.

The current article only includes linear effects (so no dominance). The non-linear CS approach we have been discussing adds non-linear effects.

The way I think about additive and dominance effects is to visualize a graph of quantitative phenotypic result for a given SNP on the y axis with 0, 1, 2 on the x axis. I assume a single SNP with only 2 relevant alleles (called major and minor depending on population frequency). The x axis is the number of minor alleles present.

Consider the following cases.

- Purely additive inheritance is a straight line.

- Dominance is a step function if complete and tends to that if partial.

- Heterozygote advantage has 1 highest (a bump in the middle). I think either 2 or 0 is usually close making it similar to step function.

The implications of this depend on the relative frequencies of the SNP combinations (higher frequency affects variance explained more). In most cases a linear fit can do a good job of explaining most of the variance explained by a given SNP (especially if weighted by SNP frequencies). This is why the additive (linear) model has more power than one might expect. (Linear models tend to be surprisingly effective in modeling and are generally worth at least trying.)

An important aspect of this is that for low minor allele frequencies (commonly called MAF) the frequency of having 2 minor alleles becomes almost zero (MAF^2). This makes almost the entire SNP contribution additive and simply a straight line between 0 and 1 minor alleles.

P.S. I hope this helped. It would be better with graphs (perhaps there is a writeup like that on the web?).

P.P.S. I find heterozygote advantage fascinating: https://en.wikipedia.org/wiki/Heterozygote_advantage

and believe it has important implications.

I am also unsure why essentially all of the additive effects would appear to work in this recessive manner. Could not a portion of all these thousands of loci exhibit incomplete dominance? Might there not be non-linear relationships amongst the additive SNPs?

Does the current article include dominant effects? It seems reasonable to expect that the Lasso technique could be adapted for dominant loci.

If the CS method were to be run again on another large sample from a different population (Asian, African or other), then would replicating a SNP suggest that such a SNP were causal? Would finding a correlated SNP in the follow on study help to narrow down the search space for the causal variant?

With all these many thousands of SNPs emerging with very small effects, the next step in the research could be to find the causal SNPs. If people wanted to use a gene editing technology to modify phenotype it would be necessary to know what loci needed to be changed. How difficult a problem will finding these loci be, now that the SNPs contributing to the additive component of many traits appear to be almost completely recoverable?

The article did not provide any specific information on any of the SNPs found. Would an article such as this actually stay as is without reporting the full results? This would be very difficult to understand, if this were true. It would mean that science would be so closed that even the results of the research would be not fully disclosed. How could that be justified?

It would be reasonable to disclose the full list of the SNPs perhaps including 50,000, with the betas, p-values etc. as a minimum (disclosing other information related to crossing the phase boundary etc. could also be appended to the supplement). Such disclosure would allow others to confirm and extend the current results which is centrally important to the scientific process.

I imagine utu will also reply, but here is my take.

There may be non-linear (dominance, interactions/epistasis) effects. Additive heritability tells us how much variance is explained by only the additive (linear) component. Then there is a gap of “unexplained genetic heritability” with components including nonlinear effects and non-SNP (e.g. CNVs) genetics. A key question is estimating the relative contributions of these components (e.g. through heritability and GCTA results). An important result of this paper was explaining additive variance close to the GCTA estimate of total additive variance.

The current article only includes linear effects (so no dominance). The non-linear CS approach we have been discussing adds non-linear effects.

The way I think about additive and dominance effects is to visualize a graph of quantitative phenotypic result for a given SNP on the y axis with 0, 1, 2 on the x axis. I assume a single SNP with only 2 relevant alleles (called major and minor depending on population frequency). The x axis is the number of minor alleles present.

Consider the following cases.

- Purely additive inheritance is a straight line.

- Dominance is a step function if complete and tends to that if partial.

- Heterozygote advantage has 1 highest (a bump in the middle). I think either 2 or 0 is usually close making it similar to step function.

The implications of this depend on the relative frequencies of the SNP combinations (higher frequency affects variance explained more). In most cases a linear fit can do a good job of explaining most of the variance explained by a given SNP (especially if weighted by SNP frequencies). This is why the additive (linear) model has more power than one might expect. (Linear models tend to be surprisingly effective in modeling and are generally worth at least trying.)

An important aspect of this is that for low minor allele frequencies (commonly called MAF) the frequency of having 2 minor alleles becomes almost zero (MAF^2). This makes almost the entire SNP contribution additive and simply a straight line between 0 and 1 minor alleles.

P.S. I hope this helped. It would be better with graphs (perhaps there is a writeup like that on the web?).

P.P.S. I find heterozygote advantage fascinating: https://en.wikipedia.org/wiki/Heterozygote_advantage

and believe it has important implications.

res, thank you. Your explanation was helpful.

I am quite surprised that given all the noise in the data that the genetic research community were able to pin down the numbers so well. My interpretation when I first encountered ideas such as heritability etc. was that they really could not be reduced to a simple algebraic solution. When you consider how much noise was present at different points in the phase diagram that assessment is not far off. Yet at the same time using a global maximization strategy with Lasso still involves an elegant and simple formula which after being applied to the data set gives a highly precise result.

Extracting all the common additive heritability for height in one computation is a substantial achievement.

I am greatly anticipating the time when all the remaining heritability and the causal SNPs are found.

Replies:@utuI am quite surprised that given all the noise in the data that the genetic research community were able to pin down the numbers so well.How do you know what the numbers are supposed to be?

Lasso still involves an elegant and simple formula which after being applied to the data set gives a highly precise result.

How do you know? It is possible that the reason the Lasso method did not choke up or got lost is only because Hsu by filtering SNPs lowered their number to p=50k or 100k converting the undetermined problem p>>n to the problem where n>p that can be handled with standard linear LSQ method. His Lasso+L1 method was not really tested on a hard case. All filtered SNPs he had correlated with y and presuming that SNPs do not correlate much with each other (matrix A is close to isometric matrix as Hsu wrote elsewhere) then adding extra SNPs to the model always increases correlation as his Figure 1 shows.

You are at awe, res is his usual cheerleader and my job as a rational skeptic is to bring you guys down back to reality.

I am quite surprised that given all the noise in the data that the genetic research community were able to pin down the numbers so well. My interpretation when I first encountered ideas such as heritability etc. was that they really could not be reduced to a simple algebraic solution. When you consider how much noise was present at different points in the phase diagram that assessment is not far off. Yet at the same time using a global maximization strategy with Lasso still involves an elegant and simple formula which after being applied to the data set gives a highly precise result.

Extracting all the common additive heritability for height in one computation is a substantial achievement.

I am greatly anticipating the time when all the remaining heritability and the causal SNPs are found.

I am quite surprised that given all the noise in the data that the genetic research community were able to pin down the numbers so well.How do you know what the numbers are supposed to be?

Lasso still involves an elegant and simple formula which after being applied to the data set gives a highly precise result.

How do you know? It is possible that the reason the Lasso method did not choke up or got lost is only because Hsu by filtering SNPs lowered their number to p=50k or 100k converting the undetermined problem p>>n to the problem where n>p that can be handled with standard linear LSQ method. His Lasso+L1 method was not really tested on a hard case. All filtered SNPs he had correlated with y and presuming that SNPs do not correlate much with each other (matrix A is close to isometric matrix as Hsu wrote elsewhere) then adding extra SNPs to the model always increases correlation as his Figure 1 shows.

You are at awe, res is his usual cheerleader and my job as a rational skeptic is to bring you guys down back to reality.

I am also unsure why essentially all of the additive effects would appear to work in this recessive manner. Could not a portion of all these thousands of loci exhibit incomplete dominance? Might there not be non-linear relationships amongst the additive SNPs?

Does the current article include dominant effects? It seems reasonable to expect that the Lasso technique could be adapted for dominant loci.

If the CS method were to be run again on another large sample from a different population (Asian, African or other), then would replicating a SNP suggest that such a SNP were causal? Would finding a correlated SNP in the follow on study help to narrow down the search space for the causal variant?

With all these many thousands of SNPs emerging with very small effects, the next step in the research could be to find the causal SNPs. If people wanted to use a gene editing technology to modify phenotype it would be necessary to know what loci needed to be changed. How difficult a problem will finding these loci be, now that the SNPs contributing to the additive component of many traits appear to be almost completely recoverable?

The article did not provide any specific information on any of the SNPs found. Would an article such as this actually stay as is without reporting the full results? This would be very difficult to understand, if this were true. It would mean that science would be so closed that even the results of the research would be not fully disclosed. How could that be justified?

It would be reasonable to disclose the full list of the SNPs perhaps including 50,000, with the betas, p-values etc. as a minimum (disclosing other information related to crossing the phase boundary etc. could also be appended to the supplement). Such disclosure would allow others to confirm and extend the current results which is centrally important to the scientific process.

You ask too many questions at once so I am not sure that even if knew the answers there is sufficient sincerity behind your questions to deserve the answers. You are all over the place. Scattered brain, hypomania? Anyway, I do not know answers to most of the question you ask.

utu, thank you for your comments.

Yes, it is quite true that since the start of the month I have been running on adrenaline. The concept of 100 SD IQ humans took me by complete surprise. There is no sci-fi that prepared me for this. We need to imagine our future before we can reach it. Super intelligence enhanced humans could become science fact before many might ever encounter it as sci-fi.

No one has even bothered to raise technical objections about this. On a first think through I can not see anything obvious that would prevent it from happening at some time in the future.

This latest research has further amplified my sense of disbelief. GWAS research has been advancing steadily but glacially for 10 years. My timeline for high heritability GWAS results had been on the scale more of decades. The entire GWAS methodology of fitting one SNP at a time seemed futile.My impression is that essentially all of the additive heritability for height has now been found.

For the pinning down the numbers comment, this is more about using the other research that these authors had done as a guide.

The paper might have held back some of the details of the method, though as an added checked I might have rerun the numbers to see if the x set at some point remained stable. This is what the theory said should happen. As you move down to and then beyond the effective boundary to complete x selection, there should be a range in which the set would be stable. Given the p-values they likely achieved and all the previous research, I am unable to propose a plausible rebuttal to this recent research.

I concede the point that Lasso possibly was used more as a method of convenience than one of necessity. Nonetheless there is no self-apparent reason to suppose that Lasso would yield a wildly wrong answer.

There are legitimate clarifications that could be raised, though the scientific rationale appears solid. They likely wanted to have the publication as a preprint as fast as they could and then later they might address some of the fine grained technical questions.

I did ask an excessive number of questions, though I did this more to spread a buffet on the table than for you to feel obligated to answer them all. Maybe others might want to respond to those points they find of interest.

To narrow down my questions, I would be especially interested in any responses to the causal SNPs question. Causals are one of the blockers left before the end zone. How might these causals be found effectively with Lasso or other methods. For example, one might try lassoing another large sample to see what tagging SNPs might be reported. Comments please!

It is also true that I have crossed the line from objective observer to member of the commentariat to a cheer leader to someone in awe and finally to an online agitpropper. res and I and quite a few others on the thread need some adult supervision: utu, you’re one of the grown ups!

Replies:@resDoes anyone have a good sense of the GWAS results for height over time? Wood 2014 was a big deal (linked above) explaining about 29% of variance with ~9,500 SNPs. But that was using a reduced threshold and only 697 SNPs were genome wide significant explaining ~16% of variance (Table 1, the abstract's "one-fifth" is annoyingly misleading IMHO, especially since 15.9% is less than one-sixth).

I would enjoy seeing the time progression of SNPs and percent variance explained at genome wide significance. Does anyone know of such a thing? Sorry, but I feel a need to disavow this one. Please only speak for yourself with comments like that.

It is important to understand that I have been excited about this since I first heard Steve Hsu bring up the idea four years ago (I knew about compressed sensing by name before that, and as L1-regularization even earlier, but not in the genetic context). I thought the potential power of the technique was obvious

as long as the theory held up in reality. Along the way I don't think there has been any shortage of doubters like utu. It is good to see initial success achieved (as far as we know at the moment, verification is important) and hopefully they will have further success with this technique in the future.P.S. Make sure to check out the updates at the bottom of Steve's most recent blog post: http://infoproc.blogspot.com/2017/09/accurate-genomic-prediction-of-human.html

Steve addresses a number of the questions raised in this thread.

I would be especially interested in any responses to the causal SNPs question.- This is the ultimate goal.Eventually the issue of causality will have to be answered by biologists. Hsu deals with mathematical questions. But I am sure there are statistical theories and criteria that deal with the issue of uncertainty how to exclude spurious correlations and so on I think it is mostly in the validation phase. See here:

https://en.wikipedia.org/wiki/Cross-validation_(statistics)

https://en.wikipedia.org/wiki/Test_set

Yes, it is quite true that since the start of the month I have been running on adrenaline. The concept of 100 SD IQ humans took me by complete surprise. There is no sci-fi that prepared me for this. We need to imagine our future before we can reach it. Super intelligence enhanced humans could become science fact before many might ever encounter it as sci-fi.

No one has even bothered to raise technical objections about this. On a first think through I can not see anything obvious that would prevent it from happening at some time in the future.

This latest research has further amplified my sense of disbelief. GWAS research has been advancing steadily but glacially for 10 years. My timeline for high heritability GWAS results had been on the scale more of decades. The entire GWAS methodology of fitting one SNP at a time seemed futile.My impression is that essentially all of the additive heritability for height has now been found.

For the pinning down the numbers comment, this is more about using the other research that these authors had done as a guide.

The paper might have held back some of the details of the method, though as an added checked I might have rerun the numbers to see if the x set at some point remained stable. This is what the theory said should happen. As you move down to and then beyond the effective boundary to complete x selection, there should be a range in which the set would be stable. Given the p-values they likely achieved and all the previous research, I am unable to propose a plausible rebuttal to this recent research.

I concede the point that Lasso possibly was used more as a method of convenience than one of necessity. Nonetheless there is no self-apparent reason to suppose that Lasso would yield a wildly wrong answer.

There are legitimate clarifications that could be raised, though the scientific rationale appears solid. They likely wanted to have the publication as a preprint as fast as they could and then later they might address some of the fine grained technical questions.

I did ask an excessive number of questions, though I did this more to spread a buffet on the table than for you to feel obligated to answer them all. Maybe others might want to respond to those points they find of interest.

To narrow down my questions, I would be especially interested in any responses to the causal SNPs question. Causals are one of the blockers left before the end zone. How might these causals be found effectively with Lasso or other methods. For example, one might try lassoing another large sample to see what tagging SNPs might be reported. Comments please!

It is also true that I have crossed the line from objective observer to member of the commentariat to a cheer leader to someone in awe and finally to an online agitpropper. res and I and quite a few others on the thread need some adult supervision: utu, you're one of the grown ups!

You realize those two statements are rather inconsistent, right? Without compressed sensing we would just be cranking along that seemingly futile path you described. It will be interesting when we can look back and judge how large a sample would be required for a standard GWAS to give results this good. Then we can talk about how many years this work advanced us in one giant step. Sorry, utu, that IS exciting.

Does anyone have a good sense of the GWAS results for height over time? Wood 2014 was a big deal (linked above) explaining about 29% of variance with ~9,500 SNPs. But that was using a reduced threshold and only 697 SNPs were genome wide significant explaining ~16% of variance (Table 1, the abstract’s “one-fifth” is annoyingly misleading IMHO, especially since 15.9% is less than one-sixth).

I would enjoy seeing the time progression of SNPs and percent variance explained at genome wide significance. Does anyone know of such a thing?

Sorry, but I feel a need to disavow this one. Please only speak for yourself with comments like that.

It is important to understand that I have been excited about this since I first heard Steve Hsu bring up the idea four years ago (I knew about compressed sensing by name before that, and as L1-regularization even earlier, but not in the genetic context). I thought the potential power of the technique was obvious

as long as the theory held up in reality. Along the way I don’t think there has been any shortage of doubters like utu. It is good to see initial success achieved (as far as we know at the moment, verification is important) and hopefully they will have further success with this technique in the future.P.S. Make sure to check out the updates at the bottom of Steve’s most recent blog post: http://infoproc.blogspot.com/2017/09/accurate-genomic-prediction-of-human.html

Steve addresses a number of the questions raised in this thread.

Replies:@resSorry, utu, that IS exciting.- I do not disagree.You realize those two statements are rather inconsistent, right?- I do not think so. What he meant that while Lasso was used and brought the progress it was not because it was Lasso. The same could have been achieved with the lacking sex appeal linear LSQ regression on the reduced set of p=50k-100k SNPs. Lasso is critical for undetermined system. Here it was not. The emphasis on Lasso is overhyped at this point. I want to see how it performs when p>>n or when nonlinear terms are included.be cranking along that seemingly futile path you described- I do not think it is a lack of mathematical methods. There must be another reason why GWAS people despite of having 500 coauthors on their papers produce much less hits. I suspect that they also want to have some biological explanations and this takes time. Hsu is not interested in biology and it is not his field. For him (just like for me) it is just a mathematical problem. I suspect that GWAS people have done similar things as Hsu with different mathematical methods (like the marginal regression from which I think Hsu borrowed the filtering of SNPs) but they did not publish it. Having a correlation is not enough for them. This might be a hopeful indication that they do not suffer from the lack of sitzfleisch unlike your buddy Davide Piffer.Does anyone have a good sense of the GWAS results for height over time? Wood 2014 was a big deal (linked above) explaining about 29% of variance with ~9,500 SNPs. But that was using a reduced threshold and only 697 SNPs were genome wide significant explaining ~16% of variance (Table 1, the abstract's "one-fifth" is annoyingly misleading IMHO, especially since 15.9% is less than one-sixth).

I would enjoy seeing the time progression of SNPs and percent variance explained at genome wide significance. Does anyone know of such a thing? Sorry, but I feel a need to disavow this one. Please only speak for yourself with comments like that.

It is important to understand that I have been excited about this since I first heard Steve Hsu bring up the idea four years ago (I knew about compressed sensing by name before that, and as L1-regularization even earlier, but not in the genetic context). I thought the potential power of the technique was obvious

as long as the theory held up in reality. Along the way I don't think there has been any shortage of doubters like utu. It is good to see initial success achieved (as far as we know at the moment, verification is important) and hopefully they will have further success with this technique in the future.P.S. Make sure to check out the updates at the bottom of Steve's most recent blog post: http://infoproc.blogspot.com/2017/09/accurate-genomic-prediction-of-human.html

Steve addresses a number of the questions raised in this thread.

One clarification. The ~20k SNPs in the CS predictor are not necessarily at genome wide significance. From the paper:

Does anyone have a good sense of the GWAS results for height over time? Wood 2014 was a big deal (linked above) explaining about 29% of variance with ~9,500 SNPs. But that was using a reduced threshold and only 697 SNPs were genome wide significant explaining ~16% of variance (Table 1, the abstract's "one-fifth" is annoyingly misleading IMHO, especially since 15.9% is less than one-sixth).

I would enjoy seeing the time progression of SNPs and percent variance explained at genome wide significance. Does anyone know of such a thing? Sorry, but I feel a need to disavow this one. Please only speak for yourself with comments like that.

It is important to understand that I have been excited about this since I first heard Steve Hsu bring up the idea four years ago (I knew about compressed sensing by name before that, and as L1-regularization even earlier, but not in the genetic context). I thought the potential power of the technique was obvious

as long as the theory held up in reality. Along the way I don't think there has been any shortage of doubters like utu. It is good to see initial success achieved (as far as we know at the moment, verification is important) and hopefully they will have further success with this technique in the future.P.S. Make sure to check out the updates at the bottom of Steve's most recent blog post: http://infoproc.blogspot.com/2017/09/accurate-genomic-prediction-of-human.html

Steve addresses a number of the questions raised in this thread.

Sorry, utu, that IS exciting.– I do not disagree.You realize those two statements are rather inconsistent, right?– I do not think so. What he meant that while Lasso was used and brought the progress it was not because it was Lasso. The same could have been achieved with the lacking sex appeal linear LSQ regression on the reduced set of p=50k-100k SNPs. Lasso is critical for undetermined system. Here it was not. The emphasis on Lasso is overhyped at this point. I want to see how it performs when p>>n or when nonlinear terms are included.be cranking along that seemingly futile path you described– I do not think it is a lack of mathematical methods. There must be another reason why GWAS people despite of having 500 coauthors on their papers produce much less hits. I suspect that they also want to have some biological explanations and this takes time. Hsu is not interested in biology and it is not his field. For him (just like for me) it is just a mathematical problem. I suspect that GWAS people have done similar things as Hsu with different mathematical methods (like the marginal regression from which I think Hsu borrowed the filtering of SNPs) but they did not publish it. Having a correlation is not enough for them. This might be a hopeful indication that they do not suffer from the lack of sitzfleisch unlike your buddy Davide Piffer.Replies:@resA big part of that reason is meta-studies need to credit the entire research team working on each included dataset.

I think you are grasping for straws at this point.

I also don't think you understand how compelling the evidence of creating a predictor that gives almost as good results on a completely different sample is. Which is fascinating given your "I think it is mostly in the validation phase" in your next comment.

Yes, it is quite true that since the start of the month I have been running on adrenaline. The concept of 100 SD IQ humans took me by complete surprise. There is no sci-fi that prepared me for this. We need to imagine our future before we can reach it. Super intelligence enhanced humans could become science fact before many might ever encounter it as sci-fi.

No one has even bothered to raise technical objections about this. On a first think through I can not see anything obvious that would prevent it from happening at some time in the future.

This latest research has further amplified my sense of disbelief. GWAS research has been advancing steadily but glacially for 10 years. My timeline for high heritability GWAS results had been on the scale more of decades. The entire GWAS methodology of fitting one SNP at a time seemed futile.My impression is that essentially all of the additive heritability for height has now been found.

For the pinning down the numbers comment, this is more about using the other research that these authors had done as a guide.

The paper might have held back some of the details of the method, though as an added checked I might have rerun the numbers to see if the x set at some point remained stable. This is what the theory said should happen. As you move down to and then beyond the effective boundary to complete x selection, there should be a range in which the set would be stable. Given the p-values they likely achieved and all the previous research, I am unable to propose a plausible rebuttal to this recent research.

I concede the point that Lasso possibly was used more as a method of convenience than one of necessity. Nonetheless there is no self-apparent reason to suppose that Lasso would yield a wildly wrong answer.

There are legitimate clarifications that could be raised, though the scientific rationale appears solid. They likely wanted to have the publication as a preprint as fast as they could and then later they might address some of the fine grained technical questions.

I did ask an excessive number of questions, though I did this more to spread a buffet on the table than for you to feel obligated to answer them all. Maybe others might want to respond to those points they find of interest.

To narrow down my questions, I would be especially interested in any responses to the causal SNPs question. Causals are one of the blockers left before the end zone. How might these causals be found effectively with Lasso or other methods. For example, one might try lassoing another large sample to see what tagging SNPs might be reported. Comments please!

It is also true that I have crossed the line from objective observer to member of the commentariat to a cheer leader to someone in awe and finally to an online agitpropper. res and I and quite a few others on the thread need some adult supervision: utu, you're one of the grown ups!

I would be especially interested in any responses to the causal SNPs question.– This is the ultimate goal.Eventually the issue of causality will have to be answered by biologists. Hsu deals with mathematical questions. But I am sure there are statistical theories and criteria that deal with the issue of uncertainty how to exclude spurious correlations and so on I think it is mostly in the validation phase. See here:

https://en.wikipedia.org/wiki/Cross-validation_(statistics)

https://en.wikipedia.org/wiki/Test_set

Sorry, utu, that IS exciting.- I do not disagree.You realize those two statements are rather inconsistent, right?- I do not think so. What he meant that while Lasso was used and brought the progress it was not because it was Lasso. The same could have been achieved with the lacking sex appeal linear LSQ regression on the reduced set of p=50k-100k SNPs. Lasso is critical for undetermined system. Here it was not. The emphasis on Lasso is overhyped at this point. I want to see how it performs when p>>n or when nonlinear terms are included.be cranking along that seemingly futile path you described- I do not think it is a lack of mathematical methods. There must be another reason why GWAS people despite of having 500 coauthors on their papers produce much less hits. I suspect that they also want to have some biological explanations and this takes time. Hsu is not interested in biology and it is not his field. For him (just like for me) it is just a mathematical problem. I suspect that GWAS people have done similar things as Hsu with different mathematical methods (like the marginal regression from which I think Hsu borrowed the filtering of SNPs) but they did not publish it. Having a correlation is not enough for them. This might be a hopeful indication that they do not suffer from the lack of sitzfleisch unlike your buddy Davide Piffer.Unproven. If so why has this not been done before?

Do you really think a hypothetical “another reason” is more likely than the reason staring us in the face? Compressed sensing made a difference.

Did you look at that Wood 2014 paper? They successively reduced the SNP threshold so they could increase the SNPs discovered and the percent variation explained. I think that method is more likely to generate false positives than CS is.

A big part of that reason is meta-studies need to credit the entire research team working on each included dataset.

I think you are grasping for straws at this point.

I also don’t think you understand how compelling the evidence of creating a predictor that gives almost as good results on a completely different sample is. Which is fascinating given your “I think it is mostly in the validation phase” in your next comment.

Replies:@utuI find Hsu somewhat disingenuous. In this paper https://arxiv.org/pdf/1310.2264.pdf he is critical of marginal regression but in the last paper his filtering of SNPs is taken straight form the marginal regression. One of the reviewers (of 1st paper) asked him to do a comparison with the marginal regression but he did not oblige I think. While Hsu claims they did the set reduction only to save computer time it might be possible that w/o doing so the Lasso would choke. And as I pointed out several times the LSQ could do it as well on the set that is n>p.

They successively reduced the SNP threshold.... I think that method is more likely to generate false positives than CS is.- But this is exactly what Hsu did. He set the threshold so he got 100k out of 1 million or so SNPs. Among those 100k at least 35k (35%) of SNPs turned out to be contributing to the predictor function. These are not false hits. This is the proof that marginal regression would work here.Referring to CS as if w/o it there would be no progress is also disingenuous. Keep in mind that all what Hsu is doing in his last paper is the Lasso method that precedes CS. CS is just new application of this algorithm. The old method got repackaged.

I found this 2009 paper comparing Lasso and marginal regression (haven't red it whole yet):

Note that nowhere compressed or compresssed sensing appears in this paper. Nevertheless authors cite lots of Donoho papers. Compressed sensing is just an area of application.Here for history of compressed sensing https://en.wikipedia.org/wiki/Compressed_sensing

Compressed sensing relies on L1 techniques, which several other scientific fields have used historically.

A big part of that reason is meta-studies need to credit the entire research team working on each included dataset.

I think you are grasping for straws at this point.

I also don't think you understand how compelling the evidence of creating a predictor that gives almost as good results on a completely different sample is. Which is fascinating given your "I think it is mostly in the validation phase" in your next comment.

I really do not have an insight knowledge about the community involved in GWAS. It is possible that their work stagnated. It is possible that their methods where geared for search of single disease genes which are not optimal for polygenic studies. But on the other hand Visscher people use the most sophisticate multivariate methods and can tackle 1000′s of SNPs. So I do not think there is a lack of means in this community. Hsu’s advantage might be the lack of prior knowledge so he can cut the Gordian knot without trying to untangle it. I am glad he did what he did but still I think he is overhyping it.

I find Hsu somewhat disingenuous. In this paper https://arxiv.org/pdf/1310.2264.pdf he is critical of marginal regression but in the last paper his filtering of SNPs is taken straight form the marginal regression. One of the reviewers (of 1st paper) asked him to do a comparison with the marginal regression but he did not oblige I think. While Hsu claims they did the set reduction only to save computer time it might be possible that w/o doing so the Lasso would choke. And as I pointed out several times the LSQ could do it as well on the set that is n>p.

They successively reduced the SNP threshold…. I think that method is more likely to generate false positives than CS is.– But this is exactly what Hsu did. He set the threshold so he got 100k out of 1 million or so SNPs. Among those 100k at least 35k (35%) of SNPs turned out to be contributing to the predictor function. These are not false hits. This is the proof that marginal regression would work here.Referring to CS as if w/o it there would be no progress is also disingenuous. Keep in mind that all what Hsu is doing in his last paper is the Lasso method that precedes CS. CS is just new application of this algorithm. The old method got repackaged.

I found this 2009 paper comparing Lasso and marginal regression (haven’t red it whole yet):

Note that nowhere compressed or compresssed sensing appears in this paper. Nevertheless authors cite lots of Donoho papers. Compressed sensing is just an area of application.

Here for history of compressed sensing https://en.wikipedia.org/wiki/Compressed_sensing

Compressed sensing relies on L1 techniques, which several other scientific fields have used historically.

Replies:@resDina – succeed at what? If those you call the dumbest and most cruel love their spouses, they know they have shackled their spouse to a mediocrity. Sad! If they don’t love their spouses, that is even sadder. If they like to laugh – and who doesn’t like to laugh, as poor befuddled Judy Garland used to like to say – remember? – they know that people who are naturally funny are serious around them. Sad! And if they don’t like to laugh, well that would confuse poor Judy Garland and would in addition be very sad from an objective point of view. Sure they have money but first there is the hedonic treadmill and then there is the aging process, which bothers the successful more than the unsuccessful (whether or not the success was obtained through work or through cheating). No, we are not talking about a story that does not have a moral.

I find Hsu somewhat disingenuous. In this paper https://arxiv.org/pdf/1310.2264.pdf he is critical of marginal regression but in the last paper his filtering of SNPs is taken straight form the marginal regression. One of the reviewers (of 1st paper) asked him to do a comparison with the marginal regression but he did not oblige I think. While Hsu claims they did the set reduction only to save computer time it might be possible that w/o doing so the Lasso would choke. And as I pointed out several times the LSQ could do it as well on the set that is n>p.

They successively reduced the SNP threshold.... I think that method is more likely to generate false positives than CS is.- But this is exactly what Hsu did. He set the threshold so he got 100k out of 1 million or so SNPs. Among those 100k at least 35k (35%) of SNPs turned out to be contributing to the predictor function. These are not false hits. This is the proof that marginal regression would work here.Referring to CS as if w/o it there would be no progress is also disingenuous. Keep in mind that all what Hsu is doing in his last paper is the Lasso method that precedes CS. CS is just new application of this algorithm. The old method got repackaged.

I found this 2009 paper comparing Lasso and marginal regression (haven't red it whole yet):

Note that nowhere compressed or compresssed sensing appears in this paper. Nevertheless authors cite lots of Donoho papers. Compressed sensing is just an area of application.Here for history of compressed sensing https://en.wikipedia.org/wiki/Compressed_sensing

Compressed sensing relies on L1 techniques, which several other scientific fields have used historically.

Please show me where I have done that.

Replies:@utuDina, how might this strategy’s effectiveness change if readily available and inexpensive gene chips could accurately predict the traits you mentioned and others?

I am glad you do not object to anything else.

Replies:@resres, I was being obscure when I confused you with the apparent contradiction, though I was following utu’s line of thought that the data set was manipulated not to be underdetermined so why stay with Lasso?

I am still greatly looking forward to the Million person EA GWAS. It would be amusing if a 500,000 CS were to report much better results than a 1 million traditional GWAS.

Yes, having a height heritability time series would be informative. I was not aware that a 9000 SNP result had been reported. I had been under the belief that this is the first MEGA SNP in height.

A word or two really needs to be added noting the extreme potential that often exists for a mathematical description to fundamentally change the current state of development of a field of inquiry. It should be emphasized that math has this massive power to extract essential truths and change the entire conversation. After over a century of developing the theoretical base we now have the theory and the data sets that will provide us with answers.

If it were possible to go directly to causal SNPs through the use of mathematical models we could avoid what might be a significant bottleneck to practical application. Yet, trying to manipulate the genetics of intelligence through CRISPR with only a long list of SNPs and without a biological understanding would be risky.

You are very quick to assume things. If you are going to do that you might want to improve your accuracy.

Replies:@utuIt will not kill you to admit that you agree with my assessment in #139 which you have read.

Disagree:resArticles published on biorxiv do not require a payment. Might this diminish the quality of such articles? Do these articles need to have been accepted elsewhere for publication (vanity publishing?) ?

Is there a method to determine which journals these preprints are en route to?

In order to provide some copy editing help here are some suggestions.

The sentence at the end of the first paragraph is a near duplicate of the first sentence of the second paragraph.

The Figure that shows the locations of the SNPs in the genome should be done in the traditional way with a change of color between chromosomes and labeling of the more prominent outliers. I found the base pair numbering system used on the figure needlessly confusing. The given base pairs on the x-axis have no obviously recognizable meaning.

utu, this is a valid objection to raise about the relation between n and p as related to underdetermined systems and Lasso. There is a much greater range of selecting the location on the has plane than I had realized. Clearly being between 0 and 1 on Delta is not very favored. One needs to closely approach the x-axis to have the solution vector converge for even moderately high heritability traits such as height.

I worried that even close to the x-axis samples sizes might need to ramp up geometrically with linear decreases in rho.

Choosing instead a delta over 1 (article 5 -10) would make the Lasso better able to find an answer, though as you mentioned, no longer undetermined.

How tall would someone be, who genotyped homozygous for all the height raising alleles and null for all the height decreasing alleles?

Replies:@resPut another way, it would be interesting to know what the predictor would say for your examples (as I think you mean). I don't think anyone believes the linearity extends out that far, but it would be an interesting comment on the earlier "100 SD possible" statements.

…illustrated in Fig. 1. G-matrix models consider alocal linear approximationto the phenotype landscape. We can think of this as a plane that is tangent to the landscape at the point corresponding to the population mean. These models also replace the actual distribution of variation within the population with a multivariate normal distribution. So long as the landscape is locally smooth, thenfor a very small region it will be well approximated by an uncurved plane. If, in addition, the joint distribution of parent and offspring phenotypes is close tomultivariate normal, then G-matrix models provide a good approximation of short term evolution.Figure 1 also illustrates why G-matrix models are of little value for the study of long term evolution. As the population moves over the landscape, the slope and local geometry changes (i.e., the genetic architecture changes) in ways that could not be predicted from the initial linear approximation.

dr johnson said, “i refute it thus!” and stubbed his toe.

even though berkeley was right.

how have the hereditists refuted secretariat?

the hereditists have never addressed the problem of secretariat…

or they deny it’s a problem.

An interesting question. It would also be interesting to know what kind of shift in the distribution of alleles occurs between the shortest, average, and tallest groups of people.

Put another way, it would be interesting to know what the predictor would say for your examples (as I think you mean). I don’t think anyone believes the linearity extends out that far, but it would be an interesting comment on the earlier “100 SD possible” statements.

From a Figure from the article, there was a 50 cm range of height amongst a sample of 2000 people.

6 SD = 50 cm

1 SD = 8 cm

40% of 100 SD = 40 SD = 3 m

15 foot humans?

Wonder why this has not as yet been noted in the coverage?

Speaking of the UK Biobank, this paper looks at 118 non-binary and 599 binary traits studied using that data: https://www.biorxiv.org/content/early/2017/08/16/176834

Two things which caught my eye:

- Significant enrichment of the HLA region for significant associations (Figure 3). Especially notable for binary traits.

- “We found a significant correlation (r=0.91, P<10-46) between the number of hits and the SNP heritability of the traits, suggesting that the number of loci affecting a trait might be proportional to the heritability of the trait (Figure 4, Supplementary Fig. 5)."

Regarding height they found: "Standing height was the trait with the largest number of hits (Figure 5) with 12,135 significantly associated variants distributed across 4,090 independent loci."

I understood it, but tbh it seems obvious. Is there some underlying epiphany beyond what is stated. I am of the o opinion that it reflects merely what we don’t know

The distinction made is important, to be sure

There is a possible flaw in Hsu’s paper:

(1) validation sample is too small

(2) filtering of SNPs which is essentially marginal regression method if it was done on the total sample that includes the validation sample will increase probability of successful validation on the validation sample.

So it is possible that correlations obtained by Hsu are higher than if the validation sample was kept hidden from him.