**Case Sensitive**

**Exact Words**

**Include Comments**

Dear Prof Posthuma,

Thank you for your comments. These comments are not new, and that is not necessarily a bad thing. Actually, it works to my advantage because over the years I have had the opportunity to develop ways to rebut these criticisms.

One of the ways of answering your criticisms, and the one which convinces me the most about the validity of my findings, is the new Monte Carlo approach I developed. I show that thousands of unlinked random SNPs (matched for Minor Allele Frequency using the SNPSNAP algorithm) rarely (p<0.01) achieve the same predictive power as the polygenic scores built from GWAS hits. The issues of Linkage Disequilibrium decay, different causal variants, etc, mentioned by you simply create noise, they do not bias the results in one direction. There is no reason why Linkage Disequilibrium decay should produce the pattern we observe, and magically match the IQ scores of populations so closely. As the paper you cite (Martin et al., 2017) explains: “We demonstrate that scores inferred from European GWASs are biased by genetic drift in other populations even when choosing the same causal variants, and that biases in any direction are possible and unpredictable”.

But genetic drift has been controlled for and ruled out in my papers by two different and complementary methods. First, a Mantel-like test, based on regressing phenotypic values on Fst distances and polygenic score distances, showing that polygenic scores predict average intelligence above and beyond Fst distances (i.e. drift and all that is not directional). Second, a method that shows the unviability of drift to explain my results is a Monte Carlo simulation with several thousands of SNPs, whose correlation to population IQ is outperformed 99% or more by the GWAS hits (for a demonstration, see my paper: https://www.preprints.org/manuscript/201701.0127/v3).

The factor analysis of GWAS hits produced even better results, outperforming 99.8% of the random SNPs. For a report, check: https://rpubs.com/Daxide/279148

What is remarkable is that height GWAS hits fail to predict population IQ. Guess what they predict? Height. The East Asian advantage we observe for education or intelligence-related SNPs disappears and turns into a lower score for the notoriously not gigantic Chinese, Vietnamese and Japanese. A demonstration of this can be seen here: https://f1000research.com/articles/4-15/v3. Look at table 1 and compare the polygenic scores to those for intelligence such as my table2 of my 2015 Intelligence paper http://www.sciencedirect.com/science/article/pii/S0160289615001087 or the more recent scores: (https://topseudoscience.wordpress.com/2017/06/02/new-genes-same-results-group-level-genotypic-intelligence-for-26-and-52-populations). They almost look like their mirror image, with ranks reversed.

An issue I see in the Martin et al. paper is that the polygenic scores were created using a very liberal p-value for inclusion thus pulling in a lot of false positives. False positives are expected to work like random SNPs, hence it is not surprising that they could not reproduce the results in non-Europeans.

When we home-in on the causal variants by picking the right alleles, instead of using a brute-force approach, we tend to see that the same genes have the same effects across different super-populations. For example, countless studies showed that the APOE4 allele is involved in Alzheimer’s disease and has a variety of health-related effects. This allele confers risk on African Americans and European-Americans alike (http://www.nytimes.com/2013/04/10/health/african-americans-have-higher-risk-of-alzheimers-study-shows.html). Accidentally, I should mention that this variant also has a population pattern closely mirroring the intelligence polygenic scores, perhaps due to the general effect on cognition.

The strength of my approach is in using the SNPs that replicated across many GWAS studies, increasing the chance of dealing with true causal variants or SNPs in close Linkage Disequilibrium with them, hence reducing the effect of Linkage Disequilibrium decay.

And Europeans are not even the top scorers, as the “reference-population-bias” hypothesis would predict. This hypothesis is widespread but lacks any logical rationale. In fact, I consistently observed higher polygenic factor scores for East Asians than for Europeans. If there had been a pro-European (i.e. pro GWAS-reference population) bias built into the cross-population comparison, this would imply that my method underestimates all non-European scores, not just Africans. I am so amused that the debate is fixated on the lower African scores, and nobody notices the East Asian advantage. You cannot have it both ways: if my method had a pro-White bias, then the East Asian scores would also be underestimated. This would actually imply that the East Asian advantage is even bigger than that which I have found. This reduction ad absurdum shows the absurdity of claims against my method.

Finally, a paper published this week, using GWAS hits, replicates the East Asian advantage on educational attainment found by several of my papers (although funnily they do not acknowledge my studies, although one of the authors is familiar with my results, because a while ago I had shared my results with him via email): http://biorxiv.org/content/early/2017/06/04/146043

This paper strengthens the argument that SNPs which predict within-population differences can be used to predict between-population differences.

I recently published a paper where I put together all my main findings to date: https://www.preprints.org/manuscript/201706.0039/v1

That paper should be able to answer general questions about my findings and my methods.

In summary, within-population differences can be used to predict between-population differences.

This research program is based on two fundamental conjectures: (1) within-population differences in education, IQ etc are caused by the same causal variants everywhere, and (2) allele frequencies vary among populations. As long as these two conjectures are true, causal SNPs discovered in Europeans and the polygenic scores constructed from them can predict between-population differences.

The problem is that our present polygenic scores are not computed from known causal variants, but from GWAS hits that in the vast majority of cases are merely in linkage disequilibrium with the causal variants. Even if the same causal polymorphisms were polymorphic in all populations, the linkage phase and extent of linkage disequilibrium between GWAS hit and causal polymorphism is not necessarily the same everywhere. This is the most important reason why most of the polygenic scores defined for Europeans have low predictive power for non-Europeans. Also, causal variants that are polymorphic in Europeans may be monomorphic elsewhere, and vice versa.

Instead of bickering about the limitations of our present polygenic scores, what needs to be done is to make the transition from microarray-based discovery studies to sequencing-based fine mapping of the causal variants. This will necessitate large-scale studies in non-Europeans in order to capitalize on ethnic variations in linkage patterns. African populations are especially suitable for fine mapping because of their generally lower linkage disequilibrium.

The wider issue, not within science but more generally, is what is preferable in this case: knowledge or ignorance. Knowledge constrains the kinds of beliefs that people can reasonably hold. Racists have their own favored beliefs about polygenic scores, and politically correct types have different favored beliefs. We would inflict great emotional damage on these people by telling them the truth. That would be cruel and unreasonable, wouldn’t it?

Replies:@Davide PifferLD decay is not complete and when choosing the SNPs replicated across studies it's less of an issue because the chance of hitting on a true causal SNP is much higher, and even for the tag SNPs, the average LD decay will be lower. It's simply a nuisance that is gonna add error to the prediction. As the Martin et al. paper pointed out, patterns of LD decay follow genetic drift and do not have bias for some populations. In other words, LD decay is not racist.

The more important issue is getting a larger sample of countries, because n=26 is not very convincing no matter what is done. Keep in mind that there's genomic autocorrelation too, so the real independent n sample size is much smaller.

Note: there is real data-based simulation evidence behind the claim of unitary causal patterns.

http://biorxiv.org/content/early/2016/11/03/085092

We might also note that the environmentalists gambling on non-unitary causal patterns in variants is not wise because this plays directly into the hands of people say that the races are so different they should be labeled different species. The unitary causal patterns with some LD decay is more in line with the 'only one species'-position.

IMO, direct evidence not convincing as of now, but looking forward to larger databases of genomic data, e.g. country level (of natives!).

Geography of Genetic Variants Browser - http://popgen.uchicago.edu/ggv

As for your (1) my sense is that Piffer's work makes a looser conjecture. That the SNPs detected in (European) GWAS are indicative of selection effects everywhere. This is subject to LD issues as Piffer has discussed above and in comments, but there is no requirement that the European SNPs be the only relevant SNPs.

Although not directly applicable to this, this statement from the height paper linked above (thanks) seems related: Some thoughts/questions about this passage:

- I don't recall seeing something similar in the IQ/EA work. Did I miss it, or is it not relevant there, or ...?

- Was the correction large enough to explain the observed height polygenic score results disparity for Africans (a 4" underestimate IIRC)?

- What about African specific causal SNPs (e.g. Pygmy height?)?

P.S. Is anyone currently doing either fine mapping or large scale GWAS work on African populations?

The problem is that our present polygenic scores are not computed from known causal variants, but from GWAS hits that in the vast majority of cases are merely in linkage disequilibrium with the causal variants. Even if the same causal polymorphisms were polymorphic in all populations, the linkage phase and extent of linkage disequilibrium between GWAS hit and causal polymorphism is not necessarily the same everywhere. This is the most important reason why most of the polygenic scores defined for Europeans have low predictive power for non-Europeans. Also, causal variants that are polymorphic in Europeans may be monomorphic elsewhere, and vice versa.

Instead of bickering about the limitations of our present polygenic scores, what needs to be done is to make the transition from microarray-based discovery studies to sequencing-based fine mapping of the causal variants. This will necessitate large-scale studies in non-Europeans in order to capitalize on ethnic variations in linkage patterns. African populations are especially suitable for fine mapping because of their generally lower linkage disequilibrium.

The wider issue, not within science but more generally, is what is preferable in this case: knowledge or ignorance. Knowledge constrains the kinds of beliefs that people can reasonably hold. Racists have their own favored beliefs about polygenic scores, and politically correct types have different favored beliefs. We would inflict great emotional damage on these people by telling them the truth. That would be cruel and unreasonable, wouldn't it?

LD decay is not racist. I think nobody is denying the existence of LD decay, or even that it is a problem. What the LD decay objection leaves unanswered is this: why if these SNPs were completely noise, do they match population IQ more than almost all the random SNPs and more than the height SNPs? And not just in the first SNP GWAS set used by Piffer in 2015, but in all subsequent polygenic scores computed from independent studies?

LD decay is not complete and when choosing the SNPs replicated across studies it’s less of an issue because the chance of hitting on a true causal SNP is much higher, and even for the tag SNPs, the average LD decay will be lower. It’s simply a nuisance that is gonna add error to the prediction. As the Martin et al. paper pointed out, patterns of LD decay follow genetic drift and do not have bias for some populations. In other words, LD decay is not racist.

The problem is that our present polygenic scores are not computed from known causal variants, but from GWAS hits that in the vast majority of cases are merely in linkage disequilibrium with the causal variants. Even if the same causal polymorphisms were polymorphic in all populations, the linkage phase and extent of linkage disequilibrium between GWAS hit and causal polymorphism is not necessarily the same everywhere. This is the most important reason why most of the polygenic scores defined for Europeans have low predictive power for non-Europeans. Also, causal variants that are polymorphic in Europeans may be monomorphic elsewhere, and vice versa.

Instead of bickering about the limitations of our present polygenic scores, what needs to be done is to make the transition from microarray-based discovery studies to sequencing-based fine mapping of the causal variants. This will necessitate large-scale studies in non-Europeans in order to capitalize on ethnic variations in linkage patterns. African populations are especially suitable for fine mapping because of their generally lower linkage disequilibrium.

The wider issue, not within science but more generally, is what is preferable in this case: knowledge or ignorance. Knowledge constrains the kinds of beliefs that people can reasonably hold. Racists have their own favored beliefs about polygenic scores, and politically correct types have different favored beliefs. We would inflict great emotional damage on these people by telling them the truth. That would be cruel and unreasonable, wouldn't it?

Sequencing is very expensive, and not necessary. Higher density arrays will also reduce the LD decay problems. One does not need the specific causal variants, just some marker in very close vicinity of it so that the decay is unproblematic.

The more important issue is getting a larger sample of countries, because n=26 is not very convincing no matter what is done. Keep in mind that there’s genomic autocorrelation too, so the real independent n sample size is much smaller.

Note: there is real data-based simulation evidence behind the claim of unitary causal patterns.

http://biorxiv.org/content/early/2016/11/03/085092

We might also note that the environmentalists gambling on non-unitary causal patterns in variants is not wise because this plays directly into the hands of people say that the races are so different they should be labeled different species. The unitary causal patterns with some LD decay is more in line with the ‘only one species’-position.

IMO, direct evidence not convincing as of now, but looking forward to larger databases of genomic data, e.g. country level (of natives!).

Replies:@Davide PifferDo you have current cost data (in study quantities) for sequencing and lower/higher density arrays?

Nice "one species" point.

The wider issue, not within science but more generally, is what is preferable in this case: knowledge or ignorance. Knowledge constrains the kinds of beliefs that people can reasonably hold. Racists have their own favored beliefs about polygenic scores, and politically correct types have different favored beliefs. We would inflict great emotional damage on these people by telling them the truth. That would be cruel and unreasonable, wouldn't it?

Let’s look at both of your statements here. Your (2) is demonstrably true. Anybody who doubts that should take a look at a reference like SNPedia or for the visually inclined see:

Geography of Genetic Variants Browser – http://popgen.uchicago.edu/ggv

As for your (1) my sense is that Piffer’s work makes a looser conjecture. That the SNPs detected in (European) GWAS are indicative of selection effects everywhere. This is subject to LD issues as Piffer has discussed above and in comments, but there is no requirement that the European SNPs be the only relevant SNPs.

Although not directly applicable to this, this statement from the height paper linked above (thanks) seems related:

Some thoughts/questions about this passage:

- I don’t recall seeing something similar in the IQ/EA work. Did I miss it, or is it not relevant there, or …?

- Was the correction large enough to explain the observed height polygenic score results disparity for Africans (a 4″ underestimate IIRC)?

- What about African specific causal SNPs (e.g. Pygmy height?)?

P.S. Is anyone currently doing either fine mapping or large scale GWAS work on African populations?

The more important issue is getting a larger sample of countries, because n=26 is not very convincing no matter what is done. Keep in mind that there's genomic autocorrelation too, so the real independent n sample size is much smaller.

Note: there is real data-based simulation evidence behind the claim of unitary causal patterns.

http://biorxiv.org/content/early/2016/11/03/085092

We might also note that the environmentalists gambling on non-unitary causal patterns in variants is not wise because this plays directly into the hands of people say that the races are so different they should be labeled different species. The unitary causal patterns with some LD decay is more in line with the 'only one species'-position.

IMO, direct evidence not convincing as of now, but looking forward to larger databases of genomic data, e.g. country level (of natives!).

We already have samples with more populations (52+) and ALL natives. I am talking about ALFRED. The big problem is coverage is very low so only about 10% of the variants are present. Nonetheless, it is still possible to create polygenic scores or factor analyze those. What we lose in terms of number of SNPs/genomic resolution we gain in population N/spatial resolution.

Replies:@Emil O. W. KirkegaardThe more important issue is getting a larger sample of countries, because n=26 is not very convincing no matter what is done. Keep in mind that there's genomic autocorrelation too, so the real independent n sample size is much smaller.

Note: there is real data-based simulation evidence behind the claim of unitary causal patterns.

http://biorxiv.org/content/early/2016/11/03/085092

We might also note that the environmentalists gambling on non-unitary causal patterns in variants is not wise because this plays directly into the hands of people say that the races are so different they should be labeled different species. The unitary causal patterns with some LD decay is more in line with the 'only one species'-position.

Interesting link. Thanks.

Do you have current cost data (in study quantities) for sequencing and lower/higher density arrays?

Nice “one species” point.

Rarely does not mean never, right? How rarely? If any other set of randomly selected SNP’s have the same predictive power as the original 9 SNPs with respect to the sequence of 26 IQ numbers then we may ask what else is lurking in the set of 10 millions of SNPs and what else they can predict. What is the predictive power of randomly selected SNPs from among 10 million? What else they can predict? If we generate a random sequence of 26 numbers can we find 9 or more SNP’s that will predict these numbers with r=0.9 correlation?

Clearly the author did not explore the issue. Perhaps there is nothing magical about this. Spurious correlations will occur in the undetermined system where you have 10 million SPNs and only 26 data points. Perhaps we may find SNP’s that will explain tomorrow’s lottery results in London, Lusaka and Tokyo.

Replies:@Davide Piffer"rarely (p<0.01)"and proceed asking justhowrarely. Seriously? Come on, man, get an education.Your comment simply shows lack of understanding. You clearly have not read my papers or you’d find the answer there. It is only you who have not even bothered exploring the issue and is uttering nonsense. The sets of 9 SNPs that can predict IQ with the same correlation varies between 1% and 0.2%, depending on the polygenic/factor score being used. See: https://rpubs.com/Daxide/279148

Replies:@utuPoor Utu, you need to go to school (or at least start reading Wikipedia or something). You quote Davide’s

“rarely (p<0.01)"and proceed asking justhowrarely. Seriously? Come on, man, get an education.I went to your preprint that you gave a link to on another thread and found this:

What were the top 8 correlations out of 818?

Replies:@utuI found the answer

Randomly 8 groups were found that produced better correlations with IQ’s than the 9 SNP’s used in the study.

Replies:@Davide PifferYou can think of p= 0.01 this way. In a room with 1000 people, you will find on average only 10 whose poor understanding of statistics is as bad or worse than yours.

utu,

Again, your lack of statistical reasoning in interpreting experimental results is showing. As the paper said, p < 0.01 , i.e. the confident level is better than 1 in 100. With 819 runs the upper limit of false positive is 8, which is exactly you found.

In life and social sciences the acceptable significant level is p < 0.05 , i.e. 1 in 20

p < 0.01 is already exceeded the expected significant level.

Go learn some more statistics.

You also have a pre-conceived one track mind that ALL random trials should be false while in real life there could be real effects not yet discovered yet. Those 8 cases should be examined in more details in case they are real.

Replies:@utuWhat is the main claim of Davide Piffer's paper: "Look guys, I found 9 SNPs with which I can explain IQ differences among populations with r=0.88." And then he proceeds with doing his random search to prove some statistical significance (significance of what - I fail to see it) and he quickly finds 8 other groups of 9 SNP's that outperform the original one in terms of correlation. So actually he undermines the significance of his claim. Finding 9 SNPs which explain IQ differences among populations with r=0.88 is apparently very easy. A random process can do it in 100 trials. There are millions of 9 SNPs groups out there that can do the job. Should one paper be written for each case?

My question is why not continue the random search beyond 818 trials and find the one that maximizes the correlation with populations IQ's. How high can you get? Can you get to r=0.99? And if you did, how would you explain it? My next question would be as follows: Replace the IQ list with 26 random numbers and do the search for SNP's that will correlate with it the best. And do it for many different 26 random numbers sets. Then analyze the random numbers sets for which you can get high correlations. Only then you can talk about significance of the results against the spurious correlations that are bound to happen in this heavily undetermined system (26 dependent variables and potentially millions of independent variables).

Have you ever heard of stuff called p value or significance testing?

You can think of p= 0.01 this way. In a room with 1000 people, you will find on average only 10 whose poor understanding of statistics is as bad or worse than yours.

Replies:@utuonly1% of randomly selected 9 SNPs groups outperform the group you have identified. It is the opposite. After a short search of 818 trials you found 8 groups that have higher correlations with populations IQs than the group you have originally identified. Why not concentrate on the one that yields the highest correlation? If you keep searching, perhaps you could find a group with r=0.999 on the N=26 populations IQ's set and then you could write paper titled:99% of IQ variance explained with 9 SNPs.Don't you want to get higher correlation? Why to set on just r=0.88? You exactly know why, because higher correlations strongly suggest that the correlation effect might be spurious which is likely because the set N=26 is small. So it is possible that the r=0.88 you obtained is also a result of being lucky spurious effect and thus it might be meaningless. If you want to find out how significant is your result with respect to spurious effects you need to randomize the IQ values and commence the search of SNPs groups that correlate with it.JT:

Finally, a paper published this week, using GWAS hits, replicates the East Asian advantage on educational attainment found by several of my papers (although funnily they do not acknowledge my studies, although one of the authors is familiar with my results, because a while ago I had shared my results with him via email):http://biorxiv.org/content/early/2017/06/04/146043

This paper strengthens the argument that SNPs which predict within-population differences can be used to predict between-population differences.For this paper though, the supplemental tables 4 and 20 show binomial tests on the ≈85 polygenic alleles associated with educational attainment tend to be from East Asian positive:European negative results and East Asian positive:Native American negative results. East Asian positive:Near East negative and negative results also drive associations in the extended dataset.

The binomial tests on the ≈85 polygenic alleles under study for African vs Europeans, East Asians, etc. are essentially all neutral. Native Australian and Papuan scores also tend to be neutral in binomial tests in the extended data set.

Do you have any ideas on you think the binomial tests have these results? Is the structure of between population genetic variance of educational accomplishment East Asian>African/Oceanian>Native American/European as they suggest?

Again, your lack of statistical reasoning in interpreting experimental results is showing. As the paper said, p < 0.01 , i.e. the confident level is better than 1 in 100. With 819 runs the upper limit of false positive is 8, which is exactly you found.

In life and social sciences the acceptable significant level is p < 0.05 , i.e. 1 in 20

p < 0.01 is already exceeded the expected significant level.

Go learn some more statistics.

You also have a pre-conceived one track mind that ALL random trials should be false while in real life there could be real effects not yet discovered yet. Those 8 cases should be examined in more details in case they are real.

At this point those 8 are all positive and there is nothing false about them. The only criterion is the correlation with IQ and those 8 might be performing better than the original group, so there is no category of false or not-false.

What is the main claim of Davide Piffer’s paper: “Look guys, I found 9 SNPs with which I can explain IQ differences among populations with r=0.88.” And then he proceeds with doing his random search to prove some statistical significance (significance of what – I fail to see it) and he quickly finds 8 other groups of 9 SNP’s that outperform the original one in terms of correlation. So actually he undermines the significance of his claim. Finding 9 SNPs which explain IQ differences among populations with r=0.88 is apparently very easy. A random process can do it in 100 trials. There are millions of 9 SNPs groups out there that can do the job. Should one paper be written for each case?

My question is why not continue the random search beyond 818 trials and find the one that maximizes the correlation with populations IQ’s. How high can you get? Can you get to r=0.99? And if you did, how would you explain it? My next question would be as follows: Replace the IQ list with 26 random numbers and do the search for SNP’s that will correlate with it the best. And do it for many different 26 random numbers sets. Then analyze the random numbers sets for which you can get high correlations. Only then you can talk about significance of the results against the spurious correlations that are bound to happen in this heavily undetermined system (26 dependent variables and potentially millions of independent variables).

Replies:@Davide Piffer(Davide, I have been trying to find the referenced Table S1 and failed after looking in both your most recent paper and reference 8 aka doi: 10.20944/preprints201611.0047.v1 The latter mentions "table 13" which I also could not find. I looked for supplemental material for both papers but did not see any. Could you please point me to a description of your selection process for your 9 SNPs?)

The key point (which I have mentioned at least once recently) is that Piffer did not AFAICT pick these SNPs himself--they were derived from earlier GWAS. He is testing a hypothesis based on SNPs derived from other research. This is very different from cherry picking SNPs himself. Even if that were not the case, being able to replicate the results with more SNPs from later studies (as I have also mentioned) is additional evidence in favor of this being a real phenomenon.

In my opinion the people chastising you (utu) for lack of statistics knowledge are on target. If you really don't understand the idea of getting a p value from the random simulations described as a way of testing the hypothesis then you need to spend some time educating yourself. Your criticisms would have merit if Piffer had gone looking for best fitting SNPs from the whole genome and then presented those, but that's not what he did.

For utu's benefit, here is the defintion of p value from http://www.statsdirect.com/help/basics/p_values.htm Compare that definition to Piffer's random SNP methodology.

My understanding from reading papers like this one is that Piffer’s approach can, at best, take us only half-way towards understanding the genetic basis of race differences. That paper shows that at least half of the genetic variants influencing IQ are very rare; lots of the causal variants are different for each extended family. Piffer’s approach tests for differences in common variants between races. However, even if it was firmly established that the frequencies of common causal variants differ between races, that would not say anything about those rare variants that GWAS cannot find. I guess you could make some theoretical argument about selective pressures, but you cannot show it empirically because the variants are so rare.

What is the main claim of Davide Piffer's paper: "Look guys, I found 9 SNPs with which I can explain IQ differences among populations with r=0.88." And then he proceeds with doing his random search to prove some statistical significance (significance of what - I fail to see it) and he quickly finds 8 other groups of 9 SNP's that outperform the original one in terms of correlation. So actually he undermines the significance of his claim. Finding 9 SNPs which explain IQ differences among populations with r=0.88 is apparently very easy. A random process can do it in 100 trials. There are millions of 9 SNPs groups out there that can do the job. Should one paper be written for each case?

My question is why not continue the random search beyond 818 trials and find the one that maximizes the correlation with populations IQ's. How high can you get? Can you get to r=0.99? And if you did, how would you explain it? My next question would be as follows: Replace the IQ list with 26 random numbers and do the search for SNP's that will correlate with it the best. And do it for many different 26 random numbers sets. Then analyze the random numbers sets for which you can get high correlations. Only then you can talk about significance of the results against the spurious correlations that are bound to happen in this heavily undetermined system (26 dependent variables and potentially millions of independent variables).

You utter utter nonsense

Replies:@utuYou can think of p= 0.01 this way. In a room with 1000 people, you will find on average only 10 whose poor understanding of statistics is as bad or worse than yours.

You are missing to see the significance of your significance testing. It is not that

only1% of randomly selected 9 SNPs groups outperform the group you have identified. It is the opposite. After a short search of 818 trials you found 8 groups that have higher correlations with populations IQs than the group you have originally identified. Why not concentrate on the one that yields the highest correlation? If you keep searching, perhaps you could find a group with r=0.999 on the N=26 populations IQ’s set and then you could write paper titled:99% of IQ variance explained with 9 SNPs.Don’t you want to get higher correlation? Why to set on just r=0.88? You exactly know why, because higher correlations strongly suggest that the correlation effect might be spurious which is likely because the set N=26 is small. So it is possible that the r=0.88 you obtained is also a result of being lucky spurious effect and thus it might be meaningless. If you want to find out how significant is your result with respect to spurious effects you need to randomize the IQ values and commence the search of SNPs groups that correlate with it.What is the main claim of Davide Piffer's paper: "Look guys, I found 9 SNPs with which I can explain IQ differences among populations with r=0.88." And then he proceeds with doing his random search to prove some statistical significance (significance of what - I fail to see it) and he quickly finds 8 other groups of 9 SNP's that outperform the original one in terms of correlation. So actually he undermines the significance of his claim. Finding 9 SNPs which explain IQ differences among populations with r=0.88 is apparently very easy. A random process can do it in 100 trials. There are millions of 9 SNPs groups out there that can do the job. Should one paper be written for each case?

My question is why not continue the random search beyond 818 trials and find the one that maximizes the correlation with populations IQ's. How high can you get? Can you get to r=0.99? And if you did, how would you explain it? My next question would be as follows: Replace the IQ list with 26 random numbers and do the search for SNP's that will correlate with it the best. And do it for many different 26 random numbers sets. Then analyze the random numbers sets for which you can get high correlations. Only then you can talk about significance of the results against the spurious correlations that are bound to happen in this heavily undetermined system (26 dependent variables and potentially millions of independent variables).

utu, you state

At the top of page 2 of his latest paper Piffer states: “Piffer [8] identified 9 genomic loci (table S1) that were replicated across the three largest GWAS of educational attainment published to date [9-11].”

(Davide, I have been trying to find the referenced Table S1 and failed after looking in both your most recent paper and reference 8 aka doi: 10.20944/preprints201611.0047.v1 The latter mentions “table 13″ which I also could not find. I looked for supplemental material for both papers but did not see any. Could you please point me to a description of your selection process for your 9 SNPs?)

The key point (which I have mentioned at least once recently) is that Piffer did not AFAICT pick these SNPs himself–they were derived from earlier GWAS. He is testing a hypothesis based on SNPs derived from other research. This is very different from cherry picking SNPs himself. Even if that were not the case, being able to replicate the results with more SNPs from later studies (as I have also mentioned) is additional evidence in favor of this being a real phenomenon.

In my opinion the people chastising you (utu) for lack of statistics knowledge are on target. If you really don’t understand the idea of getting a p value from the random simulations described as a way of testing the hypothesis then you need to spend some time educating yourself. Your criticisms would have merit if Piffer had gone looking for best fitting SNPs from the whole genome and then presented those, but that’s not what he did.

For utu’s benefit, here is the defintion of p value from http://www.statsdirect.com/help/basics/p_values.htm

Compare that definition to Piffer’s random SNP methodology.

Replies:@utuSee:

https://www.biomedware.com/files/documentation/OldCSHelp/MCR/Calculating_Monte_Carlo_p-values.htm

The supplementary table will be added to the final publication. In the meantime, you can view it here: https://docs.google.com/document/d/198eH3X87-969Hxv60MuuL7sGtPEmMYeB2RoM-pGd7cg/edit?usp=sharing

By the way, I have posted a new paper on calculating LD decay and getting rid of it when computing polygenic scores: https://rpubs.com/Daxide/283453

I don't have a lot of free time to work on this so that is the provisional version.

(Davide, I have been trying to find the referenced Table S1 and failed after looking in both your most recent paper and reference 8 aka doi: 10.20944/preprints201611.0047.v1 The latter mentions "table 13" which I also could not find. I looked for supplemental material for both papers but did not see any. Could you please point me to a description of your selection process for your 9 SNPs?)

The key point (which I have mentioned at least once recently) is that Piffer did not AFAICT pick these SNPs himself--they were derived from earlier GWAS. He is testing a hypothesis based on SNPs derived from other research. This is very different from cherry picking SNPs himself. Even if that were not the case, being able to replicate the results with more SNPs from later studies (as I have also mentioned) is additional evidence in favor of this being a real phenomenon.

In my opinion the people chastising you (utu) for lack of statistics knowledge are on target. If you really don't understand the idea of getting a p value from the random simulations described as a way of testing the hypothesis then you need to spend some time educating yourself. Your criticisms would have merit if Piffer had gone looking for best fitting SNPs from the whole genome and then presented those, but that's not what he did.

For utu's benefit, here is the defintion of p value from http://www.statsdirect.com/help/basics/p_values.htm Compare that definition to Piffer's random SNP methodology.

My point is not about cherry picking. He did a right thing. He identified group of SNPs on the basis of previous studies and decide to test it against the set of 26 IQ’s. My point is the opposite to the cherry picking. My point is about the preponderance (1%) of randomly selected groups of 9 SNPs that produce correlation r≥0.88. Why Davide Piffer wrote a paper on the group of SNPs that he found in other publications which “merely” produced r=0.88 while among the randomly found groups there were some that had r>0.88? Shouldn’t higher correlation warrant more interest?

Tell me what possibly one can’t understand about such a trivial concept. I was surprised that it even had a name. Small minds usually like to give lofty names to trivial things. There is a reason why statisticians do not enjoy the highest respect among mathematicians.

Replies:@Davide Piffer(Davide, I have been trying to find the referenced Table S1 and failed after looking in both your most recent paper and reference 8 aka doi: 10.20944/preprints201611.0047.v1 The latter mentions "table 13" which I also could not find. I looked for supplemental material for both papers but did not see any. Could you please point me to a description of your selection process for your 9 SNPs?)

The key point (which I have mentioned at least once recently) is that Piffer did not AFAICT pick these SNPs himself--they were derived from earlier GWAS. He is testing a hypothesis based on SNPs derived from other research. This is very different from cherry picking SNPs himself. Even if that were not the case, being able to replicate the results with more SNPs from later studies (as I have also mentioned) is additional evidence in favor of this being a real phenomenon.

In my opinion the people chastising you (utu) for lack of statistics knowledge are on target. If you really don't understand the idea of getting a p value from the random simulations described as a way of testing the hypothesis then you need to spend some time educating yourself. Your criticisms would have merit if Piffer had gone looking for best fitting SNPs from the whole genome and then presented those, but that's not what he did.

For utu's benefit, here is the defintion of p value from http://www.statsdirect.com/help/basics/p_values.htm Compare that definition to Piffer's random SNP methodology.

Utu lacks statistics knowledge, and repeatedly fails to understand simple concepts even when several people try to explain him. What I did is a Monte Carlo simulation, it’s not arcane or illegal as Utu is trying to suggest but it’s totally legit and it’s an empirical way of finding a p value, in this case much better in my opinion than traditional statistical tests which rely on too many unsatisfied assumptions (e.g. normality, lack of spatial autocorrelation in the data, etc.)

See:

https://www.biomedware.com/files/documentation/OldCSHelp/MCR/Calculating_Monte_Carlo_p-values.htm

The supplementary table will be added to the final publication. In the meantime, you can view it here: https://docs.google.com/document/d/198eH3X87-969Hxv60MuuL7sGtPEmMYeB2RoM-pGd7cg/edit?usp=sharing

By the way, I have posted a new paper on calculating LD decay and getting rid of it when computing polygenic scores: https://rpubs.com/Daxide/283453

I don’t have a lot of free time to work on this so that is the provisional version.

Look, I found this on the net that, I think, may fit your preoccupation with p-value.

Still it was fortunate that you undertook this random search to satisfy the requirements stemming from the prevalent fashion in your field for having p-value test because you found groups of SNPs that outperformed the one you have selected. The problem is that you do not realize the significance of your finding.

Replies:@Davide PifferA while ago I had proposed the technique of reverse-engineering polygenic scores by finding those with a high correlation to them. It's just not the focus of this paper. This paper adopts a more conservative approach. If we assume that the random SNPs with a higher correlation to IQ have some meaning, then my results would become even more significant than they are.

This is what I wrote (Piffer, 2015, Intelligence): "Finally, thismethod can be “reverse-engineered” to aid in the detectionof new GWAS hits by selecting polymorphisms whose frequencies

correlate with the polygenic score or selection factor. These genes (or

“polygenes”) will have a higher probability of being intelligence related thus reducing the need for extremely large samples and the reliance upon ‘chance capitalization’ typical of current intelligence

GWA studies".

This is just not the focus of this paper. Look up Monte Carlo and then you will understand what I did.

Have you bought any bridge lately ?

Let’s stop feeding this troll

Ok this is gonna be my last reply to Utu if he keeps trolling. If only he would read my papers more carefully he’d realize he could have spared us all his posts.

A while ago I had proposed the technique of reverse-engineering polygenic scores by finding those with a high correlation to them. It’s just not the focus of this paper. This paper adopts a more conservative approach. If we assume that the random SNPs with a higher correlation to IQ have some meaning, then my results would become even more significant than they are.

This is what I wrote (Piffer, 2015, Intelligence): “Finally, thismethod can be “reverse-engineered” to aid in the detectionof new GWAS hits by selecting polymorphisms whose frequencies

correlate with the polygenic score or selection factor. These genes (or

“polygenes”) will have a higher probability of being intelligence related thus reducing the need for extremely large samples and the reliance upon ‘chance capitalization’ typical of current intelligence

GWA studies”.

This is just not the focus of this paper. Look up Monte Carlo and then you will understand what I did.

Replies:@AnonymousIf we assume that the random SNPs with a higher correlation to IQ have some meaning, then my results would become even more significant than they are.Your result (singular) was identifying 9 SNPs that highly correlated with populations IQs. It turns out that 1 in 100 randomly selected SNPs have the same (or better) property. This neither strengthen nor weakens your result but raises a question whether the correlations are spurious. You do not address the issue of spurious correlation. I suggested in my other trolling comments that IQ should be randomized to estimate statistically how likely is a spurious correlation on the set of 26 populations. This is the issue of the undetermined system.

I am pretty confident that if you continued random search for SNP's you could easily find single SNP's or 2 SNPs, or 3 SNPs groups that also have high correlations with IQ. If finding randomly a group of 9 has p≈0.01 then this might imply that finding SNP with "right property" of belonging to a group of 9 has probability (0.01)^(1/9)≈0.6 which is very high. Wow, 60% of genome has a right kind of SNPs to contribute to high correlation. When you did your randomization it would be useful to look on the histogram of correlations to see how close to binomial distribution it was.

Let's look at implications of your result. The straight line that fits polygenic scores (PS) to IQ's (Table 2 in your paper) with correlation r=0.91 is as follows:

IQ=24.5+131.1*PS

Let's create virtual populations Cloud 9 of people who have all 9 SNPs and Cloud 8 or people who have 8 out of 9 SNPs. Your formula makes the following predictions about IQs for these populations:

Cloud 9: IQ=155 (9 SNPs)

Cloud 8: IQ=141 (8 out of 9 SNPs)

Cloud 7: IQ=126 (7 out of 9 SNPs)

Data in Table 8 do not preclude existence of these high IQ populations Cloud 9, Cloud 8, Cloud 7. In Europe the lowest frequency is 0.229 for rs11584700 , so Cloud 9 could be as high as 22.9% but obviously it depends on mutual (spatial) correlations among frequencies which I do not have. On the other hand from IQ distribution among Europeans we can estimate population sizes:

Cloud 9: IQ=155 0.01%

Cloud 8: IQ=141 0.3%

Cloud 7: IQ=126 4.1%

Go to some large genome database of Europeans and find how large are Cloud 9, Cloud 8, Cloud7? This should be your first check. If the results will not make sense you drop the towel then. If they do make sense, which I think is highly unlikely, you should look at large database of IQ's and try to come with predictive model based on these or others SNP's. How big correlation can you get there when N=70,000 rather than N=26? If you get r=0.25 this will be worth publishing. But do not count for more.

______

You should relax a bit. And be more appreciative that somebody gave your paper time of day on some level of scrutiny.

A while ago I had proposed the technique of reverse-engineering polygenic scores by finding those with a high correlation to them. It's just not the focus of this paper. This paper adopts a more conservative approach. If we assume that the random SNPs with a higher correlation to IQ have some meaning, then my results would become even more significant than they are.

This is what I wrote (Piffer, 2015, Intelligence): "Finally, thismethod can be “reverse-engineered” to aid in the detectionof new GWAS hits by selecting polymorphisms whose frequencies

correlate with the polygenic score or selection factor. These genes (or

“polygenes”) will have a higher probability of being intelligence related thus reducing the need for extremely large samples and the reliance upon ‘chance capitalization’ typical of current intelligence

GWA studies".

This is just not the focus of this paper. Look up Monte Carlo and then you will understand what I did.

Utu is not a reader, he is a writer.

You seems to rely more on subjective hand waving and foams in the mouth than on objective statistical p values. Sad.

Have you bought any bridge lately ?

A while ago I had proposed the technique of reverse-engineering polygenic scores by finding those with a high correlation to them. It's just not the focus of this paper. This paper adopts a more conservative approach. If we assume that the random SNPs with a higher correlation to IQ have some meaning, then my results would become even more significant than they are.

This is what I wrote (Piffer, 2015, Intelligence): "Finally, thismethod can be “reverse-engineered” to aid in the detectionof new GWAS hits by selecting polymorphisms whose frequencies

correlate with the polygenic score or selection factor. These genes (or

GWA studies".

If we assume that the random SNPs with a higher correlation to IQ have some meaning, then my results would become even more significant than they are.Your result (singular) was identifying 9 SNPs that highly correlated with populations IQs. It turns out that 1 in 100 randomly selected SNPs have the same (or better) property. This neither strengthen nor weakens your result but raises a question whether the correlations are spurious. You do not address the issue of spurious correlation. I suggested in my other trolling comments that IQ should be randomized to estimate statistically how likely is a spurious correlation on the set of 26 populations. This is the issue of the undetermined system.

I am pretty confident that if you continued random search for SNP’s you could easily find single SNP’s or 2 SNPs, or 3 SNPs groups that also have high correlations with IQ. If finding randomly a group of 9 has p≈0.01 then this might imply that finding SNP with “right property” of belonging to a group of 9 has probability (0.01)^(1/9)≈0.6 which is very high. Wow, 60% of genome has a right kind of SNPs to contribute to high correlation. When you did your randomization it would be useful to look on the histogram of correlations to see how close to binomial distribution it was.

Let’s look at implications of your result. The straight line that fits polygenic scores (PS) to IQ’s (Table 2 in your paper) with correlation r=0.91 is as follows:

IQ=24.5+131.1*PS

Let’s create virtual populations Cloud 9 of people who have all 9 SNPs and Cloud 8 or people who have 8 out of 9 SNPs. Your formula makes the following predictions about IQs for these populations:

Cloud 9: IQ=155 (9 SNPs)

Cloud 8: IQ=141 (8 out of 9 SNPs)

Cloud 7: IQ=126 (7 out of 9 SNPs)

Data in Table 8 do not preclude existence of these high IQ populations Cloud 9, Cloud 8, Cloud 7. In Europe the lowest frequency is 0.229 for rs11584700 , so Cloud 9 could be as high as 22.9% but obviously it depends on mutual (spatial) correlations among frequencies which I do not have. On the other hand from IQ distribution among Europeans we can estimate population sizes:

Cloud 9: IQ=155 0.01%

Cloud 8: IQ=141 0.3%

Cloud 7: IQ=126 4.1%

Go to some large genome database of Europeans and find how large are Cloud 9, Cloud 8, Cloud7? This should be your first check. If the results will not make sense you drop the towel then. If they do make sense, which I think is highly unlikely, you should look at large database of IQ’s and try to come with predictive model based on these or others SNP’s. How big correlation can you get there when N=70,000 rather than N=26? If you get r=0.25 this will be worth publishing. But do not count for more.

______

You should relax a bit. And be more appreciative that somebody gave your paper time of day on some level of scrutiny.

Replies:@Davide PifferIf we assume that the random SNPs with a higher correlation to IQ have some meaning, then my results would become even more significant than they are.Your result (singular) was identifying 9 SNPs that highly correlated with populations IQs. It turns out that 1 in 100 randomly selected SNPs have the same (or better) property. This neither strengthen nor weakens your result but raises a question whether the correlations are spurious. You do not address the issue of spurious correlation. I suggested in my other trolling comments that IQ should be randomized to estimate statistically how likely is a spurious correlation on the set of 26 populations. This is the issue of the undetermined system.

I am pretty confident that if you continued random search for SNP's you could easily find single SNP's or 2 SNPs, or 3 SNPs groups that also have high correlations with IQ. If finding randomly a group of 9 has p≈0.01 then this might imply that finding SNP with "right property" of belonging to a group of 9 has probability (0.01)^(1/9)≈0.6 which is very high. Wow, 60% of genome has a right kind of SNPs to contribute to high correlation. When you did your randomization it would be useful to look on the histogram of correlations to see how close to binomial distribution it was.

Let's look at implications of your result. The straight line that fits polygenic scores (PS) to IQ's (Table 2 in your paper) with correlation r=0.91 is as follows:

IQ=24.5+131.1*PS

Let's create virtual populations Cloud 9 of people who have all 9 SNPs and Cloud 8 or people who have 8 out of 9 SNPs. Your formula makes the following predictions about IQs for these populations:

Cloud 9: IQ=155 (9 SNPs)

Cloud 8: IQ=141 (8 out of 9 SNPs)

Cloud 7: IQ=126 (7 out of 9 SNPs)

Data in Table 8 do not preclude existence of these high IQ populations Cloud 9, Cloud 8, Cloud 7. In Europe the lowest frequency is 0.229 for rs11584700 , so Cloud 9 could be as high as 22.9% but obviously it depends on mutual (spatial) correlations among frequencies which I do not have. On the other hand from IQ distribution among Europeans we can estimate population sizes:

Cloud 9: IQ=155 0.01%

Cloud 8: IQ=141 0.3%

Cloud 7: IQ=126 4.1%

Go to some large genome database of Europeans and find how large are Cloud 9, Cloud 8, Cloud7? This should be your first check. If the results will not make sense you drop the towel then. If they do make sense, which I think is highly unlikely, you should look at large database of IQ's and try to come with predictive model based on these or others SNP's. How big correlation can you get there when N=70,000 rather than N=26? If you get r=0.25 this will be worth publishing. But do not count for more.

______

You should relax a bit. And be more appreciative that somebody gave your paper time of day on some level of scrutiny.

Really, this is my last reply to this troll. You are entirely missing the point. I am not interested in predicting IQ within populations (we have GWAS for that). I apply GWAS results to between-population differences. You are making a silly confusion between within-population and between population variance. Since you show that you constantly fail to understand the logic of my method and what’s worse, even basic statistics, you are entirely unqualified to take part in this discussion.

I am not interested in predicting IQ within populationsI think you have just blinked, Davide.

The formula that follows from your fit of PS to IQ (IQ=24.5+131.1*PS) has objective existence independent of your interests or wishes. It is you who brought it to life and now it asks questions and you pretend that you do not hear them. But these are really your own questions that you failed or were afraid to ask in the first place. Remember the story of Golem?

I apply GWAS results to between-population differences.That’s why I constructed for you two populations Cloud 9 and Cloud 8 to get your attention. These two populations have polygenic scores PS=9/9=1 or PS=8/9=0.88.9. While they do not have separate geographic locations or separate ethnic identities they do exist dispersed and distributed (just like cloud) among and within other populations. If these populations existed on the map you would have included them in the studies but since they are dispersed you want to pretend they do not exist? Where is your curiosity and following through the consequences of what you have started?

Anyway, the formula from your fit predicts that IQ of Cloud 9 and Cloud 8 are 155 and 141, respectively. This can be verified by looking up large group of individuals who qualify to be members of Cloud 9 or Cloud 8 (have all 9 or 8 out 9 SNPs, respectively) and who underwent IQ testing.

You are making a silly confusion between within-population and between population variance.

I think you are disingenuous to pretend that you can separate one from the other. What made you think to use the average of frequencies for 9 SNPs to predict average populations IQ? Where does the average come from? It is a sum of all individual IQs, right? For the average to contain information on SNPs it is required that individual IQ’s contain information on the SNPs and by summing them up this information is not wiped out (averaged out). So there must be a functional (not random) relationship between individual IQ and SNPs. You have established that the average IQ is a linear function of average frequency of 9 SNPs. Granted, it does not follow that exactly the same relationship will hold for individual IQ’s but the reverse is always true, i.e., If IQ is a linear function of SNPs then the average of IQ’s is linear function of average frequencies of SNP’s.

Replies:@utuWatson IQ=120 (https://www.simonsfoundation.org/science_lives_video/james-d-watson/)

Venter IQ=141 (https://www.bbvaopenmind.com/en/craig-venter-the-man-who-knew-himself/)

If selection pressure acted on these 9 SNPs by driving their frequencies up in population A compared to B, then it has also done the same to other SNPs. We don't need to know what these other SNPs are because theory predicts that they will have similar distribution.

I am not interested in predicting IQ within populationsI think you have just blinked, Davide.

The formula that follows from your fit of PS to IQ (IQ=24.5+131.1*PS) has objective existence independent of your interests or wishes. It is you who brought it to life and now it asks questions and you pretend that you do not hear them. But these are really your own questions that you failed or were afraid to ask in the first place. Remember the story of Golem?

I apply GWAS results to between-population differences.That's why I constructed for you two populations Cloud 9 and Cloud 8 to get your attention. These two populations have polygenic scores PS=9/9=1 or PS=8/9=0.88.9. While they do not have separate geographic locations or separate ethnic identities they do exist dispersed and distributed (just like cloud) among and within other populations. If these populations existed on the map you would have included them in the studies but since they are dispersed you want to pretend they do not exist? Where is your curiosity and following through the consequences of what you have started?

Anyway, the formula from your fit predicts that IQ of Cloud 9 and Cloud 8 are 155 and 141, respectively. This can be verified by looking up large group of individuals who qualify to be members of Cloud 9 or Cloud 8 (have all 9 or 8 out 9 SNPs, respectively) and who underwent IQ testing.

You are making a silly confusion between within-population and between population variance.

I think you are disingenuous to pretend that you can separate one from the other. What made you think to use the average of frequencies for 9 SNPs to predict average populations IQ? Where does the average come from? It is a sum of all individual IQs, right? For the average to contain information on SNPs it is required that individual IQ's contain information on the SNPs and by summing them up this information is not wiped out (averaged out). So there must be a functional (not random) relationship between individual IQ and SNPs. You have established that the average IQ is a linear function of average frequency of 9 SNPs. Granted, it does not follow that exactly the same relationship will hold for individual IQ's but the reverse is always true, i.e., If IQ is a linear function of SNPs then the average of IQ's is linear function of average frequencies of SNP's.

From what you wrote to Dr. Thompson http://www.unz.com/jthompson/the-dna-of-genius-n2/

we know that Watson and Venter members of population Cloud 8 that has expected IQ of 141 according to the formula you have established. And lo and behold

Watson IQ=120 (https://www.simonsfoundation.org/science_lives_video/james-d-watson/)

Venter IQ=141 (https://www.bbvaopenmind.com/en/craig-venter-the-man-who-knew-himself/)

I am not interested in predicting IQ within populationsI think you have just blinked, Davide.

The formula that follows from your fit of PS to IQ (IQ=24.5+131.1*PS) has objective existence independent of your interests or wishes. It is you who brought it to life and now it asks questions and you pretend that you do not hear them. But these are really your own questions that you failed or were afraid to ask in the first place. Remember the story of Golem?

I apply GWAS results to between-population differences.That's why I constructed for you two populations Cloud 9 and Cloud 8 to get your attention. These two populations have polygenic scores PS=9/9=1 or PS=8/9=0.88.9. While they do not have separate geographic locations or separate ethnic identities they do exist dispersed and distributed (just like cloud) among and within other populations. If these populations existed on the map you would have included them in the studies but since they are dispersed you want to pretend they do not exist? Where is your curiosity and following through the consequences of what you have started?

Anyway, the formula from your fit predicts that IQ of Cloud 9 and Cloud 8 are 155 and 141, respectively. This can be verified by looking up large group of individuals who qualify to be members of Cloud 9 or Cloud 8 (have all 9 or 8 out 9 SNPs, respectively) and who underwent IQ testing.

You are making a silly confusion between within-population and between population variance.

I think you are disingenuous to pretend that you can separate one from the other. What made you think to use the average of frequencies for 9 SNPs to predict average populations IQ? Where does the average come from? It is a sum of all individual IQs, right? For the average to contain information on SNPs it is required that individual IQ's contain information on the SNPs and by summing them up this information is not wiped out (averaged out). So there must be a functional (not random) relationship between individual IQ and SNPs. You have established that the average IQ is a linear function of average frequency of 9 SNPs. Granted, it does not follow that exactly the same relationship will hold for individual IQ's but the reverse is always true, i.e., If IQ is a linear function of SNPs then the average of IQ's is linear function of average frequencies of SNP's.

These SNPs that explain variance within populations are markers of polygenic selection. They do not have to explain a lot of variance between populations, or even within populations. Again, after your statistical ignorance, your ignorance of evolutionary genetics is showing up. The polygenic evolution model predicts that a few SNPs will have frequencies correlated to frequencies of countless other SNPs. I just need to know the few most important SNPs to gather a signal and infer to the distribution of the other unknown SNPs.

If selection pressure acted on these 9 SNPs by driving their frequencies up in population A compared to B, then it has also done the same to other SNPs. We don’t need to know what these other SNPs are because theory predicts that they will have similar distribution.

Replies:@resIt seems to me one could argue for a methodology using your techniques to select the thousand best correlating SNPs and then look at those in the GWAS but using a multiple hypothesis correction for only one thousand alternative hypotheses rather than the hundreds of thousands presented by using all SNPs.

Perhaps not justifiable for the outcome variable correlations since the same coincidence could affect both results (though if the within and between genetics really are as different as critics claim...), but I think it works for your polygenic scores.

If selection pressure acted on these 9 SNPs by driving their frequencies up in population A compared to B, then it has also done the same to other SNPs. We don't need to know what these other SNPs are because theory predicts that they will have similar distribution.

It would be interesting to see if the other SNPs correlating with either your polygenic score or the outcome variables themselves are showing up in the GWAS but just not at the extreme p-values needed to qualify.

It seems to me one could argue for a methodology using your techniques to select the thousand best correlating SNPs and then look at those in the GWAS but using a multiple hypothesis correction for only one thousand alternative hypotheses rather than the hundreds of thousands presented by using all SNPs.

Perhaps not justifiable for the outcome variable correlations since the same coincidence could affect both results (though if the within and between genetics really are as different as critics claim…), but I think it works for your polygenic scores.

Replies:@Davide PifferIt seems to me one could argue for a methodology using your techniques to select the thousand best correlating SNPs and then look at those in the GWAS but using a multiple hypothesis correction for only one thousand alternative hypotheses rather than the hundreds of thousands presented by using all SNPs.

Perhaps not justifiable for the outcome variable correlations since the same coincidence could affect both results (though if the within and between genetics really are as different as critics claim...), but I think it works for your polygenic scores.

Seems like a viable test. Currently I don’t have the time to work on this. Do you? I am happy to offer my support.

Replies:@resTLDR; I'd be happy to help if I have access to the necessary data.

P.S. On a related note, if there is other work like this (e.g. I have spent a fair amount of time with R and R markdown) that would be helpful to share perhaps we should talk. I can contact you offline (e.g. email) as long as you are OK with respecting my anonymity here.

P.P.S. I seem to remember seeing an analysis discussing the number of SNPs typically seen in GWAS for differing values of -log10(p), but I don't have a handy reference. It might be helpful to estimate the number of SNPs (both possible and actual?) likely to appear in the region ~2 (i.e. factor of hundreds per above) below the current threshold.

I have time, but I’m not sure how to get access to the GWAS results for SNPs giving strong (but not strong enough) signals. One approach would be to use your technique to find the SNPs and send the candidates to a GWAS researcher who could check them. Not sure if you know anyone like that friendly enough to your ideas to do it. I think the decoupling would also help reduce the possibility of people claiming cherry picking.

TLDR; I’d be happy to help if I have access to the necessary data.

P.S. On a related note, if there is other work like this (e.g. I have spent a fair amount of time with R and R markdown) that would be helpful to share perhaps we should talk. I can contact you offline (e.g. email) as long as you are OK with respecting my anonymity here.

P.P.S. I seem to remember seeing an analysis discussing the number of SNPs typically seen in GWAS for differing values of -log10(p), but I don’t have a handy reference. It might be helpful to estimate the number of SNPs (both possible and actual?) likely to appear in the region ~2 (i.e. factor of hundreds per above) below the current threshold.

Perhaps not justifiable for the outcome variable correlations since the same coincidence could affect both results (though if the within and between genetics really are as different as critics claim...), but I think it works for your polygenic scores.

Another thought about this idea. It would be interesting to take a look at it by “rerunning history.” Say by starting from the IQ GWAS (plural) leading to the 9 SNPs in the polygenic score, then using that polygenic score to power the methodology described on the old data. It would be interesting to see if any (how many?) of the IQ SNPs found more recently would have been found. I think this would make a good validation test for the methodology of using Piffer’s polygenic score to augment GWAS SNP discovery. It might also give insights about an appropriate threshold to use for the correlation cutoff.

Replies:@Davide Pifferusing that polygenic scoreTraits like height, weight, IQ that have extended continuum must depend on very large number of SNPs. A polygenic score which is a sum of N SNPs can produce only N discrete values regardless whether all SNPs have the same weight or different weights. So either N must be very large if you define PS as a sum of SNPs or it is nonlinear dependence that takes into account mutual interactions among SNPs. For example for two SNPs you may have 3 possible outcomes:

(0,1)-->Y01; (1,0)-->Y10; (1,1)-->Y11

Then if you take into account all possible combination among N SNP's you will have 2^N possible discrete values. What I am saying is that defining a polygenic score as a sum of SNPs is probably too simplistic.

Two interesting papers:

Zeroing in on the Genetics of Intelligence, Ruben C. Arslan and Lars Penke Results of a “GWAS Plus:” General Cognitive Ability Is Substantially Heritable and Massively Polygenic

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112390#s1

I had proposed this reverse engineering of my method in my 2015 paper but had not thought about using existing GWAS to validate it. If you email me we can discuss it in depth.

pifferdavide@gmail.com

using that polygenic scoreTraits like height, weight, IQ that have extended continuum must depend on very large number of SNPs. A polygenic score which is a sum of N SNPs can produce only N discrete values regardless whether all SNPs have the same weight or different weights. So either N must be very large if you define PS as a sum of SNPs or it is nonlinear dependence that takes into account mutual interactions among SNPs. For example for two SNPs you may have 3 possible outcomes:

(0,1)–>Y01; (1,0)–>Y10; (1,1)–>Y11

Then if you take into account all possible combination among N SNP’s you will have 2^N possible discrete values. What I am saying is that defining a polygenic score as a sum of SNPs is probably too simplistic.

Two interesting papers:

Zeroing in on the Genetics of Intelligence, Ruben C. Arslan and Lars Penke

Results of a “GWAS Plus:” General Cognitive Ability Is Substantially Heritable and Massively Polygenic

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112390#s1

Replies:@resThanks for reminding us that before they figured out the need to use strict multiple hypothesis corrections in genetic studies false positives were common. I think all of the SNPs we are talking about postdate that realization.

P.S. Your N values argument applies only to individuals. Piffer's work is primarily looking at population frequencies. An important difference. It helps to understand what you are criticizing before making the criticisms.

using that polygenic scoreTraits like height, weight, IQ that have extended continuum must depend on very large number of SNPs. A polygenic score which is a sum of N SNPs can produce only N discrete values regardless whether all SNPs have the same weight or different weights. So either N must be very large if you define PS as a sum of SNPs or it is nonlinear dependence that takes into account mutual interactions among SNPs. For example for two SNPs you may have 3 possible outcomes:

(0,1)-->Y01; (1,0)-->Y10; (1,1)-->Y11

Then if you take into account all possible combination among N SNP's you will have 2^N possible discrete values. What I am saying is that defining a polygenic score as a sum of SNPs is probably too simplistic.

Two interesting papers:

Zeroing in on the Genetics of Intelligence, Ruben C. Arslan and Lars Penke Results of a “GWAS Plus:” General Cognitive Ability Is Substantially Heritable and Massively Polygenic

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112390#s1

After all of this discussion have you still not figured out that the idea of the polygenic score is to serve as a signal of polygenic selection?! (i.e. not just as an explanation of the variance accounted for by those SNPs directly) The idea as I understand it is that even though the individual SNPs represent a small fraction of IQ variance they effectively indicate the overall selection effect operating on all IQ SNPs, which allows for more explanatory power than one would expect from the small percent variance explained.

Thanks for reminding us that before they figured out the need to use strict multiple hypothesis corrections in genetic studies false positives were common. I think all of the SNPs we are talking about postdate that realization.

P.S. Your N values argument applies only to individuals. Piffer’s work is primarily looking at population frequencies. An important difference. It helps to understand what you are criticizing before making the criticisms.

How far can we go by imputing the ALFRED data?