I am slowly learning the perverse art of headline writing, but retain an inherent allegiance to telling the truth: I am sure that there are the usual sex differences in Romanian men and women, as indicated in the traditional costumes above, but apparently no consistent differences in intelligence. A null result is as important as a positive result, so this finding must enter the mix for us to ponder about. Does it show something specific about one country, or something general about our methods, or both?
Dragos Iliescu, A lexandra Ilie, Dan Ispas, Anca Dobrean, Aurel Ion Clinciu. Sex differences in intelligence: A multi-measure approach using nationally representative samples from Romania. Intelligence Volume 58, September–October 2016, Pages 54–6
Interestingly, the intelligence tests standardised in Romania cover the full range: almost as if no intellectual measure had been left out. Whatever the finding, one cannot easily quibble that another test would have shown a different result.
However, the Lynn hypothesis is that boys are late to mature, so it is only at adult ages that male advantage shows itself. The SON test goes up to 8 years, so is not relevant. The WISC-IV goes up to 17 years so is partially relevant. The Raven test covers the full age range, so is relevant:
Sample sizes are small, which reduces the chance of “significance” but out of 13 age bands 10 show male advantage to some small degree. Advantage Lynn.
For the 12 adult groups on the MAB-II the story collapses. Overall IQ favours men for 10 out of 12, but only one is significant, the rest tiny. Performance IQ shows male advantage for 10 out of 12, but most are infinitesimal, so forgettable.
For GAMA there are 14 adult age groups, of which 11 show male advantage, but mostly tiny ones, only 3 being significant.
For IST there are 10 adult age groups, of which 2 show male advantage, and only the female advantage is significant.
Looking at the individual test results as a whole the picture is, as the authors imply, unconvincing on the male advantage hypothesis, even among those tests that cover adults.
However, almost all these tests do not report the raw scores, which is a considerable problem in ability testing. Why not? Well, many intelligence tests have idiosyncratic scoring systems according to the material used, number of items, additions for quick completion, reductions for partial errors, and so on. So the real raw scores are changed into scaled scores, and those scaled scores may be drawn from different tables according to age. There is some scope for blurring reality. It should not affect sex differences, but the change from raw to scaled scores is not something easy to track down. This certainly has an impact on Flynn effect calculations. Looking at the raw scores on coding tasks or digits forwards and backwards for each age (where the raw score is a real ratio scale) would be very interesting, which should knock on the head any residual doubts.
If you inspect the torrent of individual results in the paper, there is little evidence of any consistent pattern of sex differences. The sample sizes for each age band are respectable though not large, so it was with some relief that I turned to their overall meta-analysis of the results in Table 7, though that table is a little hard to read. A positive Cohen’s d score reveals a male advantage. The Q score is the Chi-square test result, with the degrees of freedom in brackets. The L squared test gives the chi-square results corrected for degrees of freedom and calculates the percentage of the variability in effect estimates that is due to heterogeneity rather than sampling error.
However, to test Lynn’s hypothesis we should have a Table 8 which restricts itself to the 17 year+ adults, up the whole age range. This would be interesting.
The authors say:
The only two scores with a significant (though small) effect are the Raven (d = 0.11, p < 0.01), and the Performance subscore of the SON-R (d = 0.12, p < 0.01), both in favor of males. In the case of the SON-R, medium heterogeneity is signalled by the data: Q(5) = 10.01, p < 0.10, I2 = 50.04, I.e. 50% of the total variability in this set of effect sizes are due to between-subsamples variability (true heterogeneity). In the case of the Raven scores, heterogeneity is not present: Q(22) = 21.34, ns., I2 = 0.00; I.e. all variability in effect size estimates is due to sampling error within subsamples.
Of course, as Richard Lynn found out, the Wechsler may have been fiddled with a bit to brush away some sex differences, but I doubt that can have been the case for all the other measures, particularly the Raven, designed long ago.
The authors do not bother to remark on something which caught my eye: the Wechsler Intelligence Scale for Children shows a lot of heterogeneity on Full Scale IQ, Verbal IQ and Perceptual Reasoning IQ. The Multidimensional Aptitude Battery and Intelligence Structure tests also show a fair amount of heterogeneity, compared with none for the Raven test. Of course, Richard Lynn might argue that the children’s scale does not prove anything, but that the adult form (not used here) would do so.
The authors conclude:
The random and non-replicable pattern of differences observed in the current research seems to support the conclusion that any sex mean or variance differences are likely spurious and the result of sampling or measurement errors than substantive and stable effects. This conclusion is supported for both general intelligence and second-level (more specific) abilities (e.g. performance vs. reasoning, verbal vs. performance, fluid vs. crystallized).
Cautiously, they admit:
The current study has a number of limitations. First, even though all the 6 samples on which we report data are carefully selected nationally representative samples, they are not comparable in volume to some of the samples on which data was reported in other studies, such as Deary et al. (2003), or Lohman and Lakin (2009). Therefore, while they make an important contribution for an understudied culture, they may only have a limited impact on the international state of knowledge. Second, some of the tests used in the current research were developed to be as sex neutral as possible. At least for the WISC-IV and SON-R, item bias was examined both by trained judges and through item analysis, and the GAMA and MAB-II were developed with the clear objective of minimizing adverse impact by gender. This may have affected the results and contributed to our null effect conclusion.
My comment: “sex neutral” sound impeccable, but the general drift of test construction is towards sex difference suppression.
Their final word:
Research on group differences in intelligence is a politically charged topic with important societal consequences. Therefore, we strongly encourage researchers examining group differences in intelligence to pay close attention to the quality of the samples used and make efforts for increasing their representativeness.
In fact, I think the authors have done very well. They have set out results from many intelligence tests, not just one, on a good national sample. No, it is not the whole nation, as with the Scottish data. No, there was not a meta-analysis of the adult data separately (though it probably would not come up with much), but overall it certainly gives pause to the acceptance of the sex difference findings in other work.
Is it all down Romania, and some special sex-difference-annulling culture, as so sedulously sought by some people? Has Romania achieved what the Nordics strived for but could not attain? Although I believe in exceptional countries, as an outside observer I cannot find anything in Romania’s long and rich history which leads me to believe that sex differences were deliberately diminished. However, Romanian readers are invited to send me further and better particulars.