Newspapers have very warmly received an international project which, in the author’s views, strongly suggests that healthy babies are all alike in their developmental milestones, at least as determined by a study of particular centres in different parts of the world.
The study has the following general features: Find healthy pregnant women in several different comfortable parts of the world and then check whether the development of their children is the same or different between these centres. If the same, argue that race cannot be an explanation for differences between continental groups, since once they are equalized for health, child developmental differences disappear. This could well be true, so the excitement generated by the updated findings is understandable.
Newspapers are hardly to blame for reporting this study in glowing terms. The authors are bold enough to say:
It is evident that across developmental and growth parameters, only a very small percentage (around 10%) of the total variance in these fundamental human functions can be explained by differences among these populations (Fig. 3). The present results and previous publications, presented together in Fig. 3, support the position that most of the observed differences in growth and neurodevelopment across general populations or countries are primarily due to socioeconomic, educational and class disparities, i.e. postal codes define the health profiles of humans better than their genetic code.
For completeness, here is Fig 3
As regards differences between the peoples of different continents, the authors argue there is nothing much to see here, particularly on cognitive abilities, though there may be something happening with children’s behaviour. However, the authors suggest the behaviour difference is because of cultural differences in how people rate behaviour, not because children actually behave differently. Odd, because the authors were trying to ensure standard procedures were used across different sites, so as to be able to make valid statements about differences and similarities. You would have thought they would have ironed these things out in this large and long-term program of work. Anyway, for whatever reason, negative behaviours and emotional reactions vary between sites. Some kids seem to be more of a nuisance in some places.
You may see that Fig 3 shows very little differences in HC (head circumference) which has often been a bone of contention. Here are the actual figures for head circumference in centimeters at 37 weeks taken from the 2014 paper:
UK 34.5 (1.3)
USA 34.5 (1.4)
Brazil 34.2 (1.2)
Kenya 34.2 (1.2)
Italy 34.0 (1.2)
China 33.6 (1.2)
Oman 33.6 (1.1)
India 33.1 (1.1)
As you can see, UK and USA head circumferences are largest and have the largest standard deviations, India the smallest and the smallest standard deviation. Indeed, the mean for Indian head circumference is one UK standard deviation below the UK mean. Put like that, the centres differ somewhat in the brain size of the children.
What can we say about the apparent lack of any study centre differences in cognitive abilities? Few psychometricians would suggest that cognitive abilities could be reliably assessed at age 2. The Wechsler Preschool and Primary Scale of Intelligence makes a brave start at 2 years and 6 months. Others find it better to wait till 4 years of age, or better still 7 years of age or, for the sweet spot of early testing with reasonable predictive power for adulthood, 11 years of age.
Let us see what these researchers have included in their cognitive assessment of two year olds.
The following is taken from their Inter-NDA instruction manual
1) Make a tower of 5 blocks. There are no higher scores for children who can do the task immediately. Any child doing it in 3 trials gets same score as child who does it in the first trial. A child who builds a 4 block tower gets same score as child who only achieves 3 blocks. This may lead to a lack of discrimination among brighter children.
2) Naming 4 colours. Better task, but naming of 1 or 2 colours gets lumped together. Some loss of discrimination.
3) Matching cubes of same colour. Good scoring system, giving a valid 3 point scale.
4) Handing cube to examiner. Simple scoring, the first to use a time cut-off.
5) Puts spoon in cup when asked. This is a very easy test, because many children will have seen spoons in cups. Some kids might put the spoon in the cup without being asked. It isn’t a pure test of language comprehension. The scoring system loses discrimination at the higher end. A child who does it immediately gets the same score as a child who takes 3 trials to get the hang of it. The child who takes a full 5 trials to do it gets the same score as those who do it in 4 trials. Once again, there is a ceiling effect in the scoring system
6) Match 3 shapes on board. Again, a very easy test, with 3 shapes to be put in their respective holes. Using 4 or even 5 might have given a more discriminative test. Again, the scoring system loses discrimination in the higher range, exactly as described above.
7) Point to the door/entrance in the room. Simple task, same loss of discrimination at higher end.
8) Place raisin into a small opening. A coordination motor task, but a weak test of cognition.
9) Drinks water from cup. A weak test of cognition.
10) Looks at something pointed at. A weak test of cognition.
11) Pretends to drink from a cup. Interesting idea, and a better scoring system.
12) Pretends to make a cup of tea. Some cultural loading here? Test of whether the child can do a pouring motion with a toy teapot.
13) Give the dolly some tea. Imitation.
14) Horizontal scribble Again, interesting, but scoring not sensitive to brighter children.
15) Finding a bracelet placed in full view under a cloth. Scoring system again could do with more range.
16) Child’s use of plurals when shown objects. Good language test, but again the scoring could be more precise.
There are then several tasks to be rated on the basis of parental report: can ask for toilet, runs back to mother, goes up steps, throws ball near something, kicks ball.
Then a language item about syllabic babbling, good topic, but again very crudely measured. Next items, all reasonable and interesting: uses two words together; indicates “no” by gesture; uses a pronoun; count of how many words the child uses during the assessment (this is a good item, but with restricted range at the top); how many 3 word sentences used (another good item, but with restricted range at top); whether child can follow the topic of conversation (good); combines word and gesture (good).
Summary: testing cognition in 2 year olds is difficult, and the authors have tried a wide range of tasks, which is good. However, the result is a very mixed bag in terms of varying cognitive demands. It would be good to see what the individual item responses look like, and any correlational and factorial analyses. They do not acknowledge any help from psychometricians, and that shows in the test construction. In my view the end result is a blunt instrument, which is fine for roughly screening the lower to middle range and identifying very slow developers, but does not allow brighter children to shine. The item response data might prove me wrong, but I think their scoring systems will severely diminish the relative advantage of brighter children.
There then follows a good observational section about the child’s behaviour during testing. I will not go into this further, other than to say that it probably provided very useful data.
Just to tell you about the selection criteria for entry into the study: healthy, breastfed children with minimum environmental, health, and nutrition constraints on growth from six populations in Brazil, Ghana, India, Norway, Oman, and the USA (n=8406). Results of the study showed striking similarity in linear growth in children from the six sites, thereby justifying pooling data to construct one international growth standard from birth to 5 years of age, which has since been adopted worldwide.
This study suggests that if we boost health outcomes for everyone in the world, many apparent racial differences will be shown to be due to bad environments and bad health systems.
Of course, if races actually differ, by choosing to study only those who are healthy and diligent then the differences could be brushed under the carpet in the form of selective sampling, different rejection rates, and for different reasons (early age of childbirth in some samples, later age in others). For example, soldiers in the US Army do not show much in the way of racial differences in intelligence. This is because many candidates from some racial groups have already been rejected by the intelligence-based selection procedure.
If you take the very brightest in the poorer parts of the world, who have risen by immense efforts to that level, such that they can afford environments which are excellent by local standards, and compare them with the average in wealthy countries, you might be picking from very different parts of the bell curve. This could well mis-represent real differences by apparently selecting for similar health environments across the world.
The fly in the ointment is that by picking healthy elites across the world the groups may have inadvertently have been roughly equalized for intelligence. This is really hard to determine. The Oxford sample is very highly educated, and even there only 22% of Oxford mothers met the study criterion. This sample seems to be super-selected for big brains. The results for each study centre have to be studied in detail.
most common reasons for ineligibility were maternal age younger than 18 years or older than 35 years (915, 11%), maternal height less than 153 cm (1022, 12%; mostly in India and Oman), and BMI of 30 kg/m 2 or higher (1009, 12%; mostly in the UK and USA). The contribution of each site to the total study population ranged from 7% (311/4607) in the USA to 14% (640/4607) in the UK. Of the 4607 enrolled, we excluded 36 women (0·8%) who developed severe conditions during pregnancy or took up smoking or drug use, and 71 (1·5%) were lost to follow-up or withdrew consent. Of the 4422 women who had live singleton births, 4321 (98%) had newborn babies without congenital malformations; their data comprised the FGLS population in this study.
The other disappointment about this very large study is that it stopped at age 2, long before proper intelligence testing could be carried out. It would be fascinating to do the follow-up, now that many of the children are of an age when testing is more strongly predictive of adult achievements.
Curiously for a study which was later used to say that there were no racial differences, the racial composition of the samples is not mentioned in the main results paper.
For example, the Kenyan sample is in Parkland suburb, originally designed by colonial Britain for civil servants, and is said to have a high Asian population, that is to say Indians from India. Was that the case in this sample?
The Brazilian sample was in Pelotas. The study details from the website do not mention the racial background of the sample. I wrote about Pelotas some time ago:
The State of Rio Grande do Sul, in which the city of Pelotas is situated, is 79% European 9.5% African and 11% American Indian. Did the high-status sample conform to those proportions, or was it purely European? I cannot find this described in the paper, but it may be in some annexes.
The Indian centre was in Maharashtra, which is 8th highest province in national examinations in India, with a score of 258. Not the top (275), but well above average, and Indian province are very different in their scholastic attainments, more so than US states.
And, as already stated, Oxford mothers were all highly educated but only 22% of pilot study mothers met criteria for inclusion. This is super-selection for big brains.
It should be a simple matter to mention racial backgrounds in the paper (in may be in supplementary details somewhere).
The samples are not of equal sizes. The Kenyan sample is three times as big as the USA sample. This could unbalance some of the comparisons, though it is just a fact of life in international studies that some places are easier to study than others.
The study deliberately selected healthy women in un-polluted neighbourhoods.
At the individual level, we recruited mothers (and their newborn babies) for FGLS aged older than or equal to 18 years and younger than or equal to 35 years, who measured greater than or equal to 153 cm in height, had BMI greater than or equal to 18·5 kg/m2 and less than 30 kg/m2, who had no clinically relevant obstetric or gynaecological history, initiated antenatal care at less than 14 weeks of gestation (by menstrual dates), and met the entry criteria of optimum health, nutrition, education, and socioeconomic status.
They describe this as selecting for health. Of course, it may also select for intelligence. They say that about a third of the mothers in these areas met criteria. It would be good to know how representative these women were of their nations, not the wealthy urban areas with good health services chosen for the study. More important would be to find the educational backgrounds of the mothers.
This is a very large and well-funded project, and I think that some opportunities have been missed.
1) There may have been psychometric input, but I think they mostly consulted child psychiatry and obstetrics departments, and the cognitive assessment could have been better, and the scoring system improved.
2) Dan Freedman’s babies. This was a great opportunity to look at neonate behaviour in the first few days and try to replicate Freedman’s work.
3) Full genomes on children and parents.
Considering the strong statements the authors made to the press, about post codes being more important than genetic codes, they have chosen not to make that comparison in their actual study.
Lastly, the study patients did not undergo genetic profiling and, although this might seem to be a limitation, the eight populations included in the study are unlikely to be homogeneous when compared with each other.
Well, the lack of homogeneity does not make up for an opportunity missed. Also, as already mentioned, it would allow us to see which racial groups were represented in each of the centres, and to look at racial groups whichever centre they were tested at. This is a well-funded study (Gates Foundation) which could easily have afforded a SNP type analysis for about a $150 per child and a full genome for about $1000. Why not measure the code you are seeking to disparage?
What do I think of this study?
I think the authors have over-sold their findings. Their method is interesting and their findings could well be right (that once health is at a high standard, child development is relatively uniform in terms of major milestones), but a number of things are still uncertain.
They should have included genetic profiling. There is no longer any excuse for avoiding it. There is certainly no excuse for excluding the possible contribution of a variable you choose not to measure.
In that vein, why avoid even a description of the racial composition of each sample? The Brazilian sample is probably largely European in origin. The Kenyan sample might include many Indians. Why not disclose these matters if your wish is to show the unimportance of genetics? They may be of no consequence, but it would be good to be sure about it.
ALSPAC did it years ago:
The Avon Longitudinal Study of Children and Parents (ALSPAC) was established to understand how genetic and environmental characteristics influence health and development in parents and children. All pregnant women resident in a defined area in the South West of England, with an expected date of delivery between 1st April 1991 and 31st December 1992, were eligible and 13 761 women (contributing 13 867 pregnancies) were recruited. These women have been followed over the last 19–22 years and have completed up to 20 questionnaires, have had detailed data abstracted from their medical records and have information on any cancer diagnoses and deaths through record linkage. A follow-up assessment was completed 17–18 years postnatal at which anthropometry, blood pressure, fat, lean and bone mass and carotid intima media thickness were assessed, and a fasting blood sample taken. The second follow-up clinic, which additionally measures cognitive function, physical capability, physical activity (with accelerometer) and wrist bone architecture, is underway and two further assessments with similar measurements will take place over the next 5 years. There is a detailed biobank that includes DNA, with genome-wide data available on >10 000, stored serum and plasma taken repeatedly since pregnancy and other samples; a wide range of data on completed biospecimen assays are available. Details of how to access these data are provided in this cohort profile.
Two years of age is too early to make statements about cognitive ability, but the cognitive items could have been improved and scored more accurately. There is probably a reasonable cognitive measure lurking there. Another opportunity lost, though the next wave of testing can very probably provide further and better particulars. A truly negative result would have been really interesting.
In summary, there may well be a lot here which is worth considering, and which will contribute to our knowledge of early child development. The very restricted selection criteria may have been selecting on factors other than just health. With genome sequencing and some intelligence test data when the children are 7 and/or 11 years of age this study will be more informative.