When I started work in September 1968 one of the first things I was taught was that intelligence testing had a long history, and that many of the subtests in the Wechsler assessments I had been taken from previous research. Kohs’ blocks (1920), I used to mutter, when people talked about Block Design. I was also taught something about the Stanford Binet tests that I would not be using, because some clinicians still used them, and there was historical data I would need to know about. In hearing about skilled Binet testers I learned about dynamic testing: going from one domain to another as quickly as possible, just to establish general levels efficiently. I also learned that such procedures were only possible once one had achieved a very good knowledge of the test.
I was required to know my material almost by heart so that I could concentrate on every aspects of the patient’s behaviour. After 200 test administrations I began to feel confident I had seen it all, and knew all error types intimately. On my 201st test administration I encountered an entirely new error on Block Design, a scope and size error which was highly unusual. Even psychologists can learn something.
Does history matter? I think so. The early history of intelligence testing allows us to test the idea that IQ items are “arbitrary” and have no relevance to real life problems.
First publication of subtests in the Stanford-Binet 5, WAIS-IV, WISC-V, and WPPSI-IV. Aisa Gibbons, Russell T. Warne. Intelligence, 2019, Volume 75, July–August 2019, Pages 9-18.
The authors discuss the pre-history of intelligence testing from 1905 onwards. The period up to 1920 was extremely productive, and testing was popular, perhaps because of its widespread use in the military. Binet was interested in the lower levels of ability, Terman in the highest levels. Test have to cater for the entire range, all 7 tribes of intellect. Not only that, they have to maintain discriminatory power throughout the whole range, though that is hard to do at the extremes.
Wechsler favored test formats and items that (a) showed high discrimination in intelligence across much of the continuum of ability, (b) produced scores with high reliability, (c) correlated strongly with other widely accepted measures of intelligence, and (d) correlated with “pragmatic” subjective ratings of intelligence from people who knew the examinee—such as a work supervisor (Wechsler, 1944).
That is a good summary of what an intelligence test item must achieve: discrimination, reliability, validity with other tests and, most importantly, with intelligence in everyday life.
Gibbons and Warne show that many tests go back a long way, and are earlier than generally realized. Their list of tests is an excellent way to understand all the tasks which have constituted the core elements of intelligence testing.
I learned a great reading through this section of the paper. For example, I did not know that Binet said of his early reasoning test that it was the best of the lot:
“the 1908 scale (of reasoning) has three images, each containing at least one human figure. The child then was asked to describe the picture, and more complex responses based on interpretation (rather than simply naming objects in the image) were viewed as indicative of greater intellectual ability. Binet found this subtest so useful when diagnosing intellectual disabilities that he wrote, “Very few tests yield so much information as this one.. .. We place it above all the others, and if we were obliged to retain only one, we should not hesitate to select this one” (Binet & Simon, 1908/1916, p. 189).
Intelligence goes beyond the obvious.
Here are some historical points which were news to me:
Jean Marc Gaspard Itard was the first to use a form board-like task when he studied and educated a young boy found in the wild (named the “wild boy of Aveyron”) in 1798.
The very similar visual puzzles and object assembly subtests have an origin in the puzzles used for entertainment and geography education, which were first created in the 1750s in England and were in wide-spread use in the early 20th century when the first intelligence tests were being created (Norgate, 2007).
One discovery that we found striking was the diverse sources of inspiration for subtests. While the majority did have roots in the creation of cognitive tests, others have their origin in games (the delayed response subtest, the object assembly subtest), classroom lessons (the block design subtest), the study of a feral child (form boards and related subtests), school assessments (vocabulary subtest) and more. To us, this means that items on intelligence tests often have a connection with the real world—even when they are presented in a standardized, acontextual testing setting. Additionally, this undercuts the suggestion that critics of intelligence testing often make that intelligence test items are meaningless tasks that are divorced from any relationship to an examinee’s environment (e.g., Gould, 1981).
On the other hand, one criticism of intelligence tests seems justified from our study: subtests that appear on popular intelligence tests have changed little in the past century (Linn, 1986). While one could argue that the enduring appeal of these subtests is due to their high performance in measuring intelligence, the fact remains that many of these subtests were often created with little guiding theory or understanding of how the brain and mind work to solve problems (Naglieri, 2007). While sophisticated theories regarding test construction and the inter-relationships of cognitive abilities have developed in recent decades (e.g., Carroll, 1993), it is often not clear exactly how the tasks on modern intelligence tasks elicit examinees to use their mental abilities to respond to test items.
One way to test this criticism is to think of new tests more suited to the present age. Of course, others have already had that thought, and have created computer games which measure intelligence. Fun, but is this a big advance? It is only a gain if the results are more accurate, better predictive of real-life achievements and more speedily obtained. That is a hard bar to clear, since reasonable overall measures can be obtained in a few minutes. More likely, corporations are measuring our intelligence very quickly and surreptitiously by noting our google searches, Facebook likes, and perhaps even commenting histories.
A more pressing problem, to which the authors allude in passing, is that some new-fangled tests are launched each year, and most fall out of use. The reasons is that Wechsler testers have now become highly pragmatic, and do not take kindly to complicated administration procedures, not to test materials which are difficult to assemble and present quickly.
The reality appears to be that any puzzling task taps ability, and there are diminishing returns when using new psychometric tasks. This is the familiar “indifference of the indicator” which Spearman proposed in 1904. This does not exclude finding that individuals have strength and weaknesses in specific domains, but simply that all tasks lead to g, either quickly or slowly, to slightly varying degrees.