Peter Schonemann Bibliography


Peter H. Schonemann

This material is abstracted from Dr. Schonemann's web page:
http://www.psych.purdue.edu/~phs/phs.html


IQ Controversy


(a)
  Problem of defining "intelligence":

In his controversial revival  of the eugenic  traditions of the 20s, Arthur Jensen (1969) made explicit reference to Spearman's factor model  as a vehicle for defining "intelligence".  In view of the factor indetermincay problem (see above, factor analysis), this approach is not viable [40, 47, 52, 57, 83] . Recourse to concrete IQ tests is equally unsatisfactory, because different tests are often quite poorly correlated. In fact , this was the reason why Spearman had postulated the g-model in the first place. Contrary to what is sometimes claimed, conventional IQ tests  prove to be  poor predictors of  criteria of interest, including scholastic achievement. For example, the SAT - a close relative of conventional "verbal" IQ tests such as the Army Alpha - is consistently outperformed by  previous grades as a predictor of subsequent grades, especially as the prediction interval is lengthened. For long range criteria (such as graduation or GPA at graduation), the SAT usually accounts for less than 5% of the criterion variance (Humphreys, 1967, Donlon, 1984). As might be expected, the findings for the GRE are even worse. In two recent, still unpublished large scale validity studies, Horn and Hofer (undated) and Sternberg (1998 ) found that the validities of the GRE for predicting successful completion of graduate training are effectively zero.
 

  • This means that no-one knows what "intelligence" is after 100 years of feverish "research". These disappointing results are especially disconcerting if viewed against the historical record of the mental test movement which Jensen and his followers have tried to revive by linking untenable validity claims for IQ to equally untenable "heritability" claims (see Quantitative Behavior Genetics, below)..

 

(b) Spearman's Hypothesis

In the early 80s, Jensen (in Bias in Mental Testing, 1980) revived a casual observation Spearman had made in 1927:  He reported that subtests most highly loaded on  his  general intelligence factor "g" showed the largest black/white contrasts ("Spearman Hypothesis"). Jensen, after substituting the largest principal component (PC1) for "g", interpreted this as new, compelling evidence for the existence of g which seemed to corroborate   his central claim that Blacks, on average, are deficient in g compared to Whites,  and that these differences are primarily genetic, not cultural, in origin.

In [43]  I drew attention to the fact that this result can be explained as an artifact which has nothing to do with Blacks or g, but rather arises with any data, including randomly generated data, provided they exhibit a sufficiently large mean difference vector.  This explanation was subsequently challenged by Shockley, who correctly pointed out that it was limited to a positive relation between the mean differences and the weights of the PC1 of the pooled group, while  most of Jensen's data showed such a positive relation within each group PC1. I therefore extended  my results to this more general situation by imposing  multinormality as an additional condition. I showed mathematically, geometrically, empirically, and by random simulation, the following result:

If one splits a multinormal distribution of positively intercorrelated variables into a high and a low group, then one will find

(a) that the mean differences between both groups are monotonically related to the loadings on the largest principal component

(b)  if both groups are of equal size, then the cosine between both vectors will be unity, and

(c)  if the groups are of different size, then the effect will be more pronounced for the larger group [68, 82, 83].

  • Thus, Spearman's Hypothesis does not warrant any of the farreaching claims Jensen and some of his followers (e.g., Herrnstein and Murray) have attached to it. In particular, it does not validate the existence of a general ability g as Jensen has asserted, nor does it have any bearing on the race question.

Publication [82] is a Target Article on this topic, followed by  numerous  commentaries. Most of them endorse the stringency of the above reasoning. For a chronicle of the protracted history of the target article, see [85].

 

(c) Hit-Rate Bias

In  view of the considerable implications  of a mistaken interpretation of discrepancies in IQ performance of various ethnic groups much attention has been focused on the question whether these discrepancies might conceivably  be the result of a bias that favors some groups over others (e.g., males over females). A. Jensen (1980) has devoted a whole book to this issue, entitled "Bias in Mental Testing". He concluded that such worries are unwarranted so far as the Black/White discrepancy is concerned. If anything, the tests seemed to overpredict Black criterion performance.

This reasoning follows traditional lines in emphasizing the institutional point of view (e.g. that of universities) over that of the testees, by focussing on  regression equations and  validity coefficients. From an institutional point of view, a test is useful if it improves the composition of the subgroup that is eventually hired as a result of superior test performance.  From this point of view it can be shown that even a test with low validity has merit as long as the hiring institution employs a sufficiently stringent admission quota (by raising the test cut-off).

However, this narrow institutional  perspective ignores two important aspects of the bias problem:
 

(a)  the base-rate problem:

 By solely focusing on the regression equation and correlations ( predictive validities), the traditional approach to the bias problem ignores the fact that a test can be worse than useless even if it has positive  validity if the base-rates (= proportion of qualified candidates) are sufficiently skewed. To illustrate this briefly, suppose the base-rate of  a clinical syndrom (e.g., schizophrenia) is  1%. In this case we would be able to achieve 99% correct prediction by simply predicting that everybody is "normal", regardless of test performance.  For a test to achieve the same degree of correct prediction, it would have to have an unrealistically high predictive validity.

More generally, validity coefficients by themselves (in the absence of knowledge of base rate and quota), are meaningless as indicators of the pratical utility of a test.

This was already known to Meehl and Rosen (1955), but has been conveniently forgotten in the meantime.
 

(b) the interests of the testee (as opposed to that of the hiring or admitting institution):

 Once the bias problem is cast into the language of prediction error frequencies (rather than validities and regression equations,  disregarding base-rates), then it becomes immediately obvious that to same the  extent that low validity tests benefit institutions when (admission) quotas are tightened, they penalize qualified applicants by wrongly rejecting them as a result of poor test validity:

Traditionally, the conditional probability that a candidate will be successful (e.g., graduate), given that he passes the test, is called the "success ratio" (SR) of the test. Following standard terminology of signal detection theory, let HR denote the "hit-rate", which is the reverse conditional probability, that a qualified candidate  passes the test. Finally, let Q denote the (admission) "quota" and BR the base-rate (the proportion of qulaified candidates in the unselected population.

Then Bayes' Theorem asserts:

                                                                 SR = HR x BR/Q,

which expresses the conventional instituional point of view:  The smaller we make the quota (by raising the test cut-off), the larger will be the success rate, because Q shows up in the denominator.

However, if we adopt the point of view of  qualified candidates, then we find (by  solving the above equation for the hit-rate):

                                                                HR = SR x Q /BR.

Now Q appears in the numerator. Hence, the  tighter the admission quota, the smaller will be the hit-rate, the chance of the qualified student to be admitted.

Although these simple relations have been known for a long time, they have been consistently ignored or downplayed in the mental test literature. In particular, so far as I know, few if any systematic investigations of the actual  hit-rates as a function of validity, base-rate, and quotas  seem to have been made in the past. Nor have the experts shown  much interest  in the problem whether the tests may be biased against minorities in terms of hit-rates.

In [78]  we derive simple approximations for hit-rates and tabulate them as a function of validity, quota, and base-rate.  We also derive a  bound on hit-rates,

                                                                   HR < Q/BR,

which says that tightening the quota inevitably penalizes the qualified students by lowering the hit-rate.

Finally, we review a number of data sets to evaluate the  hit-rates of the SAT and ACT for different ethnic groups and also to evaluate the hit-rate bias, i.e., the extent to which conventional tests favor or discriminate against subroups in terms of the chances of qualified students to pass the test. We found that these conventional admission tests discriminate against Blacks, and that the bias increases as the admission quotas are tightened. In [48] these results are further refined and extendend to include formulae for estimating the minimum validity needed for given quota and base rate, so that use of the test improves the percentage of correct decisions over random admissions. Basically, in the actual observed validity range (.3 - .4), no test improves over random admission in terms of overall correct decisions as soon as one of the two base rates ecceeds .7.

      • Some of these results impact directly on the ongoing debate about the admission standards of the NCAA, and  on the affirmative action debate more generally. In the past, such discussions have often been misguided by the fallacious notion that all one has to do to "raise standards" is to raise test cut-offs. The above formula underlines the need to clearly distinguish between predictor standards and criterion standards, especially if the test validities  are low, as they usually are. In this case, raising test standards may not raise criterion standards as much as it  raises discrimination against minorities.

 

Quantitative Behavior Genetics  
 

Presumably, one reason for the astonishing persistence of the IQ myth in the face of overwhelming prior and poterios odds against it is the unbroken chain of excessive heritability claims for "intelligence", which IQ tests are supposed to "measure". However, if  "intelligence" is undefined, and Spearman's g is beset with numerous problems, not the least of which is universal (and gnerally acknowledged) rejection of Spearman's model by the data, then how can the heritability of "intelligence" exceed that of milk production of cows and egg production of hens?

These problems are addressed in a series of  recent publications, [54, 60, 61, 62, 63, 70, 71, 72, 75, 81]. In [70] it is shown that a once widely used "heritability estimate" (Holzinger's h**2) is mathematically unsound, because Holzinger had made a mistake in his derivations. On the other hand,  another such estimate, though mathematically valid, never fits any data. This could have been obvious for a long time because it produces an inordinate number of  inadmissible estimates. However, they often found their way into print without challenge or comment.  Moreover, it also produces excessive "heritabilities" for variables which plainly have nothing to do with genes. For example, the "heritability" of answers to the question: "Did you have your back rubbed last year?" turn out 92% for males and 21% for females [81].

  • The main problem is that all such estimates rely on  simplistic mathematical models which necessarily make unrealistically stringent assumptions which were rarely tested. Once they are tested, one finds they are usually violated by the data. A comprehensive review of these issues is attempted in [81], where further references to specific subproblems can be found.
ISAR HOME