|
Peter
Schonemann Bibliography
Peter
H. Schonemann
This
material is abstracted from Dr. Schonemann's web page:
http://www.psych.purdue.edu/~phs/phs.html
IQ Controversy
(a) Problem
of defining "intelligence":
In
his controversial revival of the eugenic traditions
of the 20s, Arthur Jensen (1969) made explicit reference to
Spearman's factor model as a vehicle for defining
"intelligence". In view of the factor indetermincay
problem (see above, factor analysis), this approach is not viable
[40, 47, 52, 57, 83] . Recourse to concrete IQ tests is equally
unsatisfactory, because different tests are often quite poorly
correlated. In fact , this was the reason why Spearman had postulated
the g-model in the first place. Contrary to what is sometimes
claimed, conventional IQ tests prove to be poor
predictors of criteria of interest, including scholastic
achievement. For example, the SAT - a close relative of conventional
"verbal" IQ tests such as the Army Alpha - is consistently
outperformed by previous grades as a predictor of subsequent
grades, especially as the prediction interval is lengthened.
For long range criteria (such as graduation or GPA at graduation),
the SAT usually accounts for less than 5% of the criterion variance
(Humphreys, 1967, Donlon, 1984). As might be expected, the findings
for the GRE are even worse. In two recent, still unpublished
large scale validity studies, Horn and Hofer (undated) and Sternberg
(1998 ) found that the validities of the GRE for predicting
successful completion of graduate training are effectively zero.
- This
means that no-one knows what "intelligence" is after
100 years of feverish "research". These disappointing
results are especially disconcerting if viewed against the
historical record of the mental test movement which Jensen
and his followers have tried to revive by linking untenable
validity claims for IQ to equally untenable "heritability"
claims (see Quantitative Behavior Genetics, below)..
(b)
Spearman's Hypothesis
In
the early 80s, Jensen (in Bias in Mental Testing, 1980) revived
a casual observation Spearman had made in 1927: He reported
that subtests most highly loaded on his general
intelligence factor "g" showed the largest black/white
contrasts ("Spearman Hypothesis"). Jensen,
after substituting the largest principal component (PC1) for
"g", interpreted this as new, compelling evidence
for the existence of g which seemed to corroborate
his central claim that Blacks, on average, are deficient in
g compared to Whites, and that these differences are primarily
genetic, not cultural, in origin.
In
[43] I drew attention to the fact that this result can
be explained as an artifact which has nothing to do with Blacks
or g, but rather arises with any data, including randomly generated
data, provided they exhibit a sufficiently large mean difference
vector. This explanation was subsequently challenged by
Shockley, who correctly pointed out that it was limited to a
positive relation between the mean differences and the weights
of the PC1 of the pooled group, while most
of Jensen's data showed such a positive relation within each
group PC1. I therefore extended my results to this
more general situation by imposing multinormality as an
additional condition. I showed mathematically, geometrically,
empirically, and by random simulation, the following result:
If
one splits a multinormal distribution of positively intercorrelated
variables into a high and a low group, then one will find
(a)
that the mean differences between both groups are monotonically
related to the loadings on the largest principal component
(b)
if both groups are of equal size, then the cosine between
both vectors will be unity, and
(c)
if the groups are of different size, then the effect will
be more pronounced for the larger group [68, 82, 83].
- Thus,
Spearman's Hypothesis does not warrant any of the farreaching
claims Jensen and some of his followers (e.g., Herrnstein
and Murray) have attached to it. In particular, it does
not validate the existence of a general ability g as Jensen
has asserted, nor does it have any bearing on the race question.
Publication
[82] is a Target Article on this topic, followed by numerous
commentaries. Most of them endorse the stringency of the above
reasoning. For a chronicle of the protracted history of the
target article, see [85].
(c)
Hit-Rate Bias
In
view of the considerable implications of a mistaken interpretation
of discrepancies in IQ performance of various ethnic groups
much attention has been focused on the question whether these
discrepancies might conceivably be the result of a bias
that favors some groups over others (e.g., males over females).
A. Jensen (1980) has devoted a whole book to this issue, entitled
"Bias in Mental Testing". He concluded that such worries
are unwarranted so far as the Black/White discrepancy is concerned.
If anything, the tests seemed to overpredict Black criterion
performance.
This
reasoning follows traditional lines in emphasizing the institutional
point of view (e.g. that of universities) over that of the testees,
by focussing on regression equations and validity
coefficients. From an institutional point of view, a test is
useful if it improves the composition of the subgroup that is
eventually hired as a result of superior test performance.
From this point of view it can be shown that even a test with
low validity has merit as long as the hiring institution employs
a sufficiently stringent admission quota (by raising the test
cut-off).
However,
this narrow institutional perspective ignores two important
aspects of the bias problem:
(a)
the base-rate problem:
By
solely focusing on the regression equation and correlations
( predictive validities), the traditional approach to the
bias problem ignores the fact that a test can be worse than
useless even if it has positive validity if the base-rates
(= proportion of qualified candidates) are sufficiently skewed.
To illustrate this briefly, suppose the base-rate of
a clinical syndrom (e.g., schizophrenia) is 1%. In this
case we would be able to achieve 99% correct prediction by
simply predicting that everybody is "normal", regardless
of test performance. For a test to achieve the same
degree of correct prediction, it would have to have an unrealistically
high predictive validity.
More
generally, validity coefficients by themselves (in the absence
of knowledge of base rate and quota), are meaningless as
indicators of the pratical utility of a test.
This
was already known to Meehl and Rosen (1955), but has been
conveniently forgotten in the meantime.
(b)
the interests of the testee (as opposed to that of
the hiring or admitting institution):
Once
the bias problem is cast into the language of prediction error
frequencies (rather than validities and regression equations,
disregarding base-rates), then it becomes immediately obvious
that to same the extent that low validity tests benefit
institutions when (admission) quotas are tightened, they penalize
qualified applicants by wrongly rejecting them as a result
of poor test validity:
Traditionally,
the conditional probability that a candidate will be successful
(e.g., graduate), given that he passes the test, is called
the "success ratio" (SR) of the test. Following
standard terminology of signal detection theory, let HR denote
the "hit-rate", which is the reverse conditional
probability, that a qualified candidate passes the test.
Finally, let Q denote the (admission) "quota" and
BR the base-rate (the proportion of qulaified candidates in
the unselected population.
Then
Bayes' Theorem asserts:
SR = HR x BR/Q,
which
expresses the conventional instituional point of view:
The smaller we make the quota (by raising the test cut-off),
the larger will be the success rate, because Q shows up in
the denominator.
However,
if we adopt the point of view of qualified candidates,
then we find (by solving the above equation for the
hit-rate):
HR = SR x Q /BR.
Now
Q appears in the numerator. Hence, the tighter the admission
quota, the smaller will be the hit-rate, the chance of the
qualified student to be admitted.
Although
these simple relations have been known for a long time, they
have been consistently ignored or downplayed in the mental
test literature. In particular, so far as I know, few if any
systematic investigations of the actual hit-rates as
a function of validity, base-rate, and quotas seem to
have been made in the past. Nor have the experts shown
much interest in the problem whether the tests may
be biased against minorities in terms of hit-rates.
In
[78] we derive simple approximations for hit-rates and
tabulate them as a function of validity, quota, and base-rate.
We also derive a bound on hit-rates,
HR < Q/BR,
which
says that tightening the quota inevitably penalizes the qualified
students by lowering the hit-rate.
Finally,
we review a number of data sets to evaluate the hit-rates
of the SAT and ACT for different ethnic groups and also to evaluate
the hit-rate bias, i.e., the extent to which conventional
tests favor or discriminate against subroups in terms of the
chances of qualified students to pass the test. We found
that these conventional admission tests discriminate against
Blacks, and that the bias increases as the admission quotas
are tightened. In [48] these results are further refined and
extendend to include formulae for estimating the minimum validity
needed for given quota and base rate, so that use of the test
improves the percentage of correct decisions over random admissions.
Basically, in the actual observed validity range (.3 - .4),
no test improves over random admission in terms of overall correct
decisions as soon as one of the two base rates ecceeds .7.
- Some
of these results impact directly on the ongoing debate about
the admission standards of the NCAA, and on the affirmative
action debate more generally. In the past, such discussions
have often been misguided by the fallacious notion that
all one has to do to "raise standards" is to raise
test cut-offs. The above formula underlines the need to
clearly distinguish between predictor standards and criterion
standards, especially if the test validities are low,
as they usually are. In this case, raising test standards
may not raise criterion standards as much as it raises
discrimination against minorities.
Quantitative
Behavior Genetics
Presumably,
one reason for the astonishing persistence of the IQ myth in
the face of overwhelming prior and poterios odds against it
is the unbroken chain of excessive heritability claims
for "intelligence", which IQ tests are supposed
to "measure". However, if "intelligence"
is undefined, and Spearman's g is beset with numerous problems,
not the least of which is universal (and gnerally acknowledged)
rejection of Spearman's model by the data, then how can the
heritability of "intelligence" exceed that of milk
production of cows and egg production of hens?
These
problems are addressed in a series of recent publications,
[54, 60, 61, 62, 63, 70, 71, 72, 75, 81]. In [70] it is shown
that a once widely used "heritability estimate" (Holzinger's
h**2) is mathematically unsound, because Holzinger had made
a mistake in his derivations. On the other hand, another
such estimate, though mathematically valid, never fits any data.
This could have been obvious for a long time because it produces
an inordinate number of inadmissible estimates. However,
they often found their way into print without challenge or comment.
Moreover, it also produces excessive "heritabilities"
for variables which plainly have nothing to do with genes. For
example, the "heritability" of answers to the question:
"Did you have your back rubbed last year?" turn out
92% for males and 21% for females [81].
- The
main problem is that all such estimates rely on simplistic
mathematical models which necessarily make unrealistically
stringent assumptions which were rarely tested. Once they
are tested, one finds they are usually violated by the data.
A comprehensive review of these issues is attempted in [81],
where further references to specific subproblems can be found.
|
|