|
Based on the book Developing and Using Tests Effectively by Jacobs and Chase,
1992. Reliability deals with the consistency of measurements. The
reliability of an assessment is a measure of the consistency with which the test
produces the same result under different but comparable conditions (Example:
similar populations of students getting similar scores).
For a test to be reliable it must adequately reflect the objectives of the
teaching unit.
The best way is to have two measurements of a common trait (course unit) for
a common group of people (your students) this would mean having two parallel or
equivalent forms of the same test.
Take a single test and split it into two halves. You get a student’s score on
the odd number items and another score on the even numbered items. This is not
the best way to do a reliability assessment but it will give you some indication
if the number of test questions is large enough (fifty or more)
Another way is to look how consistently students perform on each item, in
effect treating each item as a mini test.
Using Split Halves to Test Reliability
The test must be focused on a common domain of knowledge.
Limitations are that you end up with two short tests that are less reliable
measures.
However, you can use the Spearman-Brown prophecy formula to estimate the
reliability of the test of the original length. What we are seeking is that a
student’s score on the first half (or their rank) is close to their score (or
rank) on the second half. Ideally, you want to have correlations at the .70 or
.80 value. (Additional Information can be found at
http://www.jmu.edu/assessment/wm_library/Reliability_validity.pdf
Factors that Influence Reliability
-
The length of the test
-
The time limits for the tests
-
The nature of the student group (i.e. if the
group is quite homogeneous the reliability will be lower than if the group is
fairly heterogeneous)
-
The difficulty of the test items (i.e. if the
test items are too difficult than the spread of scores will be small)
-
A common set of instructions
-
A common environment in which to attempt the
test
-
The scoring procedure of the test
-
Students are aware of how they will be assessed,
length, time, content or objectives and value of the test
How to Improve Test Reliability
- Tests should be long enough to sample the content well
- Time limit should allow most students to finish
- Items should be free of ambiguity and tricks
- Directions should be clear and concise
- There should be few items that all get wrong or right
What about Essay Test and Reliability
The biggest issue is the consistency of the reader (teacher) of the test
answers.
- Are the questions too wide in scope?
- Has the reader developed a prescribed scoring method?
A check list of things that need to be in the answer and their point value
- A written response to each question by the instructor that is reviewed
before reading the student answers
Some examples that promote reliability
- Reading all of the answers to one question across all tests, so you focus
on one
answer at a time
- Higher reliability is achieved if two readers are used for the tests
Estimating Reliability
- Test-retest reliability: a correlation between the score from
giving the same test twice to the same students.
- Parallel form of reliability: a correlation between scores of the
same students on two equivalent forms of the same test.
- Internal consistency reliability: correlation or consistency
indices among items on a single test—a student should score a similar score on
the odd questions as the even questions of a test.
Other factors in Test Reliability
- Tests that are too easy
- Test are too hard—this encourages guessing which introduces random error
- The more questions the better reliability and the less impact guessing
will have on the score.
- The "true" or "exact score" of a student on a test should be seen as
falling into a range of success. An 85% is not significantly better or
worse that an 88% or 82%-- There is always a degree of error in any test
score.
Test Validity
- A valid assessment procedure is one which actually tests
what it sets out to test—i.e. one which accurately measures the behavior
described by the objective under scrutiny.
- A test is valid if it provides data that increases the
accuracy of decisions about a person or object.
- Tests do not have general validity. They are valid in
relation to specific variables, such as intelligence or achievement of course
objectives.
Measuring Tests Validity
The obvious way to collect evidence of a tests validity is to compare a
student’s score on a test to some external measure of the same trait that the
test measures.
Example: Or comparing the ACT with freshman grades in college.
The focus is on how well the test samples a domain of behaviors or
knowledge about which we will make an inference.
Tests are not measures of an entire domain, but samples of the desired
behavior from which we draw conclusions about a student’s knowledge of an
entire domain.
Content Validation
- The extent to which the test questions reflect the
entire body of the content that the test is designed to measure
- Tests are only a sample of what a student knows so the
validity of the sample is dependent upon the representative-ness of the sample
- The domain of interest measured is not only a subject
matter domain but also a behavioral domain (what kinds of mental operations
should be tested)
- Well written tests not only sample all the material
taught but also do so in a representative way—the percentage of questions on a
test should reflect the time and importance given to a topic.
How to Insure Test Validity
- The test should not test other "things" like vocabulary or writing skills
unless these were part of the course objectives.
- If the instructor samples the course objectives in proportion to their
importance in the course, the instructor’s test will have content validity.
- Use a Table of Specifications (see appendix)
- The extent to which the test is free from influences or irrelevant
variables that threaten the validity of the test the more accurate the finding
will be.
- Examples of irrelevant variables: vocabulary, ambiguity, minute details,
grammatical incorrectness, too little time etc.
Also items like:
- Directions are not clear
- Test requires inappropriate levels of skills that are not part of the
course objectives
- Test items are poorly written
- Test length does not allow for adequate sampling of content
- Complexity and subjectivity of scoring inaccurately rank some students
- If the scoring process has many steps there are many opportunities for
mistakes
- If it is subjective and easily influenced by factors not part of the
teaching objectives
Other Factors that Affect Test Validity
- The test taking skills of the students—guessing strategies, good
allocation of time.
- Test wise-ness—the ability to use clues to obtain a score higher than a
score that is deserved—they appear to know more than they do.
- Response sets—the tendency to respond on a test in a certain consistent
way—always mark true if don’t know—always choose the longest answer
on multiple choice if uncertain.
- Anxiety and Motivation—performance may be impaired by high anxiety
or low motivation—research is unclear if the problem is poor test
taking skills which leads to poor performance and anxiety. Several studies
have shown no causal link between anxiety and test scores but we do know
the brain secretes neuro-chemicals under stress that can interfere with
cognitive processes.
Administrative Factors
- The way a test is administrated— the "this is going to be hard" comments
- Proctored by someone else
- Students believe cheating is "ignored"
- Clarity of instructions
- Coaching and practice—drill and practice to the test can affect scores
- Test bias—the manner in which a test is constructed to give some people an
unfair advantage over others (can help or hurt scores) Test bias is defined as
individual(s) from different groups who are equally able do not have equal
probability of success (Anderson 1980). The question is, do the differences on
tests result from factors irrelevant to what the test was designed to measure
or do such differences mirror the true differences between the groups in what
the test intended to measure. The key is to equally able groups that perform
differently.
Random Errors and Systematic Errors
- Random errors by definition are random: the amount and direction of the
test error differs unsystematically from one measurement to the next and from
one person to the next.
- Random errors reduce test reliability
Examples:
- Would be to under predict or over predict the math ability levels of a
test group
- The student is tired
- Student(s) being upset
- The test being just before a big event
- Students being cold or hot or hungry
Systematic Errors Affect Test Validity
Unknown to the test developer or the test taker the test measures something
that it was not intended to measure
Example: a math test that has word problems that require a very good reading
ability – if the score is lower it may be due to reading ability and not math
ability. |