ch.5 for unit 3 Flashcards
reliability coefficient/Test-retest reliability
a statistic that quantifies reliability, ranging from 0 (not at all reliable) to 1 (perfectly reliable)./ is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test. The test-retest measure is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait. If the characteristic being measured is assumed to fluctuate over time, then there would be little sense in assessing the reliability of the test using the test-retest method.
measurement error
refers to the inherent uncertainty associated with any measurement, even after care has been taken to minimize preventable mistake
true score
the measurement of a quantity if there were no measurement error at all.
carryover effects/ practice effects/ fatigue effects
Measurement processes that alter what is measured are termed carryover effects.
In ability tests, practice effects are carryover effects in which the test itself provides an opportunity to learn and practice the ability being measured. Fatigue effects are carryover effects in which repeated testing reduces overall mental energy or motivation to perform on a test.
why is the true score not nessecarily the “truth”?
By definition, a true score is tied to the measurement instrument used
construct score.
If you are interested in the truth independent of measurement, you are not looking for the so-called true score, but what psychologists call the construct score. A construct is a theoretical variable we believe exists, such as depression, agreeableness, or reading ability. A construct score is a person’s standing on a theoretical variable independent of any particular measurement. If we could create tests that perfectly measured theoretical constructs, the true score and the construct score would be identical. Unfortunately, all tests are flawed. The long-term average of many measurements using a flawed measurement procedure is still called a true score, flaws, and all.
Why bother with true scores when construct scores are clearly more important?
Because true scores help us understand and calculate reliability, and without reliability a test cannot be valid. In Chapter 6, we will discuss test validity in greater detail. Whenever we evaluate a test’s validity, we first check that its reliability is sufficient. The lower the test’s reliability, the lower the test’s validity. Yet high reliability does not guarantee high validity. A deeply flawed test that gives consistent measurements is reliable but not valid.
true variance, vs error
r. Variance from true differences is true variance, and variance from irrelevant, random sources is error variance. If σ2 represents the total observed variance, its relation with the true variance and the error variance, can be expressed as
In this equation, the total variance in an observed distribution of test scores (σ2) equals the sum of the true variance (
) and the error variance (
)
Random error vs systematic error
Random error consists of unpredictable fluctuations and inconsistencies of other variables in the measurement process. Sometimes referred to as “noise,” this source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores.
vs Random errors increase or decrease test scores unpredictably. On average and in the long run, random errors tend to cancel each other out. In contrast to random errors, systematic errors do not cancel each other out because they influence test scores in a consistent direction. Systematic errors either consistently inflate scores or consistently deflate scores.
bias
bias refers to the degree to which systematic error influences the measurement.
item sampling or content sampling,
terms that refer to variation among items within a test as well as to variation among items between tests.
testtaker variables.
Pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance. Formal learning experiences, casual life experiences, therapy, illness, and changes in mood or mental state are other potential sources of testtaker-related error variance.
Examiner-related variables
s are potential sources of error variance. The examiner’s physical appearance and demeanor—even the presence or absence of an examiner—are some factors for consideration here. Some examiners in some testing situations might knowingly or unwittingly depart from the procedure prescribed for a particular test. On an oral examination, some examiners may unwittingly provide clues by emphasizing key words as they pose questions.
methodological error
interviewers may not have been trained properly, the wording in the questionnaire may have been ambiguous, or the items may have somehow been biased to favor one or another of the candidates.
coefficient of stability.
When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as the
replicability crisis.
Here it will be argued that the major causal factors are (1) a general lack of published replication attempts in the professional literature, (2) editorial preferences for positive over negative findings, and (3) questionable research practices on the part of authors of published studies.
Preregistration
involves publicly committing to a set of procedures prior to carrying out a study. Using such a procedure, there can be no doubt as to the number of observations planned, and the number of measures anticipated. In fact, there are now several websites that allow researchers to preregister their research plans.
coefficient of equivalence.
The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability,
Parallel forms/ parallel forms reliability
exist when, for each form of the test, the means and the variances of observed test scores are equal
parallel forms reliability refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.
Alternate forms/ alternate forms reliability
e simply different versions of a test that have been constructed so as to be parallel. Although they do not meet the requirements for the legitimate designation “parallel,” alternate forms of a test are typically designed to be equivalent with respect to variables such as content and level of difficulty
alternate forms reliability refers to an estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error. Estimating alternate forms reliability is straightforward: Calculate the correlation between scores from a representative sample of individuals who have taken both tests.
How is Obtaining estimates of alternate-forms reliability and parallel-forms reliability similar?
(1) Two test administrations with the same group are required, and (2) test scores may be affected by factors such as motivation, fatigue, or intervening events such as practice, learning, or therapy (although not as much as when the same test is administered twice).
internal consistency estimate of reliability or as an estimate of inter-item consistency.
nt. An estimate of the reliability of a test can be obtained without developing an alternate form of the test and without having to administer the test twice to the same people. Deriving this type of estimate entails an evaluation of the internal consistency of the test items
split-half reliability i
is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense). The computation of a coefficient of split-half reliability generally entails three steps:
Step 1. Divide the test into equivalent halves.
Step 2. Calculate a Pearson r between scores on the two halves of the test.
Step 3. Adjust the half-test reliability using the Spearman–Brown formula (discussed shortly).
GOAL = a primary objective in splitting a test in half for the purpose of obtaining a split-half reliability estimate is to create what might be called “mini-parallel-forms,” with each half equal to the other—or as nearly equal as humanly possible—in format, stylistic, statistical, and related aspects.
Why should a test not be split in half when calculating split-half reliability?
Simply dividing the test in the middle is not recommended because it’s likely that Different amounts of fatigue for the first as opposed to the second part of the test, different amounts of test anxiety, and differences in item difficulty as a function of placement in the test are all factors to consider.this procedure would spuriously raise or lower the reliability coefficient.