Test construction Flashcards by Ro Al

Classical Test Theory and Reliability is based on the assumption that

obtained test scores (X) are due to a combination of true score variability (T) and measurement error (E): i.e., X = T + E.

How well did you know this?

Not at all

Perfectly

measurement error is due to

random factors that affect the test performance of examinees in unpredictable ways

How well did you know this?

Not at all

Perfectly

Test reliability refers to

the extent to which a test provides consistent information

How well did you know this?

Not at all

Perfectly

interpretation of reliability coefficient

They’re always interpreted directly as the amount of variability in obtained test scores that’s due to true score variability. For instance, if a test has a reliability coefficient of .80, this means that 80% of variability in obtained test scores is due to true score variability and the remaining 20% is due to measurement error.

How well did you know this?

Not at all

Perfectly

4 Methods for Estimating Reliability:

There are four main methods for assessing a test’s reliability: test-retest, alternate forms, internal consistency, and inter-rater.

How well did you know this?

Not at all

Perfectly

Test-retest reliability provides information about

the consistency of scores over time.

How well did you know this?

Not at all

Perfectly

test-retest reliability involves

administering the test to a sample of examinees, re-administering the test to the same examinees at a later time, and correlating the two sets of scores.

How well did you know this?

Not at all

Perfectly

Test-retest reliability is useful for tests that are designed to measure a characteristic that’s ___over time.

stable

How well did you know this?

Not at all

Perfectly

Alternate forms reliability provides information about

the consistency of scores over different forms of the test and, when the second form is administered at a later time, the consistency of scores over time.

How well did you know this?

Not at all

Perfectly

alternate forms reliability involves

administering one form of the test to a sample of examinees, administering the other form to the same examinees, and correlating the two sets of scores. Alternate forms reliability is important whenever a test has more than one form.

How well did you know this?

Not at all

Perfectly

Internal consistency reliability provides information

on the consistency of scores over different test items and is useful for tests that are designed to measure a single content domain or aspect of behavior.

How well did you know this?

Not at all

Perfectly

best reliability measures for measuring speed/performance

internal consistency relaibility is NOT a good measure because it tends to overestimate their reliability. For speed tests, test-retest and alternate forms reliability are appropriate.

How well did you know this?

Not at all

Perfectly

3 key methods for INTERNAL CONSISTENCY reliability

Coefficient Alpha (Cronbach’s Alpha)

Kuder-Richardson 20 (KR-20)

Split-Half Reliability:

How well did you know this?

Not at all

Perfectly

coefficient/cronbach’s alpha used for tests with ___

multiple response formats

How well did you know this?

Not at all

Perfectly

kuder-richardson 20 (KR-20) is used for tests with ___

dichotomous (e.g. right/wrong) scoring

How well did you know this?

Not at all

Perfectly

issue with split-half reliability is

a split-half reliability coefficient underestimates a test’s reliability and is usually corrected with the Spearman-Brown prophecy formula, which is used to determine the effects of lengthening or shortening a test on its reliability coefficient.

Inter-rater reliability is important for measures that are ____scored and provides information on the consistency of scores or ratings assigned by different raters.

subjectively

methods used to evaluate inter-rater reliability:

Percent agreement

Cohen’s kappa coefficient is also known as the kappa statistic

issue with percent agreement method

Percent agreement can be calculated for two or more raters. A problem with this method is that it does not take chance agreement into account, which can result in an overestimate of reliability.

when ratings represent a nominal scale, best inter rater reliability method is

Cohen’s kappa coefficient is also known as the kappa statistic and is one of several inter-rater reliability coefficients that is corrected for chance agreement between raters.

The reliability of subjective ratings can be affected by _____It occurs when two or more raters communicate with each other while assigning ratings, which results in increased consistency (but often decreased accuracy) in ratings and an overestimate of inter-rater reliability.

consensual observer drift.

3 Factors that Affect the Reliability Coefficient

content homogeneity
range of scores
guessing

when a test’s reliability coefficient is .81, the reliability index is the ____

square root of .81, which is .90.

For dichotomously scored items, an item’s difficulty level (p) indicates the percentage of examinees who

answered the item correctly.

when 50 of 100 examinees answered an item correctly, the item’s p value is

50/100, or .50.

For mastery tests (tests used to identify examinees who have mastered a certain level of knowledge or skill), ___ p values are preferred.

lower

The item discrimination index (D) ranges from ___and indicates the difference between the percentage of examinees with ___ and the percentage of examinees with __

-1.0 to +1.0 ; high total test scores (often the top 27%) who answered the item correctly; low total test scores (the bottom 27%) who answered the item correctly.

what is the optimal difficulty level for a four-answer multiple choice question

the chance of choosing the correct answer to a four-answer multiple-choice question by guessing is .25, and the optimal difficulty level for this type of item is calculated by adding 1.0 to .25 and dividing the result by 2: (1.0 + .25)/2 = 1.25/2 = .625.

when 90% of examinees in the high-scoring group and 20% of examinees in the low-scoring group answered an item correctly, the item’s D value is

.90 minus .20, which is .70.

an item’s difficulty level affects its ability to discriminate, with items of ___ having higher levels of discrimination.

moderate difficulty

if a test has a standard deviation of 5 and a reliability coefficient of .84, its standard error of measurement equals

5 times the square root of 1 minus .84: 1 minus .84 is .16, the square root of .16 is .4, 5 times .4 is 2. In other words, when a test’s standard deviation is 5 and its reliability coefficient is .84, its standard error of measurement is 2.