Test Construction Flashcards
Test
A “test” is a systematic method for measuring a sample of behavior. Although the exact procedures used to develop a test depend on its purpose and format, test construction ordinarily involves specifying the test’s purpose, generating test items, administering the items to a sample of examinees for the purpose of item analysis, evaluating the test’s reliability and validity, and establishing norms.
Relevance
Refers to the extent to which test items contribute to achieving the stated goals of testing.
Determination is based on qualitative judgement based on 3 factors:
- Content Appropriateness (Does item assess content it’s designed to evaluate?)
- Taxonomic level (Does item reflect approp. cog./ability level?)
- Extraneous Abilities (What extent does the item req. knowledge, skills or abilities outside the domain of interest?)
Item Difficulty
(Ordinal Scale) An item’s difficulty level is calculated by dividing the # of individuals who answered the item correctly by the total number of individuals.
p=Total # of examinees passing the item/Total # of examinees
p value ranges from 0 (nobody answered item correctly; very difficult) to 1.0 (Item answered correctly by all; very easy).
An item difficulty index of p=.50 is optimal because it maximizes differentiation between individuals w/high & low ability & helps ensure a high reliability coefficient.
Ex: Devel. of EPPP would be interested in assessing Item difficulty level to make sure the exam does not contain too many items that are either too easy or too difficult.
1 exception: true/false tests bc probability of answering the question correctly by chance is .50; optimal difficulty level is p=.75 (Item Difficulty Index = p)
A multiple choice item w/4 options the probability of answering the item correctly by guessing is 25%; so the optimum p value is halfway btwn 1 & .25, which is 0.625.
Item Difficulty Index (p)
(Ordinal Scale) For most tests a test developer wants items w/p values close to .50
The goal of testing is to choose a certain # of top performers, the optimal p value corresponds to the proportion of examinees to be choosen.
The optimal value is affected by the likelihood that examinees can select the correct answer by guessing, w/the preferred difficulty level being halfway btwn 100% of examinees answering the item correctly & the probability of answering correctly by guessing.
The optimum p value also depends on the test’s ceiling & floor:
-
Adequate Ceiling: Occurs when the test can distinguish btwn examinees w/high levels of the attribute being measured.
- Ceiling is maxed by including a large proportion of items w/a low p value; difficult items.
-
Adequate Floor: Occurs when the test can distinguish btwn examinees w/low levels of the attribute being measured.
- Floor is maxed by including a large proportion of items w/a high p value; easy items.
Item Discrimination
Refers to the extent to which a test item discriminates (differentiates) btwn examinees who obtain high vs. low scores on the entire test or on an external criterion.
To calculate, need to ID the examinee in sample that got the highest and lowest scores on the test & for each item, subtract the % of examinee in the lower-scoring group (L) from the % in the upper-scoring group (U) who answered the item correctly:
D = U - L
The item discrimination index (D) ranges from -1.0 to + 1.0.
- D = +1.0, all examinee in the upper group & none in the lower group answered the item correctly
- D = 0, the same percent of examinees in both grps answered the item correctly.
- D = -1.0, none of the examinee in the upper group & all examinees in the lower group answered the item correctly
For most test a D=.35 or higher is considered acceptable; yet items with D=.50 have greatest potential for max. discrimination.
Item Response Theory (IRT)
Advantages of IRT are that item parameters are sample invariant (same across different samples) & performance on different sets of items or tests can be easily equated.
Use of IRT involves deriving an item characteristic curve for each item. (IRT = ICC)
Item Characteristic Curve (ICC)
When using IRT, an ICC is constructed for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either:
- The total test score,
- Performance on an external criterion, or
- A mathematically-derived estimate of a latent ability or trait.
The curve provides info. on the relationship btwn an examinees level on the ability or trait measured by the test & the probability that he/she will respond to the item correctly.
The difficulty level of an item is indicated by the ability level (Ex: -3 to +3) where 50% of examinees in sample obtained a correct response. The diff. level for this item is 0 (ave. ability level).
The items’ ability to discriminate btwn high & low achievers is indicated by the slop of the curve; steeper slope, greater discrimination.
Probability of guessing correctly is indicated by the point where the curve intercepts the vertical axis.
Classical Test Theory
Theory of measurement that regards observed variability in test scores as reflecting 2 components:
- True Score Variability (True differences btwn examinees on the attribute(s) measured by the test) &
- Variability due to measurment (random) error (The effects of differences due to measurement (random) error).
Reliability is a measure of true score variability.
To calculate an examinee’s obtained test score (X) is compsed of 2 components:
- (T) a true score component &
- (E) an error component:
X = T + E
(Ex: X = score obtained on licensing exam is likely to be due to both; (T) The knowledge obtained about test items & (E) effects of random factors such as anxiety, way items were written, attention, etc.)
Reliability
Refers to the consistency of test scores; i.e., the extent to which a test measures an attribute w/out being affected by random fluctuations (measurement error) that produce inconsistencies over time, across items, or over different forms.
A test is reliable:
- To the degree that it is free from error & provides info. about examinees “true” test scores &
- The degree it provides repeatable, consistent results.
Problems: Item & test parameters are sample-dependent.
Reliability is est. by evaluating consistency in scores over time or across different forms of the test, different test items, or different raters. This method is based on the assumption that:
- True score variability is consistent
- Variability due to measurement error is inconsistent.
Methods for establishing reliability include:
- Test-retest,
- Alternative forms,
- Split-half,
- Coefficient alpha, and
- Inter-rater.
Reliability Coefficient
Most produce a Reliability Coefficient (a correlation coefficent that is calc. by correlating at least 2 scores obtained from each examinee in a sample), which is a correlation coefficient that ranges in value from:
- 0.0 (due to measurement error; no reliability) to +1.0 (true score variability; perfect relability).
- The coefficient is interpreted directly as a measure of true score variability*
(Ex: a reliability of .80 indicates that 80% of variability in test scores is true score variability; remaining 20% [1-.80] is due to measurement error). Interpreted directly as the proportion of variability in a set of test scores that is attributable to true score variability.
(Reliability coefficient = rₓₓ)
Test-Retest Reliability
A method for assessing reliability that involves administering the same test to the same group of examinees on 2 different occasions & correlating the 2 sets of scores (test-retest reliability).
Yields a Coefficient of Stability: Provides a measure of test score stability over time.
Approp. for tests designed to measure attributes/characterisitics that are stable over time & not affected by repeated measurements or are affected in a random way (ex: aptitude).
Alternate Forms Reliability
Provides a measure of test score consistency over 2 forms of the test (aka parallel forms/equivalent forms).
- Method for est. a test’s reliability that entails administering 2 equivalent forms of the test to the same group of examinees & correlating the 2 sets of scores.
Forms can be administered at about the same time (coefficient of equivalence) or at different times (coefficient of equivalence & stability).
Considered by some experts to be the best (most thorough) method for assessing reliability.
Best for determining the reliability of tests:
- designed to measure attributes that are stable over time &
- not affected by repeated measurements, characteristics that fluctuate over time,
- when exposure to 1 form is likely to affect perf. on the other in an unsystematic way.
Internal Consistency Reliability
Degree to which items included in the test are measuring the same characteristic & indicates the degree of consistency across different test items.
- Approp. for tests that measure a single content or behavior domain (Ex: subtest but not entire test)
- Useful for est. the reliability of tests that measure characterisitics that fluctuate over time or are susceptible to memory or practice effects.
Including Split-Half Reliability & Cronbach’s coefficient/KR-20
Split-Half Reliability
A method for assessing internal consistency reliability & involves:
- “splitting” the test in half (e.g., odd- versus even-numbered items) & correlating examinees scores on the 2 halves of the test.
Since the size of a reliability coefficient is affected by test length, the split-half method tends to underestimate a test’s true reliability.
The split-half reliability coefficient is usually corrected with the Spearman-Brown formula.
- The Spearman-Brown formula can also be used more generally to est. the effects of shortening or lengthening a test on its reliability coefficent.
(Split-half Reliability = Spearman-Brown Formula)
Not approp. for speeded tests in which score depends on speed of responding.
Shorter tests are less reliable than longer ones.
Spearman-Brown Formula
Spearman-Brown formula, which estimates what the test’s reliability would be if it were based on the full length of the test to obtain an estimate of what the test’s true reliability is.
(Split-half Reliability = Spearman-Brown Formula)
Coefficient Alpha
Method for assessing internal consistency reliability that provides an index of average inter-item consistency rather than the consistency between 2 halves of the test.
KR-20 can be used as a substitute for coefficient alpha when test items are scored dichotomously.
(KR-20 = Coefficient Alpha)
Kuder-Richardson Formula 20 (KR-20)
Kuder-Richardson Formula 20 (KR-20) can be used as a substitute for coefficient alpha when test items are scored dichotomously (scored as right or wrong; Ex: T/F & multiple choice questions).
(KR-20 = Coefficient Alpha)
Inter-Rater Reliability
Important for tests that are subjectively scored, such as essay & projective tests, based on judgement.
To be sure an examinee will obtain the same score no matter who is doing the scoring.
The scores assigned by different raters can be used to calculate a correlation (reliability) coefficient or to determine the percent agreement between raters, the resulting index of reliability can be artificially inflated by the effects of chance agreement.
Alt. a special correlation coefficent can be used such as:
- Cohen’s Kappa Statistic: Used to measure agreement btwn 2 raters when scores represent a nominal scale.
- Kendall’s Coefficent of Concordance: Used to measure agreement btwn 3 or more raters when scores are reported as ranks.
Reliability coefficents over .80 are generally considered acceptable.
(Kappa Statistic = Inter-rater reliability)
Cohen’s Kappa Statistic & Kendall’s Coefficient of Concordance
Cohen’s Kappa Statistic: A correlation coefficient used to assess agreement btwn 2 raters (inter-rater reliability) when scores represent a nominal scale.
(Kappa Statistic = Inter-rater reliability)
Kendall’s Coefficent of Concordance: Used to measure agreement btwn 3 or more raters when scores are reported as ranks.
Factors that Affect the size of the Reliability Coefficient
4 Factors can increase a tests reliability:
- Test Length
- Range of Scores/Heterogenity
- Test Content
- Guessing
Test Length
The larger the sample of the attributes being measured by a test, the less relative effects of measurement error & more likely will provide dependable, consistent information.
Longer tests are generally more reliable than shorter tests.
Ways to increase the test length is by adding items of similar content and quality.
Can use Spearman-Brown to estimate reliability, yet it tend to overestimate a test’s true reliability.
Range of Scores
A test’s reliability can be increased with the heterogeneity of the sample in terms of the attributes measured by the test, which will increase the range of scores. (p=.50).
Reliability increases w/the degree of similarity of examinees in terms of the attributes measured by the test, which will increase the range of scores.
Maximized when the range of scores is unrestricted.
Test Content
The ore homogeneous a test is w/regard to content, the higher it’s reliability coefficient.
Easiest to understand if you consider internal consistency; the more consistent the test items are in terms of content the larger the coefficient alpha or split-half reliability.
Guessing
A test’s reliability coefficient is also affected by the probability that examinees can guess the correct answer to test items.
As the probability of correctly guessing increases the reliability coefficient decrease.
More difficult it is to pick the right answer by guessing the larger the reliability coefficint.
Confidence Interval
Helps a test user to estimate the range w/in which an examinee’s true score is likely to fall given their obtained score.
Bc tests are not totally reliable, an examinee’s obtained score may or may not be his/her true score.
Always best to interpret an examinee’s obtained score it to construct a confidence interval around that score.
- A confidence interval indicates the range w/in which an examinee’s true score is likely to fall given the obtained score.
- It is derived using the Standard Error of Measurement (SEM):
- 68% = +/- 1 SEM from obtained score
- 95% = +/- 2 SEM from obtained score
- 99% = +/- 3 SEM from obtained score
Standard Error of Measurement (SEM) = Used to construct a confidence interval around a measured or obtained score.
Standard Error of Measurement
It is used to construct a confidence interval around an examinee’s obtained (measured) score.
Range is calculated by multiplying the standard deviation of the test scores by the square root of 1 - the reliability coefficient.
This is an index of the amount of error that can be expected in obtained scores due to the unreliability of the test.
SEmes=SDx√(1-rₓₓ)
Ex: A psychologist administers an interpersonal assertiveness test to a sales applicant who receives a score of 80. Since test‘s reliability is less than 1.0, the psych. knows that this score might be an imprecise est. of the applicant’s true score & decides to use the standard error of measurement to construct a 95% confidence interval. Assuming that the test‘s reliability coefficient is .84 and its standard deviation is 10, the standard error of measurement is equal to 4.0
The psych. constructs a 95% confidence interval by adding and subtracting 2 standard errors from the applicant’s obtained score: 80 ±2(4.0) = 72 to 88. There is a 95% chance that the applicant’s true score falls between 72 and 88.
SEM = SDx√(1-rₓₓ)
=10√1-.84
=10(.4) = 4.0
Validity
Refer’s to a test’s accuracy in terms of the extent to which the test measures what it is intended to measure.
3 Different Types of Validity include:
- Content Validity (content or behavior)
- Construct Validity (hypothetical trait or construct)
- Criterion-Related Validity (status/perf. on external criterion)
Content Validity
Important for tests designed to measure a specific content or behavior domain & items are an accurate & respresentative sample of content domains they represent.
Content validity is not the same as face validity.
Most important for achievement & job sample tests.
Determined primarily by “expert judgment”
This is of concern when a test has been designed to measure 1 + content/behavior domains.
A test has content validity when its items are a representative sample of the domain(s) that the test is intended to measure.
Usually built into a test bc it is being constructed & involves clearly defining:
- The content/behavior domain
- Divided into categories & sub-categories
- Then write or select items that represent each sub-category thru the selection of a representative sample of items.
After test devel., content validity is checked by having subject matter experts who determine if test items are an adequate & representative sample of the content or behavior domain & then evaluate the test in a systematic way.
Scores on the test (X) are important bc they provide info. on how much each examinee knows about a content domain w/regard to the traits being measured, then content or construct validity are of interest.