Test Construction Flashcards
reliability
amount of consistency, repeatability, and dependability in scores obtained on a given test
classical test theory
any obtained score is a combination of truth and error total variability = true score variability + error variability reliability is the proportion of true score variability
reliability coefficient
rxx or rtt commonly derived by correlating score obtained on test at one point in time (x or t) with score obtained at second point in time (x or t)
common sources of error in tests (3)
content sampling, time sampling, test heterogeneity
content sampling error
when a test, by chance, has items that tape into test-taker’s knowledge base or items that don’t tap into a test-taker’s knowledge
time sampling error
occurs when a test is given at two different points in time and the scores on each administration are different because of factors related to the passage of time (e.g. forgetting over time)
test heterogeneity error
error due to test heterogeneity occurs when a test has heterogeneous items tapping more than one domain
factors affecting reliability
number of items (reliability INCREASES when number of items increased) homogeneity of items - refers to items tapping into similar content items (reliability INCREASES with increased homogeneity) range of scores - unrestricted range maximizes reliability, related to heterogeneity of subjects (range of scores INCREASES with increased subject heterogeneity) ability to guess - true/false tests easier to guess (reliability DECREASES as ability to guess increases)
four estimates of reliability
test-retest reliability parallel forms reliability internal consistency reliability - split-half reliability, Kuder-Richardson (KR-20 & KR-21), Cronbach’s Alpha interrater reliability
test-retest reliability
expressed as coefficient of stability involves correlating pairs of scores from the same sample of people who are administered the identical test at two points in time major source of error = time sampling (correlated decreases when time interval between administrations increases)
parallel forms reliability
expressed coefficient of equivalent correlating the scores obtained by the same group of people on two roughly equivalent but not identical forms of the same test administered at two different points in time major source of error = time sampling and content sampling (subjects may be more or less familiar with items on one version of the test)
internal consistency reliability
looks at consistency of scores within the test test administered only once to one group of people split-half reliability Kuder-Richardson (KR-10 & KR-21) or Cronbach’s coefficient alpha
split-half reliability
calculated by splitting the test in half and then correlating scores obtained on each half by each person Spearman-Brown formula typically used major source of error = item or content sampling (someone might, by chance, know more items on one half)
Kuder-Richardson (KR-20 & KR-21) & Cronbach’s Coefficient Alpha
Sophisticated forms of internal consistency reliability involve analysis of correlation of each item with every other item on the test reliability calculated by taking mean of correlation coefficients for every possible split-half KR-20 & KR-21: when items are scored dichotomously (correct or incorrect) Cronbach’s Coefficient Alpha: when items are scored non-dichotomously and there is a range of possible scores for each item or category (e.g. Likert Scale) Major sources of error: content sampling and test heterogeneity
interrater reliability
looks at degree of agreement between two or more scorers when test subjectively scored
standard error of measurement
theoretical distribution: one person’s scores if he/she were tested hundreds of times with alternate or equivalent forms of the test standard deviation of a theoretically normal distribution of test scores obtained by one individual on equivalent tests ranges from 0.0 to SD of test when test perfectly reliable, standard error of measurement would be 0.0
95% probability that a person’s true score lies within two standard errors of measurement of the obtained score
content validity
addresses how adequately a test samples a particular content area quantified by asking panel of experts if each item is essential, useful/not essential, or not necessary no numerical validity coefficient is derived
criterion-related validity
looks at how adequately a test score can be used to infer, predict, or estimate criterion outcome e.g. how well SAT scores predict college GPA coefficient (rxy) ranges from -1.0 to 1.0 validities as low as 0.20 considered acceptable two subtypes: concurrent validity and predictive validity