Test Construction Flashcards
true score variability
variability due to real differences in ability or knowledge in the test-takers
error variability
variability caused by chance or random factors
classical test theory
test scores = true score variability + error variability
reliability
the amount of consistency, repeatability, and dependability in scores obtained on a test
reliability coefficient
- represented as ‘r’
- ranges from 0.00-1.00
- minimum acceptability is 0.8
- two factors that affect the size are: range of test scores and the homogeneity of test content
sources of errors in tests
- content sampling
- time sampling (ex - forgetting over time)
- test heterogeneity
factors that effect reliability
- number of items (the more the better)
- homogeneity (the more similar the items are, the better)
- range of scores (the greater the range, the better)
- ability to guess (true/false = the least reliable)
test-retest reliability
(or coefficient of stability)
- correlating pairs of scores from the same sample of people who are administered the identical test at two points in time
parallel forms reliability
(or coefficient of equivalence)
- correlating the scores obtained by the same group of people on two roughly equivalent but not identical forms of the same test administered at two different points in time
internal consistency reliability
- looks at the consistency of the scores within the test
- 2 ways: Kuder-Richardson or Cronbach’s coefficient alpha
split half reliability
- splitting the test in half (ex - odd vs even numbered questions) and then correlating the scores based on half the number of items
Spearman-Brown prophecy formula
- a type of split-half reliability
- tells us how much more reliable the test would be if it were longer
*inappropriate for speeded tests
Kuder-Richardson and Cronbach’s coeffcient alpha
- involve analysis of the correlation of each item with every other item in the test (reliability/internal consistency)
- KR-20 is used when items are scored dichotomously (ex - right or wrong)
- Cronbach’s is used when items are scored non-dichotomously (ex - Likert scale)
interrater reliability
- the degree of agreement between two or more scorers when a test is subjectively scored
- best way to improve = provide opportunity for group discussion, practice exercises, and feedback
validity
- the meaningfulness, usefulness, or accuracy of a measure
3 basic types of validity
- content
- criterion
- construct
face validity
- the degree to which a test subjectively appears to measure what it says it measures
content validity
- how adequately a test samples a particular content area
true positive
- test takers who are accurately identified as possessing what is being measured
*correct prediction
false positive
- test takers who are inaccurately identified as possessing what is being measured
*incorrect prediction
true negative
- test takers who are accurately identified as not possessing what is being measured
*correct prediction
false negatives
- test takers who are inaccurately identified as not possessing what is being measured
*incorrect prediction
item difficulty
- represented by ‘p’
- can range in value from 0 - 1+ (0 easy, 1 very difficult)
- difficulty level = number of people who got it right
- items should have an average difficulty level of 0.50, and a range of 0.30 and 0.80
example - if ‘p’ is 0.10, that means only 10% of people got the item right (therefore it was difficult)
diagnostic validity
- focuses on who DOES NOT have the disorder
- an ideal situation would result in high sensitivity, specificity, hit rate, and predictive values with few false positives and few false negatives
convergent validity
the degree to which scores have high correlations with scores on another measure that assesses the same thing
standard error of measurement
used to construct a confidence interval
confidence intervals
68% = 1 standard error of measurement
95% = 2 standard error of measurement
99% = 3 standard error of measurement
content validity
Does the content accurately measure all of the topics/items it’s intending to measure?
Example: In a stats exam, content validity would be very low if it only asked you how to calculate the mean. Content validity would increase if it asked additional questions about other things within the stats world as well.
criterion validity
Does the content accurately reflect a set of abilities in a current or future setting?
2 kinds:
Concurrent criterion = does this test accurately assess my student’s current level of ability?
Predictive criterion = does this test accurately assess how my students will do in the future?
construct validity
does your test measure the construct it claims to measure (ex - aggression)
construct = a group of interrelated variables that you care about
Increase item difficulty
0= very difficult
1= very easy
So to increase you want to add items that have p values as close to 0 as possible
Sensitivity vs specificity
Sensitivity: senses people WITH diagnosis = positive predictive value true positives/ (TP+FN)
Specificity: specifies people who DONT have diagnosis. Negative predictive value = True negatives/ (TN+FN)
Factor matrix & commonality
Tool that helps us see how different things (variables) are connected or share common traits.
Commonality: outlines proportion of variability
We rotate a matrix to obtain a factor matrix that is easier to interpret
Size of Standard error of mean increases as
Population SD INCREASES and sample size DECREASES
Raising test’s cut off score. Will have which effects?
Raising cut off score: will result in fewer applications being hired.
DECREASE: number of false positives
INCREASE: number of true negatives
When test scores represent an interval or ratio scale and the distribution is skewed. The best measure of central tendency is what?
Median
Item characteristic curve intercepts Y (vertical axis) and provides information about which of the following?
Probability of answering item correctly by guessing
Banding (statistical banding)
Put test scores into groups based on a range of possible errors. People with scores in the same group are considered equally good at something (assuming small core differences don’t matter for job performance)
Attenuation formula
Used to estimate what the maximum criterion related validity coefficient would be if the predictor/criterion had a reliability coefficient of 1
When the prevalence of disorder increases how are its magnitude effects of a test’s positive and negative predictive values impacted?
When prevalence INCREASES the POSITIVE predictive value INCREASES and NEGATIVE predictive value DECREASES
How to calculate standard error of estimate
A 68% confidence interval is constructed by adding and subtracting one standard error of estimate to and from the person’s predicted criterion score, a 95% confidence interval is constructed by adding and subtracting two standard errors, and a 99% confidence interval is constructed by adding and subtracting three standard errors.