Test Construction Flashcards
Item Difficulty
• The proportion of examinees in the tryout sample who answer the item correctly
• Used to measure examinees knowledge or skill level
Range 0-1
“p”
Item Difficulty
Item Discrimination
• refers to how much an item discriminates between examinees who obtain low or high scores on the test or an external criterion
• Calculated by subtracting the percent of examinees in the lower scoring group who answered the item correctly from the percent of examinees in the upper-scoring group who answered the item correctly
“D”; Range: -1-+1, .35 is acceptable
Item Discrimintation:
-1, 0, +1
+1 (positive answer), all upper-scoring group answered the item lower-scoring group answered the item incorrectly
0, both groups answered the item correctly
-1, lower-scoring group answered the item correctly upper-scoring group answered the item incorrectl
Classical Test Theory
• variability in test scores reflects a combination of true score variability and variability due to measurement (random) error X = T + E Total Variability (X) = True Score Measurement Error Variability (T) + Measurement Error (E)
True Score Variability
Actual test knowledge
Measurement Error
environment or guessing
Reliability (4 things)
make sure examinees score reflects their true score, rxx, range is 0-1, .8 is acceptable
Reliability Coefficient
measure of true score variability
Reliability 4 methods (4)
- test-retest
- alternate forms
- internal consistency
- inter-rater
Test-Retest Reliability (3)
Consistency over time, also known as coefficient of stability, NOT fluctuating chara.. or random affects
Alternate Forms Reliability (3)
consistency over 2 forms of a test, aka parallel test reliability, appropriate for forms stable over time, NOT fluxuacting chara or random affects
Internal Consistency (3)
degree of consistency across different test items. Appropriate for a single content or behavior domain Measured by split-half reliability and Cronbach’s coefficient alpha
Split-Half Reliability (6)
associated with Internal Consistency, splits test in half and correlates, coefficient corrected by Spearman Brown prophesy formula, NOT for speeded test, Crohnbech’s coefficient alpha, Kudar-Richardson formula 20
Spearman-Brown Formula
ass w/ Internal Consistency, used along with split-half more generally to estimate the effect of shortening or lengthening a test on its reliability coefficient,
Cronbach’s Coefficient Alpha
Used with Split-Half Reliability
“mean of all possible split-half a correlation coefficients”, Can’t use Alpha with “forced choice”, Cronbach’s α: used with continuous variables
Kuder-Richardson Formula 20 (KR-20) (3)
Ass w/ Split-Half Reliability, Can be used as a substitute for coefficient alpha when test items are scored dichotomously, Used for T/F Multiple choice questions when there is right or wrong
Inter-rater Reliability (5)
Important for measures that are subjectively scored ex. essay, make sure obtain the same score no matter who is doing the scoring, measured using percent agreement (over est Interrater reliability), or using Cohen’s Kappa statistic, or Kendall’s coefficient of concordance
Cohen’s kappa statistic
Ass w/ Inter-rater reliability, used to measure agreement between two raters when scores represent a nominal scale
Kendall’s coefficient of concordance
ass with inter-rater reliability, is used to measure agreement between three or more raters when scores are reported as ranks
Factors that Affect Reliability (4)
Test length: longer tests are more reliable
Range of scores: increases the size of the reliability coefficient
Content of the test: more homogenous, more reliable
Likelihood of items can be answered by guessing: less choice/guessing, more reliable
Confidence Interval
- indicates the range within which an examinee’s true score is likely to fall given his/her obtained score
- derived using the standard error of measurement (SEM)
Standard Error of Measurement (SEM)
o used to obtain a confidence interval around obtained test score
•68% confidence interval: one SEM is added to and subtracted
• 95% confidence interval: two SEM’s are added to and subtracted
•99% confidence interval: three SEM’s are added to and subtracted
Standard deviation x square root of 1-rxx;
Consensual observer drift
ass with inter-rater reliability, occurs when two or more observers working together influence each other’s ratings on a behavioral rating scale so that they assign ratings in a similar idiosyncratic way.
Coefficient of concordance:
is another measure of inter-rater reliability.
Validity
Refers to a test’s accuracy in terms of the extent to which the test measures what it was designed to measure (main ones: content, construct, criterion);
Content Validity
measures a specific content or behavior domain
Construct Validity
measures a theoretical hypothetical trait or construct
Criterion-related Validity
used to predict or estimate an examinee’s status on an external criterion
Face Validity
Refers to whether or not test items “look like” they’re measuring what the test is designed to measure, not an actually type of validity
multitrait-multimethod matrix
Used with construct validity; a table of correlation coefficients that provide information about a test’s convergent and divergent (discriminant) validity
Discriminant Validity
convergent and divergent validity
Multitrait-Multimethod Matrix (4 measures)
- Measure being validated
- Measure of the same trait using a different method
- Measure of an unrelated trait using the same method
- Measure of the same unrelated trait using a different method
Convergent Validity
ass with discriminant validity; Correlation between the test we’re validating and the measure of the same trait using a different method
Divergent Validity
ass with Discriminant Validity; Correlations between the test we’re validating and the measures of unrelated traits
Factor Analysis
more complex way to measure construct validity as well as discriminant validity; 1. Administer tests to a sample of examinees
- Derive and interpret the correlation matrix
- Extract the initial factor matrix (difficult to interpret)
- Rotate the factor matrix (make it easier to interpret)
communality
Ass with factor analysis; single variable, multiple factors
Factor Matrix
Orthogonal means uncorrelated,
Oblique means correlated
Criterion-Related Validity
Important when test scores will be used to predict or estimate status on a criterion (on a different measure). Coefficient is always less than ±1; Evaluated by correlating scores on the test (predictor) with scores on the criterion for a sample of examinees to obtain a criterion-related validity coefficient
Concurrent Validity:
involves obtaining scores on the predictor and criterion at about the same time (current status) vs. predictive Validity
Predictive Validity
Involves obtaining predictor scores prior to obtaining criterion scores
Standard Error of Estimate:
is used to construct a confidence interval around a predicted criterion score (vs. SEM for obtained test score) SEest= SDy√1-rxy2
SEM vs Sest
SEM : confidence interval around a measure or obtained score
SEest: confidence interval around a predicted score
Validity vs. Reliability
reliability is a necessary but not sufficient condition for validity, ex. A valid test must be reliable but reliability doesn’t guarantee validity; Validity coefficient must be less than the square root of the Reliability Coefficient
Rxy ≤ √Rxx
Steps in Validating a Predictor:
- Conduct a job analysis
- Select/develop the predictor and criterion
- Obtain and correlate scores on the predictor and criterion
- Check for adverse impact
- Evaluate incremental validity
- Cross-validate
Incremental Validity:
Refers to the increase in decision-making accuracy that use of a predictor provides
Incremental
Validity Scatterplot:
Criterion-Y axis Predictor-X axis
Incremental Validity Calculation
Calculated by subtracting the base rate from the positive hit rate
Positive Hit Rate
ass w/ Incremental Validity; (# of people hired and successful on criterion, true positives/total positives) incremental validity: Calculated by subtracting the base rate from the positive hit rate
Base Rate
Ass w/ Incremental Validty; ( # of people hired without the predictor who are successful # of successful individuals/total number of individuals); incremental validity: Calculated by subtracting the base rate from the positive hit rate
Specificity
Ass w/ Incremental validity; Refers to the identification of true negatives (percent of cases in the validation sample who do not have the disorder and were accurately classified by the test as not having the disorder).
Sensitivity
Ass with Incremental Validity; refers to the probability that a predictor will correctly identify people with the disorder from the pool of people with the disorder. It is calculated using following formula: true positives or (true positives + false negatives).
Norm-Referenced Interpretation:
Compares examinees’ test scores obtained in a test score to scores obtained in a standardization sample or other comparison group; raw score is converted to a score that indicates his/her relative standing in the comparison group ex. Standard Score, Percentile rank, Z score, T score, IQ
Percentile Rank (4)
- Ranges from 1-99
- examinee s’ score in terms of score in terms of percentage of examinees who achieved lower scores
- Distribution is always flat (rectangular) regardless of the shape of the raw score distribution
- Maximize differences in the middle of the raw score distribution and minimize differences at the extremes.
Nonlinear Transformation:
Changes the shape of the original raw score
Limitation: they indicate an examinee’s relative position in a distribution but do not provide information about differences between examinees on raw scores
Standard Scores:
Indicates the examinee’s relative standing in the comparison group in terms of standard deviations from the mean
Z-scores, T-scores, and deviation IQ
• Z-score distribution has a mean of 0 and standard deviation of 1
o if an examinee obtains a score of 110 on a test that has a mean of 100 and standard deviation of 10, his/her z-score is +1.0
•T score: mean of 50 and SD of 10
•Deviation IQ score: Mean score of 10 and SD of 15
Z Scores
o calculated by subtracting the mean of the distribution from the examinee’s score to obtain a deviation score and dividing the deviation score by the distribution’s standard deviation
o if an examinee obtains a score of 110 on a test that has a mean of 100 and standard deviation of 10, his/her z-score is +1.0
Criterion-Referenced Interpretation
Involves interpreting an examinee s’ score in terms of a predefined standard; • percent correct (percentage) score; cutoff score is usually set; also used to interpret likely status on an external criterion using regression equation or expectancy table
Ex. Pass or fail test
Leptokurtic
Distribution of scores that is more pointed than normal distribution
Platokurtic
distribution of scores that is more flat than the normal distribution
Eigenvalue
Indicates the total amount of variability in a set of tests or other variables that is explained by an identified component or factor