Test construction Flashcards
What is the item discrimination index?
item discrimination index (D) indicates the difference between the percentage of examinees with high total test scores who answered the item correctly and the percentage of examinees with low total test scores who answered the item correctly. When the same percentage of examinees in the two groups answered the item correctly, D equals 0.
How can you increase reliability coefficient?
Reliability coefficients tend to be larger for longer tests than shorter tests as long as the added items address similar content as the original items do and when the tryout sample is heterogeneous with regard to the content measured by the test so that there is an unrestricted range of scores.
maximized when range of scores is unrestricted, when examinees are heterogeneous the range of scores is maximized
-difficulty level of items will also affect range, all easy or all hard items will lead to all high or low test scores, want average difficulty level of item to be mid-range
Explain classical test theory
Classical test theory is also known as true score theory and predicts that obtained test scores (X) are due to a combination of true score variability (T) and measurement error (E), with measurement error referring to random factors that affect test performance in unpredictable ways.
How do you interpret a reliability coefficient?
A reliability coefficient is interpreted directly as the amount of variability in test scores that’s due to true score variability. When a test’s reliability coefficient is .90, this means that 90% of variability in test scores is due to true score variability and the remaining 10% is due to measurement error.
What is the spearman-brown formula used for?
Test length is one of the factors that affects the size of the reliability coefficient, and the Spearman-Brown formula is often used to estimate the effects of lengthening or shortening a test on its reliability coefficient. This formula is especially useful for correcting the split-half reliability coefficient because assessing split-half reliability involves splitting the test in half and calculating a reliability coefficient for the two halves of the test. Therefore, split-half reliability tends to underestimate a test’s actual reliability, and the Spearman-Brown formula is used to estimate the reliability coefficient for the full length of the test.
When is Cohen’s Kappa coefficient used?
The kappa coefficient is used to assess the consistency of ratings assigned by two raters when the ratings represent a nominal scale (e.g., when a rating scale classifies children as either meeting or not meeting the diagnostic criteria for ADHD).
used to evaluate inner-rater reliability
corrected for change agreement between raters
When do you use the Kuder-Richardson 20?
Kuder-Richardson 20 (KR-20) can be used to assess a test’s internal consistency reliability when test items are dichotomously scored (e.g., as correct or incorrect) (altenrative to cronbach’s alpha)
What is test reliability?
extent to which a test provides consistent info
r =reliability coefficent (a correlation coefficient)
-ranges from 0-1
-interpreted as amoung of variability in test scores that’s due to true score variability
-do NOT square this, interpret as is
formula to calculate standard error of measurement
SEM = (SD)(square root of 1 - r)
where r = reliability coefficient
how to construct CI for 68%, 95% and 99%
from the person’s score add/subtract 1 SEM for 68% CI, 2 SEM for 95% CI, and 3 SEM for 99% CI
what does squaring a correlation coefficient tell you?
can only score correlation coefficient when it represents the correlation between two different tests
when squared it provides a measure of shared variability or uses terms like “accounted for by” or “explained by”
What does cronbach’s alpha measure?
internal consistency reliabilty
What is the problem with split-half reliability?
for split-half reliability you split the test in half and adminster then look at the correlation between the two halves
problem is that shorter tests are less reliable than longer tests, so the reliability coefficent of a split-half test underestimates the full tests true reliability
this is corrected with the spearman-brown prophecy formula
what is percent agreement?
used to assess inter-rater reliability for 2 or more raters, does not take chance agreement into account and can overestimate reliability
cohen’s kappa preferred because it is corrected for change agreement
What are factors that affect the reliability coefficient?
content homogeneity- leads to larger reliability coefficients
range of scores- reliability coefficients are larger when range of test scores are unrestricted
guessing- easier it is to choose the correct answer the lower the reliability coefficient
What is item analysis used for in test construction?
to determine which items to include based on difficulty level and ability to discriminate between examinees who obtain high and low scores
how is item difficulty determined
for dichotomous items it is the % of examinees who answered the item correctly, ranges 0-1, smaller values more difficult
What is item response theory?
an alternative to classical test theory. CTT is test based, IRT is item based.
overcomes limitations of CTT: better suited for developing computerized adaptive tests
What is an item characteristic curve and what does it tell you?
tells about the relationship between each item and the latent trait being measured by the test
x-axis = total test scores
y-axis = probability of answer item correctly
location of curve= difficulty parameter, more likely to be answered correctly are on the left side of the graph and less likely to be answered correctly are on the right side of the graph
slope of the curve= discrimination parameter, how well the item can discriminate between individuals which high and low levels of the trait, steeper slope = better discrimination
point at which curve crosses the y-axis = probability of guessing correctly, closer to 0, more difficult to guess
What is content validity?
items of the test are a clearly representative sample of the domain being tested
What is construct validity?
important for tests designed to measure a hypothetical trait that cannot be directly observed but is inferred from behavior
includes convergent and divergent (discriminant) validity
convergent- degree to which scores on test have high correlation with scores on other measures designed to assess the same or related construct
divergent- degree to which test scores have low correlations with measures of other unrelated constructs
What is multitrait-multimethod matrix used for?
provides info about a tests reliability, and convergent and divergent validity
test and 3 other measures are administered: 1) test assessing same trait but with different method, 2) test of unrelated trait using same method, 3) unrelated trait using different method
correlate all pairs of test scores and interpret
how do you interpret the correlations from a multitrait-multimethod matrix
monotrait-monomethod- this is the reliabilty coefficient or coefficient alpha (correlating test with its self)
monotrait-heteromethod-correlation between the new test and the test that measures the same trait with a different method, when this coefficient is large, provides evidence for convergent validity
heterotrait-monomethod- correlation between new test and the test of a different trait using the same method, small correlation demonstrates divergent validity
heterotrait-heteromethod- correlation between the new test and the test that assesses unrelated trait with a diferent method, small correlation is evidence for divergent validity
What is factor analysis used for?
to assess a test’s convergent and divergent validity
administer test being developed as well as tests of similar and unrelated traits, correlate all pairs of scores and put in a correlation matrix and derive a factor matrix, rotate the matrix and then name and interpret the factors
matrix has to be rotated to be more easily interpreted
how do you interpret the factor loadings of a factor analysis?
factor loadings are correlation coeffiecients between each test and each factor identified by the statistical procedure
square each coefficient to determine how much variability in the test is explained by variability in the factor, look at which factor each test loads on. factor loading of .80 for test A on factor I means that .64 of variance in test A is accounted for by Factor 1
communality column= amount of variability in each test that is explained by all the identified factors. calculate this by squaring each of the correlation coefficients and adding so.80 and .1 = .64 + .01 = .65, 65% of variability in test A scores is explained by factors I and II
What is criterion-related validity and what are concurrent and predictive validity?
criterion-related validity is important for tests that will be used to predict or estimate scores on another measure for example predictor =job knowledge, criterion= measure of job performance
concurrent= scores on bother predictor and criterion obtained around the same time, for use when predictor will be used to predict current status on the criterion
predictive validity= when predictor will be used to predict future performance on the criterion, test to predict future job performance if hired
interpreting criterion-related validity coefficient
-ranges from -1 to +1
-closer to +/- 1 the more accurate predictor scores are at predicting criterion
-squaring the correlation coefficient tells you amount of variability shared by the two measures
What are cross-validation and shrinkage?
initial correlation coefficient for predcitor and criterion are likely overestimate of true correlation
When tests are cross-validated (validated for a new sample) the correlation coefficient is likely to be smaller because the same chance factors that were there before are not likely to be present again = shrinkage
shrinkage is greatest when initial sample is small, and for multiple correlation, the number of predictors is large
Interpreting standard error of the estimate
can use the SEE to calcualte CI around the person’s score (just like CI with SEM). SEE= (sd)(square root of 1-criterion-related validity coefficient squared)
SEE ranges from 0 to the size of the SD
What is incremental validity and how do you calculate it?
the increase in accuracy of predictions about criterion performance that occurs by adding the new predictor method
conduct a criterion-related validity study to see how many more accurate predictions are made using the new predictor compared to the old way
how do you perform a criterion-related validity study?
administer new measures along with old measure to make hiring decisions, then 3 months later and set cut off scores for predictor and criterion and see how many employees in each category:
true positives- high scores on predictor and criterion
false positives- high scores on predictor and low scores on criterion
true negatives- low scores on predictor and criterion
false negatives- low score on predictor, high score on criterion
calculate incremental validity by subtracting base rate from hit rate:
positive hit rate (high predictor, high criterion) MINUS base rate(employes with high criterion score by the total number of employees)
how does changing the predictor cutoff score in a criterion-related validity study affect the number of trust and false positives and negatives?
raising the predictor cut off score will result in fewer ppl being hired and fewer true and false positives and more true and false negatives
lower the cutoff score will result in more people being hired and more true/false positives and less true/false negatives
What is diagnostic efficiency?
aka diagnostic validity or diagnostic accuracy, ability of a test to directly distinugish between people who do and do not have a disorder
What are sensitivity and specificity? Hit rate?
sensitivity- proportion of people with disorder who are correctly identified as having the disorder (TP/TP+FN)
specificity- proportion of people without the disorder who are correctly identfied as not having the disorder (TN/TN+FP)
hit rate- overall correct classification rate, proportion of people correctly classified by the test
what are positive predictive value and negative predictive value?
positive predictive value = probability that a person who tests positive for a disorder actually has the disorder
(TP/TP+FP)
negative predictive value= proabbility that person who tests negative for a disorder does not actually have the disorder
(TN/ TN+FN)
sensitivity and specificty do not vary setting to setting, but positive and negative predictive value depend on the prevalence of the disorder in each setting
relationship between reliability and validity
a predcitor’s reliability places a celing on its validity: criterion-related valdity coefficient can be no greater than its reliability index (which is the square root of the predictors reliaibility coefficient)
if reliability coefficient = .81, then criterion related validity can be no greater than .90
What are the SD equivalents for percentile ranks
2 %ile = -2SD
16%ile = -1SD
50%ile= 0 SD
84%ile = 1 SD
98%ile = 2 SD
t-score = mean of 50, SD of 10
z score = mean of zero, sd off 1
full scale iq score= mean of 100, SD of 15
stanines= mean of 5, SD of 2