Test Construction Flashcards
psychological test
an objective and standardized measure of a sample of behavior
standardization
uniformity of procedure in administering and scoring the test;
test conditions and scoring procedures should be the same for all examinees
norms
the scores of a representative sample of the population on a particular test;
interpretation of most psychological tests involves comparing an individual’s test score to norms
conceptual points about norms
1) norms are obtained from a sample that is truly representative of the population for which the test is designed;
2) to be truly representative, a sample must be reasonably large;
3) examinee’s score should be compared to the scores obtained by a representative sample of the population to which he or she belongs;
4) norm-referenced scores indicate an examinee’s standing on a test as compared to other persons, which permits comparison of an individual’s performance on different tests;
5) don’t provide a universal standard of “good” or “bad” performance - represent the performance of persons in the standardization sample
objective
administration, scoring, and interpretation of scores are “independent of the subjective judgment of the particular examiner”;
the examinee will obtain the same score regardless of whoever administers or scores the test
sample of behavior
the test will sample the behavior in question
reliability
yields repeatable, dependable, and consistent results;
yields examinees’ true scores on whatever attribute that it measures
validity
measures what it purports to measure
maximum performance
tells us about an examinee’s best possible performance, or what a person can do;
achievement and aptitude tests
typical performance
tell us what an examinee usually does or feels;
interest and personality tests
pure speed (speeded) test
the examinee’s response rate is assessed;
have time limits and consist of items that all (or almost all) examinees would answer correctly if given enough time
power test
assesses the level of difficulty a person can attain;
no time limit or a time limit that permits most or all examinees to attempt all items;
items are arranged in order from least difficult to most difficult
mastery tests
designed to determine whether a person can attain a pre-established level of acceptable performance;
“all or none” score (e.g., pass/fail);
commonly employed to test basic skills (e.g., basic reading, basic math) at the elementary school level
ipsative measure
individual themself (opposed to a norm group or external criterion) is the frame of reference in score reporting;
scores are reported in terms of the relative strength of attributes within the individual examinee;
scores reflect which needs are strongest or weakest within the examinee, rather than as compared to a norm group;
examinees express a preference for one item over others, rather than responding to each item individually - required to choose which of 2 statements appeals to you the most
normative measures
provide a measure of the absolute strength of each attribute measured by the test;
examinees answer every item;
score can be compared to those of other examinees
classical test theory
a given examinee’s obtained test score consists of two components: truth and error
true score
reflects the examinee’s actual status on whatever attribute is being measured by the test
error (measurement error)
factors that are irrelevant to whatever is being measured; random;
does not affect all examinees in the same way
reliability coefficient
a correlation coefficient that ranges in value from 0.0 to +1.0;
indicates the proportion of variability that is true score variability;
0.0 - test is completely unreliable; observed variability (differences) in test scores due entirely to random factors;
1.0 - perfect reliability; no error - all observed variability reflects true variability;
.90 - 90% of observed variability in obtained test scores due to true score differences among examinees and the remaining 10% of observed variability represents measurement error;
cannot be squared
test-retest reliability coefficient (“coefficient of stability”)
administering the same test to the same group of people, and then correlating scores on the first and second administrations
“time sampling”
factors related to time that are sources of measurement error for the test-retest coefficient;
from one administration to the next, there may be changes in exam conditions (noises, weather) or factors such as illness, fatigue, worry, etc.
practice effects
doing better the second time around due to practice
drawbacks of test-retest reliability coefficient
examinees systematically tend to remember their previous responses;
not appropriate for assessing the reliability of tests that measure unstable attributes (mood);
recommended only for tests that are not appreciably affected by repetition, so very few psychological tests fall into this category
alternate forms (equivalent forms or parallel forms) reliability coefficient
administering two equivalent forms of a test to the same group of examinees, and then obtaining the correlation between the two sets of scores
drawbacks of alternate forms reliability coefficient
lower than the test-retest reliability coefficient;
sources of measurement error: differences in content between the 2 forms (some do better on Form A, others do better on Form B) and passage of time, since the two forms cannot be administered at the same time;
impractical and costly to construct two versions of the same test;
should not be used to assess the reliability of a test that measures an unstable trait
internal consistency
obtaining correlations among individual items;
split-half reliability, Cronbach’s coefficient alpha, Kuder-Richardson Formula 20;
administer the test once to a single group of examinees
split-half reliability
dividing the test in two and obtaining a correlation between the halves as if they were two shorter tests
Spearman-Brown formula
estimates the effect that shortening (or lengthening) a test will have on the reliability coefficient
drawbacks of split-half reliability
correlation will vary depending on how the items are divided;
splitting the test in this manner artificially lowers the reliability coefficient since the longer a test, the more reliable it will be - so use Spearman-Brown formula
Kuder-Richardson Formula 20 (KR-20)
indicate the average degree of inter-item consistency;
used when the test items are dichotomously scored (right-wrong, yes/no)
coefficient alpha
indicate the average degree of inter-item consistency;
used for tests with multiple-scored items (“usually”, “sometimes”, “rarely”, “never”)
pros and cons of internal consistency reliablity
pros: good for assessing the reliability of tests that measure unstable traits or are affected by repeated administration;
cons: major source of measurement error is item heterogeneity; inappropriate for assessing the reliability of speed tests.
content sampling, or item heterogeneity
degree that items are different in terms of the content they sample
interscorer (or inter-rater) reliability
calculating a correlation coefficient between the scores of two different raters
kappa coefficient
measure of the agreement between two judges who each rate a set of objects using nominal scales
mutually exclusive categories
a particular behavior clearly belongs to one and only one category
exhaustive categories
the categories cover all possible responses or behaviors
duration recording
rater records the elapsed time during which the target behavior or behaviors occur
frequency recording
observer keeps count of the number of times the target behavior occurs;
useful for recording behaviors of short duration and those where duration is not important
interval recording
observing a subject at a given interval and noting whether the subject is engaging or not engaging in the target behavior during that interval;
useful for behaviors that do not have a fixed beginning or end
continuous recording
recording all the behavior of the target subject during each observation session
standard error of measurement (σmeas)
indicates how much error an individual test score can be expected to have;
used to construct a confidence interval
confidence interval
the range within which an examinee’s true score is likely to fall, given his or her obtained score
SEM formula
SEMEAS = SDx (√1−rxx)
SEMEAS = standard error of measurement
SDx = standard deviation of test scores
rxx = reliability coefficient
CI formulas
(± σmeas) of obtained score = 68%;
(± 1.96 x σmeas) of obtained score = 95%;
(± 2.58 x σmeas) of obtained score = 99%
factors affecting reliability
- short tests are less reliable than longer tests
- as the group taking a test becomes more homogeneous, the variability of the scores - and hence the reliability coefficient - decreases
- if test items are too difficult, most people will get low scores on the test; if items are too easy, most people will get high scores, decreasing score variability, resulting in a lower reliability coefficient
- the higher the probability that examinees can guess the correct answer to items, the lower the reliability coefficient
- for inter-item consistency measured by the KR-20 or coefficient alpha methods, reliability is increased as the items become more homogeneous
content validity
the extent to which the test items adequately and representatively sample the content area to be measured;
educational achievement tests, work samples, EPPP
assessment of content validity
judgment and agreement of subject matter experts;
high correlation with other tests that purport to sample the same content domain;
students who are known to have succeeded in learning a particular content domain do well on a test designed to sample that domain
face validity
appears valid to examinees who take it, personnel who administer it, and other technically untrained observers
criterion-related validity
useful for predicting an individual’s behavior in specified situations;
applied situations (select employees, college admissions, place students in special classes)
criterion-related validity coefficient
a correlation coefficient (Pearson r) is used to determine the correlation between the predictor and the criterion
criterion-related validity coefficient formula
rxy
“x” refers to the predictor
“y” refers to the criterion
validation
the procedures used to determine how valid a predictor is
concurrent validation
the predictor and the criterion data are collected at or at about the same time;
useful for predicting a given current behavior, say that it has high concurrent validity;
focus on current status on a criterion
predictive validation
scores on the predictor are collected first, and the criterion data are collected at some future point;
useful for predicting a future behavior, say that the test has high predictive validity;
designed to predict future status
standard error of estimate (or σest)
estimate the range in which a person’s true score on a criterion is likely to fall, given his/her score as estimated by a predictor
standard error of estimate formula
SE est = SDy 1 - rxy2
SEest = standard error of estimate
SDy = standard deviation of criterion scores
rxy = validity coefficient
CI for standard error of estimate
68% = ± (1)(σest) of predicted criterion score;
95% = ± (1.96)(σest);
99% = ±(2.58)(σest )
differences between standard error of estimate and standard error of measurement
1) SEM is related to the reliability coefficient; SEE is related to the validity coefficient
2) SEM used to estimate where an examinee’s true test score is likely to fall, given obtained score on that same test - no predictor measure is involved; SEE used to determine where an examinee’s actual criterion score is likely to fall, given the criterion score that was predicted by another measure - predictor is being used
criterion cutoff
whether or not that person will meet or exceed a certain minimum standard or criterion performance
predictor cutoff score
if the examinee scores at or above the predictor cutoff score he or she is selected, but if the examinee scores below the predictor cutoff score, he or she is rejected
True Positives (or Valid Acceptances)
scored above the cutoff point on the predictor and turn out to be successful on the criterion;
predictor said they would be successful on the job and it was right
False Positives (or False Acceptances)
scored above the cutoff point on the predictor but did not turn out to be successful on the criterion;
the predictor wrongly indicated that they would be successful on the job
True Negatives (or Valid Rejections)
scored below the cutoff point on the predictor and turned out to be unsuccessful on the criterion;
predictor correctly indicated that they would be unsuccessful on the job
False Negatives (or Invalid Rejections)
scored below the cutoff point on the predictor but turned out to be successful on the criterion;
predictor incorrectly indicated that they would be unsuccessful on the job
“positive” and “negative” for predictor
“positive”: predictor says the person should be selected;
“negative”: predictor says the person should not be selected
“true” and “false” for predictor
where the person actually stands on the criterion;
“true”: predictor classified the person into the correct criterion group;
“false”: predictor made an incorrect classification
predictor’s functional utility
determine the increase in the proportion of correct hiring decisions that would result from using the predictor as a selection tool, relative to when it is not used
Factors Affecting the Validity Coefficient
1) Heterogeneity of Examinees: lowered if there is a restricted range of scores - either on predictor or criterion - more homogeneous the validation group, lower the validity coefficient
2) Reliability of Predictor and Criterion: for a predictor to be valid, both the predictor and the criterion must be reliable - an unreliable test will always be invalid, but a reliable test will not always be valid
3) Moderator Variables: the criterion-related validity of a test may vary among subgroups within a population by moderator variables
4) Cross-Validation: after a test is validated, it is typically re-validated with a sample of individuals different from the original validation sample
moderator variables
variables that influence the relationship between two other variables
differential validity
test is more valid for one subgroup but not another
cross-validation
after a test is validated, it is typically re-validated with a sample of individuals different from the original validation sample
shrinkage
reduction that occurs in a criterion-related validity coefficient upon cross-validation;
occurs b/c the predictor is “tailor-made” for the original validation sample and doesn’t fully generalize to other samples
when is shrinkage greatest
the original validation sample is small;
the original item pool is large;
the number of items retained is small relative to the number of items in the item pool;
items are not chosen based on a previously formulated hypothesis or experience with the criterion
criterion contamination
in the process of validating a test, the predictor scores themselves influence any individual’s criterion status;
artificially inflates the validity coefficient - it makes the predictor look more valid than it actually is
construct
a psychological variable that is abstract
construct validity
measures a theoretical construct or trait
convergent validity
requires that different ways of measuring the same trait yield similar results (WISC, WJ);
tests that measure the same trait have a high correlation, even when they use different methods
discriminant (divergent) validity
low correlation with another test that measures a different construct;
two tests that measure different traits have a low correlation, even when they use the same method
multitrait-multimethod matrix
assessment of two or more traits by two or more methods (self-report inventory, peer ratings, projective test)
monotrait-monomethod coefficients
indicate the correlation between the measure and itself and are therefore reliability coefficients
monotrait-heteromethod coefficients
correlations between two measures that assess the same (mono) trait using different (hetero) methods;
if a test has convergent validity, this correlation should be high
heterotrait-monomethod coefficients
correlations between two measures that measure different (hetero) traits using the same (mono) method;
if a test has discriminant validity, this coefficient should be low
heterotrait-heteromethod coefficients
correlations between two measures that measure different (hetero) traits using different (hetero) methods;
if a test has discriminant validity, this correlation should be low
factor analysis
reducing a set of many variables (e.g., tests) to fewer variables to assess construct validity of a test;
detect structure in several variables;
can allow you to start with a large number of variables and classify them into sets
underlying constructs
tests in the analysis are not directly intended to measure these constructs
(AKA latent variables)
factor loading
the correlation between a given test and a given factor;
range from +1 to -1;
can be squared to determine the proportion of variability in the test accounted for by the factor
communality (h2)
determine the proportion of variance of a test that is attributable to the factors;
part of true variability shared with other tests
common variance
factors are also accounting for variance in the other tests included in the analysis
unique variance (u2)
variance specific to the test and not explained by the factors;
part of true variability unique to the test itself
explained variance, or eigenvalues
measure of the amount of variance in all the tests accounted for by the factor
things you should know about eigenvalues
1) factors will be ordered in terms of the size of their eigenvalue - Factor I larger than Factor II, which is larger than Factor III, etc. Factor I will explain more of “what’s going on” in the tests than Factor II;
2) sum of the eigenvalues can be no larger than the number of tests included in the analysis
rotation
procedure that facilitates interpretation of a factor matrix;
re-dividing the test’s communalities so that a clearer pattern of loadings emerges
orthogonal
factors that are independent of each other (uncorrelated)
oblique
factors that are correlated with each other to some degree
factorial validity
when a test correlates highly with a factor it would be expected to correlate with
differences between principal components and factor analysis
1) terminology: “factor” in factor analysis is usually referred to as a principal component or an eigenvector in principal components analysis
2) in principal components analysis variance has 2 elements: explained variance and error variance; in factor analysis, the variance has 3 elements: communality, specificity, and error
3) in principal components analysis, the factors (or components, or eigenvectors) are always uncorrelated
cluster analysis
place objects into categories;
develop a taxonomy or classification system
differences between cluster analysis and factor analysis
1) only variables that are measured using interval or ratio data can be used in a factor analysis; variables measured using any type of data can be included in a cluster analysis
2) factors in factor analysis are usually interpreted as underlying traits or constructs measured by the variables in the analysis; clusters in cluster analysis are just categories, and not necessarily traits
3) cluster analysis used in studies where there is an a priori hypothesis regarding what categories the objects will cluster into; factor analysis used to test a hypothesis regarding what traits a set of variables measures
relationship between reliability and validity
a test is reliable if it measures “something,” and a test is valid if that “something” is what the test developer claims it is;
for a test to be valid, it must be reliable;
the validity coefficient is less than or, at the most, equal to the square root of the reliability coefficient - it can’t be higher;
reliability places an upper limit on validity
correction for attenuation
the formula answers the following question: “What would the validity coefficient of my predictor be if both the predictor and the criterion were perfectly reliable?”;
what would happen to the validity coefficient if reliability (of both the predictor and the criterion) were higher
item analysis
used to determine which items will be retained for the final version of the test;
can be qualitative (content of the test) and quantitative (measurement of item difficulty, item discrimination)
item difficulty index (“p”)
the percentage of examinees who answer it correctly;
the higher the p value, the less difficult the item;
ideal items have p = ~.50
ordinal scale only
item difficulty index for:
gifted
mastery
true/false
multiple choice
.25
.80 to .90
.75
.60
item discrimination
degree to which a test item differentiates among examinees in terms of the behavior that the test is designed to measure
item discrimination index (“D”)
choose items that have high correlations with the criterion but low correlations with each other
item characteristic curves (ICCs)
graphs that depict each item in terms of how difficult the item was for individuals in different ability groups
item response theory assumptions about test items
1) performance on an item is related to the estimated amount of a latent trait being measured by the item; implies that the scores of individuals tested with different items can be directly compared to each other since all the items measure the same latent trait.
2) results of testing are sample free (“invariance of item parameters”) - an item should have the same parameters (difficulty and discrimination levels) across all random samples of a population so it can be used with any individual to provide an estimate of their ability
adaptive testing of ability
administering a set of items tailored to the examinee’s estimated level of ability
norm-referenced interpretation
comparing an examinee’s score to norms (scores of other examinees in a standardization sample);
indicates where the examinee stands in relation to others who have taken the test
developmental norms
indicate how far along the normal developmental path an individual has progressed
mental age (MA) score
comparing an examinee’s score to the average performance of others at different age levels
grade equivalent scores
computing the average raw score obtained by children in each grade;
for educational achievement tests
disadvantages of developmental norms
don’t permit comparisons of individuals at different age levels;
grade equivalent scores on different tests are not comparable
within-group norms
provide a comparison of the examinee’s score to those of the most nearly comparable standardization sample
percentile rank (PR)
the percentage of persons in the standardization sample who fall below a given raw score
pros and cons of percentile rank
pro: easy to understand and interpret;
con: represent ranks (ordinal data) and therefore do not allow interpretations in terms of absolute amount of difference between scores
standard scores
express a raw score’s distance from the mean in terms of standard deviation units;
tell us how many standard deviation units a person’s score is above or below the mean
pros of using standard scores
scores can be compared across different age groups;
allow for interpretation in terms of the absolute amount of differences between scores
Z-scores
directly indicates how many standard deviation units a score falls above or below the mean
T-scores
mean of 50 and a SD of 10;
a T-score of 60 has a score that falls 1 standard deviation above the mean
Stanine Scores
scores range from 1 to 9;
mean of 5 and a SD of 2
Deviation IQ scores
mean of 100 and a standard deviation of 15
differential prediction
a case where given scores on a predictor test predict different outcomes for different subgroups
single-group validity
a test is valid for one subgroup but not another subgroup
sensitivity of a test
the proportion of correctly identified cases;
the ratio of examinees whom the test correctly identifies as having the characteristic to the total number of examinees who actually possess the characteristic
triangulation
attempt to increase reliability by reducing systematic or method error through a strategy in which the researcher employs multiple methods of measurement (e.g., observation, survey, archival data)
calibration
attempt to increase reliability by increasing homogeneity of ratings through feedback to the raters, when multiple raters are used;
raters might meet during pretesting of the instrument to discuss items on which they have disagreed seeking to reach consensus on rules for rating items (defining a “2” for an item dealing with job performance)