Test Construction Flashcards
psychological test
an objective and standardized measure of a sample of behavior
standardization
uniformity of procedure in administering and scoring the test;
test conditions and scoring procedures should be the same for all examinees
norms
the scores of a representative sample of the population on a particular test;
interpretation of most psychological tests involves comparing an individual’s test score to norms
conceptual points about norms
1) norms are obtained from a sample that is truly representative of the population for which the test is designed;
2) to be truly representative, a sample must be reasonably large;
3) examinee’s score should be compared to the scores obtained by a representative sample of the population to which he or she belongs;
4) norm-referenced scores indicate an examinee’s standing on a test as compared to other persons, which permits comparison of an individual’s performance on different tests;
5) don’t provide a universal standard of “good” or “bad” performance - represent the performance of persons in the standardization sample
objective
administration, scoring, and interpretation of scores are “independent of the subjective judgment of the particular examiner”;
the examinee will obtain the same score regardless of whoever administers or scores the test
sample of behavior
the test will sample the behavior in question
reliability
yields repeatable, dependable, and consistent results;
yields examinees’ true scores on whatever attribute that it measures
validity
measures what it purports to measure
maximum performance
tells us about an examinee’s best possible performance, or what a person can do;
achievement and aptitude tests
typical performance
tell us what an examinee usually does or feels;
interest and personality tests
pure speed (speeded) test
the examinee’s response rate is assessed;
have time limits and consist of items that all (or almost all) examinees would answer correctly if given enough time
power test
assesses the level of difficulty a person can attain;
no time limit or a time limit that permits most or all examinees to attempt all items;
items are arranged in order from least difficult to most difficult
mastery tests
designed to determine whether a person can attain a pre-established level of acceptable performance;
“all or none” score (e.g., pass/fail);
commonly employed to test basic skills (e.g., basic reading, basic math) at the elementary school level
ipsative measure
individual themself (opposed to a norm group or external criterion) is the frame of reference in score reporting;
scores are reported in terms of the relative strength of attributes within the individual examinee;
scores reflect which needs are strongest or weakest within the examinee, rather than as compared to a norm group;
examinees express a preference for one item over others, rather than responding to each item individually - required to choose which of 2 statements appeals to you the most
normative measures
provide a measure of the absolute strength of each attribute measured by the test;
examinees answer every item;
score can be compared to those of other examinees
classical test theory
a given examinee’s obtained test score consists of two components: truth and error
true score
reflects the examinee’s actual status on whatever attribute is being measured by the test
error (measurement error)
factors that are irrelevant to whatever is being measured; random;
does not affect all examinees in the same way
reliability coefficient
a correlation coefficient that ranges in value from 0.0 to +1.0;
indicates the proportion of variability that is true score variability;
0.0 - test is completely unreliable; observed variability (differences) in test scores due entirely to random factors;
1.0 - perfect reliability; no error - all observed variability reflects true variability;
.90 - 90% of observed variability in obtained test scores due to true score differences among examinees and the remaining 10% of observed variability represents measurement error;
cannot be squared
test-retest reliability coefficient (“coefficient of stability”)
administering the same test to the same group of people, and then correlating scores on the first and second administrations
“time sampling”
factors related to time that are sources of measurement error for the test-retest coefficient;
from one administration to the next, there may be changes in exam conditions (noises, weather) or factors such as illness, fatigue, worry, etc.
practice effects
doing better the second time around due to practice
drawbacks of test-retest reliability coefficient
examinees systematically tend to remember their previous responses;
not appropriate for assessing the reliability of tests that measure unstable attributes (mood);
recommended only for tests that are not appreciably affected by repetition, so very few psychological tests fall into this category
alternate forms (equivalent forms or parallel forms) reliability coefficient
administering two equivalent forms of a test to the same group of examinees, and then obtaining the correlation between the two sets of scores
drawbacks of alternate forms reliability coefficient
lower than the test-retest reliability coefficient;
sources of measurement error: differences in content between the 2 forms (some do better on Form A, others do better on Form B) and passage of time, since the two forms cannot be administered at the same time;
impractical and costly to construct two versions of the same test;
should not be used to assess the reliability of a test that measures an unstable trait
internal consistency
obtaining correlations among individual items;
split-half reliability, Cronbach’s coefficient alpha, Kuder-Richardson Formula 20;
administer the test once to a single group of examinees
split-half reliability
dividing the test in two and obtaining a correlation between the halves as if they were two shorter tests
Spearman-Brown formula
estimates the effect that shortening (or lengthening) a test will have on the reliability coefficient
drawbacks of split-half reliability
correlation will vary depending on how the items are divided;
splitting the test in this manner artificially lowers the reliability coefficient since the longer a test, the more reliable it will be - so use Spearman-Brown formula
Kuder-Richardson Formula 20 (KR-20)
indicate the average degree of inter-item consistency;
used when the test items are dichotomously scored (right-wrong, yes/no)
coefficient alpha
indicate the average degree of inter-item consistency;
used for tests with multiple-scored items (“usually”, “sometimes”, “rarely”, “never”)
pros and cons of internal consistency reliablity
pros: good for assessing the reliability of tests that measure unstable traits or are affected by repeated administration;
cons: major source of measurement error is item heterogeneity; inappropriate for assessing the reliability of speed tests.
content sampling, or item heterogeneity
degree that items are different in terms of the content they sample
interscorer (or inter-rater) reliability
calculating a correlation coefficient between the scores of two different raters
kappa coefficient
measure of the agreement between two judges who each rate a set of objects using nominal scales
mutually exclusive categories
a particular behavior clearly belongs to one and only one category
exhaustive categories
the categories cover all possible responses or behaviors
duration recording
rater records the elapsed time during which the target behavior or behaviors occur
frequency recording
observer keeps count of the number of times the target behavior occurs;
useful for recording behaviors of short duration and those where duration is not important
interval recording
observing a subject at a given interval and noting whether the subject is engaging or not engaging in the target behavior during that interval;
useful for behaviors that do not have a fixed beginning or end
continuous recording
recording all the behavior of the target subject during each observation session
standard error of measurement (σmeas)
indicates how much error an individual test score can be expected to have;
used to construct a confidence interval
confidence interval
the range within which an examinee’s true score is likely to fall, given his or her obtained score
SEM formula
SEMEAS = SDx (√1−rxx)
SEMEAS = standard error of measurement
SDx = standard deviation of test scores
rxx = reliability coefficient
CI formulas
(± σmeas) of obtained score = 68%;
(± 1.96 x σmeas) of obtained score = 95%;
(± 2.58 x σmeas) of obtained score = 99%
factors affecting reliability
- short tests are less reliable than longer tests
- as the group taking a test becomes more homogeneous, the variability of the scores - and hence the reliability coefficient - decreases
- if test items are too difficult, most people will get low scores on the test; if items are too easy, most people will get high scores, decreasing score variability, resulting in a lower reliability coefficient
- the higher the probability that examinees can guess the correct answer to items, the lower the reliability coefficient
- for inter-item consistency measured by the KR-20 or coefficient alpha methods, reliability is increased as the items become more homogeneous
content validity
the extent to which the test items adequately and representatively sample the content area to be measured;
educational achievement tests, work samples, EPPP
assessment of content validity
judgment and agreement of subject matter experts;
high correlation with other tests that purport to sample the same content domain;
students who are known to have succeeded in learning a particular content domain do well on a test designed to sample that domain
face validity
appears valid to examinees who take it, personnel who administer it, and other technically untrained observers
criterion-related validity
useful for predicting an individual’s behavior in specified situations;
applied situations (select employees, college admissions, place students in special classes)
criterion-related validity coefficient
a correlation coefficient (Pearson r) is used to determine the correlation between the predictor and the criterion