Test Construction Flashcards

1
Q

psychological test

A

an objective and standardized measure of a sample of behavior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

standardization

A

uniformity of procedure in administering and scoring the test;
test conditions and scoring procedures should be the same for all examinees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

norms

A

the scores of a representative sample of the population on a particular test;
interpretation of most psychological tests involves comparing an individual’s test score to norms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

conceptual points about norms

A

1) norms are obtained from a sample that is truly representative of the population for which the test is designed;
2) to be truly representative, a sample must be reasonably large;
3) examinee’s score should be compared to the scores obtained by a representative sample of the population to which he or she belongs;
4) norm-referenced scores indicate an examinee’s standing on a test as compared to other persons, which permits comparison of an individual’s performance on different tests;
5) don’t provide a universal standard of “good” or “bad” performance - represent the performance of persons in the standardization sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

objective

A

administration, scoring, and interpretation of scores are “independent of the subjective judgment of the particular examiner”;
the examinee will obtain the same score regardless of whoever administers or scores the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

sample of behavior

A

the test will sample the behavior in question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

reliability

A

yields repeatable, dependable, and consistent results;
yields examinees’ true scores on whatever attribute that it measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

validity

A

measures what it purports to measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

maximum performance

A

tells us about an examinee’s best possible performance, or what a person can do;
achievement and aptitude tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

typical performance

A

tell us what an examinee usually does or feels;
interest and personality tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

pure speed (speeded) test

A

the examinee’s response rate is assessed;
have time limits and consist of items that all (or almost all) examinees would answer correctly if given enough time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

power test

A

assesses the level of difficulty a person can attain;
no time limit or a time limit that permits most or all examinees to attempt all items;
items are arranged in order from least difficult to most difficult

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

mastery tests

A

designed to determine whether a person can attain a pre-established level of acceptable performance;
“all or none” score (e.g., pass/fail);
commonly employed to test basic skills (e.g., basic reading, basic math) at the elementary school level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ipsative measure

A

individual themself (opposed to a norm group or external criterion) is the frame of reference in score reporting;
scores are reported in terms of the relative strength of attributes within the individual examinee;
scores reflect which needs are strongest or weakest within the examinee, rather than as compared to a norm group;
examinees express a preference for one item over others, rather than responding to each item individually - required to choose which of 2 statements appeals to you the most

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

normative measures

A

provide a measure of the absolute strength of each attribute measured by the test;
examinees answer every item;
score can be compared to those of other examinees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

classical test theory

A

a given examinee’s obtained test score consists of two components: truth and error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

true score

A

reflects the examinee’s actual status on whatever attribute is being measured by the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

error (measurement error)

A

factors that are irrelevant to whatever is being measured; random;
does not affect all examinees in the same way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

reliability coefficient

A

a correlation coefficient that ranges in value from 0.0 to +1.0;
indicates the proportion of variability that is true score variability;
0.0 - test is completely unreliable; observed variability (differences) in test scores due entirely to random factors;
1.0 - perfect reliability; no error - all observed variability reflects true variability;
.90 - 90% of observed variability in obtained test scores due to true score differences among examinees and the remaining 10% of observed variability represents measurement error;
cannot be squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

test-retest reliability coefficient (“coefficient of stability”)

A

administering the same test to the same group of people, and then correlating scores on the first and second administrations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

“time sampling”

A

factors related to time that are sources of measurement error for the test-retest coefficient;
from one administration to the next, there may be changes in exam conditions (noises, weather) or factors such as illness, fatigue, worry, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

practice effects

A

doing better the second time around due to practice

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

drawbacks of test-retest reliability coefficient

A

examinees systematically tend to remember their previous responses;
not appropriate for assessing the reliability of tests that measure unstable attributes (mood);
recommended only for tests that are not appreciably affected by repetition, so very few psychological tests fall into this category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

alternate forms (equivalent forms or parallel forms) reliability coefficient

A

administering two equivalent forms of a test to the same group of examinees, and then obtaining the correlation between the two sets of scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

drawbacks of alternate forms reliability coefficient

A

lower than the test-retest reliability coefficient;
sources of measurement error: differences in content between the 2 forms (some do better on Form A, others do better on Form B) and passage of time, since the two forms cannot be administered at the same time;
impractical and costly to construct two versions of the same test;
should not be used to assess the reliability of a test that measures an unstable trait

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

internal consistency

A

obtaining correlations among individual items;
split-half reliability, Cronbach’s coefficient alpha, Kuder-Richardson Formula 20;
administer the test once to a single group of examinees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

split-half reliability

A

dividing the test in two and obtaining a correlation between the halves as if they were two shorter tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Spearman-Brown formula

A

estimates the effect that shortening (or lengthening) a test will have on the reliability coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

drawbacks of split-half reliability

A

correlation will vary depending on how the items are divided;
splitting the test in this manner artificially lowers the reliability coefficient since the longer a test, the more reliable it will be - so use Spearman-Brown formula

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Kuder-Richardson Formula 20 (KR-20)

A

indicate the average degree of inter-item consistency;
used when the test items are dichotomously scored (right-wrong, yes/no)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

coefficient alpha

A

indicate the average degree of inter-item consistency;
used for tests with multiple-scored items (“usually”, “sometimes”, “rarely”, “never”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

pros and cons of internal consistency reliablity

A

pros: good for assessing the reliability of tests that measure unstable traits or are affected by repeated administration;

cons: major source of measurement error is item heterogeneity; inappropriate for assessing the reliability of speed tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

content sampling, or item heterogeneity

A

degree that items are different in terms of the content they sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

interscorer (or inter-rater) reliability

A

calculating a correlation coefficient between the scores of two different raters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

kappa coefficient

A

measure of the agreement between two judges who each rate a set of objects using nominal scales

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

mutually exclusive categories

A

a particular behavior clearly belongs to one and only one category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

exhaustive categories

A

the categories cover all possible responses or behaviors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

duration recording

A

rater records the elapsed time during which the target behavior or behaviors occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

frequency recording

A

observer keeps count of the number of times the target behavior occurs;
useful for recording behaviors of short duration and those where duration is not important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

interval recording

A

observing a subject at a given interval and noting whether the subject is engaging or not engaging in the target behavior during that interval;
useful for behaviors that do not have a fixed beginning or end

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

continuous recording

A

recording all the behavior of the target subject during each observation session

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

standard error of measurement (σmeas)

A

indicates how much error an individual test score can be expected to have;
used to construct a confidence interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

confidence interval

A

the range within which an examinee’s true score is likely to fall, given his or her obtained score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

SEM formula

A

SEMEAS = SDx (√1−rxx)
SEMEAS = standard error of measurement
SDx = standard deviation of test scores
rxx = reliability coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

CI formulas

A

(± σmeas) of obtained score = 68%;
(± 1.96 x σmeas) of obtained score = 95%;
(± 2.58 x σmeas) of obtained score = 99%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

factors affecting reliability

A
  1. short tests are less reliable than longer tests
  2. as the group taking a test becomes more homogeneous, the variability of the scores - and hence the reliability coefficient - decreases
  3. if test items are too difficult, most people will get low scores on the test; if items are too easy, most people will get high scores, decreasing score variability, resulting in a lower reliability coefficient
  4. the higher the probability that examinees can guess the correct answer to items, the lower the reliability coefficient
  5. for inter-item consistency measured by the KR-20 or coefficient alpha methods, reliability is increased as the items become more homogeneous
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

content validity

A

the extent to which the test items adequately and representatively sample the content area to be measured;
educational achievement tests, work samples, EPPP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

assessment of content validity

A

judgment and agreement of subject matter experts;
high correlation with other tests that purport to sample the same content domain;
students who are known to have succeeded in learning a particular content domain do well on a test designed to sample that domain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

face validity

A

appears valid to examinees who take it, personnel who administer it, and other technically untrained observers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

criterion-related validity

A

useful for predicting an individual’s behavior in specified situations;
applied situations (select employees, college admissions, place students in special classes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

criterion-related validity coefficient

A

a correlation coefficient (Pearson r) is used to determine the correlation between the predictor and the criterion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

criterion-related validity coefficient formula

A

rxy
“x” refers to the predictor
“y” refers to the criterion

53
Q

validation

A

the procedures used to determine how valid a predictor is

54
Q

concurrent validation

A

the predictor and the criterion data are collected at or at about the same time;
useful for predicting a given current behavior, say that it has high concurrent validity;
focus on current status on a criterion

55
Q

predictive validation

A

scores on the predictor are collected first, and the criterion data are collected at some future point;
useful for predicting a future behavior, say that the test has high predictive validity;
designed to predict future status

56
Q

standard error of estimate (or σest)

A

estimate the range in which a person’s true score on a criterion is likely to fall, given his/her score as estimated by a predictor

57
Q

standard error of estimate formula

A

SE est = SDy 1 - rxy2
SEest = standard error of estimate
SDy = standard deviation of criterion scores
rxy = validity coefficient

58
Q

CI for standard error of estimate

A

68% = ± (1)(σest) of predicted criterion score;
95% = ± (1.96)(σest);
99% = ±(2.58)(σest )

59
Q

differences between standard error of estimate and standard error of measurement

A

1) SEM is related to the reliability coefficient; SEE is related to the validity coefficient
2) SEM used to estimate where an examinee’s true test score is likely to fall, given obtained score on that same test - no predictor measure is involved; SEE used to determine where an examinee’s actual criterion score is likely to fall, given the criterion score that was predicted by another measure - predictor is being used

60
Q

criterion cutoff

A

whether or not that person will meet or exceed a certain minimum standard or criterion performance

61
Q

predictor cutoff score

A

if the examinee scores at or above the predictor cutoff score he or she is selected, but if the examinee scores below the predictor cutoff score, he or she is rejected

62
Q

True Positives (or Valid Acceptances)

A

scored above the cutoff point on the predictor and turn out to be successful on the criterion;
predictor said they would be successful on the job and it was right

63
Q

False Positives (or False Acceptances)

A

scored above the cutoff point on the predictor but did not turn out to be successful on the criterion;
the predictor wrongly indicated that they would be successful on the job

64
Q

True Negatives (or Valid Rejections)

A

scored below the cutoff point on the predictor and turned out to be unsuccessful on the criterion;
predictor correctly indicated that they would be unsuccessful on the job

65
Q

False Negatives (or Invalid Rejections)

A

scored below the cutoff point on the predictor but turned out to be successful on the criterion;
predictor incorrectly indicated that they would be unsuccessful on the job

66
Q

“positive” and “negative” for predictor

A

“positive”: predictor says the person should be selected;
“negative”: predictor says the person should not be selected

67
Q

“true” and “false” for predictor

A

where the person actually stands on the criterion;
“true”: predictor classified the person into the correct criterion group;
“false”: predictor made an incorrect classification

68
Q

predictor’s functional utility

A

determine the increase in the proportion of correct hiring decisions that would result from using the predictor as a selection tool, relative to when it is not used

69
Q

Factors Affecting the Validity Coefficient

A

1) Heterogeneity of Examinees: lowered if there is a restricted range of scores - either on predictor or criterion - more homogeneous the validation group, lower the validity coefficient
2) Reliability of Predictor and Criterion: for a predictor to be valid, both the predictor and the criterion must be reliable - an unreliable test will always be invalid, but a reliable test will not always be valid
3) Moderator Variables: the criterion-related validity of a test may vary among subgroups within a population by moderator variables
4) Cross-Validation: after a test is validated, it is typically re-validated with a sample of individuals different from the original validation sample

70
Q

moderator variables

A

variables that influence the relationship between two other variables

71
Q

differential validity

A

test is more valid for one subgroup but not another

72
Q

cross-validation

A

after a test is validated, it is typically re-validated with a sample of individuals different from the original validation sample

73
Q

shrinkage

A

reduction that occurs in a criterion-related validity coefficient upon cross-validation;
occurs b/c the predictor is “tailor-made” for the original validation sample and doesn’t fully generalize to other samples

74
Q

when is shrinkage greatest

A

the original validation sample is small;
the original item pool is large;
the number of items retained is small relative to the number of items in the item pool;
items are not chosen based on a previously formulated hypothesis or experience with the criterion

75
Q

criterion contamination

A

in the process of validating a test, the predictor scores themselves influence any individual’s criterion status;
artificially inflates the validity coefficient - it makes the predictor look more valid than it actually is

76
Q

construct

A

a psychological variable that is abstract

77
Q

construct validity

A

measures a theoretical construct or trait

78
Q

convergent validity

A

requires that different ways of measuring the same trait yield similar results (WISC, WJ);
tests that measure the same trait have a high correlation, even when they use different methods

79
Q

discriminant (divergent) validity

A

low correlation with another test that measures a different construct;
two tests that measure different traits have a low correlation, even when they use the same method

80
Q

multitrait-multimethod matrix

A

assessment of two or more traits by two or more methods (self-report inventory, peer ratings, projective test)

81
Q

monotrait-monomethod coefficients

A

indicate the correlation between the measure and itself and are therefore reliability coefficients

82
Q

monotrait-heteromethod coefficients

A

correlations between two measures that assess the same (mono) trait using different (hetero) methods;
if a test has convergent validity, this correlation should be high

83
Q

heterotrait-monomethod coefficients

A

correlations between two measures that measure different (hetero) traits using the same (mono) method;
if a test has discriminant validity, this coefficient should be low

84
Q

heterotrait-heteromethod coefficients

A

correlations between two measures that measure different (hetero) traits using different (hetero) methods;
if a test has discriminant validity, this correlation should be low

85
Q

factor analysis

A

reducing a set of many variables (e.g., tests) to fewer variables to assess construct validity of a test;
detect structure in several variables;
can allow you to start with a large number of variables and classify them into sets

86
Q

underlying constructs

A

tests in the analysis are not directly intended to measure these constructs
(AKA latent variables)

87
Q

factor loading

A

the correlation between a given test and a given factor;
range from +1 to -1;
can be squared to determine the proportion of variability in the test accounted for by the factor

88
Q

communality (h2)

A

determine the proportion of variance of a test that is attributable to the factors;
part of true variability shared with other tests

89
Q

common variance

A

factors are also accounting for variance in the other tests included in the analysis

90
Q

unique variance (u2)

A

variance specific to the test and not explained by the factors;
part of true variability unique to the test itself

91
Q

explained variance, or eigenvalues

A

measure of the amount of variance in all the tests accounted for by the factor

92
Q

things you should know about eigenvalues

A

1) factors will be ordered in terms of the size of their eigenvalue - Factor I larger than Factor II, which is larger than Factor III, etc. Factor I will explain more of “what’s going on” in the tests than Factor II;
2) sum of the eigenvalues can be no larger than the number of tests included in the analysis

93
Q

rotation

A

procedure that facilitates interpretation of a factor matrix;
re-dividing the test’s communalities so that a clearer pattern of loadings emerges

94
Q

orthogonal

A

factors that are independent of each other (uncorrelated)

95
Q

oblique

A

factors that are correlated with each other to some degree

96
Q

factorial validity

A

when a test correlates highly with a factor it would be expected to correlate with

97
Q

differences between principal components and factor analysis

A

1) terminology: “factor” in factor analysis is usually referred to as a principal component or an eigenvector in principal components analysis
2) in principal components analysis variance has 2 elements: explained variance and error variance; in factor analysis, the variance has 3 elements: communality, specificity, and error
3) in principal components analysis, the factors (or components, or eigenvectors) are always uncorrelated

98
Q

cluster analysis

A

place objects into categories;
develop a taxonomy or classification system

99
Q

differences between cluster analysis and factor analysis

A

1) only variables that are measured using interval or ratio data can be used in a factor analysis; variables measured using any type of data can be included in a cluster analysis
2) factors in factor analysis are usually interpreted as underlying traits or constructs measured by the variables in the analysis; clusters in cluster analysis are just categories, and not necessarily traits
3) cluster analysis used in studies where there is an a priori hypothesis regarding what categories the objects will cluster into; factor analysis used to test a hypothesis regarding what traits a set of variables measures

100
Q

relationship between reliability and validity

A

a test is reliable if it measures “something,” and a test is valid if that “something” is what the test developer claims it is;
for a test to be valid, it must be reliable;
the validity coefficient is less than or, at the most, equal to the square root of the reliability coefficient - it can’t be higher;
reliability places an upper limit on validity

101
Q

correction for attenuation

A

the formula answers the following question: “What would the validity coefficient of my predictor be if both the predictor and the criterion were perfectly reliable?”;
what would happen to the validity coefficient if reliability (of both the predictor and the criterion) were higher

102
Q

item analysis

A

used to determine which items will be retained for the final version of the test;
can be qualitative (content of the test) and quantitative (measurement of item difficulty, item discrimination)

103
Q

item difficulty index (“p”)

A

the percentage of examinees who answer it correctly;
the higher the p value, the less difficult the item;
ideal items have p = ~.50
ordinal scale only

104
Q

item difficulty index for:
gifted
mastery
true/false
multiple choice

A

.25
.80 to .90
.75
.60

105
Q

item discrimination

A

degree to which a test item differentiates among examinees in terms of the behavior that the test is designed to measure

106
Q

item discrimination index (“D”)

A

choose items that have high correlations with the criterion but low correlations with each other

107
Q

item characteristic curves (ICCs)

A

graphs that depict each item in terms of how difficult the item was for individuals in different ability groups

108
Q

item response theory assumptions about test items

A

1) performance on an item is related to the estimated amount of a latent trait being measured by the item; implies that the scores of individuals tested with different items can be directly compared to each other since all the items measure the same latent trait.

2) results of testing are sample free (“invariance of item parameters”) - an item should have the same parameters (difficulty and discrimination levels) across all random samples of a population so it can be used with any individual to provide an estimate of their ability

109
Q

adaptive testing of ability

A

administering a set of items tailored to the examinee’s estimated level of ability

110
Q

norm-referenced interpretation

A

comparing an examinee’s score to norms (scores of other examinees in a standardization sample);
indicates where the examinee stands in relation to others who have taken the test

111
Q

developmental norms

A

indicate how far along the normal developmental path an individual has progressed

112
Q

mental age (MA) score

A

comparing an examinee’s score to the average performance of others at different age levels

113
Q

grade equivalent scores

A

computing the average raw score obtained by children in each grade;
for educational achievement tests

114
Q

disadvantages of developmental norms

A

don’t permit comparisons of individuals at different age levels;
grade equivalent scores on different tests are not comparable

115
Q

within-group norms

A

provide a comparison of the examinee’s score to those of the most nearly comparable standardization sample

116
Q

percentile rank (PR)

A

the percentage of persons in the standardization sample who fall below a given raw score

117
Q

pros and cons of percentile rank

A

pro: easy to understand and interpret;
con: represent ranks (ordinal data) and therefore do not allow interpretations in terms of absolute amount of difference between scores

118
Q

standard scores

A

express a raw score’s distance from the mean in terms of standard deviation units;
tell us how many standard deviation units a person’s score is above or below the mean

119
Q

pros of using standard scores

A

scores can be compared across different age groups;
allow for interpretation in terms of the absolute amount of differences between scores

120
Q

Z-scores

A

directly indicates how many standard deviation units a score falls above or below the mean

121
Q

T-scores

A

mean of 50 and a SD of 10;
a T-score of 60 has a score that falls 1 standard deviation above the mean

122
Q

Stanine Scores

A

scores range from 1 to 9;
mean of 5 and a SD of 2

123
Q

Deviation IQ scores

A

mean of 100 and a standard deviation of 15

124
Q

differential prediction

A

a case where given scores on a predictor test predict different outcomes for different subgroups

125
Q

single-group validity

A

a test is valid for one subgroup but not another subgroup

126
Q

sensitivity of a test

A

the proportion of correctly identified cases;
the ratio of examinees whom the test correctly identifies as having the characteristic to the total number of examinees who actually possess the characteristic

127
Q

triangulation

A

attempt to increase reliability by reducing systematic or method error through a strategy in which the researcher employs multiple methods of measurement (e.g., observation, survey, archival data)

128
Q

calibration

A

attempt to increase reliability by increasing homogeneity of ratings through feedback to the raters, when multiple raters are used;
raters might meet during pretesting of the instrument to discuss items on which they have disagreed seeking to reach consensus on rules for rating items (defining a “2” for an item dealing with job performance)