test construction Flashcards

1
Q

What is Classical Test Theory?

A

A theory of measurement used for developing and evaluating tests, also known as true score test theory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the formula representing the relationship between obtained test scores, true score variability, and measurement error?

A

X = T + E

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does true score variability (T) represent?

A

Actual differences among examinees regarding what the test measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is measurement error (E)?

A

Random factors affecting test performance in unpredictable ways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some examples of measurement error?

A
  • Distractions during testing
  • Ambiguously worded test items
  • Examinee fatigue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does test reliability refer to?

A

The extent to which a test provides consistent information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a reliability coefficient?

A

A type of correlation coefficient that ranges from 0 to 1.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is a reliability coefficient interpreted?

A

As the amount of variability in obtained test scores due to true score variability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What reliability coefficient is considered minimally acceptable for many tests?

A

0.70 or higher

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What reliability coefficient is usually required for high-stakes tests?

A

0.90 or higher

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the four main methods for assessing a test’s reliability?

A
  • Test-retest
  • Alternate forms
  • Internal consistency
  • Inter-rater
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does test-retest reliability measure?

A

The consistency of scores over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is alternate forms reliability assessed?

A

By correlating scores from different forms of the test administered to the same examinees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does internal consistency reliability measure?

A

The consistency of scores over different test items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is internal consistency reliability not useful for speed tests?

A

It tends to overestimate their reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is coefficient alpha also known as?

A

Cronbach’s alpha

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Kuder-Richardson 20 (KR-20) used for?

A

Evaluating internal consistency reliability for dichotomously scored items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the split-half reliability method?

A

Correlating scores from two halves of a test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a drawback of split-half reliability?

A

It underestimates a test’s reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What formula is used to correct split-half reliability?

A

Spearman-Brown prophecy formula

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does inter-rater reliability assess?

A

The consistency of scores or ratings assigned by different raters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What methods are used to evaluate inter-rater reliability?

A
  • Percent agreement
  • Cohen’s kappa coefficient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a limitation of percent agreement in inter-rater reliability?

A

It does not account for chance agreement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is consensual observer drift?

A

Increased consistency (but often decreased accuracy) in ratings due to raters communicating

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How can consensual observer drift be reduced?

A
  • Not having raters work together
  • Providing adequate training
  • Regularly monitoring accuracy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What factor affects the size of the reliability coefficient related to content?

A

Content homogeneity

Tests that are homogeneous regarding content tend to have larger reliability coefficients than heterogeneous tests, especially for internal consistency reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How does the range of scores influence reliability coefficients?

A

Larger reliability coefficients occur when test scores are unrestricted in range

This happens when the sample includes examinees with high, moderate, and low levels of the characteristics measured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What impact does guessing have on reliability coefficients?

A

Easier guessing leads to lower reliability coefficients

True/false tests are likely less reliable than multiple-choice tests with three or more answer choices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the reliability index?

A

Theoretical correlation between observed test scores and true test scores

Calculated by taking the square root of the reliability coefficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What does an item analysis determine in test development?

A

Which items to include based on difficulty level and discrimination ability

It is a process used in classical test theory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How is item difficulty (p) calculated?

A

p = number of correct answers / total number of examinees

Ranges from 0 to 1.0, with smaller values indicating more difficult items.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the preferred range of item difficulty for most tests?

A

p = .30 to .70

Moderately difficult items are preferred, but optimal values may vary based on the test purpose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the optimal item difficulty level for mastery tests?

A

Lower p values are preferred

For example, an optimal average item difficulty of .20 might be used to identify mastery of at least 20% of content.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How is the optimal difficulty level for guessing calculated?

A

Optimal p = (1.0 + probability of guessing) / 2

For a four-answer multiple-choice question, this would be (1.0 + .25) / 2 = .625.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What does the item discrimination index (D) measure?

A

Difference in correct responses between high and low total test score groups

Ranges from -1.0 to +1.0, with higher D values indicating better discrimination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is an acceptable D value for most tests?

A

D value of .30 or higher

Items of moderate difficulty typically have higher discrimination levels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What does a reliability coefficient less than 1.0 indicate about a test score?

A

An examinee’s obtained test score may or may not be their true score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is a confidence interval in the context of test scores?

A

It indicates the range within which an examinee’s true score is likely to be based on their obtained score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How is the standard error of measurement calculated?

A

It is calculated by multiplying the test’s standard deviation by the square root of 1 minus the reliability coefficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is the standard error of measurement if the standard deviation is 5 and the reliability coefficient is .84?

A

2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How do you construct a 68% confidence interval around an obtained test score?

A

Add and subtract one standard error of measurement to and from the obtained score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

How do you construct a 95% confidence interval around an obtained test score?

A

Add and subtract two standard errors of measurement to and from the obtained score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

How do you construct a 99% confidence interval around an obtained test score?

A

Add and subtract three standard errors of measurement to and from the obtained score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is the 95% confidence interval for an examinee who scored 90 with a standard error of measurement of 5?

A

80 to 100.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What does Item Response Theory (IRT) focus on?

A

Examinees’ responses to individual test items.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

How does IRT differ from Classical Test Theory (CTT)?

A

CTT is test-based and focuses on total test scores, while IRT is item-based.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What advantage does IRT have over CTT regarding item parameters?

A

IRT derives sample invariant parameters using mathematical techniques and a large sample size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is a computerized adaptive test?

A

A test that tailors items to each examinee by presenting items appropriate for their level of the trait.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is another name for Item Response Theory?

A

Latent trait theory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What does the item characteristic curve (ICC) represent?

A

The relationship between each item and the latent trait measured by the test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What are the two axes of the ICC graph?

A

Total test scores (horizontal/x-axis) and probabilities of endorsing or answering the item correctly (vertical/y-axis).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What does the difficulty parameter in IRT indicate?

A

The level of the trait required for a 50% probability of endorsing or answering the item correctly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What does the discrimination parameter in IRT indicate?

A

How well the item can discriminate between individuals with high and low levels of the trait.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What does the slope of the ICC indicate?

A

The steeper the slope, the better the discrimination of the item.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What does the y-axis crossing point of the ICC represent?

A

The probability of guessing correctly.

56
Q

Fill in the blank: When the y-axis crossing point of the ICC is closer to 0, it indicates that _______.

A

it is more difficult for examinees to choose the correct answer by guessing.

57
Q

What does adequate reliability in a test indicate?

A

Test scores can be expected to be consistent

Adequate reliability does not indicate that the test measures what it was designed to measure.

58
Q

Define validity in the context of testing.

A

The degree to which evidence and theory support the interpretation of test scores for proposed uses of tests

Validity is a unitary concept, incorporating multiple sources of validity evidence.

59
Q

What are the three traditional types of validity?

A
  • Content Validity
  • Construct Validity
  • Criterion-related Validity
60
Q

List the five sources of validity evidence.

A
  • Evidence based on test content
  • The response process
  • The internal structure of the test
  • Relationships with other variables
  • The consequences of testing
61
Q

What is content validity?

A

Evidence that a test measures one or more content or behavior domains

Important for achievement tests and work samples.

62
Q

How is content validity established during test development?

A

By clearly defining the domain to be assessed and including representative items

Subject matter experts systematically review items for domain coverage.

63
Q

What is face validity?

A

The extent to which test items ‘look valid’ to examinees

Not an actual type of validity, but can affect examinees’ willingness to perform well.

64
Q

Define construct validity.

A

Evidence that a test measures a hypothetical trait inferred from behavior

Important for traits like intelligence and motivation.

65
Q

What is convergent validity?

A

The degree to which scores on the test correlate with scores on other measures of the same or related constructs.

66
Q

What is divergent validity?

A

The degree to which scores on the test have low correlations with scores on measures of unrelated constructs

Also known as discriminant validity.

67
Q

What is a multitrait-multimethod matrix?

A

A table of correlation coefficients that provide information about a test’s reliability and validity

Used to assess convergent and divergent validity.

68
Q

What is a monotrait-monomethod coefficient?

A

A reliability coefficient for the same trait using the same method.

69
Q

What does a large monotrait-heteromethod coefficient indicate?

A

Evidence of the self-report sociability test’s convergent validity.

70
Q

What does a small heterotrait-monomethod coefficient indicate?

A

Evidence of the self-report sociability test’s divergent validity.

71
Q

What is factor analysis?

A

A statistical method used to assess a test’s convergent and divergent validity

Involves several steps including administering tests and correlating scores.

72
Q

List the basic steps in factor analysis.

A
  • Administer the test to a sample
  • Correlate all pairs of scores
  • Derive the initial factor matrix
  • Rotate and interpret the factor matrix
73
Q

What are factor loadings?

A

Correlation coefficients indicating the relationship between each test and identified factors.

74
Q

How is communality calculated?

A

By squaring and adding the factor loadings when factors are orthogonal.

75
Q

What do high correlations with Factor I and low correlations with Factor II indicate in factor analysis?

A

Evidence of convergent and divergent validity for the test being validated.

76
Q

What is the purpose of naming factors in factor analysis?

A

To interpret the identified factors based on the correlation patterns of the tests.

77
Q

What is criterion-related validity?

A

It is of interest whenever scores on a test predict or estimate scores on another measure.

78
Q

What is an example of criterion-related validity?

A

Evaluating a job knowledge test used for hiring decisions by predicting job performance scores.

79
Q

What are the two types of criterion-related validity?

A

Concurrent and predictive validity.

80
Q

How is concurrent validity evaluated?

A

By obtaining scores on the predictor and criterion at about the same time.

81
Q

When is concurrent validity most important?

A

When predictor scores estimate current status on the criterion.

82
Q

How is predictive validity evaluated?

A

By obtaining scores on the predictor before obtaining scores on the criterion.

83
Q

When is predictive validity most important?

A

When predictor scores estimate future status on the criterion.

84
Q

What does the criterion-related validity coefficient range from?

A

-1 to +1.

85
Q

What does a criterion-related validity coefficient closer to ±1 indicate?

A

More accurate predictor scores for predicting criterion scores.

86
Q

How can the amount of variability explained by one measure be determined?

A

By squaring the criterion-related validity coefficient.

87
Q

If a job knowledge test has a validity coefficient of .70, what does .70 squared indicate?

A

49% of variability in job performance is explained by job knowledge.

88
Q

What is cross-validation?

A

Validating a predictor for a new sample to check the correlation coefficient.

89
Q

Why might the initial correlation coefficient overestimate the true correlation?

A

Due to chance (random) factors affecting high correlations.

90
Q

What happens to the correlation coefficient during cross-validation?

A

It is likely to shrink.

91
Q

When is shrinkage greatest in correlation coefficients?

A

When the initial sample is small and the number of predictors is large.

92
Q

Fill in the blank: The predictor is the _______ and the measure of job performance is the criterion.

A

job knowledge test.

93
Q

What is the standard error of estimate used for?

A

It is used to construct a confidence interval around a person’s predicted criterion score

94
Q

How is a confidence interval defined?

A

It indicates the range within which an examinee’s true criterion score is likely to fall given his or her predicted score

95
Q

What is the relationship between the standard error of estimate and the normal curve?

A

A 68% confidence interval adds and subtracts one standard error, a 95% confidence interval adds and subtracts two, and a 99% confidence interval adds and subtracts three

96
Q

How is the standard error of estimate calculated?

A

By multiplying the criterion measure’s standard deviation by the square root of 1 minus the criterion-related validity coefficient squared

97
Q

What is the range of the standard error of estimate?

A

It ranges from 0 to the size of the criterion measure’s standard deviation

98
Q

What happens to the standard error of estimate when the validity coefficient is +1 or -1?

A

The standard error is 0

99
Q

What is the correction for attenuation formula used for?

A

To estimate the maximum validity coefficient if the predictor and/or criterion had a reliability coefficient of 1.0

100
Q

What does clinical utility refer to?

A

The extent to which a test is useful for clinical purposes

101
Q

Define incremental validity

A

The increase in the accuracy of predictions about criterion performance by adding a new predictor

102
Q

What are true positives?

A

Recently hired employees who obtained high scores on both the predictor and criterion

103
Q

What are false positives?

A

Recently hired employees who obtained high scores on the predictor but low scores on the criterion

104
Q

What are true negatives?

A

Recently hired employees who obtained low scores on both the predictor and criterion

105
Q

What are false negatives?

A

Recently hired employees who obtained low scores on the predictor but high scores on the criterion

106
Q

How is the base rate calculated?

A

By dividing the number of employees with high scores on the criterion by the total number of employees

107
Q

What is the positive hit rate?

A

The proportion of employees who would have been hired using their scores on the new predictor and obtained high scores on the criterion

108
Q

What is diagnostic efficiency?

A

The ability of a test to correctly distinguish between people who do and do not have a disorder

109
Q

What is sensitivity in the context of testing?

A

The proportion of people with the disorder identified by the test as having the disorder

110
Q

What is specificity?

A

The proportion of people without the disorder identified by the test as not having the disorder

111
Q

Define hit rate

A

The proportion of people correctly categorized by the test

112
Q

What does positive predictive value indicate?

A

The probability that a person who tests positive actually has the disorder

113
Q

What does negative predictive value indicate?

A

The probability that a person who tests negative does not actually have the disorder

114
Q

True or False: A test’s positive and negative predictive values are stable across different settings.

A

False

115
Q

What is the relationship between reliability and validity?

A

A predictor’s reliability always places a ceiling on its validity

116
Q

How is the reliability index calculated?

A

It is the square root of the predictor’s reliability coefficient

117
Q

What are norm-referenced scores?

A

Scores that indicate how well an examinee performed compared to a standardization sample.

Norm-referenced scores include percentile ranks and standard scores.

118
Q

What is the primary objective of using norm-referenced scores?

A

To make distinctions among individuals or groups in terms of the ability or trait assessed by a test.

(Urbina, 2014, p. 212)

119
Q

What does a percentile rank indicate?

A

The percentage of examinees in the reference group who scored at or below a given score.

For example, a percentile rank of 82 means 82% scored lower than the examinee.

120
Q

How is the conversion of raw scores to percentile ranks described?

A

As a nonlinear transformation.

This is because the percentile rank distribution is always rectangular.

121
Q

What do standard scores indicate?

A

How well an examinee did in terms of standard deviations from the mean score obtained by the reference group.

Standard scores include z-scores, T-scores, IQ scores, and stanines.

122
Q

What is the mean and standard deviation of the z-score distribution?

A

Mean = 0, Standard Deviation = 1.0.

123
Q

How is a z-score calculated?

A

z = (X – M)/SD, where X is the raw score, M is the mean, and SD is the standard deviation.

124
Q

What does a T-score of 40 indicate?

A

The examinee’s raw score is one standard deviation below the mean.

T-scores have a mean of 50 and standard deviation of 10.

125
Q

What are the mean and standard deviation for full-scale IQ scores on the SB-5 and Wechsler tests?

A

Mean = 100, Standard Deviation = 15.

126
Q

What does a stanine of 5 represent?

A

Raw scores that range from .25 standard deviations below to .25 standard deviations above the mean.

Stanines have a mean of 5 and a standard deviation of 2.

127
Q

What is the primary objective of criterion-referenced scores?

A

To evaluate a person’s or group’s degree of competence or mastery against a preestablished standard of performance.

(Urbina, 2014, p. 121)

128
Q

What do percentage scores indicate?

A

The percentage of test items that examinees answered correctly.

129
Q

What is a cutoff score?

A

A predetermined score that distinguishes between mastery and non-mastery of content.

130
Q

What is an expectancy table?

A

A table that predicts an examinee’s expected score on another measure based on their obtained test score.

131
Q

What is the difference between cutoff scores and ranking in selection decisions?

A

Cutoff scores select candidates above a certain score; ranking selects candidates from highest to lowest scores.

132
Q

What is banding in the context of test scores?

A

Grouping test scores into bands based on the standard error of measurement to consider scores within each band as equivalent.

133
Q

True or False: Banding helps reduce adverse impact by including members of minority groups within score bands.

A

True.

134
Q

What is the relationship between a PR of 2 and SD in a normal distribution?

A

PR of 2 is equivalent to -2 SD.

135
Q

What is the relationship between a PR of 16 and SD in a normal distribution?

A

PR of 16 is equivalent to -1 SD.

136
Q

What is the relationship between a PR of 84 and SD in a normal distribution?

A

PR of 84 is equivalent to +1 SD.

137
Q

What is the relationship between a PR of 98 and SD in a normal distribution?

A

PR of 98 is equivalent to +2 SD.