test construction Flashcards

(137 cards)

1
Q

What is Classical Test Theory?

A

A theory of measurement used for developing and evaluating tests, also known as true score test theory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the formula representing the relationship between obtained test scores, true score variability, and measurement error?

A

X = T + E

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does true score variability (T) represent?

A

Actual differences among examinees regarding what the test measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is measurement error (E)?

A

Random factors affecting test performance in unpredictable ways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some examples of measurement error?

A
  • Distractions during testing
  • Ambiguously worded test items
  • Examinee fatigue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does test reliability refer to?

A

The extent to which a test provides consistent information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a reliability coefficient?

A

A type of correlation coefficient that ranges from 0 to 1.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is a reliability coefficient interpreted?

A

As the amount of variability in obtained test scores due to true score variability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What reliability coefficient is considered minimally acceptable for many tests?

A

0.70 or higher

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What reliability coefficient is usually required for high-stakes tests?

A

0.90 or higher

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the four main methods for assessing a test’s reliability?

A
  • Test-retest
  • Alternate forms
  • Internal consistency
  • Inter-rater
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does test-retest reliability measure?

A

The consistency of scores over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is alternate forms reliability assessed?

A

By correlating scores from different forms of the test administered to the same examinees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does internal consistency reliability measure?

A

The consistency of scores over different test items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is internal consistency reliability not useful for speed tests?

A

It tends to overestimate their reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is coefficient alpha also known as?

A

Cronbach’s alpha

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Kuder-Richardson 20 (KR-20) used for?

A

Evaluating internal consistency reliability for dichotomously scored items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the split-half reliability method?

A

Correlating scores from two halves of a test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a drawback of split-half reliability?

A

It underestimates a test’s reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What formula is used to correct split-half reliability?

A

Spearman-Brown prophecy formula

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does inter-rater reliability assess?

A

The consistency of scores or ratings assigned by different raters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What methods are used to evaluate inter-rater reliability?

A
  • Percent agreement
  • Cohen’s kappa coefficient
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a limitation of percent agreement in inter-rater reliability?

A

It does not account for chance agreement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is consensual observer drift?

A

Increased consistency (but often decreased accuracy) in ratings due to raters communicating

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How can consensual observer drift be reduced?
* Not having raters work together * Providing adequate training * Regularly monitoring accuracy
26
What factor affects the size of the reliability coefficient related to content?
Content homogeneity ## Footnote Tests that are homogeneous regarding content tend to have larger reliability coefficients than heterogeneous tests, especially for internal consistency reliability.
27
How does the range of scores influence reliability coefficients?
Larger reliability coefficients occur when test scores are unrestricted in range ## Footnote This happens when the sample includes examinees with high, moderate, and low levels of the characteristics measured.
28
What impact does guessing have on reliability coefficients?
Easier guessing leads to lower reliability coefficients ## Footnote True/false tests are likely less reliable than multiple-choice tests with three or more answer choices.
29
What is the reliability index?
Theoretical correlation between observed test scores and true test scores ## Footnote Calculated by taking the square root of the reliability coefficient.
30
What does an item analysis determine in test development?
Which items to include based on difficulty level and discrimination ability ## Footnote It is a process used in classical test theory.
31
How is item difficulty (p) calculated?
p = number of correct answers / total number of examinees ## Footnote Ranges from 0 to 1.0, with smaller values indicating more difficult items.
32
What is the preferred range of item difficulty for most tests?
p = .30 to .70 ## Footnote Moderately difficult items are preferred, but optimal values may vary based on the test purpose.
33
What is the optimal item difficulty level for mastery tests?
Lower p values are preferred ## Footnote For example, an optimal average item difficulty of .20 might be used to identify mastery of at least 20% of content.
34
How is the optimal difficulty level for guessing calculated?
Optimal p = (1.0 + probability of guessing) / 2 ## Footnote For a four-answer multiple-choice question, this would be (1.0 + .25) / 2 = .625.
35
What does the item discrimination index (D) measure?
Difference in correct responses between high and low total test score groups ## Footnote Ranges from -1.0 to +1.0, with higher D values indicating better discrimination.
36
What is an acceptable D value for most tests?
D value of .30 or higher ## Footnote Items of moderate difficulty typically have higher discrimination levels.
37
What does a reliability coefficient less than 1.0 indicate about a test score?
An examinee’s obtained test score may or may not be their true score.
38
What is a confidence interval in the context of test scores?
It indicates the range within which an examinee’s true score is likely to be based on their obtained score.
39
How is the standard error of measurement calculated?
It is calculated by multiplying the test’s standard deviation by the square root of 1 minus the reliability coefficient.
40
What is the standard error of measurement if the standard deviation is 5 and the reliability coefficient is .84?
2.
41
How do you construct a 68% confidence interval around an obtained test score?
Add and subtract one standard error of measurement to and from the obtained score.
42
How do you construct a 95% confidence interval around an obtained test score?
Add and subtract two standard errors of measurement to and from the obtained score.
43
How do you construct a 99% confidence interval around an obtained test score?
Add and subtract three standard errors of measurement to and from the obtained score.
44
What is the 95% confidence interval for an examinee who scored 90 with a standard error of measurement of 5?
80 to 100.
45
What does Item Response Theory (IRT) focus on?
Examinees’ responses to individual test items.
46
How does IRT differ from Classical Test Theory (CTT)?
CTT is test-based and focuses on total test scores, while IRT is item-based.
47
What advantage does IRT have over CTT regarding item parameters?
IRT derives sample invariant parameters using mathematical techniques and a large sample size.
48
What is a computerized adaptive test?
A test that tailors items to each examinee by presenting items appropriate for their level of the trait.
49
What is another name for Item Response Theory?
Latent trait theory.
50
What does the item characteristic curve (ICC) represent?
The relationship between each item and the latent trait measured by the test.
51
What are the two axes of the ICC graph?
Total test scores (horizontal/x-axis) and probabilities of endorsing or answering the item correctly (vertical/y-axis).
52
What does the difficulty parameter in IRT indicate?
The level of the trait required for a 50% probability of endorsing or answering the item correctly.
53
What does the discrimination parameter in IRT indicate?
How well the item can discriminate between individuals with high and low levels of the trait.
54
What does the slope of the ICC indicate?
The steeper the slope, the better the discrimination of the item.
55
What does the y-axis crossing point of the ICC represent?
The probability of guessing correctly.
56
Fill in the blank: When the y-axis crossing point of the ICC is closer to 0, it indicates that _______.
it is more difficult for examinees to choose the correct answer by guessing.
57
What does adequate reliability in a test indicate?
Test scores can be expected to be consistent ## Footnote Adequate reliability does not indicate that the test measures what it was designed to measure.
58
Define validity in the context of testing.
The degree to which evidence and theory support the interpretation of test scores for proposed uses of tests ## Footnote Validity is a unitary concept, incorporating multiple sources of validity evidence.
59
What are the three traditional types of validity?
* Content Validity * Construct Validity * Criterion-related Validity
60
List the five sources of validity evidence.
* Evidence based on test content * The response process * The internal structure of the test * Relationships with other variables * The consequences of testing
61
What is content validity?
Evidence that a test measures one or more content or behavior domains ## Footnote Important for achievement tests and work samples.
62
How is content validity established during test development?
By clearly defining the domain to be assessed and including representative items ## Footnote Subject matter experts systematically review items for domain coverage.
63
What is face validity?
The extent to which test items 'look valid' to examinees ## Footnote Not an actual type of validity, but can affect examinees' willingness to perform well.
64
Define construct validity.
Evidence that a test measures a hypothetical trait inferred from behavior ## Footnote Important for traits like intelligence and motivation.
65
What is convergent validity?
The degree to which scores on the test correlate with scores on other measures of the same or related constructs.
66
What is divergent validity?
The degree to which scores on the test have low correlations with scores on measures of unrelated constructs ## Footnote Also known as discriminant validity.
67
What is a multitrait-multimethod matrix?
A table of correlation coefficients that provide information about a test’s reliability and validity ## Footnote Used to assess convergent and divergent validity.
68
What is a monotrait-monomethod coefficient?
A reliability coefficient for the same trait using the same method.
69
What does a large monotrait-heteromethod coefficient indicate?
Evidence of the self-report sociability test’s convergent validity.
70
What does a small heterotrait-monomethod coefficient indicate?
Evidence of the self-report sociability test’s divergent validity.
71
What is factor analysis?
A statistical method used to assess a test’s convergent and divergent validity ## Footnote Involves several steps including administering tests and correlating scores.
72
List the basic steps in factor analysis.
* Administer the test to a sample * Correlate all pairs of scores * Derive the initial factor matrix * Rotate and interpret the factor matrix
73
What are factor loadings?
Correlation coefficients indicating the relationship between each test and identified factors.
74
How is communality calculated?
By squaring and adding the factor loadings when factors are orthogonal.
75
What do high correlations with Factor I and low correlations with Factor II indicate in factor analysis?
Evidence of convergent and divergent validity for the test being validated.
76
What is the purpose of naming factors in factor analysis?
To interpret the identified factors based on the correlation patterns of the tests.
77
What is criterion-related validity?
It is of interest whenever scores on a test predict or estimate scores on another measure.
78
What is an example of criterion-related validity?
Evaluating a job knowledge test used for hiring decisions by predicting job performance scores.
79
What are the two types of criterion-related validity?
Concurrent and predictive validity.
80
How is concurrent validity evaluated?
By obtaining scores on the predictor and criterion at about the same time.
81
When is concurrent validity most important?
When predictor scores estimate current status on the criterion.
82
How is predictive validity evaluated?
By obtaining scores on the predictor before obtaining scores on the criterion.
83
When is predictive validity most important?
When predictor scores estimate future status on the criterion.
84
What does the criterion-related validity coefficient range from?
-1 to +1.
85
What does a criterion-related validity coefficient closer to ±1 indicate?
More accurate predictor scores for predicting criterion scores.
86
How can the amount of variability explained by one measure be determined?
By squaring the criterion-related validity coefficient.
87
If a job knowledge test has a validity coefficient of .70, what does .70 squared indicate?
49% of variability in job performance is explained by job knowledge.
88
What is cross-validation?
Validating a predictor for a new sample to check the correlation coefficient.
89
Why might the initial correlation coefficient overestimate the true correlation?
Due to chance (random) factors affecting high correlations.
90
What happens to the correlation coefficient during cross-validation?
It is likely to shrink.
91
When is shrinkage greatest in correlation coefficients?
When the initial sample is small and the number of predictors is large.
92
Fill in the blank: The predictor is the _______ and the measure of job performance is the criterion.
job knowledge test.
93
What is the standard error of estimate used for?
It is used to construct a confidence interval around a person's predicted criterion score
94
How is a confidence interval defined?
It indicates the range within which an examinee’s true criterion score is likely to fall given his or her predicted score
95
What is the relationship between the standard error of estimate and the normal curve?
A 68% confidence interval adds and subtracts one standard error, a 95% confidence interval adds and subtracts two, and a 99% confidence interval adds and subtracts three
96
How is the standard error of estimate calculated?
By multiplying the criterion measure’s standard deviation by the square root of 1 minus the criterion-related validity coefficient squared
97
What is the range of the standard error of estimate?
It ranges from 0 to the size of the criterion measure’s standard deviation
98
What happens to the standard error of estimate when the validity coefficient is +1 or -1?
The standard error is 0
99
What is the correction for attenuation formula used for?
To estimate the maximum validity coefficient if the predictor and/or criterion had a reliability coefficient of 1.0
100
What does clinical utility refer to?
The extent to which a test is useful for clinical purposes
101
Define incremental validity
The increase in the accuracy of predictions about criterion performance by adding a new predictor
102
What are true positives?
Recently hired employees who obtained high scores on both the predictor and criterion
103
What are false positives?
Recently hired employees who obtained high scores on the predictor but low scores on the criterion
104
What are true negatives?
Recently hired employees who obtained low scores on both the predictor and criterion
105
What are false negatives?
Recently hired employees who obtained low scores on the predictor but high scores on the criterion
106
How is the base rate calculated?
By dividing the number of employees with high scores on the criterion by the total number of employees
107
What is the positive hit rate?
The proportion of employees who would have been hired using their scores on the new predictor and obtained high scores on the criterion
108
What is diagnostic efficiency?
The ability of a test to correctly distinguish between people who do and do not have a disorder
109
What is sensitivity in the context of testing?
The proportion of people with the disorder identified by the test as having the disorder
110
What is specificity?
The proportion of people without the disorder identified by the test as not having the disorder
111
Define hit rate
The proportion of people correctly categorized by the test
112
What does positive predictive value indicate?
The probability that a person who tests positive actually has the disorder
113
What does negative predictive value indicate?
The probability that a person who tests negative does not actually have the disorder
114
True or False: A test's positive and negative predictive values are stable across different settings.
False
115
What is the relationship between reliability and validity?
A predictor’s reliability always places a ceiling on its validity
116
How is the reliability index calculated?
It is the square root of the predictor’s reliability coefficient
117
What are norm-referenced scores?
Scores that indicate how well an examinee performed compared to a standardization sample. ## Footnote Norm-referenced scores include percentile ranks and standard scores.
118
What is the primary objective of using norm-referenced scores?
To make distinctions among individuals or groups in terms of the ability or trait assessed by a test. ## Footnote (Urbina, 2014, p. 212)
119
What does a percentile rank indicate?
The percentage of examinees in the reference group who scored at or below a given score. ## Footnote For example, a percentile rank of 82 means 82% scored lower than the examinee.
120
How is the conversion of raw scores to percentile ranks described?
As a nonlinear transformation. ## Footnote This is because the percentile rank distribution is always rectangular.
121
What do standard scores indicate?
How well an examinee did in terms of standard deviations from the mean score obtained by the reference group. ## Footnote Standard scores include z-scores, T-scores, IQ scores, and stanines.
122
What is the mean and standard deviation of the z-score distribution?
Mean = 0, Standard Deviation = 1.0.
123
How is a z-score calculated?
z = (X – M)/SD, where X is the raw score, M is the mean, and SD is the standard deviation.
124
What does a T-score of 40 indicate?
The examinee's raw score is one standard deviation below the mean. ## Footnote T-scores have a mean of 50 and standard deviation of 10.
125
What are the mean and standard deviation for full-scale IQ scores on the SB-5 and Wechsler tests?
Mean = 100, Standard Deviation = 15.
126
What does a stanine of 5 represent?
Raw scores that range from .25 standard deviations below to .25 standard deviations above the mean. ## Footnote Stanines have a mean of 5 and a standard deviation of 2.
127
What is the primary objective of criterion-referenced scores?
To evaluate a person’s or group’s degree of competence or mastery against a preestablished standard of performance. ## Footnote (Urbina, 2014, p. 121)
128
What do percentage scores indicate?
The percentage of test items that examinees answered correctly.
129
What is a cutoff score?
A predetermined score that distinguishes between mastery and non-mastery of content.
130
What is an expectancy table?
A table that predicts an examinee’s expected score on another measure based on their obtained test score.
131
What is the difference between cutoff scores and ranking in selection decisions?
Cutoff scores select candidates above a certain score; ranking selects candidates from highest to lowest scores.
132
What is banding in the context of test scores?
Grouping test scores into bands based on the standard error of measurement to consider scores within each band as equivalent.
133
True or False: Banding helps reduce adverse impact by including members of minority groups within score bands.
True.
134
What is the relationship between a PR of 2 and SD in a normal distribution?
PR of 2 is equivalent to -2 SD.
135
What is the relationship between a PR of 16 and SD in a normal distribution?
PR of 16 is equivalent to -1 SD.
136
What is the relationship between a PR of 84 and SD in a normal distribution?
PR of 84 is equivalent to +1 SD.
137
What is the relationship between a PR of 98 and SD in a normal distribution?
PR of 98 is equivalent to +2 SD.