test construction 2 Flashcards
What is the range of p values (Item difficulty index)?
a. -1.0 to 1.0
b. 0 to 2.0
c. 0 to 1.5
d. 0 to 1.0
d
How to calculate the item difficulty index
p= Total number of Examinees passing the item/Total number of Examinees
What does a larger p value indicate?
a. better reliability
b. easier items
c. better item discrimination
d. harder items
b
In most situations, a p value of _____ is optimal. One exception is the case of a true/false test, for which the optimal p value is ____.
.50; .75
This refers to the extent to which a test item is able to differentiate between examinees who obtain high versus low scores on the entire test or on an external criterion.
Item discrimination
The item discrimination index ranges from:
a. -1.0 to 1.0
b. 0 to 2.0
c. 0 to 1.5
d. 0 to 1.0
a
For most tests, an item with a discrimination index of ____ or higher is considered acceptable.
.35
If all examinees in the upper group and none in the lower group answered the item correctly, D is equal to _____.
1.0
If none of the examinees in the upper group and all examinees in the lower group answered the item correctly, D equals _____.
-1.0
Test construction is usually based on one of two theories:
classical test theory
item response theory
Advantages of item response theory are that item parameters are sample invariant and performance on different sets of items or tests can be easily __________. Use of IRT involves deriving an ____________________________ for each item.
equated; item characteristic curve
Whenever we administer a test to examinees, we would like to know how much of their scores reflects “truth” and how much reflects error. It is a measure of ________ that provides us with an estimate of the proportion of variability in examinees’ obtained scores that is due to true differences among examinees on the attributes measure by the test. when a test is ________, it provides dependable, consistent results.
Reliability; reliable
Most methods for estimating reliability produce a reliability coefficient, which is a correlation coefficient that ranges in value from:
a. -1.0 to 1.0
b. 0 to 2.0
c. 0 to 1.5
d. 0 to 1.0
d
When a test’s reliability coefficient is 0.0, this means that all variability in obtained test scores is due to __________________.
measurement error
When a test’s reliability coefficient is +1.0, this indicates that all variability in scores ______________.
reflects true score variability
A reliability coefficientt of .84 indicates that ____% of variability in scores is due to true score differences among examining, while the remaining _____% is due to measurement error.
84; 16
This method for estimating reliability involves administering the same test to the same group of examinees on two different occasions and then correlating the two sets of scores.
a. Alternate forms reliability
b. Test-retest reliability
c. Split-half reliability
d. Inter-rater reliability
b
An ______________ coefficient is calculated by administering two equivalent forms of a test to the same group of examinees and correlating the two sets of scores.
a. Alternate forms reliability
b. Test-retest reliability
c. Split-half reliability
d. Inter-rater reliability
a
The test-retest reliability coefficient is also known as the coefficient of ____________.
stability
The alternate forms reliability coefficient is also referred to as the coefficient of ______________.
equivalence (and stability)
To assess ____________________, a test is administered once to a single group of examinees.
a. Alternate forms reliability
b. Test-retest reliability
c. Split-half reliability
d. Internal consistency reliability
d
A _________________ coefficient is calculated by splitting the test in half and correlating examinees’ scores on the two halves. Because the size of a reliability coefficient is affected by the test length, the split-half method tends to __________ a test’s true reliability. Consequently, the ____________ formula is often used in conjunction with split-half reliability to obtain an estimate of what the test’s true reliability is.
split-half; underestimate; Spearman-Brown
____________, another method used to assess internal consistency reliability, indicates the average inter-item consistency rather than the consistency between two halves of the test. The _____________ can be used as a substitute for it when tests items are scored dichotomously (right or wrong).
Coefficient alpha; Kuder-Richardson Formula 20
____________ should be assessed when a test is subjectively scored. The scores assigned by different raters can be used to calculate a ___________________ or to determine the percent agreement between raters.
Inter-rater reliability; correlation coefficient (kappa statistic)
The ___________ could be used to estimate the effects of increasing or reducing the number of items on a test.
Spearman-brown prophesy formula
While different types of tests can be expected to have different levels of reliability, for most tests, reliability coefficients of ______ or larger are considered acceptable.
.80
The magnitude of a reliability coefficient is affected by several factors. In general, the longer a test, the ____________ its reliability coefficeent.
larger
The __________________ is useful for indicating how much we can expect an individual examinee’s obtained score reflects his or her true score. It is calculated by multiplying the standard deviation of the test scores by the ________ of one minus the reliability coefficient.
standard error of measurement; square root
The standard error of measurement is used to construct a ______________ around an examinee’s obtained score.
confidence interval
________ refers to a test’s accuracy: A test is ______ when it measures what it is intended to measure.
Validity; valid
There are three main forms of validity: _______ is of concern whenever a test has been designed to measure one or more content or behavior domains. _________ is important when a test will be used to measure a hypothetical construct such as achievement, motivation, intelligence, or mechanical aptitude. __________ is of interest when a test has been designed to estimate or predict performance on another measure.
Content validity; construct validity; criterion-related
When scores on the test (X) are important because they provide information on how much each examinee know about a content domain or on each examinee’s status with regard to the trait being measured, then ___________ or ________ validity, respectively, are of interest. However, when the test (X) scores will be used to predict scores on some other measure (Y) and it is the scores on Y that are of most interest, the ___________ validity is of greatest concern.
content; construct; criterion-related
High correlations with measures of the same trait provide evidence of the test’s ___________ validity, while low correlations with measures of unrelated characteristics provide evidence of the test’s ___________ validity.
convergent; discriminant (divergent)
The ________________ is used to systematically organize the data collected when assessing a test’s convergent and discriminant validity. It indicates that a test has convergent validity when the ________________ coefficients are large and discriminant validity when the _______________ and the ___________ coefficients are small.
mulitrait-multimethod matrix; monotrait-heteromethod; heterotrait-heteromethod; heterotrait-monomethod
Factor analysis is used to identify the factors (dimensions) that underlie the ___________ among a set of tests. One use of the data obtained in a factor analysis is to determine if a test has ______________.
intercorrelations; construct validity
Using a factor analysis, a test is shown to have construcct validity when it has ______________ correlations with the factor(s) it is expected to correlate with and ______ correlations with the factor(s) it is not expected to correlate with.
high; low
In a factor matrix, the correlation between a test and a factor is referred to as ___________. This correlation can be interpreted in terms of shared variability. For example, if a test has a correlation of .50 with Factor I, this means that _____ percent of variability in test scores is explained by Factor I.
factor loading; 25
In factor analysis, when the identified factors are ______________ (uncorrelated), a test’s communality can be calculated by summing the __________________. If a test has a correlation of .50 with a Factor I and a correlation of .20 with Factor II and the factors are uncorrelated, the test’s communality is equal to _____. This means that ____% of the variability in test scores is explained by the identified factors, while the remaining variability is due to some combination of specificity and measurement error.
orthogonal; .29; 29
When the purpose of testing is to draw conclusions about performance on another measure, the test is referred to as the _____________ and the other measure is called the _____________.
predictor; criterion
There are two types of criterion-related validity: When establishing __________ validity, the predictor is administered to a sample of examinees prior to the criterion. It is the appropriate type of validity when the goal of testing is to predict ________ status on the criterion.
When evaluating ______________ validity, the predictor and criterion are administered at about the same time. It is the preferred method for assessing validity when the purpose of testing is to estimate ________ status on the criterion.
predictive; future; concurrent; current
This is used to construct a confidence interval around an individuals predicted criterion score.
standard error of estimate
The data collected in a concurrent or predictive validity study can be used to assess a predictor’s _________________, or the increase in correct decisions that can be expected if the predictor is used as a decision-making tool.
incremental validity
Study tip: Remember that it is the ___________ that determines if a person is a positive or a negative, and the ________ that determines if he/she is a “true” or “false”
predictor; criterion
The optimal item difficulty level for a true/false test is:
a. .25
b. .50
c. .75
d. 1.00
c
For a test item that has an item discrimination index of +1.0, you would expect:
a. high achievers to be more likely to answer the item correctly than low achievers
b. low achievers to be more likely to answer the item correctly than high achievers
c. low and high achievers to be equally likely to answer the item correction
d. low and high achievers to be equally likely to answer the item incorrectly
a
In terms of item response theory, the slope (steepness) of the item response curve indicates the item’s:
a. difficulty
b. discriminability
c. reliability
d. validity
b. When using an item characteristic curve, an item’s ability to discriminate between high and low achievers is indicated by the slope of the curve - the steeper the slopoe, the greater the discrimination.
A researcher correlates scores on two alternate forms of an achievement test and obtains a correlation coefficient of .80. This means that ____% of observed test score variability reflects true score variability:
a. 80
b. 64
c. 36
d. 20
a
To estimate the effects off lengthening a 50-item test to 100 items on the test’s reliability, you would use which off the following:
a. Pearson r
b. Kuder-Richardson Formula 20
c. kappa coefficient
d. Spearman-Brown Formula
d
To assess the internal consistency of a test that contains 50 items with are each scored as “right” or “wrong,” you would use which of the following:
a. KR-20
b. Spearman-Brown
c. kappa statistic
d. coefficient of concordance
a
You administer a test to a group of examinees on April 1 and then re-administer the test to the same group of examinees on May 1. When you correlate the two sets of scores, you will have obtained:
a. coefficient of consistency
b. coefficient of determination
c. coefficient of equivalence
d. coefficient of stability
d
The kappa statistic for a test is .90. This means that the test has:
a. adequate inter-rater reliability
b. adequate internal consistency reliability
c. inadequate intra-rater reliability
d. inadequate alternate forms reliability
a
Refers to the extent to which test items contribute to achieving the stated goals of testing
relevance