Item Analysis and Rliability Flashcards
A “____” is a systematic method for measuring a sample of behavior. Although the exact procedures used to develop a test depend on its ____ and ____, ____ ____ ordinarily involves specifying the test’s purpose, generating test items, administering the items to a sample of examinees for the purpose of item analysis, evaluating the test’s reliability and validity, and establishing norms. In this chapter, the basic principles of test construction are summarized.
Test; Purpose and Format; Test Construction
Two measurement frameworks are most commonly used to construct and evaluate tests ____ ____ ____ and ____ ____ ____.
Classical Test Theory and Item Response Theory
Two measurement frameworks are most commonly used to construct and evaluate tests ____ ____ ____ and ____ ____ ____.
Classical Test Theory and Item Response Theory
Classical test theory (CTT) has a longer history than item response theory, and its theoretical assumptions are considered to be ____ they can be easily met by ____ ____ ____. It also focuses more on ____-____ than item-level information (i.e., on relating an examinee’s obtained test score to their true test score), and it is most useful when the size of the sample being used to develop the test is ____.
Weaker; Most Test Data; Test-Level; Small
In contrast, the underlying assumptions of item response theory (IRT) are ____ ____ (more difficult to meet). It focuses on ____-____ ____ (i.e., on the relationship between an examinee’s response to a test item and his/her status regarding the latent trait being measured by the item), and it requires a ____ ____.
More Stringent; Item-Level Information; Large Sample
____ ____ ____ (¬¬___) provides two methods of item analysis — item difficulty and item discrimination — as well as methods for assessing test reliability.
Classical Test Theory (CTT)
____ ____ is measured by calculating an item difficulty index (p).
Item Difficulty
The value of p ranges from _ to _, with larger values indicating ____ ____: When p is equal to 0, this indicates that ____ of the examinees in the tryout sample answered the item correctly, and when p is equal to 1.0, this means that the item was answered ____ by ____ examinees.
0 to 1.0; Easier Items; None; Correctly; All
For many tests, items with ____ ____ ____ (p values close to .50) are retained. This strategy is useful because it increases test score ____, helps ensure that scores will be normally distributed, provides maximum ____ between examinees, and helps maximize the test’s ____. The optimal difficulty level is affected, however, by several factors. One factor is the likelihood that an examinee can choose the correct answer by ____, with the preferred level being ____ ____ _ (100% of examinees answered the item correctly) and the level of success expected by chance alone.
Moderate Difficulty Levels; Variability; Normally Distributed; Discrimination; Reliability; Guessing; Halfway Between 1.0
For true/false items, the probability of obtaining a correct answer by chance alone is .50, so the preferred difficulty level is .75, which is ____ between 1.0 and .50. Another factor that affects the optimal item difficulty level is the ____ of ____. If the goal is to choose a certain number of examinees, the optimal difficulty level corresponds to the proportion of examinees to be selected. For a graduate school admissions test, if only 15% of applicants are to be admitted, items will be chosen so that the average difficulty level for all items included in the test is about .15.
Halfway; Goal of Testing
____ ____ refers to the extent to which a test item discriminates (differentiates) between examinees who obtain high versus low scores on the entire test and is measured with the ____ ____ ____ (). This requires identifying the examinees in the tryout sample who obtained the ____ and ____ ____ on the ____ (often the upper 27% and lower 27%) and, for each item, ____ the percent of examinees In the ____-____ ____ () from the percent of examinees in the ____-____ ____ (_) who answered the item correctly: D = U – L
Item Discrimination; Item Discrimination Index (D); Highest and Lowest Scores on the Test; Subtracting; Lower-Scoring Group (L); Upper-Scoring Group (U)
The item discrimination index ranges from __ to __. If all examinees in the upper group and none in the lower group answered the item correctly, D is equal to __. If none of the examinees in the upper group and all examinees in the lower group answered the item correctly, D equals __. For most tests, an item with a discrimination index of __ or higher is considered acceptable. As noted above, items with moderate difficulty levels (around .50) have the greatest potential for ____ ____.
-1.0 to +1.0; +1.0; –1.0; .35; Maximum Discrimination
For the exam, you want to know the range of p and D and how to Interpret a specific value — e.g., ____________________________________. Also, keep in mind that, in general, a ____________________________________.
Know that a p value of .50 means that 50% of examinees answered the item correctly and that a D value of +1.0 means that all examinees in the highest scoring group and none of the examinees in the lowest scoring group answered the item correctly; p value of .50 is preferred, but that the optimal value depends on the likelihood that the correct answer can be chosen by guessing
Critics of CTT point out that it has several limitations. One of the biggest problems is that item and test parameters are ____-____. That is, the item difficulty and item discrimination indices, the reliability coefficient, and other measures derived from CIT are likely to ____ from ____ to ____. Another problem is that it is ____ to equate scores obtained on different tests that have been developed on the basis of CTT: A total score of 50 on an English test does not necessarily mean the same thing as a total score of 50 on a math test or a different English test. According to its advocates, ___ overcomes these problems and has several other advantages as well.
Sampe-Dependent; Vary from Sample to Sample; Difficult; IRT (Item Response Theory)
The item characteristics (parameters) derived from IRT are considered to be ____ ____ — i.e., they are the same across different samples. Also, because test scores are reported in terms of an ____ ____ on the ____ ____ ____ (rather than in terms of a total test score), it is possible to equate scores from different sets of items and from different tests. Another advantage is that the use of IRT makes it easier to develop ____-____ ____, in which the administration of subsequent items is based on the examinee’s performance on ____ ____.
Sample Invariant; Examinee’s Level; Trait Being Measured; Computer-Adaptive Tests; Previous Items
When using IRT, an ____ ____ ____ (___) is constructed for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either the total test score, performance on an external criterion, or a mathematically derived estimate of the latent ability or trait being measured by the item. The curve provides information on the relationship between an ____ ____ on that ____ or ____ and the ____ that he/she will respond to the item ____. The various IRT models produce ICCs that provide information on either one, two, or three parameters. The ICC in Figure I provides information on all three parameters — ____, ____, and ____ of ____ ____.
Item Characteristic Curve (ICC); Examinee’s Level; Ability or Trait; Probability; Correctly; Difficulty, Discrimination, and Probability of Guessing Correctly
An item’s ____ of ____ is indicated by the ability level at which 50% of examinees in the tryout sample provided a correct response. The difficulty level for the item depicted in Figure I is about 0, which corresponds to an ____ ____ ____ and indicates that the item is of ____ ____ — i.e., a person with average ability has a _% chance of answering this item correctly.
Level of Difficulty; Average Ability Level; Medium Difficulty; 50%
The item’s ____ to ____ between high and low achievers is indicated by the slope of the curve — the ____ the slope, the greater the discrimination. The item depicted in Figure I has good discrimination: It indicates that examinees with low ability (below 0) are more likely to answer the item ____, while those with high ability (above 0) are more likely to answer it ____.
Ability to Discriminate; Steeper; Incorrectly; Correctly
The ____ of ____ ____ is indicated by the point at which the ICC intercepts the vertical axis. Figure 1 indicates that there is a low probability of guessing correctly for this item: Only a small proportion of examinees with very low ability answered the item ____.
Probability of Guessing Correctly; Correctly
“____ ____ ____ “ linked with “item characteristic curve” and know how difficulty level, discrimination, and probability of guessing correctly are indicated by the ___.
Item Characteristic Curve; ICC
An item analysis is conducted to determine which items to retain in the final version of a test. An item difficulty index (p) is calculated by dividing the number of examinees who answered the item correctly by the (1) ________. It ranges in value from (2) ____. In general, an item difficulty level of (3) ____ is preferred because it not only maximizes (4) ____ between examinees of low and high ability but also helps ensure that the test has high (5) ____.
(1) total number of examinees; (2) O to 1.0; (3) .50; (4) discrimination; (5) reliability
However, the optimal difficulty level is affected by the probability that an examinee can answer the item correctly by guessing. For this reason, the optimal p value for true/false items is (6) ____. An (7) ____ index (D) is calculated by subtracting the percent of examinees in the lower-scoring group from the percent of examinees in the upper-scoring group who answered the item correctly. It ranges in value from (8) ____.
(6) .75; (7) item discrimination; (8) -1.0 to +1.0
Advantages of item response theory (IRT) are that item parameters are (9) ____ invariant and performance on different sets of items or tests can be easily (10) ____. Use of IRT involves deriving an item (11) ____ for each item that provides information on one, two, or three parameters — i.e., difficulty, discrimination, and (12) _________.
(9) sample; (10) equated; (11) characteristic curve; (12) probability of guessing correctly
From the perspective of ____ ____ ____, an examinee’s obtained test score (X) is composed of two components, a true score component (T) and an error component (E): X = T + E
Classical Test Theory
The ____ ____ ____ reflects the examinee’s status with regard to the attribute that is measured by the test, while the ____ ____ represents measurement error. Measurement error is ____ ____: It is due to factors that are ____ to what is being ____ by the ____ and that have an ____ (unsystematic) ____ on an examinee’s test score. The score you obtain on the licensing exam (X) is likely to be due both to the knowledge you have about the topics addressed by exam items (T) and the effects of random factors (E) such as the way test items are written, any alterations in anxiety, attention, or motivation you experience while taking the test, and the accuracy of your “educated guesses.”
True Score Component; Error Component; Random Error; Irrelevant; Measured; Test; Unpredictable Effect
Whenever we administer a test to examinees, we would like to know how much of their scores reflects “____” and how much reflects ____. It is a measure of ____ that provides us with an estimate of the proportion of variability that is due to true differences among examinees on the attribute(s) measured by the test. When a test is reliable, it provides ____, ____ ____ and, for this reason, the term ____ is often given as a synonym for reliability.
Truth; Error; Reliability; Dependable; Consistent Results; Consistency
Ideally, a test’s reliability (true score variability) could be measured ____. However, this is ____ ____, and reliability must be ____. There are several ways to estimate a test’s reliability. Each involves assessing the ____ of an examinee’s ____ over time, across different ____ ____, or across ____ ____ and is based on the assumption that variability that is consistent is ____ ____ ____, while variability that is inconsistent reflects ____ (____) ____.
Directly; Not Possible; Estimated; Consistency; Scores; Content Samples; Different Scores; True Score Variability; Measurement (Random) Error
Most methods for estimating reliability produce a ____ ____, which is a correlation coefficient that ranges in value from _ to _. When a test’s reliability coefficient is _, this means that all variability in obtained test scores is due to measurement error. Conversely, when a test’s reliability coefficient is , this indicates that all variability in scores reflects true score variability. The reliability coefficient is symbolized with the letter “” and a subscript that contains two of the ____ ____ or ____ (e.g., “rxx”). The ____ indicates that the correlation coefficient was calculated by correlating a test with itself rather than with some other measure.
Reliability Coefficient; 0.0 to +1.0; 0.0; +1.0; r; Same Letters or Numbers; Subscript
Regardless of the method used to calculate a reliability coefficient, the coefficient is interpreted directly as the ____ of ____ in ____ ____ ____ that reflects ____ ____ ____. For example, a reliability coefficient of .84 indicates that 84% of variability in scores is due to true score differences among examinees, while the remaining 16% (1.00 .84) is due to measurement error.
Proportion of Variability in Obtained Test Scores; True Score Variability
Note that a reliability coefficient does not provide any information about what is actually being ____ by a test. A reliability coefficient only indicates whether the attribute measured by the test — whatever it is — is being assessed in a ____, ____ ____. Whether the test is actually assessing what it was designed to measure is addressed by an analysis of the test’s ____
Measured; Consisted, Precise Way; Validity