Item Analysis and Rliability Flashcards by Cassie Morris

A “____” is a systematic method for measuring a sample of behavior. Although the exact procedures used to develop a test depend on its ____ and ____, ____ ____ ordinarily involves specifying the test’s purpose, generating test items, administering the items to a sample of examinees for the purpose of item analysis, evaluating the test’s reliability and validity, and establishing norms. In this chapter, the basic principles of test construction are summarized.

Test; Purpose and Format; Test Construction

How well did you know this?

Not at all

Perfectly

Two measurement frameworks are most commonly used to construct and evaluate tests ____ ____ ____ and ____ ____ ____.

Classical Test Theory and Item Response Theory

How well did you know this?

Not at all

Perfectly

Two measurement frameworks are most commonly used to construct and evaluate tests ____ ____ ____ and ____ ____ ____.

Classical Test Theory and Item Response Theory

How well did you know this?

Not at all

Perfectly

Classical test theory (CTT) has a longer history than item response theory, and its theoretical assumptions are considered to be ____ they can be easily met by ____ ____ ____. It also focuses more on ____-____ than item-level information (i.e., on relating an examinee’s obtained test score to their true test score), and it is most useful when the size of the sample being used to develop the test is ____.

Weaker; Most Test Data; Test-Level; Small

How well did you know this?

Not at all

Perfectly

In contrast, the underlying assumptions of item response theory (IRT) are ____ ____ (more difficult to meet). It focuses on ____-____ ____ (i.e., on the relationship between an examinee’s response to a test item and his/her status regarding the latent trait being measured by the item), and it requires a ____ ____.

More Stringent; Item-Level Information; Large Sample

How well did you know this?

Not at all

Perfectly

____ ____ ____ (¬¬___) provides two methods of item analysis — item difficulty and item discrimination — as well as methods for assessing test reliability.

Classical Test Theory (CTT)

How well did you know this?

Not at all

Perfectly

____ ____ is measured by calculating an item difficulty index (p).

Item Difficulty

How well did you know this?

Not at all

Perfectly

The value of p ranges from _ to _, with larger values indicating ____ ____: When p is equal to 0, this indicates that ____ of the examinees in the tryout sample answered the item correctly, and when p is equal to 1.0, this means that the item was answered ____ by ____ examinees.

0 to 1.0; Easier Items; None; Correctly; All

How well did you know this?

Not at all

Perfectly

For many tests, items with ____ ____ ____ (p values close to .50) are retained. This strategy is useful because it increases test score ____, helps ensure that scores will be normally distributed, provides maximum ____ between examinees, and helps maximize the test’s ____. The optimal difficulty level is affected, however, by several factors. One factor is the likelihood that an examinee can choose the correct answer by ____, with the preferred level being ____ ____ _ (100% of examinees answered the item correctly) and the level of success expected by chance alone.

Moderate Difficulty Levels; Variability; Normally Distributed; Discrimination; Reliability; Guessing; Halfway Between 1.0

How well did you know this?

Not at all

Perfectly

For true/false items, the probability of obtaining a correct answer by chance alone is .50, so the preferred difficulty level is .75, which is ____ between 1.0 and .50. Another factor that affects the optimal item difficulty level is the ____ of ____. If the goal is to choose a certain number of examinees, the optimal difficulty level corresponds to the proportion of examinees to be selected. For a graduate school admissions test, if only 15% of applicants are to be admitted, items will be chosen so that the average difficulty level for all items included in the test is about .15.

Halfway; Goal of Testing

How well did you know this?

Not at all

Perfectly

____ ____ refers to the extent to which a test item discriminates (differentiates) between examinees who obtain high versus low scores on the entire test and is measured with the ____ ____ ____ (). This requires identifying the examinees in the tryout sample who obtained the ____ and ____ ____ on the ____ (often the upper 27% and lower 27%) and, for each item, ____ the percent of examinees In the ____-____ ____ () from the percent of examinees in the ____-____ ____ (_) who answered the item correctly: D = U – L

Item Discrimination; Item Discrimination Index (D); Highest and Lowest Scores on the Test; Subtracting; Lower-Scoring Group (L); Upper-Scoring Group (U)

How well did you know this?

Not at all

Perfectly

The item discrimination index ranges from __ to __. If all examinees in the upper group and none in the lower group answered the item correctly, D is equal to __. If none of the examinees in the upper group and all examinees in the lower group answered the item correctly, D equals __. For most tests, an item with a discrimination index of __ or higher is considered acceptable. As noted above, items with moderate difficulty levels (around .50) have the greatest potential for ____ ____.

-1.0 to +1.0; +1.0; –1.0; .35; Maximum Discrimination

How well did you know this?

Not at all

Perfectly

For the exam, you want to know the range of p and D and how to Interpret a specific value — e.g., ____________________________________. Also, keep in mind that, in general, a ____________________________________.

Know that a p value of .50 means that 50% of examinees answered the item correctly and that a D value of +1.0 means that all examinees in the highest scoring group and none of the examinees in the lowest scoring group answered the item correctly; p value of .50 is preferred, but that the optimal value depends on the likelihood that the correct answer can be chosen by guessing

How well did you know this?

Not at all

Perfectly

Critics of CTT point out that it has several limitations. One of the biggest problems is that item and test parameters are ____-____. That is, the item difficulty and item discrimination indices, the reliability coefficient, and other measures derived from CIT are likely to ____ from ____ to ____. Another problem is that it is ____ to equate scores obtained on different tests that have been developed on the basis of CTT: A total score of 50 on an English test does not necessarily mean the same thing as a total score of 50 on a math test or a different English test. According to its advocates, ___ overcomes these problems and has several other advantages as well.

Sampe-Dependent; Vary from Sample to Sample; Difficult; IRT (Item Response Theory)

How well did you know this?

Not at all

Perfectly

The item characteristics (parameters) derived from IRT are considered to be ____ ____ — i.e., they are the same across different samples. Also, because test scores are reported in terms of an ____ ____ on the ____ ____ ____ (rather than in terms of a total test score), it is possible to equate scores from different sets of items and from different tests. Another advantage is that the use of IRT makes it easier to develop ____-____ ____, in which the administration of subsequent items is based on the examinee’s performance on ____ ____.

Sample Invariant; Examinee’s Level; Trait Being Measured; Computer-Adaptive Tests; Previous Items

How well did you know this?

Not at all

Perfectly

When using IRT, an ____ ____ ____ (___) is constructed for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either the total test score, performance on an external criterion, or a mathematically derived estimate of the latent ability or trait being measured by the item. The curve provides information on the relationship between an ____ ____ on that ____ or ____ and the ____ that he/she will respond to the item ____. The various IRT models produce ICCs that provide information on either one, two, or three parameters. The ICC in Figure I provides information on all three parameters — ____, ____, and ____ of ____ ____.

Item Characteristic Curve (ICC); Examinee’s Level; Ability or Trait; Probability; Correctly; Difficulty, Discrimination, and Probability of Guessing Correctly

How well did you know this?

Not at all

Perfectly

An item’s ____ of ____ is indicated by the ability level at which 50% of examinees in the tryout sample provided a correct response. The difficulty level for the item depicted in Figure I is about 0, which corresponds to an ____ ____ ____ and indicates that the item is of ____ ____ — i.e., a person with average ability has a _% chance of answering this item correctly.

Level of Difficulty; Average Ability Level; Medium Difficulty; 50%

How well did you know this?

Not at all

Perfectly

The item’s ____ to ____ between high and low achievers is indicated by the slope of the curve — the ____ the slope, the greater the discrimination. The item depicted in Figure I has good discrimination: It indicates that examinees with low ability (below 0) are more likely to answer the item ____, while those with high ability (above 0) are more likely to answer it ____.

Ability to Discriminate; Steeper; Incorrectly; Correctly

How well did you know this?

Not at all

Perfectly

The ____ of ____ ____ is indicated by the point at which the ICC intercepts the vertical axis. Figure 1 indicates that there is a low probability of guessing correctly for this item: Only a small proportion of examinees with very low ability answered the item ____.

Probability of Guessing Correctly; Correctly

How well did you know this?

Not at all

Perfectly

“____ ____ ____ “ linked with “item characteristic curve” and know how difficulty level, discrimination, and probability of guessing correctly are indicated by the ___.

Item Characteristic Curve; ICC

How well did you know this?

Not at all

Perfectly

An item analysis is conducted to determine which items to retain in the final version of a test. An item difficulty index (p) is calculated by dividing the number of examinees who answered the item correctly by the (1) ________. It ranges in value from (2) ____. In general, an item difficulty level of (3) ____ is preferred because it not only maximizes (4) ____ between examinees of low and high ability but also helps ensure that the test has high (5) ____.

(1) total number of examinees; (2) O to 1.0; (3) .50; (4) discrimination; (5) reliability

How well did you know this?

Not at all

Perfectly

However, the optimal difficulty level is affected by the probability that an examinee can answer the item correctly by guessing. For this reason, the optimal p value for true/false items is (6) ____. An (7) ____ index (D) is calculated by subtracting the percent of examinees in the lower-scoring group from the percent of examinees in the upper-scoring group who answered the item correctly. It ranges in value from (8) ____.

(6) .75; (7) item discrimination; (8) -1.0 to +1.0

How well did you know this?

Not at all

Perfectly

Advantages of item response theory (IRT) are that item parameters are (9) ____ invariant and performance on different sets of items or tests can be easily (10) ____. Use of IRT involves deriving an item (11) ____ for each item that provides information on one, two, or three parameters — i.e., difficulty, discrimination, and (12) _________.

(9) sample; (10) equated; (11) characteristic curve; (12) probability of guessing correctly

How well did you know this?

Not at all

Perfectly

From the perspective of ____ ____ ____, an examinee’s obtained test score (X) is composed of two components, a true score component (T) and an error component (E): X = T + E

Classical Test Theory

How well did you know this?

Not at all

Perfectly

The ____ ____ ____ reflects the examinee's status with regard to the attribute that is measured by the test, while the ____ ____ represents measurement error. Measurement error is ____ ____: It is due to factors that are ____ to what is being ____ by the ____ and that have an ____ (unsystematic) ____ on an examinee's test score. The score you obtain on the licensing exam (X) is likely to be due both to the knowledge you have about the topics addressed by exam items (T) and the effects of random factors (E) such as the way test items are written, any alterations in anxiety, attention, or motivation you experience while taking the test, and the accuracy of your "educated guesses."

True Score Component; Error Component; Random Error; Irrelevant; Measured; Test; Unpredictable Effect

Whenever we administer a test to examinees, we would like to know how much of their scores reflects "____" and how much reflects ____. It is a measure of ____ that provides us with an estimate of the proportion of variability that is due to true differences among examinees on the attribute(s) measured by the test. When a test is reliable, it provides ____, ____ ____ and, for this reason, the term ____ is often given as a synonym for reliability.

Truth; Error; Reliability; Dependable; Consistent Results; Consistency

Ideally, a test's reliability (true score variability) could be measured ____. However, this is ____ ____, and reliability must be ____. There are several ways to estimate a test's reliability. Each involves assessing the ____ of an examinee's ____ over time, across different ____ ____, or across ____ ____ and is based on the assumption that variability that is consistent is ____ ____ ____, while variability that is inconsistent reflects ____ (____) ____.

Directly; Not Possible; Estimated; Consistency; Scores; Content Samples; Different Scores; True Score Variability; Measurement (Random) Error

Most methods for estimating reliability produce a ____ ____, which is a correlation coefficient that ranges in value from _ to _. When a test's reliability coefficient is _, this means that all variability in obtained test scores is due to measurement error. Conversely, when a test's reliability coefficient is _, this indicates that all variability in scores reflects true score variability. The reliability coefficient is symbolized with the letter "_" and a subscript that contains two of the ____ ____ or ____ (e.g., "rxx"). The ____ indicates that the correlation coefficient was calculated by correlating a test with itself rather than with some other measure.

Reliability Coefficient; 0.0 to +1.0; 0.0; +1.0; r; Same Letters or Numbers; Subscript

Regardless of the method used to calculate a reliability coefficient, the coefficient is interpreted directly as the ____ of ____ in ____ ____ ____ that reflects ____ ____ ____. For example, a reliability coefficient of .84 indicates that 84% of variability in scores is due to true score differences among examinees, while the remaining 16% (1.00 .84) is due to measurement error.

Proportion of Variability in Obtained Test Scores; True Score Variability

Note that a reliability coefficient does not provide any information about what is actually being ____ by a test. A reliability coefficient only indicates whether the attribute measured by the test — whatever it is — is being assessed in a ____, ____ ____. Whether the test is actually assessing what it was designed to measure is addressed by an analysis of the test's ____

Measured; Consisted, Precise Way; Validity

In contrast to other correlation coefficients, the reliability coefficient is ____ ____ to interpret it but is interpreted ____ as a measure of ____ ____ ____. When a test has a reliability coefficient of .89, this means that 89% of variability in obtained scores is true score variability.

Never Squared; Directly; True Score Variability

The selection of a method for estimating reliability depends on the ____ of the ____. As noted below, each method entails ____ ____ and is affected by different sources of ____. For many tests, ____ than ____ ____ should be used.

Nature of the Test; Different Procedures; Error; More than One Method

____-____ ____: The test-retest method for estimating reliability involves administering the same test to the same group of examinees on two different occasions and then correlating the two sets of scores. When using this method, the ____ ____ indicates the degree of stability (consistency) of examinees' scores over time and is also known as the ____ of ____.

Test-Retest Reliability; Reliability Coefficient; Coefficient of Stability

The primary sources of measurement error for test-retest reliability are any random factors related to the ____ that ____ between the ____ ____ of the ____. These time sampling factors include random fluctuations in ____ ____ ____ (e.g., changes in anxiety or motivation) and random variations in the ____ ____.

Time that Passes; Two Administrations of the Test; Examinees Over Time; Testing Situation

Test-retest reliability is appropriate for determining the ____ of ____ designed to measure attributes that are ____ ____ ____ ____ and that are not affected by ____ ____. It would be appropriate for a ____ of ____, which is a stable characteristic, but not for a ____ of ____, since mood ____ ____ ____, or a test of ____, which might be affected by previous exposure to test items.

Reliability of Tests; Relatively Stable Over Time; Repeated Measurement; Test of Aptitude; Test of Mood; Fluctuates Over Time; Creativity

____ (____, ____) ____ ____: To assess a test's ____ ____ ____, two equivalent forms of the test are administered to the same group of examinees and the two sets of scores are correlated. Alternate forms reliability indicates the ____ of ____ to different ____ ____ (the two test forms) and, when the forms are administered at different times, the consistency of ____ ____ ____.

Alternate (Equivalent, Parallel) Forms Reliability; Alternate Forms Reliability; Consistency of Responding; Item Samples; Responding Over Time

The alternate forms reliability coefficient is also called the ____ of ____ when the two forms are administered at about the same time and the ____ of ____ and ____ when a relatively long period of time separates administration of the two forms. The primary source of measurement error for alternate forms reliability is ____ ____, or error introduced by an ____ between different examinees' ____ and the different ____ ____ by the ____ ____ in the ____ ____. When administration of the two forms is separated by a period of time, ____ ____ ____ also contribute to error.

Coefficient of Equivalence; Coefficient of Equivalence and Stability; Content Sampling; Interaction; Knowledge; Content Assessed; Items Included; Two Forms; Time Sampling Factors

Like test-retest reliability, alternate forms reliability is not appropriate when the attribute measured by the test is likely to ____ ____ ____ and the forms will be administered at ____ ____ or when scores are likely to be affected by ____ ____ (e.g., by practice effects). Although alternate forms reliability is considered by some experts to be the ____ ____ (and ____) ____ for estimating ____, it is not often assessed due to the ____ in ____ ____ that are ____ ____.

Fluctuate Over Time; Different Times; Repeated Measurement; Most Rigorous (and Best) Method; Reliability; Difficulty in Developing Forms that are Truly Equivalent

____ ____ ____: Reliability can also be estimated by measuring the ____ ____ of a test. ____-____ ____ and ____ ____ are two methods for evaluating internal consistency. Both involve administering the test ____ to a ____ ____ of ____, and both yield a reliability coefficient that is also known as the ____ of ____ ____.

Internal Consistency Reliability; Internal Consistency; Split-Half Reliability and Coefficient Alpha; Once to a Single Group of Examinees; Coefficient of Internal Consistency

To determine a test's ____-____ ____, the test is split into equal halves so that each examinee has two scores (one for each half of the test), and scores on the two halves are then correlated. Tests can be split in several ways, but probably the most common way is to divide the test on the basis of ____-____ ____-____ ____. A problem with the split-half method is that it produces a reliability coefficient that is based on test scores that were derived from ____-____ of the ____ ____ of the ____. If a test contains 30 items, each score is based on 15 items.

Split-Half Reliability; Odd-Versus Even-Numbered Items; One-Half of the Entire Length of the Test

Because reliability tends to decrease as the length of a test ____, the split-half reliability coefficient usually ____ a test's ____ ____. For this reason, the split-half reliability coefficient is ordinarily corrected using the ____-____ ____ ____, which provides an estimate of what the reliability coefficient would have been had it been based on the full length of the test.

Decreases; Underestimates a test’s True Reliability; Spearman-Brown Prophecy Formula

Cronbach's ____ ____ also involves administering the test once to a single group of examinees. However, rather than splitting the test in half, a ____ ____ is used to determine the ____ ____ of ____-____ ____. One way to interpret coefficient alpha is as the ____ ____ that would be obtained from ____ ____ ____ of the ____. Coefficient alpha tends to be ____ and can be considered the ____ ____ of a test's ____ (Novick & Lewis, 1967). When test items are scored dichotomously (right or wrong), a variation of coefficient alpha known as the ____-____ ____ _ (KR-20) can be used.

Coefficient Alpha; Special Formula; Average Degree of Inter-Item Consistency; Average Reliability; All Possible Splits of the Test; Conservative; Lower Boundary; Reliability; Kuder-Richardson Formula 20

____ ____ is a source of error for both split-half reliability and coefficient alpha. For split-half reliability, content sampling refers to the ____ resulting from differences between the ____ of the ____ ____ of the ____ (i.e., the items included in one half may better fit the knowledge of some examinees than items in the other half); for coefficient alpha, content (item) sampling refers to differences between ____ ____ ____ rather than between test halves. For coefficient alpha, the ____ of the ____ ____ is an additional source of ____.

Content Sampling; Error; Content; Two Halves of the Test; Individual Test Items; Heterogeneity of the Content Domain; Error

A test is ____ with regard to content domain when its items measure several different domains of knowledge or behavior. The greater the heterogeneity of the content domain, the lower the ____-____ ____ and the ____ the ____ of ____ ____. Coefficient alpha could be expected to be ____ for a 200-item test that contains items assessing knowledge of test construction, statistics, ethics, industrial-organizational psychology, clinical psychology, and psychopathology than for a 200-item test that contains questions on test construction only.

Heterogeneous; Inter-Item Correlations; Lower the Magnitude of Coefficient Alpha; Smaller

The methods for assessing internal consistency reliability are useful when a test is designed to measure a ____ ____, when the characteristic measured by the test ____ ____ ____, or when scores are likely to be affected by ____ ____ to the ____. They are not appropriate for assessing the reliability of ____ ____ because, for these tests, they tend to produce spuriously high coefficients. (A ____ ____ has a fixed time limit that allows few, if any, examinees to respond to all items, and an examinee's performance on the test depends on their ____ of ____.) For speed tests, ____ ____ ____ is usually the best choice.

Single Characteristic; Fluctuates Over Time; Repeated Exposure to the Test; Speed Tests; Speed Test; Speed of Responding; Alternate Forms Reliability

____-____ (____-____, ____-____) ____: ____-____ ____ is of concern whenever test scores depend on a rater's judgment. A test constructor would want to make sure that an essay test, a behavioral observation scale, or a projective personality test have adequate ____-____ ____. This type of reliability is assessed by calculating a ____ ____ or the ____ ____ for the scores or ratings assigned by two or more raters.

Inter-Rater (Inter-Scorer, Inter-Observer) Reliability; Inter-Rater Reliability; Inter-Rater Reliability; Correlation Coefficient; Precent Agreement

____ ____: The ____ ____ and ____ of ____ are two correlation coefficients that are used to measure inter-rater reliability. The ____ ____ (k) is also known as the kappa coefficient and Cohen's kappa and is used when scores or ratings represent a nominal or ordinal scale of measurement. The ____ of ____ is also known as Kendall's coefficient of concordance and is used to assess inter-rater reliability when there are three or more raters and ratings are reported as ranks.

Correlation Coefficient; Kappa Statistic and Coefficient of Concordance; Kappa Statistic; Coefficient of Concordance

____ ____: ____ ____ is calculated by dividing the number of items or observations in which raters are in agreement by the total number of items or observations. For example, if two raters assign the same ratings to 40 of 50 behavioral observations, the percent agreement is 80%. Percent agreement is ____ to ____ and ____, but it can lead to ____ ____ because it does not take into account the ____ of ____ that would have occurred by ____ ____. This is a particular problem for ____ ____ ____ that require raters to record the ____ of a ____ ____. In this situation, the degree of chance agreement is high whenever the behavior has a high rate of occurrence, and percent agreement will provide an ____ ____ of the measure's ____.

Precent Agreement; Precent Agreement; Easy to Calculate and Interpret; Erroneous Conclusions; Level of Agreement; Chance Alone; Behavioral Observation Scales; Frequency of a Specific Behavior; Inflated Estimate; Reliability

Sources of error for inter-rater reliability include factors related to the ____ such as lack of motivation and rater biases and ____ of the ____ ____. An inter-rater reliability coefficient is likely to be ____, for instance, when rating categories are not exhaustive (i.e., don't include all possible responses or behaviors) and/or are not ____ ____. In addition, the inter-rater reliability of a behavioral rating scale may be affected by ____ ____ ____, which occurs when two (or more) observers working together influence each other's ratings so that they both assign ratings in a similarly idiosyncratic way.

Raters; Characteristic of the Measuring Device; Low; Mutually Exclusive; Consensual Observer Drift

Unlike other sources of error, consensual observer drift tends to ____ ____ ____-____ ____. The reliability (and validity) of ratings can be improved in several ways, but, overall, the best way is to provide raters with ____ ____ and ____ ____.

Artificially Inflate Inter-Rater Reliability; Adequate Training and Periodic Retraining

The Spearman-Brown linked with ____-____ ____, KR-20 linked with ____ ____, and the kappa statistic linked with ____-____ ____. Also know that ____ ____ ____ is the most thorough method for estimating reliability and that internal consistency reliability is not appropriate for ____ ____.

Split-Half Reliability; Coefficient Alpha; Inter-Rater Reliability; Alternate Forms Reliability; Speed Tests

Prom the perspective of (1) ____ test theory, variability in test scores reflects two factors: true differences between examinees on the attribute measured by the test and differences due to (2) ____. Reliability is a measure of the amount of variability in obtained test scores that is due to (3) ____ variability.

(1) classical; (2) measurement (random) error; (3) true score

A test's reliability is commonly estimated by calculating a reliability coefficient, which is a type of (4) ____ coefficient. The reliability coefficient ranges in value from (5) ____ and is interpreted directly as a measure of (6) ____ variability. For example, if a test has a reliability coefficient of .91, this means that (7) ____% of variability in obtained test scores is due to (8) ____ variability, while the remaining 9% reflects (9) ___.

(4) correlation; (5) 0 to +1.0; (6) true score; (7) 91; (8) true score; (9) measurement error

Test-retest reliability is assessed by administering a test to the same group of examinees at two different (10) ____ and then (11) ____ the two sets of scores. The test-retest reliability coefficient is also known as the coefficient of (12) ____. An alternate forms reliability coefficient is calculated by administering two (13) ____ of a test to the same group of examinees and correlating the two sets of scores. The alternate forms reliability coefficient is also referred to as the coefficient of (14) ___________.

(10) times; (11) correlating; (12) stability; (13) equivalent forms; (14) equivalence (or equivalence and stability when there is a long period of time between administration of the two forms)

To assess internal consistency reliability, a test is administered once to a single group Of examinees. A (15) ____ reliability coefficient is calculated by splitting the test in half and correlating examinees' scores on the two halves. Because the size of a reliability coefficient is affected by test length, the split-half method tends to (16) ____ a test's true reliability. Consequently, the (17) ____ formula is often used in conjunction with split-half reliability to obtain an estimate of what the test's true reliability is.

(15) split-half; (16) underestimate; (17) Spearman-Brown

Coefficient (18) ____, another method used to assess internal consistency reliability, indicates the average inter-item consistency rather than the consistency between two halves of the test. The Kuder-Richardson Formula 20 can be used as a substitute for coefficient alpha when test items are scored (19) ____. Split-half reliability, coefficient alpha, and KR-20 are not appropriate for speed tests because they tend to (20) ____ the reliability of these tests.

(18) alpha; (19) dichotomously; (20) overestimate

Finally, inter-rater reliability should be assessed whenever a test is (21) ____ scored. The scores assigned by different raters can be used to calculate a (22) ____ coefficient — for example, the (23) ____ statistic which can be used when ratings represent a nominal or ordinal scale of measurement. Alternatively, percent agreement between raters can be calculated. A problem with this approach is that the resulting index of reliability can be artificially inflated by the effects of (24) ____.

(21) subjectively; (22) correlation (reliability); (23) kappa; (24) chance agreement

The magnitude of the reliability coefficient is affected not only by the sources of error described above but also by the ____ of the ____, the ____ of the ____ ____, and the probability that the correct response to items can be selected by ____.

Length of the Test; Range of the Test Scores; Guessing

____ ____: The ____ the ____ of the ____ ____ ____ by a test, the less the relative effects of measurement error and the more likely the sample will provide dependable, consistent information. Consequently, a general rule is that the longer the ____ ____, the larger the test's reliability coefficient.

Test Length; Larger the Sample; Attribute Being Measured; Test Length

The ____-____ ____ ____ is most associated with split-half reliability but can be used whenever a test developer wants to estimate the effects of ____ or ____ a ____ on its ____ ____. For instance, if a 100-item test has a reliability coefficient of .84, the Spearman-Brown formula could be used to estimate the effects of increasing the number of items to 150 or reducing the number to 50.

Spearman-Brown Prophecy Formula; Lengthening or Shortening a Test; Reliability Coefficient

A problem with the Spearman-Brown formula is that it does not always yield an ____ ____ of ____: In general, it tends to ____ a test's ____ ____, and this is most likely to be the case when added items do not measure the ____ ____ ____ as the ____ ____ and/or are more susceptible to the ____ of ____ ____.

Accurate Estimate of Reliability; Overestimate; True Reliability; Same Content Domain as the Original Items; Effects of Measurement Error

____ of ____ ____: Since the reliability coefficient is a correlation coefficient, it is maximized when the ____ of ____ is unrestricted. The range is directly affected by the ____ of ____ of ____ with regard to the ____ ____ by the ____: When examinees are ____, the range of scores is ____. The range is also affected by the ____ ____ of the ____ ____. When all items are either very difficult or very easy, all examinees will obtain either ____ or ____ ____, resulting in a ____ ____. Therefore, the best strategy is to choose items so that the average difficulty level is in the ____-____ (p = .50).

Range of Test Scores; Range of Scores; Degree of Similarity of Examinees; Attribute Measured by the Test; Heterogeneous; Maximized; Difficulty Level of the Test Items; Low or High Scores; Restricted Range; Mid-Range

____: A test's reliability coefficient is also affected by the probability that examinees can ____ the ____ ____ to test items. As the probability of correctly guessing answers increases, the reliability coefficient ____. All other things being equal, a true/false test will have a ____ ____ ____ than a four-alternative multiple-choice test which, in turn, will have a ____ ____ ____ than a free recall test.

Guessing; Guess the Correct Answers; Decreases; Lower Reliability Coefficient; Lower Reliability Coefficient

The interpretation of a test's reliability entails considering its effects on the scores achieved by a ____ of ____ as well as the score obtained by a ____ ____.

Group of Examinees; Single Examinee

The ____ ____: As discussed above, a ____ ____ is interpreted directly as the proportion of variability in a set of test scores that is attributable to true score variability. A reliability coefficient of .84 indicates that 84% of variability in test scores is due to ____ ____ ____ among examinees, while the remaining 16% is due to ____ ____. different types of tests can be expected to have different levels of reliability, for most tests, reliability coefficients of _ or ____ are considered acceptable.

The Reliability Coefficient; Reliability Coefficient; True Score Difference; Measurement Error; .80 or Larger

The ____ ____ of ____: The reliability coefficient is useful for estimating the proportion of ____ ____ ____ in a set of ____ ____, but it is not particularly helpful for interpreting an ____ ____ obtained test score. When an examinee receives a score of 80 on a 100-item test that has a reliability coefficient of .84, for instance, we can only conclude that, since the test is not perfectly reliable, the examinee's obtained score might or might not be his or her ____ ____.

The Standard Error of Measurement; True Score Variability; Test Scores; Individual Examinee’s; True Score

A common practice when interpreting an examinee's obtained score is to construct a confidence interval around that score. The ____ ____ helps a test user estimate the ____ within which an examinee's ____ ____ is likely to ____ given his or her ____ ____. This range is calculated using the ____ ____ of ____ (___), which is an index of the amount of error that can be expected in obtained scores due to the unreliability of the test. (When raw scores have been converted to percentile ranks, the confidence interval is referred to as a ____ ____.)

Confidence Interval; Range; True Score; Fall; Obtained Score; Standard Error of Measurement (SEM); Percentile Band

As shown by the formula, the magnitude of the standard error of measurement is affected by the ____ ____ of the ____ ____ and the ____ ____ ____: The ____ the test's standard deviation and the ____ its reliability coefficient, the ____ the standard error of measurement (and vice versa). For example, when the reliability coefficient equals 1.0, the standard error equals 0; but when the reliability coefficient equals O, the standard error is equal to the standard deviation of the test scores.

Standard Deviation of the Test Scores and the Test’s Reliability Coefficient; Lower; Higher; Smaller

Because the standard error is a type of ____ ____, it can be interpreted in terms Of the areas under the ____ ____. Regarding confidence intervals, this means that a 68% confidence interval is constructed by adding and subtracting ____ ____ ____ to an examinee's obtained score; a 95% confidence interval is constructed by adding and subtracting ____ ____ ____; and a 99% confidence interval is constructed by adding and subtracting ____ ____ ____.

Standard Deviation; Areas; Normal Curve; One Standard Error; Two Standard Errors; Three Standard Errors

____ ____ of the ____ ____ ____ ____: Test users sometimes calculate a ____ ____ to compare an examinee's performance on ____ ____ ____, an examinee's performance on the same test when it is administered to the examinee on ____ ____ ____ (e.g., before and after an intervention has been administered), or the ____ of ____ ____ ____ on the ____ ____. Because each test score contains some degree of ____ ____, a difference score contains ____ ____ from ____ ____ ____ and, consequently, must be ____ with ____.

Standard Error of the Difference Between Two Scores; Different Score; Two Different Tests; Two Different Occasions; Performance of Two Different Examinees on the Same Test; Measurement Error; Measurement Error from Both Test Scores; Interpreted with Caution

The ____ ____ of the ____ is used to help a test user determine whether a difference score is significant and is calculated by ____ the ____ of the ___ of the ____ ____ to the ____ of ___ of the ____ ____ and taking the ____ ____ of the ____. It can be interpreted in terms of ____ under the ____ ____ — i.e., when two scores differ by one standard error of the difference, there is a _% chance that the difference between the scores represents a true score difference; when two scores differ by two standard errors of the difference, there is a _% chance that the difference between the scores represents a true score difference; and when two scores differ by three standard errors of the difference, there is a _% chance that the difference between the scores represents a true score difference.

Standard Error of the Difference; Adding; Square; SEM; First Score; Square of SEM; Second Score; Square Root; Result; Areas; Normal Curve; 68%; 95%; 99%

The magnitude of a reliability coefficient is affected by several factors. In general, the longer a test, the (1) ____ its reliability coefficient. The (2) ____ formula is used to estimate the effects of lengthening or (3) ____ a test on its reliability coefficient. If the new items do not represent the same content domain as the original items or are more susceptible to measurement error, this formula is likely to (4) ____ the effects of lengthening the test.

(1) larger; (2) Spearman-Brown; (3) shortening; (4) overestimate

Like other correlation coefficients, the reliability coefficient is affected by the range of scores: The greater the range, the (5) ____ the reliability coefficient. To maximize a test's reliability coefficient, the tryout sample should include people who are (6) ____ with regard to the attribute(s) measured by the test. A reliability coefficient is also affected by the probability that an examinee can select the correct answer to a test question simply by guessing. The easier it is to guess the correct answer, the (7) ____ the reliability coefficient.

(5) larger; (6) heterogeneous; (7) smaller

While the reliability coefficient is useful for assessing the amount of variability in test scores that is due to (8) ____ variability for a group of examinees, it does not directly indicate how much we can expect an individual examinee's obtained score to reflect his or her true score. The standard error of (9) ____ is useful for this purpose. It is calculated by multiplying the standard deviation of the test scores by the (10) ____ of one minus the reliability coefficient.

(8) true score; (9) measurement; (10) square root

For example, if a test's standard deviation is 10 and its reliability coefficient is .91, the standard error of measurement is equal to (11) ____. The standard error of measurement is used to construct a (12) ____ interval around an examinee's obtained ("measured") score. In terms of magnitude, the standard error of the difference between two scores is always (13) ____ than the SEM of either score because it reflects measurement error from both test scores.

(11) 3.0; (12) confidence; (13) larger

Item Analysis and Rliability Flashcards

(75 cards)