Test construction Flashcards
A researcher employs multiple methods of measurement in an attempt to increase reliability by reducing systematic error. This strategy is referred to as: Select one: A. calibration B. intraclass correlation (ICC) C. triangulation D. correction for attenuation
Correct Answer is: C
Triangulation is the attempt to increase reliability by reducing systematic or method error through a strategy in which the researcher employs multiple methods of measurement (e.g., observation, survey, archival data). If the alternative methods do not share the same source of systematic error, examination of data from the alternative methods gives insight into how individual scores may be adjusted to come closer to reflecting true scores, thereby increasing reliability.
calibration
Calibration is the attempt to increase reliability by increasing homogeneity of ratings through feedback to the raters, when multiple raters are used. For example, raters might meet during pretesting of the instrument to discuss items on which they have disagreed seeking to reach consensus on rules for rating items (e.g.., defining a “2” for an item dealing with job performance).
intraclass correlation (ICC)
Intraclass correlation (ICC) is used to measure inter-rater reliability for two or more raters and may also be used to assess test-retest reliability. ICC may be conceptualized as the ratio of between-groups variance to total variance.
correction for attenuation
Correction for attenuation is a method used to adjust correlation coefficients upward because of errors of measurement when two measured variables are correlated; the errors always serve to lower the correlation coefficient as compared with what it would have been if the measurement of the two variables had been perfectly reliable.
he reliability statistic that can be interpreted as the average of all possible split-half coefficients is Select one: A. the Spearman-Brown formula. B. Cronbach's coefficient alpha. C. chi-square. D. point-biserial coefficient.
Correct Answer is: B
According to classical test theory, the reliability of a test indicates the degree to which examinees’ scores are free from error and reflect their “true” test score. Reliability is typically measured by obtaining the correlation between scores on the same test, such as by having examinees take then retake the test and correlating both sets of scores (test-retest reliability) or by dividing the test in half and correlating scores on both halves (split-half reliability). Cronbach’s alpha, like split-half reliability, is categorized as an internal consistency reliability coefficient. Its calculation is based on the average of all inter-item correlations, which are correlations between responses on two individual items. Mathematically, Cronbach’s alpha works out to the average of all possible split-half correlations (there are many possible split-half correlations because there are many different ways of splitting the test in half).
Regarding the other choices, the Spearman-Brown formula is used to estimate the effects of lengthening a test on its reliability coefficient. Longer tests are typically more reliable. The Spearman-Brown formula is commonly used to adjust the split-half coefficient to estimate what reliability would have been if the halved tests had as many items as the full test. The chi-square test is used to test predictions about observed versus expected frequency distributions of nominal, or categorical, data; for example, if you flip a coin 100 times, you can use the chi-square test to determine if the distribution of heads versus tails outcomes falls into the expected range or if there is evidence that the coin toss was “fixed.” And the point-biserial correlation coefficient is used to correlate dichotomously scaled variables with interval or ratio data; for example, it can be used to correlate responses on test items scored as correct or incorrect with scores on the test as a whole.
If a job selection test has lower validity for Hispanics as compared to White or African-Americans, you could say that ethnicity is acting as a: Select one: A. confounding variable B. criterion contaminator C. discriminant variable D. moderator variable
Correct Answer is: D
A moderator variable is any variable which moderates, or influences, the relationship between two other variables. If the validity of a job selection test is different for different ethnic groups (i.e. there is differential validity), then ethnicity would be considered a moderator variable since it is influencing the relationship between the test (predictor) and actual job performance (the criterion).
A confounding variable is a variable in a research study which is not of interest to the researcher, but which exerts a systematic effect on the DV. Criterion contamination is the artificial inflation of validity which can occur when raters subjectively score ratees on a criterion measure after they have been informed how the ratees scored on the predictor.
In a factor analysis, an eigenvalue corresponds to
Select one:
A. the number of latent variables.
B. the strength of the relationship between factors.
C. the level of significance of the factor analysis.
D. the explained variance of one of the factors.
Correct Answer is: D
When a factor analysis produces a series of factors, it is useful to determine how much of the variance is accounted for by each factor. An eigenvalue is based on the factor loadings of all the variables in the factor analysis to a particular factor. When the factor loadings are high, the eigenvalue will be large. A large eigenvalue would mean that a particular factor accounts for a large proportion of the variance among the variables.
The factor loading for Test A and Factor II is .80 in a factor matrix. This means that:
Select one:
A. only 80% of variability in Test A is accounted for by the factor analysis
B. only 64% of variability in Test A is accounted for by the factor analysis
C. 80% of variability in Test A is accounted for by Factor II
D. 64% of variability in Test A is accounted for by Factor II
Correct Answer is: D
The correlation coefficient for a test and an identified factor is referred to as a factor loading. To obtain a measure of shared variability, the factor loading is squared. This example, the factor loading is .80, meaning that 64% (.80 squared) of variability in the test is accounted for by the factor.
The other identified factor(s) probably also account for some variability in Test A, which is why this option is not the best answer: only 64% of variability in Test A is accounted for by the factor analysis.
Likert scales are most useful for: Select one: A. dichotomizing quantitative data B. quantifying objective data C. quantifying subjective data Correct D. ordering categorical data
Correct Answer is: C
Attitudes are subject phenomena. Likert scales indicate the degree to which a person agrees or disagrees with an attitudinal statement. Using a Likert scale, attitudes are quantified - or represented in terms of ordinal scores.
Which statement is most correct?
Select one:
A. High reliability assumes high validity.
B. High validity assumes high reliability.
C. Low validity assumes low reliability.
D. Low reliability assumes low validity.
Correct Answer is: B
This question is difficult because the language of the response choices is convoluted and imprecise. We don’t write questions like this because we’re sadistic; it’s just that you’ll sometimes see this type of language on the exam as well, and we want to prepare you. What you need to do on questions like this is bring to mind what you know about the issue being asked about, and to choose the answer that best applies. Here, you should bring to mind what you know about the relationship between reliability and validity: For a test to have high validity, it must be reliable; however, for a test to have high reliability, it does not necessarily have to be valid. With this in mind, you should see that “high validity assumes high reliability” is the best answer. This means that a precondition of high validity is high reliability.
The second best choice states that low reliability assumes low validity. This is a true statement if you interpret the word “assume” to mean “implies” or “predicts.” But if you interpret the word “assume” to mean “depends on” or “is preconditioned by,” the statement is not correct.
A person obtains a raw score of 70 on a Math test with a mean of 50 and an SD of 10; a percentile rank of 84 on a History test; and a T-score of 65 on an English test. What is the relative order of each of these scores? Select one: A. History >> Math >> English B. Math >> History >> English C. History >> English >> Math D. Math >> English >> History
Correct Answer is: D
Before we can compare different forms of scores, we must transform them into some form of standardized measure. A Math test which has a mean of 50 and an SD of 10 indicates that a raw score of 70 would fall 2 standard deviations above the mean. Assuming a normal distribution of scores, a percentile rank of 84 on a History test is equivalent to 1 standard deviation above the mean. If you haven’t memorized that, you could still figure it out: Remember that 50% of all scores in a normal distribution fall below the mean and 50% fall above the mean. And 68% of scores fall within +/- 1 SD of the mean. If you divide 68% by 2 - you get 34% (the percentage of scores that fall between 0 and +1 SD). If you then add that 34% to the 50% that fall below the mean - you get a percentile rank of 84. Thus, the 84 percentile score is equivalent to 1 SD above the mean. Finally, looking at the T-score on the English test - we know that T-scores always have a mean of 50 and an SD of 10. Thus a T-score of 65 is equivalent to 1½ standard deviations above the mean. Comparing the 3 test scores we find the highest score was in Math at 2 SDs above the mean, followed by English at 1½ SDs above the mean, and History at 1 SD above the mean.
Computer-adaptive testing will yield
Select one:
A. more accurate results for high scorers on a test.
B. more accurate results for low scorers on a test.
C. more accurate results for examinees who score in the middle range of a test.
D. equally accurate results across all range of scores on a test
Correct Answer is: D
In computerized adaptive testing, the examinee’s previous responses are used to tailor the test to his or her ability. As a result, inaccuracy of scores is reduced across ability levels.
The kappa statistic is used to evaluate reliability when data are: Select one: A. interval or ratio (continuous) B. nominal or ordinal (discontinuous) C. metric D. nonlinear
Correct Answer is: B
The kappa statistic is used to evaluate inter-rater reliability, or the consistency of ratings assigned by two raters, when data are nominal or ordinal. Interval and ratio data is sometimes referred to by the term metric.
Which of the following would be used to determine the probability that examinees of different ability levels are able to answer a particular test item correctly?
Select one:
A. criterion-related validity coefficient
B. item discrimination index
C. item difficulty index
D. item characteristic curve
Correct Answer is: D
Item characteristic curves (ICCs), which are associated with item response theory, are graphs that depict individual test items in terms of the percentage of individuals in different ability groups who answered the item correctly. For example, an ICC for an individual test item might show that 80% of people in the highest ability group, 40% of people in the middle ability group, and 5% of people in the lowest ability group answered the item correctly. Although costly to derive, ICCs provide much information about individual test items, including their difficulty, discriminability, and probability that the item will be guessed correctly.
The slope of the item response curve, with respect to item response theory, indicates an item's: Select one: A. reliability B. validity C. difficulty D. discriminability
Correct Answer is: D
The item response curve provides information about an item’s difficulty; ability to discriminate between those who are high and low on the characteristic being measured; and the probability of correctly answering the item by guessing. The position of the curve indicates its difficulty* and the steeper the slope of the item response curve, the better its ability to discriminate (correct response) between examinees who are high and low on the characteristic being measured. The item response curve does not indicate reliability* or validity* (* incorrect options).
What are the minimum and maximum values of the standard error of measurement?
Select one:
A. 0 and the standard deviation of test scores
B. 0 and 1
C. 1 and the standard deviation of test scores
D. -1 and 1
Correct Answer is: A
This question is best answered with reference to the formula for the standard error of measurement, which appears in the Psychology-Test Construction section. It is calculated by subtracting 1 by reliability coefficent, and taking the square root of this value; then this is multiplied by the standard deviation of x. You need to know the minimum and maximum values of the reliability coefficient – 0 and +1.0, respectively. If the reliability coefficient is +1.0, you will find from the above formula that the standard error of measurement is 0, which is its minimum value. And when the reliability coefficient is 0, you find from the formula that the standard error of measurement is equal to the standard deviation of test scores, which is its maximum value.
Which of the following methods of establishing a test's reliability is, all other things being equal, likely to be lowest? Select one: A. split-half B. Cronbach's alpha C. alternate forms D. test-retes
Correct Answer is: C
You probably remember that the alternate forms coefficient is considered by many to be the best reliability coefficient to use when practical (if you don’t, commit this factoid to memory now). Everything else being equal, it is also likely to have a lower magnitude than the other types of reliability coefficients. The reason for this is similar to the reason why it is considered the best one to use. To obtain an alternate forms coefficient, one must administer two forms of the same test to a group of examinees, and correlate scores on the two forms. The two forms of the test are administered at different times and (because they are different forms) contain different items or content. In other words, there are two sources of error (or factors that could lower the coefficient) for the alternate forms coefficient: the time interval and different content (in technical terms, these sources of error are referred to respectively as “time sampling” and “content sampling”). The alternate forms coefficient is considered the best reliability coefficient by many because, for it to be high, the test must demonstrate consistency across both a time interval and different content.
When constructing an achievement test, which of the following would be useful for comparing total test scores of a sample of examinees to the proportion of examinees who answer each item correctly? Select one: A. classical test theory B. item response theory C. generalizability theory D. item utility theory
Correct Answer is: B
The question describes the kind of information that is provided in an item response curve, which is constructed for each item to determine its characteristics when using item response theory as the basis for test development. (Note that there is no such thing as “item utility theory.”)