10- Test Construction Flashcards
An eigenvalue is the:
Select one:
A. proportion of variance attributable to two or more factors
B. amount of variance in all the tests accounted for by a factor
C. effect of one independent variable, without consideration of the effects of other independent variables.
D. strength of the relationship between factors
Correct Answer is: B
In a factor analysis or principal components analysis, the explained variance, or “eigenvalues” indicate the amount of variance in all the tests accounted for by a factor.
proportion of variance attributable to two or more factors
This choice describes “communality” which is another outcome of a factor analysis.
effect of one independent variable, without consideration of the effects of other independent variables.
This is the definition of a “main effect”.
Additional Information: Explained Variance (or Eigenvalues)
In a study examining the effects of relaxation training on test-taking anxiety, a pre-test measure of anxiety is administered to a group of self-identified highly anxious test takers resulting in a split-half reliability coefficient of .75. If the pre-test is administered to a randomly selected group of the same number of people the split-half reliability coefficient will most likely be: Select one: A. Greater than .75 B. Less than .75 C. Equal to .75 D. impossible to predict
Correct Answer is: A
A general rule for all correlation coefficients, including reliability coefficients, is that the more heterogeneous the group, i.e., the wider the variability, the higher the coefficient will be. Since a randomly selected group would be more heterogeneous than a group of highly anxious test-takers, the randomly selected group would most likely have a higher reliability coefficient.
When looking at an item characteristic curve (ICC), which of the following provides information about how well the item discriminates between high and low achievers?
Select one:
A. the Y-intercept
B. the slope of the curve
C. the position of the curve (left versus right)
D. the position of the curve (top versus bottom)
Correct Answer is: B
An item response curve provides one to three pieces of information about a test item - its difficulty “the position of the curve (left versus right)”; its ability to discriminate between high and low scorers (correct answer); and the probability of answering the item correctly just by guessing “the Y-intercept”.
Additional Information: Item Response Theory and Item Response Curve
Adding more easy to moderately easy items to a difficult test will:
Select one:
A. increase the test’s floor.
B. decrease the test’s floor.
C. alter the test’s floor only if there is an equal number of difficult to moderately difficult items.
D. have no effect on the test’s floor.
Correct Answer is: B
As you may have guessed, “floor” refers to the lowest scores on a test (ceiling refers to the highest scores). Adding more easy to moderately easy items would lower or decrease the floor allowing for better discrimination of people at the low end.
Additional Information: Ceiling and Floor Effects
Adding more items to a test would most likely:
Select one:
A. increase the test’s reliability
B. decrease the test’s validity
C. have no effect on the test’s reliability or validity
D. preclude the use of the Spearman-Brown prophecy formula
Correct Answer is: A
Lengthening a test, that is, adding more test items, generally results in an increase in the test’s reliability. For example, a test consisting of only 3 questions would probably be more reliable if we added 10 more items.
The Spearman-Brown formula is specifically used to estimate the reliability of a test if it were lengthened or shortened.
Additional Information: Factors Affecting Reliability
The appropriate kind of validity for a test depends on the test’s purpose. For example, for the psychology licensing exam:
Select one:
A. construct validity is most important because it measures the hypothetical trait of “competence.”
B. content validity is most important because it measures knowledge of various content domains in the field of psychology.
C. criterion-related validity is most important because it predicts which psychologists will and will not do well as professionals.
D. no evidence of validity is required.
Correct Answer is: B
The psychology licensing exam is considered a measure of knowledge of various areas in the field of psychology and, therefore, is essentially an achievement-type test. Measures of content knowledge should have adequate content validity.
Additional Information: Content Validity
A test developer creates a new test of anxiety sensitivity and correlates it with an existing measure of anxiety sensitivity. The test developer is operating under the assumption that Select one: A. the new test is valid. B. the existing test is valid. C. the new test is reliable. D. the existing test is reliable.
Correct Answer is: B
The question is describing an example of obtaining evidence for a test’s construct validity. Construct validity refers to the degree to which a test measures a theoretical construct that it purports to measure; anxiety sensitivity is an example of a theoretical construct measured in psychological tests. A high correlation between a new test and an existing test that measures the same construct offers evidence of convergent validity, which is a type of construct validity. Another type is divergent validity, which is the degree to which a test has a low correlation with another test that measures a different construct. Correlating scores on a new test with an existing test to assess the new test’s convergent validity requires an assumption that the existing test is valid; i.e., that it actually does measure the construct.
Additional Information: Construct Validity
Rotation is used in factor analysis to:
Select one:
A. get an easier pattern of factor loadings to interpret.
B. increase the magnitude of the communalities.
C. reduce the magnitude of the communalities.
D. reduce the effects of measurement error on the factor loadings.
Correct Answer is: A
Factors are rotated to obtain a pattern that’s easier to interpret since the pattern of factor loadings in the initial factor matrix is often difficult to interpret.
Rotation alters the magnitude of the factor loadings but not the magnitude of the communalities (“increase the magnitude of the communalities” and “reduce the magnitude of the communalities”) and does not reduce the effects of measurement error (“reduce the effects of measurement error on the factor loadings”).
Additional Information: Interpreting and Naming the Factors
When seeking results that would be sensitive to the \_\_\_\_\_\_\_\_\_\_\_\_\_ of the test-taker, test-retest reliability would need to be the highest. Select one: A. maturity B. mood C. aptitude D. gender
Correct Answer is: D
Test-retest reliability is appropriate for determining the reliability of tests designed to measure attributes that are not affected by repeated measurement and that are relatively stable over time. The characteristics or traits represented in the incorrect choices (“maturity,” “mood,” and “aptitude,”) fluctuate and negatively affect the test-retest results.
Additional Information: Test-Retest Reliability
Researchers are interested in detecting differential item functioning (DIF). Which method would not be used? Select one: A. SIBTEST B. Mantel-Haenszel C. Lord's chi-square D. cluster analysis
Correct Answer is: D
In the context of item response theory, differential item functioning (DIF), or item bias analysis, refers to a difference in the probability of individuals from different subpopulations making a correct or positive response to an item, who are equal on the latent or underlying attribute measured by the test. The SIBTEST or simultaneous item bias test, Mantel-Haenszel, and Lord’s chi-square are statistical techniques used to identify DIF. Cluster analysis is a statistical technique used to develop a classification system or taxonomy. This method wouldn’t detect item bias or differences.
Additional Information: Item Response Theory and Item Response Curve
A measure of relative strength of a score within an individual is referred to as a(n): Select one: A. ipsative score B. normative score C. standard score D. independent variable
Correct Answer is: A
Ipsative scores report an examinee’s scores using the examinee him or herself as a frame of reference. They indicate the relative strength of a score within an individual but, unlike normative measures, do not provide the absolute strength of a domain relative to a normative group. Examples of ipsative scores are the results of a forced choice measure.
Additional Information: Ipsative vs. Normative Measures
Discriminant and convergent validity are classified as examples of: Select one: A. construct validity. B. content validity C. face validity. D. concurrent validity
Correct Answer is: A
There are many ways to assess the validity of a test. If we correlate our test with another test that is supposed to measure the same thing, we’ll expect the two to have a high correlation; if they do, the tests will be said to have convergent validity. If our test has a low correlation with other tests measuring something our test is not supposed to measure, it will be said to have discriminant (or divergent) validity. Convergent and divergent validity are both types of construct validity.
Additional Information: Construct Validity
A negative item discrimination (D) indicates:
Select one:
A. an index equal to zero.
B. more high-achieving examinees than low-achieving examinees answered the item correctly.
C. an item was answered correctly by the same number of low- and high-achieving students.
D. more low-achieving examinees answered the item correctly than high-achieving.
Correct Answer is: D
The discrimination index, D, has a value range from +1.0 to -1.0 and is the number of people in the upper or high scoring group who answered the item correctly minus the number of people in the lower scoring group who answered the item correctly, divided by the number of people in the largest of the two groups. An item will have a discrimination index equal to zero if everyone gets it correct or incorrect. A negative item discrimination index indicates that the item was answered correctly by more low-achieving students than by high-achieving students. In other words, a poor student may make a guess, select that response, and come up with the correct answer without any real understanding of what is being assessed. Whereas good students (like EPPP candidates) may be suspicious of a question that looks too easy, may read too much into the question, and may end up being less successful than those who guess.
more high-achieving examinees than low-achieving examinees answered the item correctly.
A positive item discrimination index indicates that the item was answered correctly by more high-achieving students than by low-achieving students.
Additional Information: Item Discrimination
Likert scales are most useful for: Select one: A. dichotomizing quantitative data B. quantifying objective data C. quantifying subjective data D. ordering categorical data
Correct Answer is: C
Attitudes are subject phenomena. Likert scales indicate the degree to which a person agrees or disagrees with an attitudinal statement. Using a Likert scale, attitudes are quantified - or represented in terms of ordinal scores.
Additional Information: Scales of Measurement
On the MMPI-2, what percentage of the general population the test is intended for can be expected to obtain a T-score between 40 and 60 on the depression scale? Select one: A. 50 B. 68 C. 95 D. 99
Correct Answer is: B
A T-score is a standardized score. Standardization involves converting raw scores into scores that indicate how many standard deviations the values are above or below the mean. A T-score is a standard score with a mean of 50 and a standard deviation of 10. Results of personality inventories such as the MMPI-2 are commonly reported in terms of T-scores. Other standard scores include z-scores, with a mean of 0 and a standard deviation of 1, and IQ scores, with a mean of 100 and a standard deviation of 15. When values are normally distributed in a population, standardization facilitates interpretation of test scores by making it easier to see where a test-taker stands on the variable in relation to others in the population. This is because, due to the properties of a normal distribution, one always knows the percentage of cases that are within standard deviation ranges of the mean. For example, in a normal distribution, 68.26 % of scores will fall within one standard deviation of the mean, or in a T score distribution, between 40 and 60, so 68% is the best answer to this question. Another example: 95.44% of scores fall within two standard deviations of the mean; therefore, 4.56% will have scores 2 standard deviation units or more above or below the mean. By dividing 4.56 in half, we can see that 2.28% of test-takers will score 70 or above on any MMPI scale, and 2.28% will score 30 or below.
Additional Information: Standard Scores
A condition necessary for pooled variance is: Select one: A. unequal sample sizes B. equal sample sizes C. unequal covariances D. equal covariances
Correct Answer is: B
Pooled variance is the weighted average variance for each group. They are “weighted” based on the number of subjects in each group. Use of a pooled variance assumes that the population variances are approximately the same, even though the sample variances differ. When the population variances were known to be equal or could be assumed to be equal, they might be labeled equal variances assumed, common variance or pooled variance. Equal variances not assumed or separate variances is appropriate for normally distributed individual values when the population variances are known to be unequal or cannot be assumed to be equal.
Additional Information: The Variance
In a clinical trial of a new drug, the null hypothesis is the new drug is, on average, no better than the current drug. It is concluded that the two drugs produce the same effect when in fact the new drug is superior. This is:
Select one:
A. corrected by reducing the power of the test
B. corrected by reducing the sample size
C. a Type I error
D. a Type II error
Correct Answer is: D
Type II errors occur when the null hypothesis is not rejected when it is in fact false; Type I errors are often considered more serious as the null hypothesis is wrongly rejected. For example, in the clinical trial of a new drug, this would be concluding that the new drug was better when in fact it was not. Type I and II errors are inversely related: as the probability of a Type I error increases, the probability of a Type II error decreases, and vice versa.
Which of the following statements is not true regarding concurrent validity?
Select one:
A. It is used to establish criterion-related validity.
B. It is appropriate for tests designed to assess a person’s future status on a criterion.
C. It is obtained by collecting predictor and criterion scores at about the same time.
D. It indicates the extent to which a test yields the same results as other measures of the same phenomenon.
Feedback
Correct Answer is: B
There are two ways to establish the criterion-related validity of a test: concurrent validation and predictive validation. In concurrent validation, predictor and criterion scores are collected at about the same time; by contrast, in predictive validation, predictor scores are collected first and criterion data are collected at some future point. Concurrent validity indicates the extent to which a test yields the same results as other measures of the same phenomenon. For example, if you developed a new test for depression, you might administer it along with the BDI and measure the concurrent validity of the two tests.
Additional Information: Concurrent vs. Predictive Validation
Cluster analysis would most likely be used to
Select one:
A. construct a “taxonomy” of criminal personality types.
B. obtain descriptive information about a particular case.
C. test the hypothesis that an independent variable has an effect on a dependent variable.
D. test statistical hypotheses when the assumption of independence of observations is violated.
Correct Answer is: A
The purpose of cluster analysis is to place objects into categories. More technically, the technique is designed to help one develop a taxonomy, or classification system of variables. The results of a cluster analysis indicate which variables cluster together into categories. The technique is sometimes used to divide a population of individuals into subtypes.
Additional Information: Techniques Related to Factor Analysis
Which of the following illustrates the concept of shrinkage?
Select one:
A. extremely depressed individuals obtain a high score on a depression inventory the first time they take it, but obtain a slightly lower score the second time they take it
B. items that have collectively been shown to be a valid way to diagnose a sample of individuals as depressed prove to be less valid when used for a different sample
C. the self-esteem of depressed individuals shrinks when they are faced with very difficult tasks
D. abilities such as short-term memory and response speed diminish as we get older
Correct Answer is: B
Shrinkage can be an issue when a predictor test is developed by testing out a pool of items on a validation (“try-out”) sample and then choosing the items that have the highest correlation with the criterion. When the chosen items are administered to a second sample, they usually don’t work quite as well – in other words, the validity coefficient shrinks. This occurs because of chance factors operating in the original validation sample that are not present in the second sample.
Additional Information: Factors Affecting the Validity Coefficient
In the multitrait-multimethod matrix, a large heterotrait-monomethod coefficient would indicate: Select one: A. low convergent validity. B. high convergent validity. C. high divergent validity. D. low divergent validity.
Correct Answer is: D
Use of a multitrait-multimethod matrix is one method of assessing a test’s construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-monomethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different traits using similar methods. An example might be the correlation between a test of depression based on self-report data and a test of anxiety also based on self-report data. If a test has good divergent validity, this correlation would be low. Divergent validity is the degree to which a test has a low correlation with other tests that do not measure the same construct. Using the above example, a test of depression would have poor divergent validity if it had a high correlation with other tests that purportedly measure different traits, such as anxiety. This would be evidence that the depression test is measuring traits that are unrelated to depression.
Additional Information: Convergent and Discriminant (Divergent) Validation
A kappa coefficient of .93 would indicate that the two tests
Select one:
A. measure what they are supposed to.
B. have a high degree of agreement between their raters.
C. aren’t especially reliable.
D. present test items with a high level of difficulty.
Feedback
Correct Answer is: B
The kappa coefficient is used to evaluate inter-rater reliability. A coefficient in the lower .90s indicates high reliability.
This option (“measure what they are supposed to”) is a layman’s definition of the general concept of valdity.
Additional Information: Interscorer Reliability
Kuder-Richardson reliability applies to Select one: A. split-half reliability. B. test-retest stability. C. Likert scales. D. tests with dichotomously scored questions.
Correct Answer is: D
The Kuder-Richardson formula is one of several statistical indices of a test’s internal consistency reliability. It is used to assess the inter-item consistency of tests that are dichotomously scored (e.g., scored as right or wrong).
Additional Information: Internal Consistency Reliability
In designing a new test of a psychological construct, you correlate it with an old test the new one will replace. Your assumption in this situation is that:
Select one:
A. the old test is invalid.
B. the old test is valid but out of date.
C. the old test is better than the new test.
D. the old test and the new test are both culture-fair.
Correct Answer is: B
the old test is valid but out of date.
This choice is the only one that makes logical sense. In the assessment of the construct validity of a new test, a common practice is to correlate that test with another test that measures the same construct. For this technique to work, the other test must be a valid measure of the construct. So in this situation, it is assumed that the old test is valid, but at the same time, it is being replaced. Of the choices listed, the correct option provides a reason why a valid test would be replaced.
Additional Information: Construct Validity
A large monotrait-heteromethod coefficient in a multitrait-multimethod matrix indicates evidence of: Select one: A. convergent validity B. concurrent validity C. predictive validity D. discriminant validity
Correct Answer is: A
A multitrait-multimethod matrix is a complicated method for assessing convergent and discriminant validity. Convergent validity requires that different ways of measuring the same trait yield the same result. Monotrait-heteromethod coefficients are correlations between two measures that assess the same trait using different methods; therefore if a test has convergent validity, this correlation should be high. Heterotrait-monomethod and heterotrait-heteromethod, both confirm discriminatory validity, and monotrait-monomethod coefficients are reliability coefficients.
Additional Information: Convergent and Discriminant (Divergent) Validation
In computing test reliability, to control for practice effects one would use a(n):
I. split-half reliability coefficient.
II. alternative forms reliability coefficient.
III. test-retest reliability coefficient.
Select one:
A. I and III only
B. I and II only
C. II and III only
D. II only
Correct Answer is: B
The clue here is the practice effect. That means that if you give a test, just taking it will give the person practice so that next time, he or she is not a naive person. To control for that, we want to eliminate the situation where the person is administered the same test again. So we do not use test-retest. We can use the two other methods listed. We can use split-half since, here, only one administration is used (the two parts are thought of as two different tests). And, in the alternative forms method, a different test is given the second time, controlling for the effects of taking the same test twice.
If an examinee correctly guesses the answers to a test, the reliability coefficient: Select one: A. is not affected B. stays the same C. decreases D. increases
Correct Answer is: C
One of the factors that affect the reliability coefficient is guessing. Guessing correctly decreases the reliability coefficient. The incorrect options (“is not affected,” “stays the same,” and “increases”) are not true in regards to the reliability coefficient.
Additional Information: Factors Affecting Reliability
A researcher employs multiple methods of measurement in an attempt to increase reliability by reducing systematic error. This strategy is referred to as: Select one: A. calibration B. intraclass correlation (ICC) C. triangulation D. correction for attenuation
Correct Answer is: C
Triangulation is the attempt to increase reliability by reducing systematic or method error through a strategy in which the researcher employs multiple methods of measurement (e.g., observation, survey, archival data). If the alternative methods do not share the same source of systematic error, examination of data from the alternative methods gives insight into how individual scores may be adjusted to come closer to reflecting true scores, thereby increasing reliability.
calibration
Calibration is the attempt to increase reliability by increasing homogeneity of ratings through feedback to the raters, when multiple raters are used. For example, raters might meet during pretesting of the instrument to discuss items on which they have disagreed seeking to reach consensus on rules for rating items (e.g.., defining a “2” for an item dealing with job performance).
intraclass correlation (ICC)
Intraclass correlation (ICC) is used to measure inter-rater reliability for two or more raters and may also be used to assess test-retest reliability. ICC may be conceptualized as the ratio of between-groups variance to total variance.
correction for attenuation
Correction for attenuation is a method used to adjust correlation coefficients upward because of errors of measurement when two measured variables are correlated; the errors always serve to lower the correlation coefficient as compared with what it would have been if the measurement of the two variables had been perfectly reliable.
Additional Information: Factors Affecting Reliability
If, in a normally-shaped distribution, the mean is 100 and the standard error of measurement is 5, what would the 95% confidence interval be for an examinee who receives a score of 90? Select one: A. 75-105 B. 80-100 C. 90-100 D. 95-105
Correct Answer is: B
The standard error of measurement indicates how much error an individual test score can be expected to have. A confidence interval indicates the range within which an examinee’s true score is likely to fall, given his or her obtained score. To calculate the 95% confidence interval we simply add and subtract two standard errors of measurement to the obtained score. Two standard errors of measurement in this case equal 10. We’re told that the examinee’s obtained score is 90. 90 +/- 10 results in a confidence interval of 80 to 100. In other words, we can be 95% confident that the examinee’s true score falls within 80 and 100.
Additional Information: Standard Error of Measurement
In the multitrait-multimethod matrix, a low heterotrait-heteromethod coefficient would indicate: Select one: A. low convergent validity B. low divergent validity C. high convergent validity D. high divergent validity
Correct Answer is: D
Use of a multitrait-multimethod matrix is one method of assessing a test’s construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-heteromethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different (hetero) traits using different (hetero) methods. An example might be the correlation between vocabulary subtest scores on the WAIS-IV for intelligence and scores on the Beck Depression Inventory for depression. Since these measures presumably measure different constructs, the correlation coefficient should be low, indicating high divergent or discriminant validity.
Additional Information: Convergent and Discriminant (Divergent) Validation
A way to define criterion in regard to determining criterion related validity is that the criterion is: Select one: A. The predictor test B. The validity measure C. The predictee D. The content.
Correct Answer is: C
To determine criterion-related validity, scores on a predictor test are correlated with an outside criteria. The criteria is that which is being predicted, or the “predictee.”
Additional Information: Relationship between Reliability and Validity
Raising the cutoff score on a predictor test would have the effect of Select one: A. increasing true positives B. decreasing false positives C. decreasing true negatives D. decreasing false negatives.
Correct Answer is: B
A simple way to answer this question is with reference to a chart such as the one displayed under the topic “Criterion-Related Validity” in the Psychology-Test Construction section of your materials. If you look at this chart, you can see that increasing the predictor cutoff score (i.e., moving the vertical line to the right) decreases the number of false positives as well as true positives (you can also see that the number of both true and false negatives would be increased).
You can also think about this question more abstractly by coming up with an example. Imagine, for instance, that a general knowledge test is used as a predictor of job success. If the cutoff score on this test is raised, fewer people will score above this cutoff and, therefore, fewer people will be predicted to be successful. Another way of saying this is that fewer people will come up “positive” on this predictor. This applies to both true positives and false positives.
Additional Information: Decision-Making
If, in a normally-shaped distribution, the mean is 100 and the standard error of measurement is 10, what would the 68% confidence interval be for an examinee who receives a score of 95?
Select one:
A. 85 to 105
B. 90 to 100
C. 90 to 110
D. impossible to calculate without the reliability coefficient
Correct Answer is: A
The standard error of measurement indicates how much error an individual test score can be expected to have. A confidence interval indicates the range within which an examinees’s true score is likely to fall, given his or her obtained score. To calculate the 68% confidence interval we simply add and subtract one standard error of measurement to the obtained score.
impossible to calculate without the reliability coefficient
This choice is incorrect because although the reliability coefficient is needed to calculate a standard error of measurement, in this case, we are provided with the standard error.
Additional Information: Standard Error of Measurement
The cutoff IQ score for placement in a school district's gifted program is 135. The parent of a child who scored 133 might be interested in knowing the test's standard error of measurement in order to estimate the child's Select one: A. true score. B. mean score. C. error score. D. criterion score.
Correct Answer is: A
The question is just a roundabout way of asking “what is the standard error of measurement?”, though it does supply a practical application of the concept. According to classical test theory, an obtained test score consists of truth and error. The truth component reflects the degree to which the score reflects the actual characteristic the test measures, and the error component reflects random or chance factors affecting the score. For instance, on an IQ test, a score will reflect to some degree the person’s “true” IQ and to some degree chance factors such as whether the person was tired the day he took the test, whether some of the questions happen to be a particularly good fit with the person’s knowledge base, etc. The standard error of measurement of a test indicates the expected amount of error a score on that test will contain. It can be used to answer the question, “given an obtained score, what is the likely true score?” For example, if the test referenced had a standard error of measurement of 5, there would be a 68% chance that the true test score lies within one standard error of measurement of the obtained score (between 128 and 138 in this case), and a 95% chance that the true score lies within two standard errors of measurement (between 123 - 143). So in the example, the parent would be interested to know what the test’s standard error of measurement because the higher it is, the greater the possibility that an obtained score of 133 actually reflects a true score of 135 or above.
Additional Information: Standard Error of Measurement
When using a rating scale, several psychologists agree on the same diagnosis for one patient. This is a sign that the scale is Select one: A. reliable. B. valid. C. reliable and valid. D. neither reliable nor valid.
Correct Answer is: A
The rating scale described by the question has good inter-rater reliability, or consistency across raters. However, it may or may not have good validity; that is, it may or may not measure what it purports to measure. The question illustrates that high reliability is a necessary but not a sufficient condition for high validity.
Additional Information: Interscorer Reliability
A test with limited ceiling would have a \_\_\_\_\_\_\_\_\_\_\_\_ distribution shape. Select one: A. normal B. flat C. positively skewed D. negatively skewed
Correct Answer is: D
A test with limited ceiling has an inadequate number of difficult items resulting in few low scores. Therefore the distribution would be negatively skewed.
Additional Information: Skewed Distributions, Measures of Central Tendency, Skewed Distributions