Exam # 1 Flashcards
- An examinee obtains a score of 70 on a test that has a mean of 80, a standard deviation of 15, and a standard error of measurement of 5. The 95% confidence interval for the examinee’s score is:
50-90 55-85 60-80 65-75
The Correct Answer is “C”
C. Confidence interval indicates the range within which an examinees’ true score is likely to fall, given his or her obtained score. The standard error of measurement indicates how much error an individual test score can be expected to have and is used to construct confidence intervals. To calculate the 68% confidence interval, add and subtract one standard error of measurement to the obtained score. To calculate the 95% confidence interval, add and subtract two standard errors of measurement to the obtained score. Two standard errors of measurement in this case equal 10. We’re told that the examinee’s obtained score is 70. 70 + 10 results in a confidence interval of 80 to100. In other words, we can be 95% confident that the examinee’s true score falls within 60 and 80.
- Kuder-Richardson reliability applies to
split-half reliability. test-retest stability. Likert scales. tests with dichotomously scored questions.
The Correct Answer is “D”
The Kuder-Richardson formula is one of several statistical indices of a test’s internal consistency reliability. It is used to assess the inter-item consistency of tests that are dichotomously scored (e.g., scored as right or wrong).
- Which of the following statements is not true regarding concurrent validity?
It is used to establish criterion-related validity. It is appropriate for tests designed to assess a person's future status on a criterion. It is obtained by collecting predictor and criterion scores at about the same time. It indicates the extent to which a test yields the same results as other measures of the same phenomenon.
The Correct Answer is “B”
There are two ways to establish the criterion-related validity of a test: concurrent validation and predictive validation. In concurrent validation, predictor and criterion scores are collected at about the same time; by contrast, in predictive validation, predictor scores are collected first and criterion data are collected at some future point. Concurrent validity indicates the extent to which a test yields the same results as other measures of the same phenomenon. For example, if you developed a new test for depression, you might administer it along with the BDI and measure the concurrent validity of the two tests.
- A company wants its clerical employees to be very efficient, accurate and fast. Examinees are given a perceptual speed test on which they indicate whether two names are exactly identical or slightly different. The reliability of the test would be best assessed by:test-retest
Cronbach’s coefficient alpha
split-half
Kuder-Richardson Formula 20
The Correct Answer is “A”
A. Perceptual speed tests are highly speeded and are comprised of very easy items that every examinee, it is assumed, could answer correctly with unlimited time. The best way to estimate the reliability of speed tests is to administer separately timed forms and correlate these, therefore using a test-retest or alternate forms coefficient would be the best way to assess the reliability of the test in this question. The other response choices are all methods for assessing internal consistency reliability. These are useful when a test is designed to measure a single characteristic, when the characteristic measured by the test fluctuates over time, or when scores are likely to be affected by repeated exposure to the test. However, they are not appropriate for assessing the reliability of speed tests because they tend to produce spuriously high coefficients.
- Which of the following descriptive words for tests are most opposite in nature?speed and power
subjective and aptitude
norm-referenced and standardized
maximal and ipsative
The Correct Answer is “A”
Pure speed tests and pure power tests are opposite ends of a continuum. A speed test is one with a strict time limit and easy items that most or all examinees are expected to answer correctly. Speed tests measure examinees’ response speed. A power test is one with no or a generous time limit but with items ranging from easy to very difficult (usually ordered from least to most difficult). Power tests measure level of content mastered.
- The kappa statistic is used to evaluate reliability when data are:interval or ratio (continuous)
nominal or ordinal (discontinuous)
metric
nonlinear
The Correct Answer is “B”
B. The kappa statistic is used to evaluate inter-rater reliability, or the consistency of ratings assigned by two raters, when data are nominal or ordinal. Interval and ratio data is sometimes referred to by the term metric.
- The purpose of rotation in factor analysis is to facilitate interpretation of the factors. Rotation:
alters the factor loadings for each variable but not the eigenvalue for each factor alters the eigenvalue for each factor but not the factor loadings for the variables alters the factor loadings for each variable and the eigenvalue for each factor does not alter the eigenvalue for each factor nor the factor loadings for the variables
The Correct Answer is “C”
C. In factor analysis, rotating the factors changes the factor loadings for the variables and eigenvalue for each factor although the total of the eigenvalues remains the same.
- What value is preferred for the average item difficulty level in order to maximize the size of a test’s reliability coefficient?
10. 0 0. 5 1. 0 0. 0
The Correct Answer is “B”
The item difficulty index ranges from 0 to 1, and it indicates the number of examinees who answered the item correctly. Items with a moderate difficulty level, typically 0.5, are preferred because it helps to maximize the test’s reliability.
- Which of the following would be used to determine the probability that examinees of different ability levels are able to answer a particular test item correctly?
criterion-related validity coefficient item discrimination index item difficulty index item characteristic curve
The Correct Answer is “D”
Item characteristic curves (ICCs), which are associated with item response theory, are graphs that depict individual test items in terms of the percentage of individuals in different ability groups who answered the item correctly. For example, an ICC for an individual test item might show that 80% of people in the highest ability group, 40% of people in the middle ability group, and 5% of people in the lowest ability group answered the item correctly. Although costly to derive, ICCs provide much information about individual test items, including their difficulty, discriminability, and probability that the item will be guessed correctly.
- The reliability statistic that can be interpreted as the average of all possible split-half coefficients isthe Spearman-Brown formula.
Cronbach’s coefficient alpha.
chi-square.
point-biserial coefficient.
The Correct Answer is “B”
According to classical test theory, the reliability of a test indicates the degree to which examinees’ scores are free from error and reflect their “true” test score. Reliability is typically measured by obtaining the correlation between scores on the same test, such as by having examinees take then retake the test and correlating both sets of scores (test-retest reliability) or by dividing the test in half and correlating scores on both halves (split-half reliability). Cronbach’s alpha, like split-half reliability, is categorized as an internal consistency reliability coefficient. Its calculation is based on the average of all inter-item correlations, which are correlations between responses on two individual items. Mathematically, Cronbach’s alpha works out to the average of all possible split-half correlations (there are many possible split-half correlations because there are many different ways of splitting the test in half). Regarding the other choices, the Spearman-Brown formula is used to estimate the effects of lengthening a test on its reliability coefficient. Longer tests are typically more reliable. The Spearman-Brown formula is commonly used to adjust the split-half coefficient to estimate what reliability would have been if the halved tests had as many items as the full test. The chi-square test is used to test predictions about observed versus expected frequency distributions of nominal, or categorical, data; for example, if you flip a coin 100 times, you can use the chi-square test to determine if the distribution of heads versus tails outcomes falls into the expected range or if there is evidence that the coin toss was “fixed.” And the point-biserial correlation coefficient is used to correlate dichotomously scaled variables with interval or ratio data; for example, it can be used to correlate responses on test items scored as correct or incorrect with scores on the test as a whole.
- In the multitrait-multimethod matrix, a large heterotrait-monomethod coefficient would indicate:low convergent validity.
high convergent validity.
high divergent validity.
low divergent validity
The Correct Answer is “D”
D. Use of a multitrait-multimethod matrix is one method of assessing a test’s construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-monomethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different traits using similar methods. An example might be the correlation between a test of depression based on self-report data and a test of anxiety also based on self-report data. If a test has good divergent validity, this correlation would be low. Divergent validity is the degree to which a test has a low correlation with other tests that do not measure the same construct. Using the above example, a test of depression would have poor divergent validity if it had a high correlation with other tests that purportedly measure different traits, such as anxiety. This would be evidence that the depression test is measuring traits that are unrelated to depression.
- If you find that your job selection measure yields too many “false positives,” what could you do to correct the problem?raise the predictor cutoff score and/or lower the criterion cutoff score
raise the predictor cutoff score and/or raise the criterion cutoff score
lower the predictor cutoff score and/or raise the criterion cutoff score
lower the predictor cutoff score and/or lower the criterion cutoff score
The Correct Answer is “A”
On a job selection test, a “false positive” is someone who is identified by the test as successful but who does not turn out to be successful, as measured by a performance criterion. If you raise the selection test cutoff score, you will reduce false positives, since, by making it harder to “pass” the test, you will be ensuring that the people who do pass are more qualified and therefore more likely to be successful. By lowering the criterion score, what you are in effect doing is making your definition of success more lax. It therefore becomes easier to be considered successful, and many of the people who were false positives will now be considered true positives.
If you understand concepts in pictures better than in words, refer to the Test Construction section, where a graph is used to explain this idea
- Discriminant and convergent validity are classified as examples of:construct validity.
content validity
face validity.
concurrent validity.
The Correct Answer is “A”
There are many ways to assess the validity of a test. If we correlate our test with another test that is supposed to measure the same thing, we’ll expect the two to have a high correlation; if they do, the tests will be said to have convergent validity. If our test has a low correlation with other tests measuring something our test is not supposed to measure, it will be said to have discriminant (or divergent) validity. Convergent and divergent validity are both types of construct validity.
- In the multitrait-multimethod matrix, a low heterotrait-heteromethod coefficient would indicate:low convergent validity
low divergent validity
high convergent validity
high divergent validity
The Correct Answer is “D”
Use of a multitrait-multimethod matrix is one method of assessing a test’s construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-heteromethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different (hetero) traits using different (hetero) methods. An example might be the correlation between vocabulary subtest scores on the WAIS-III for intelligence and scores on the Beck Depression Inventory for depression. Since these measures presumably measure different constructs, the correlation coefficient should be low, indicating high divergent or discriminant validity.
- The rotation of factors can be either orthogonal or oblique in factor analysis. An oblique rotation would be chosen when the:effects of one or more variables have been removed from X and Y.
effects of one or more variables have been removed from X only.
variables included in the analysis are uncorrelated.
variables included in the analysis are correlated.
The Correct Answer is “D”
D. An oblique rotation is used when the variables included in the analysis are considered to be correlated. When the variables included in the analysis are believed to be uncorrelated (c.), an orthogonal rotation is used. Response choice “a.” describes semi-partial correlation and “b.” describes partial correlation.