Exam # 1 Flashcards

1
Q
  1. An examinee obtains a score of 70 on a test that has a mean of 80, a standard deviation of 15, and a standard error of measurement of 5. The 95% confidence interval for the examinee’s score is:
50-90
55-85
60-80
65-75
A

The Correct Answer is “C”

C. Confidence interval indicates the range within which an examinees’ true score is likely to fall, given his or her obtained score. The standard error of measurement indicates how much error an individual test score can be expected to have and is used to construct confidence intervals. To calculate the 68% confidence interval, add and subtract one standard error of measurement to the obtained score. To calculate the 95% confidence interval, add and subtract two standard errors of measurement to the obtained score. Two standard errors of measurement in this case equal 10. We’re told that the examinee’s obtained score is 70. 70 + 10 results in a confidence interval of 80 to100. In other words, we can be 95% confident that the examinee’s true score falls within 60 and 80.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. Kuder-Richardson reliability applies to
split-half reliability.
test-retest stability.
Likert scales.
tests with dichotomously scored questions.
A

The Correct Answer is “D”

The Kuder-Richardson formula is one of several statistical indices of a test’s internal consistency reliability. It is used to assess the inter-item consistency of tests that are dichotomously scored (e.g., scored as right or wrong).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. Which of the following statements is not true regarding concurrent validity?
It is used to establish criterion-related validity.
It is appropriate for tests designed to assess a person's future status on a criterion.
It is obtained by collecting predictor and criterion scores at about the same time.
It indicates the extent to which a test yields the same results as other measures of the same phenomenon.
A

The Correct Answer is “B”

There are two ways to establish the criterion-related validity of a test: concurrent validation and predictive validation. In concurrent validation, predictor and criterion scores are collected at about the same time; by contrast, in predictive validation, predictor scores are collected first and criterion data are collected at some future point. Concurrent validity indicates the extent to which a test yields the same results as other measures of the same phenomenon. For example, if you developed a new test for depression, you might administer it along with the BDI and measure the concurrent validity of the two tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. A company wants its clerical employees to be very efficient, accurate and fast. Examinees are given a perceptual speed test on which they indicate whether two names are exactly identical or slightly different. The reliability of the test would be best assessed by:test-retest
    Cronbach’s coefficient alpha
    split-half
    Kuder-Richardson Formula 20
A

The Correct Answer is “A”

A. Perceptual speed tests are highly speeded and are comprised of very easy items that every examinee, it is assumed, could answer correctly with unlimited time. The best way to estimate the reliability of speed tests is to administer separately timed forms and correlate these, therefore using a test-retest or alternate forms coefficient would be the best way to assess the reliability of the test in this question. The other response choices are all methods for assessing internal consistency reliability. These are useful when a test is designed to measure a single characteristic, when the characteristic measured by the test fluctuates over time, or when scores are likely to be affected by repeated exposure to the test. However, they are not appropriate for assessing the reliability of speed tests because they tend to produce spuriously high coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. Which of the following descriptive words for tests are most opposite in nature?speed and power
    subjective and aptitude
    norm-referenced and standardized
    maximal and ipsative
A

The Correct Answer is “A”

Pure speed tests and pure power tests are opposite ends of a continuum. A speed test is one with a strict time limit and easy items that most or all examinees are expected to answer correctly. Speed tests measure examinees’ response speed. A power test is one with no or a generous time limit but with items ranging from easy to very difficult (usually ordered from least to most difficult). Power tests measure level of content mastered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. The kappa statistic is used to evaluate reliability when data are:interval or ratio (continuous)
    nominal or ordinal (discontinuous)
    metric
    nonlinear
A

The Correct Answer is “B”

B. The kappa statistic is used to evaluate inter-rater reliability, or the consistency of ratings assigned by two raters, when data are nominal or ordinal. Interval and ratio data is sometimes referred to by the term metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. The purpose of rotation in factor analysis is to facilitate interpretation of the factors. Rotation:
alters the factor loadings for each variable but not the eigenvalue for each factor
alters the eigenvalue for each factor but not the factor loadings for the variables
alters the factor loadings for each variable and the eigenvalue for each factor
does not alter the eigenvalue for each factor nor the factor loadings for the variables
A

The Correct Answer is “C”

C. In factor analysis, rotating the factors changes the factor loadings for the variables and eigenvalue for each factor although the total of the eigenvalues remains the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. What value is preferred for the average item difficulty level in order to maximize the size of a test’s reliability coefficient?
10. 0
0. 5
1. 0
0. 0
A

The Correct Answer is “B”

The item difficulty index ranges from 0 to 1, and it indicates the number of examinees who answered the item correctly. Items with a moderate difficulty level, typically 0.5, are preferred because it helps to maximize the test’s reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. Which of the following would be used to determine the probability that examinees of different ability levels are able to answer a particular test item correctly?
criterion-related validity coefficient
item discrimination index
item difficulty index
item characteristic curve
A

The Correct Answer is “D”

Item characteristic curves (ICCs), which are associated with item response theory, are graphs that depict individual test items in terms of the percentage of individuals in different ability groups who answered the item correctly. For example, an ICC for an individual test item might show that 80% of people in the highest ability group, 40% of people in the middle ability group, and 5% of people in the lowest ability group answered the item correctly. Although costly to derive, ICCs provide much information about individual test items, including their difficulty, discriminability, and probability that the item will be guessed correctly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. The reliability statistic that can be interpreted as the average of all possible split-half coefficients isthe Spearman-Brown formula.
    Cronbach’s coefficient alpha.
    chi-square.
    point-biserial coefficient.
A

The Correct Answer is “B”

According to classical test theory, the reliability of a test indicates the degree to which examinees’ scores are free from error and reflect their “true” test score. Reliability is typically measured by obtaining the correlation between scores on the same test, such as by having examinees take then retake the test and correlating both sets of scores (test-retest reliability) or by dividing the test in half and correlating scores on both halves (split-half reliability). Cronbach’s alpha, like split-half reliability, is categorized as an internal consistency reliability coefficient. Its calculation is based on the average of all inter-item correlations, which are correlations between responses on two individual items. Mathematically, Cronbach’s alpha works out to the average of all possible split-half correlations (there are many possible split-half correlations because there are many different ways of splitting the test in half). Regarding the other choices, the Spearman-Brown formula is used to estimate the effects of lengthening a test on its reliability coefficient. Longer tests are typically more reliable. The Spearman-Brown formula is commonly used to adjust the split-half coefficient to estimate what reliability would have been if the halved tests had as many items as the full test. The chi-square test is used to test predictions about observed versus expected frequency distributions of nominal, or categorical, data; for example, if you flip a coin 100 times, you can use the chi-square test to determine if the distribution of heads versus tails outcomes falls into the expected range or if there is evidence that the coin toss was “fixed.” And the point-biserial correlation coefficient is used to correlate dichotomously scaled variables with interval or ratio data; for example, it can be used to correlate responses on test items scored as correct or incorrect with scores on the test as a whole.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. In the multitrait-multimethod matrix, a large heterotrait-monomethod coefficient would indicate:low convergent validity.
    high convergent validity.
    high divergent validity.
    low divergent validity
A

The Correct Answer is “D”

D. Use of a multitrait-multimethod matrix is one method of assessing a test’s construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-monomethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different traits using similar methods. An example might be the correlation between a test of depression based on self-report data and a test of anxiety also based on self-report data. If a test has good divergent validity, this correlation would be low. Divergent validity is the degree to which a test has a low correlation with other tests that do not measure the same construct. Using the above example, a test of depression would have poor divergent validity if it had a high correlation with other tests that purportedly measure different traits, such as anxiety. This would be evidence that the depression test is measuring traits that are unrelated to depression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. If you find that your job selection measure yields too many “false positives,” what could you do to correct the problem?raise the predictor cutoff score and/or lower the criterion cutoff score
    raise the predictor cutoff score and/or raise the criterion cutoff score
    lower the predictor cutoff score and/or raise the criterion cutoff score
    lower the predictor cutoff score and/or lower the criterion cutoff score
A

The Correct Answer is “A”

On a job selection test, a “false positive” is someone who is identified by the test as successful but who does not turn out to be successful, as measured by a performance criterion. If you raise the selection test cutoff score, you will reduce false positives, since, by making it harder to “pass” the test, you will be ensuring that the people who do pass are more qualified and therefore more likely to be successful. By lowering the criterion score, what you are in effect doing is making your definition of success more lax. It therefore becomes easier to be considered successful, and many of the people who were false positives will now be considered true positives.
If you understand concepts in pictures better than in words, refer to the Test Construction section, where a graph is used to explain this idea

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. Discriminant and convergent validity are classified as examples of:construct validity.
    content validity
    face validity.
    concurrent validity.
A

The Correct Answer is “A”

There are many ways to assess the validity of a test. If we correlate our test with another test that is supposed to measure the same thing, we’ll expect the two to have a high correlation; if they do, the tests will be said to have convergent validity. If our test has a low correlation with other tests measuring something our test is not supposed to measure, it will be said to have discriminant (or divergent) validity. Convergent and divergent validity are both types of construct validity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. In the multitrait-multimethod matrix, a low heterotrait-heteromethod coefficient would indicate:low convergent validity
    low divergent validity
    high convergent validity
    high divergent validity
A

The Correct Answer is “D”

Use of a multitrait-multimethod matrix is one method of assessing a test’s construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. The heterotrait-heteromethod coefficient, one of the correlation coefficients that would appear on this matrix, reflects the correlation between two tests that measure different (hetero) traits using different (hetero) methods. An example might be the correlation between vocabulary subtest scores on the WAIS-III for intelligence and scores on the Beck Depression Inventory for depression. Since these measures presumably measure different constructs, the correlation coefficient should be low, indicating high divergent or discriminant validity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. The rotation of factors can be either orthogonal or oblique in factor analysis. An oblique rotation would be chosen when the:effects of one or more variables have been removed from X and Y.
    effects of one or more variables have been removed from X only.
    variables included in the analysis are uncorrelated.
    variables included in the analysis are correlated.
A

The Correct Answer is “D”

D. An oblique rotation is used when the variables included in the analysis are considered to be correlated. When the variables included in the analysis are believed to be uncorrelated (c.), an orthogonal rotation is used. Response choice “a.” describes semi-partial correlation and “b.” describes partial correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. The item difficulty (“p”) index yields information about the difficulty of test items in terms of a(n) _________ scale of measurement.
nominal
ordinal
interval
ratio
A

The Correct Answer is “B”

An item difficulty index indicates the percentage of individuals who answer a particular item correctly. For example, if an item has a difficulty index of .80, it means that 80% of test-takers answered the item correctly. Although it appears that the item difficulty index is a ratio scale of measurement, according to Anastasi (1982) it is actually an ordinal scale because it does not necessarily indicate equivalent differences in difficulty.

17
Q
  1. If, in a normally-shaped distribution, the mean is 100 and the standard error of measurement is 10, what would the 68% confidence interval be for an examinee who receives a score of 95?
85 to 105
90 to 100
90 to 110
impossible to calculate without the reliability coefficient
A

The Correct Answer is “A”

The standard error of measurement indicates how much error an individual test score can be expected to have. A confidence interval indicates the range within which an examinees’s true score is likely to fall, given his or her obtained score. To calculate the 68% confidence interval we simply add and subtract one standard error of measurement to the obtained score. Choice D is incorrect because although the reliability coefficient is needed to calculate a standard error of measurement, in this case, we are provided with the standard error.

18
Q
  1. The cutoff IQ score for placement in a school district’s gifted program is 135. The parent of a child who scored 133 might be interested in knowing the test’s standard error of measurement in order to estimate the child’s
true score.
mean score.
error score.
criterion score.
A

The Correct Answer is “A”

The question is just a roundabout way of asking “what is the standard error of measurement?”, though it does supply a practical application of the concept. According to classical test theory, an obtained test score consists of truth and error. The truth component reflects the degree to which the score reflects the actual characteristic the test measures, and the error component reflects random or chance factors affecting the score. For instance, on an IQ test, a score will reflect to some degree the person’s “true” IQ and to some degree chance factors such as whether the person was tired the day he took the test, whether some of the questions happen to be a particularly good fit with the person’s knowledge base, etc. The standard error of measurement of a test indicates the expected amount of error a score on that test will contain. It can be used to answer the question, “given an obtained score, what is the likely true score?” For example, if the test referenced had a standard error of measurement of 5, there would be a 68% chance that the true test score lies within one standard error of measurement of the obtained score (between 128 and 138 in this case), and a 95% chance that the true score lies within two standard errors of measurement (between 123 - 143). So in the example, the parent would be interested to know what the test’s standard error of measurement because the higher it is, the greater the possibility that an obtained score of 133 actually reflects a true score of 135 or above

19
Q
  1. Determining test-retest reliability would be most appropriate for which of the following types of tests?
brief
speed
state
trait
A

The Correct Answer is “D”

As the name implies, test-retest reliability involves administering a test to the same group of examinees at two different times and then correlating the two sets of scores. This would be most appropriate when evaluating a test that purports to measure a stable trait, since it should not be significantly affected by the passage of time between test administrations.