Test construction Flashcards

1
Q
A researcher employs multiple methods of measurement in an attempt to increase reliability by reducing systematic error. This strategy is referred to as:
Select one:
A. calibration
B. intraclass correlation (ICC)
C. triangulation
D. correction for attenuation
A

Correct Answer is: C
Triangulation is the attempt to increase reliability by reducing systematic or method error through a strategy in which the researcher employs multiple methods of measurement (e.g., observation, survey, archival data). If the alternative methods do not share the same source of systematic error, examination of data from the alternative methods gives insight into how individual scores may be adjusted to come closer to reflecting true scores, thereby increasing reliability.
calibration

Calibration is the attempt to increase reliability by increasing homogeneity of ratings through feedback to the raters, when multiple raters are used. For example, raters might meet during pretesting of the instrument to discuss items on which they have disagreed seeking to reach consensus on rules for rating items (e.g.., defining a “2” for an item dealing with job performance).

intraclass correlation (ICC)

Intraclass correlation (ICC) is used to measure inter-rater reliability for two or more raters and may also be used to assess test-retest reliability. ICC may be conceptualized as the ratio of between-groups variance to total variance.

correction for attenuation

Correction for attenuation is a method used to adjust correlation coefficients upward because of errors of measurement when two measured variables are correlated; the errors always serve to lower the correlation coefficient as compared with what it would have been if the measurement of the two variables had been perfectly reliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
he reliability statistic that can be interpreted as the average of all possible split-half coefficients is
Select one:
A. the Spearman-Brown formula.
B. Cronbach's coefficient alpha.
C. chi-square.
D. point-biserial coefficient.
A

Correct Answer is: B
According to classical test theory, the reliability of a test indicates the degree to which examinees’ scores are free from error and reflect their “true” test score. Reliability is typically measured by obtaining the correlation between scores on the same test, such as by having examinees take then retake the test and correlating both sets of scores (test-retest reliability) or by dividing the test in half and correlating scores on both halves (split-half reliability). Cronbach’s alpha, like split-half reliability, is categorized as an internal consistency reliability coefficient. Its calculation is based on the average of all inter-item correlations, which are correlations between responses on two individual items. Mathematically, Cronbach’s alpha works out to the average of all possible split-half correlations (there are many possible split-half correlations because there are many different ways of splitting the test in half).
Regarding the other choices, the Spearman-Brown formula is used to estimate the effects of lengthening a test on its reliability coefficient. Longer tests are typically more reliable. The Spearman-Brown formula is commonly used to adjust the split-half coefficient to estimate what reliability would have been if the halved tests had as many items as the full test. The chi-square test is used to test predictions about observed versus expected frequency distributions of nominal, or categorical, data; for example, if you flip a coin 100 times, you can use the chi-square test to determine if the distribution of heads versus tails outcomes falls into the expected range or if there is evidence that the coin toss was “fixed.” And the point-biserial correlation coefficient is used to correlate dichotomously scaled variables with interval or ratio data; for example, it can be used to correlate responses on test items scored as correct or incorrect with scores on the test as a whole.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
If a job selection test has lower validity for Hispanics as compared to White or African-Americans, you could say that ethnicity is acting as a:
Select one:
A. confounding variable
B. criterion contaminator
C. discriminant variable
D. moderator variable
A

Correct Answer is: D
A moderator variable is any variable which moderates, or influences, the relationship between two other variables. If the validity of a job selection test is different for different ethnic groups (i.e. there is differential validity), then ethnicity would be considered a moderator variable since it is influencing the relationship between the test (predictor) and actual job performance (the criterion).
A confounding variable is a variable in a research study which is not of interest to the researcher, but which exerts a systematic effect on the DV. Criterion contamination is the artificial inflation of validity which can occur when raters subjectively score ratees on a criterion measure after they have been informed how the ratees scored on the predictor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In a factor analysis, an eigenvalue corresponds to
Select one:
A. the number of latent variables.
B. the strength of the relationship between factors.
C. the level of significance of the factor analysis.
D. the explained variance of one of the factors.

A

Correct Answer is: D
When a factor analysis produces a series of factors, it is useful to determine how much of the variance is accounted for by each factor. An eigenvalue is based on the factor loadings of all the variables in the factor analysis to a particular factor. When the factor loadings are high, the eigenvalue will be large. A large eigenvalue would mean that a particular factor accounts for a large proportion of the variance among the variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The factor loading for Test A and Factor II is .80 in a factor matrix. This means that:
Select one:
A. only 80% of variability in Test A is accounted for by the factor analysis
B. only 64% of variability in Test A is accounted for by the factor analysis
C. 80% of variability in Test A is accounted for by Factor II
D. 64% of variability in Test A is accounted for by Factor II

A

Correct Answer is: D
The correlation coefficient for a test and an identified factor is referred to as a factor loading. To obtain a measure of shared variability, the factor loading is squared. This example, the factor loading is .80, meaning that 64% (.80 squared) of variability in the test is accounted for by the factor.
The other identified factor(s) probably also account for some variability in Test A, which is why this option is not the best answer: only 64% of variability in Test A is accounted for by the factor analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
Likert scales are most useful for:
Select one:
A. dichotomizing quantitative data
B. quantifying objective data
C. quantifying subjective data Correct
D. ordering categorical data
A

Correct Answer is: C
Attitudes are subject phenomena. Likert scales indicate the degree to which a person agrees or disagrees with an attitudinal statement. Using a Likert scale, attitudes are quantified - or represented in terms of ordinal scores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which statement is most correct?
Select one:
A. High reliability assumes high validity.
B. High validity assumes high reliability.
C. Low validity assumes low reliability.
D. Low reliability assumes low validity.

A

Correct Answer is: B
This question is difficult because the language of the response choices is convoluted and imprecise. We don’t write questions like this because we’re sadistic; it’s just that you’ll sometimes see this type of language on the exam as well, and we want to prepare you. What you need to do on questions like this is bring to mind what you know about the issue being asked about, and to choose the answer that best applies. Here, you should bring to mind what you know about the relationship between reliability and validity: For a test to have high validity, it must be reliable; however, for a test to have high reliability, it does not necessarily have to be valid. With this in mind, you should see that “high validity assumes high reliability” is the best answer. This means that a precondition of high validity is high reliability.
The second best choice states that low reliability assumes low validity. This is a true statement if you interpret the word “assume” to mean “implies” or “predicts.” But if you interpret the word “assume” to mean “depends on” or “is preconditioned by,” the statement is not correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
A person obtains a raw score of 70 on a Math test with a mean of 50 and an SD of 10; a percentile rank of 84 on a History test; and a T-score of 65 on an English test. What is the relative order of each of these scores?
Select one:
A. History >> Math >> English
B. Math >> History >> English
C. History >> English >> Math
D. Math >> English >> History
A

Correct Answer is: D
Before we can compare different forms of scores, we must transform them into some form of standardized measure. A Math test which has a mean of 50 and an SD of 10 indicates that a raw score of 70 would fall 2 standard deviations above the mean. Assuming a normal distribution of scores, a percentile rank of 84 on a History test is equivalent to 1 standard deviation above the mean. If you haven’t memorized that, you could still figure it out: Remember that 50% of all scores in a normal distribution fall below the mean and 50% fall above the mean. And 68% of scores fall within +/- 1 SD of the mean. If you divide 68% by 2 - you get 34% (the percentage of scores that fall between 0 and +1 SD). If you then add that 34% to the 50% that fall below the mean - you get a percentile rank of 84. Thus, the 84 percentile score is equivalent to 1 SD above the mean. Finally, looking at the T-score on the English test - we know that T-scores always have a mean of 50 and an SD of 10. Thus a T-score of 65 is equivalent to 1½ standard deviations above the mean. Comparing the 3 test scores we find the highest score was in Math at 2 SDs above the mean, followed by English at 1½ SDs above the mean, and History at 1 SD above the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Computer-adaptive testing will yield
Select one:
A. more accurate results for high scorers on a test.
B. more accurate results for low scorers on a test.
C. more accurate results for examinees who score in the middle range of a test.
D. equally accurate results across all range of scores on a test

A

Correct Answer is: D
In computerized adaptive testing, the examinee’s previous responses are used to tailor the test to his or her ability. As a result, inaccuracy of scores is reduced across ability levels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
The kappa statistic is used to evaluate reliability when data are:
Select one:
A. interval or ratio (continuous)
B. nominal or ordinal (discontinuous)
C. metric
D. nonlinear
A

Correct Answer is: B
The kappa statistic is used to evaluate inter-rater reliability, or the consistency of ratings assigned by two raters, when data are nominal or ordinal. Interval and ratio data is sometimes referred to by the term metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which of the following would be used to determine the probability that examinees of different ability levels are able to answer a particular test item correctly?
Select one:
A. criterion-related validity coefficient
B. item discrimination index
C. item difficulty index
D. item characteristic curve

A

Correct Answer is: D
Item characteristic curves (ICCs), which are associated with item response theory, are graphs that depict individual test items in terms of the percentage of individuals in different ability groups who answered the item correctly. For example, an ICC for an individual test item might show that 80% of people in the highest ability group, 40% of people in the middle ability group, and 5% of people in the lowest ability group answered the item correctly. Although costly to derive, ICCs provide much information about individual test items, including their difficulty, discriminability, and probability that the item will be guessed correctly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
The slope of the item response curve, with respect to item response theory, indicates an item's:
Select one:
A. reliability
B. validity
C. difficulty
D. discriminability
A

Correct Answer is: D
The item response curve provides information about an item’s difficulty; ability to discriminate between those who are high and low on the characteristic being measured; and the probability of correctly answering the item by guessing. The position of the curve indicates its difficulty* and the steeper the slope of the item response curve, the better its ability to discriminate (correct response) between examinees who are high and low on the characteristic being measured. The item response curve does not indicate reliability* or validity* (* incorrect options).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the minimum and maximum values of the standard error of measurement?
Select one:
A. 0 and the standard deviation of test scores
B. 0 and 1
C. 1 and the standard deviation of test scores
D. -1 and 1

A

Correct Answer is: A
This question is best answered with reference to the formula for the standard error of measurement, which appears in the Psychology-Test Construction section. It is calculated by subtracting 1 by reliability coefficent, and taking the square root of this value; then this is multiplied by the standard deviation of x. You need to know the minimum and maximum values of the reliability coefficient – 0 and +1.0, respectively. If the reliability coefficient is +1.0, you will find from the above formula that the standard error of measurement is 0, which is its minimum value. And when the reliability coefficient is 0, you find from the formula that the standard error of measurement is equal to the standard deviation of test scores, which is its maximum value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
Which of the following methods of establishing a test's reliability is, all other things being equal, likely to be lowest?
Select one:
A. split-half
B. Cronbach's alpha
C. alternate forms
D. test-retes
A

Correct Answer is: C
You probably remember that the alternate forms coefficient is considered by many to be the best reliability coefficient to use when practical (if you don’t, commit this factoid to memory now). Everything else being equal, it is also likely to have a lower magnitude than the other types of reliability coefficients. The reason for this is similar to the reason why it is considered the best one to use. To obtain an alternate forms coefficient, one must administer two forms of the same test to a group of examinees, and correlate scores on the two forms. The two forms of the test are administered at different times and (because they are different forms) contain different items or content. In other words, there are two sources of error (or factors that could lower the coefficient) for the alternate forms coefficient: the time interval and different content (in technical terms, these sources of error are referred to respectively as “time sampling” and “content sampling”). The alternate forms coefficient is considered the best reliability coefficient by many because, for it to be high, the test must demonstrate consistency across both a time interval and different content.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
When constructing an achievement test, which of the following would be useful for comparing total test scores of a sample of examinees to the proportion of examinees who answer each item correctly?
Select one:
A. classical test theory
B. item response theory
C. generalizability theory
D. item utility theory
A

Correct Answer is: B
The question describes the kind of information that is provided in an item response curve, which is constructed for each item to determine its characteristics when using item response theory as the basis for test development. (Note that there is no such thing as “item utility theory.”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Limited “floor” would be the biggest problem when a test will be used to
Select one:
A. distinguish between mildly and moderately retarded children.
B. distinguish between above-average and gifted students.
C. distinguish between successful and unsuccessful trainees.
D. distinguish between satisfied and dissatisfied customers.

A

Correct Answer is: A
Floor refers to a test’s ability to distinguish between examinees at the low end of the distribution, which would be an issue when distinguishing between those with mild versus moderate retardation. Limited floor occurs when the test does not contain enough easy items.
Note that “ceiling” would be of concern for tests designed to distinguish between examinees at the high end of the distribution: distinguish between above-average and gifted students.

17
Q

Which of the following statements is not true regarding concurrent validity?
Select one:
A. It is used to establish criterion-related validity.
B. It is appropriate for tests designed to assess a person’s future status on a criterion.
C. It is obtained by collecting predictor and criterion scores at about the same time.
D. It indicates the extent to which a test yields the same results as other measures of the same phenomenon.

A

Correct Answer is: B
There are two ways to establish the criterion-related validity of a test: concurrent validation and predictive validation. In concurrent validation, predictor and criterion scores are collected at about the same time; by contrast, in predictive validation, predictor scores are collected first and criterion data are collected at some future point. Concurrent validity indicates the extent to which a test yields the same results as other measures of the same phenomenon. For example, if you developed a new test for depression, you might administer it along with the BDI and measure the concurrent validity of the two tests.

18
Q

All of the following statements regarding item response theory are true, except
Select one:
A. it cannot be applied in the attempt to develop culture-fair tests.
B. it’s a useful theory in the development of computer programs designed to create tests tailored to the individual’s level of ability.
C. one of its assumptions is that test items measure a “latent trait.”
D. it usually has little practical significance unless one is working with very large samples.

A

Correct Answer is: A
Item response theory is a highly technical mathematical approach to item analysis. Use of item analysis is based on a number of complex mathematical assumptions. One of these assumptions, known as invariance of item parameters, holds that the characteristics of items should be the same for all theoretically equivalent groups of subjects chosen from the same population. Thus, any culture-free test should demonstrate such invariance; i.e., a set of items shouldn’t have a different set of characteristics for minority and non-minority subgroups.
it cannot be applied in the attempt to develop culture-fair tests.

For this reason, item response theory has been applied to the development of culture-free tests, so this choice is not a true statement. The other choices are all true statements about item response theory, and therefore incorrect answers to this question.

it’s a useful theory in the development of computer programs designed to create tests tailored to the individual’s level of ability.

Consistent with this choice, item response theory is the theoretical basis of computer adaptive assessment, in which tests tailored to the examinee’s ability level are computer generated.

one of its assumptions is that test items measure a “latent trait.”

As stated by this choice, an assumption of item response theory is that items measure a latent trait, such as intelligence or general ability.

it usually has little practical significance unless one is working with very large samples.

And, finally, research supports the notion that the assumptions of item response theory only hold true for very large samples

19
Q
Form A is administered to a group of employees in Spring and then again in Fall. Using this method, what type of reliability is measured?
Select one:
A. split-half
B. equivalence
C. stability Correct
D. internal consistency
A

Correct Answer is: C
Test-retest reliability, or the coefficient of stability, involves administering the same test to the same group on two occasions and then correlating the scores.
split-half

Split-half reliability is a method of determining internal consistency reliability.

equivalence

Alternative forms reliability, or coefficient of equivalence, consists of administering two alternate forms of a test to the same group and then correlating the scores.

internal consistency

Internal consistency reliability utilizes a single test administration and involves obtaining correlations among individual test items.

20
Q
In a study assessing the predictive validity of the SAT test to predict college success, it is found the SAT scores have a statistically significant correlation of .47 with the criterion, first year college GPA. A follow-up study separating the data by gender finds that for a given SAT score, the predicted GPA scores are higher for women than for men. This situation is most clearly an example of
Select one:
A. single group validity.
B. differential validity.
C. differential prediction.
D. adverse impact.
A

Correct Answer is: C
Differential prediction is a bit of a technical term, but in a non-technical way, it can be defined as a case where given scores on a predictor test predict different outcomes for different subgroups. Using the example in the question: if the average predicted GPA for men scoring 500 on the verbal SAT was 2.7, the average predicted GPA for females with the same SAT score was 3.3, and this type of difference is statistically significant across scores on the SAT, then use of the SAT would result in differential prediction based on gender. Differential prediction could result in selection bias in favor of one group at the expense of others. In the example under discussion, if 500 were the cutoff score for college admission, the men selected for admission would be less qualified than the women selected, and there would be a number of women not selected for admission who were equally or more qualified than the men who were selected. So use of the test would not be fair to female candidates.
Regarding the other choices: differential validity means that a test is more valid for one subgroup but not another, and single-group validity would mean that a test is valid for one subgroup but not another subgroup. In both cases, it means that the validity coefficient, or the correlation between the predictor and criterion, is different for different subgroups. This could be, but is not necessarily, the cause of differential prediction. In our example, it could be that, even though criterion scores are different for men and women at the same SAT score, the SAT predicts those scores at the same accuracy level for both groups (e.g., in our example, the score 500 provides the same level of predictive power for both men and women). In this scenario, the validity coefficients would be the same for both groups. Finally, adverse impact occurs when the use of a selection test results in a substantially lower rate of selection for one subgroup as compared to another subgroup–specifically, when the selection rate of one subgroup is 80% of less of the selection rate of another. For example, if use of the SAT resulted in 80% of males and 50% of females being admitted to college, the test would have adverse impact against females (50/80 = .625 or 62.5%, less than 80%). Since the question contains no information about the selection rates for men and women, this is not the best choice.

21
Q

The sensitivity of a screening for a psychological disorder refers to
Select one:
A. the ratio of correct to incorrect diagnostic decisions its use results in.
B. the proportion of correct diagnostic decisions its use results in.
C. the proportion of individuals without the disorder it identifies.
D. the proportion of individuals with the disorder it identifies.

A

Correct Answer is: D
In any test used to make a “yes/no” decision (e.g., screening tests, medical tests such as pregnancy tests, and job selection tests in some cases), the term “sensitivity” refers to the proportion of correctly identified cases–i.e., the ratio of examinees whom the test correctly identifies as having the characteristic to the total number of examinees who actually possess the characteristic. You can also conceptualize sensitivity in terms of true positives and false negatives. A “positive” on a screening test means that the test identified the person as having the condition, while a “negative” is someone classified by the test as not having the condition. The term true and false in this context refer to the accuracy or correctness of test results. Therefore, sensitivity can be defined as the ratio of true positives (people with the condition whom the test correctly detects) to the sum of true positives and false negatives (all the examinees who have the condition).

22
Q

When conducting a factor analysis, an oblique rotation is preferred when:
Select one:
A. more than two factors have been extracted.
B. the underlying traits are believed to be dependent.
C. the assumption of homoscedasticity has been violated.
D. the number of factors is equal to the number of tests

A

Correct Answer is: B
In the context of factor analysis, “oblique” means correlated or dependent. (“Orthogonal” means uncorrelated or independent.)

23
Q

A negative item discrimination (D) indicates:
Select one:
A. an index equal to zero.
B. more high-achieving examinees than low-achieving examinees answered the item correctly.
C. an item was answered correctly by the same number of low- and high-achieving students.
D. more low-achieving examinees answered the item correctly than high-achieving. Correct

A

Correct Answer is: D
The discrimination index, D, has a value range from +1.0 to -1.0 and is the number of people in the upper or high scoring group who answered the item correctly minus the number of people in the lower scoring group who answered the item correctly, divided by the number of people in the largest of the two groups. An item will have a discrimination index equal to zero if everyone gets it correct or incorrect. A negative item discrimination index indicates that the item was answered correctly by more low-achieving students than by high-achieving students. In other words, a poor student may make a guess, select that response, and come up with the correct answer without any real understanding of what is being assessed. Whereas good students (like EPPP candidates) may be suspicious of a question that looks too easy, may read too much into the question, and may end up being less successful than those who guess.
more high-achieving examinees than low-achieving examinees answered the item correctly.

A positive item discrimination index indicates that the item was answered correctly by more high-achieving students than by low-achieving students.

24
Q
A condition necessary for pooled variance is:
Select one:
A. unequal sample sizes
B. equal sample sizes
C. unequal covariances
D. equal covariances
A

Correct Answer is: B
Pooled variance is the weighted average variance for each group. They are “weighted” based on the number of subjects in each group. Use of a pooled variance assumes that the population variances are approximately the same, even though the sample variances differ. When the population variances were known to be equal or could be assumed to be equal, they might be labeled equal variances assumed, common variance or pooled variance. Equal variances not assumed or separate variances is appropriate for normally distributed individual values when the population variances are known to be unequal or cannot be assumed to be equal.

25
Q

R-squared is used as an indicator of:
Select one:
A. The number of values that are free to vary in a statistical calculation
B. The variability of scores
C. How much your ability to predict is improved using the regression line
D. The relationship between two variables that have a nonlinear relationship

A

Feedback
Correct Answer is: C
You might have been able to guess correctly using the process of elimination. If so, note that R-squared tells you how much your ability to predict is improved using the regression line, compared to not using it. The most possible improvement is 1 and the least is 0.
The number of values that are free to vary in a statistical calculation

This choice is the definition of degrees of freedom.

The variability of scores

This is the definition of variance.

The relationship between two variables that have a nonlinear relationship

And this is a description of the coefficient eta.

26
Q

The primary purpose of rotation in factor analysis is to
Select one:
A. facilitate interpretation of the data. Correct
B. improve the mathematical fit of the solution.
C. obtain uncorrelated factors.
D. improve the predictive validity of the factors

A

Correct Answer is: A
Factor analysis is a statistical procedure that is designed to reduce measurements on a number of variables to fewer, underlying variables. Factor analysis is based on the assumption that variables or measures highly correlated with each other measure the same or a similar underlying construct, or factor. For example, a researcher might administer 250 proposed items on a personality test and use factor analysis to identify latent factors that could account for variability in responses to the items. These factors would then be interpreted based on logical analysis or the researcher’s theories. If one of the factors identified by the analysis correlated highly with items that asked about the person’s happiness, level of energy, and hopelessness, that factor might be labeled “Depressive Tendencies.” In factor analysis, rotation is usually the final statistical step. Its purpose is to facilitate the interpretation of data by identifying variables that load (i.e., correlate) highly on one factor and not others.