Test Construction Flashcards by Michael Gale

An examiner administers and
scores the same test numerous
times without deviating from the
procedure in order to reduce
the possibility of measurement
error. This exemplifies what?

Standardization

How well did you know this?

Not at all

Perfectly

The scores of a representative
population sample on a test that an
examiner compares an individual's
scores to are referred to as
\_\_\_\_\_\_\_\_; while they allow for
comparisons on a person's
performance on different tests, they
do not provide the ultimate standard
of performance.

Norms

How well did you know this?

Not at all

Perfectly

A psychological test that is
regarded as \_\_\_\_\_\_\_\_ is
administered, scored, and
interpreted independent of
the subjective judgment of
the examiner.

Objective

How well did you know this?

Not at all

Perfectly

The SAT and GRE are
examples of \_\_\_\_\_\_\_\_ tests,
as they provide information
about a person's best possible
performance, while the MMPI-2
and PAI are \_\_\_\_\_\_\_\_ tests,
providing information about a
person's usual experience.

Maximum
performance;
typical
performance

How well did you know this?

Not at all

Perfectly

________ tests assess the difficulty level
an examinee can attain (e.g., Information
from WAIS), ________ tests assess the
person’s response rate (e.g., Digit
Symbol from WAIS), and ________ tests
help determine whether an individual can
attain a certain level of acceptable
performance (e.g., test of reading skills).

Power;
speed;
mastery

How well did you know this?

Not at all

Perfectly

A \_\_\_\_\_\_\_\_ occurs when an instrument
cannot take on a value higher than some
limit due to the measure not including
enough difficult items, resulting in all
high-achieving examinees getting similar
scores (test is too easy); conversely, a
\_\_\_\_\_\_\_\_ occurs when an instrument
cannot take on a lower value and thus all
low-achieving examinees get similar
scores (test is too hard).

Ceiling
effect; floor
effect

How well did you know this?

Not at all

Perfectly

In contrast to normative
measures, these types of
measures require individuals to
use their own frame of
reference to compare 2 or more
desirable options and choose
the one that is most preferred.

Ipsative

measures

How well did you know this?

Not at all

Perfectly

\_\_\_\_\_\_\_\_ is the consistency of
a test, or the degree to which a
test provides the same results
under the same conditions;
\_\_\_\_\_\_\_\_ refers to the degree
that a test measures what it
claims to be measuring.

Reliability;

validity

How well did you know this?

Not at all

Perfectly

A perfectly reliable test would yield every
examinees’ ________ every time it was
administered, as this would indicate the
examinees’ actual ability on whatever the
test is measuring; however, a test is
never perfectly reliable due to ________,
which is random and can be caused by
environmental noise, examinee’s mood
on testing day, and any other number of
factors.

True score;
measurement
error

How well did you know this?

Not at all

Perfectly

The most commonly used methods of estimating
reliability of a test use a correlation coefficient,
referred to as the ________, ranging in value
from 0.0 to +1.0, where coefficients closer to 0.0
indicate less reliability and values closer to +1.0
indicate increasing reliability; the coefficient is
not squared to determine the proportion of
variability, unlike other correlation coefficients,
rather it is interpreted directly.

Reliability

coefficient

How well did you know this?

Not at all

Perfectly

A researcher administers the same
instrument to the same group of
college students on 2 separate
occasions; following the second
administration, the researcher
correlates on the first and second
administrations. What type of
reliability is the researcher
attempting to obtain?

Test-retest
reliability (or
“coefficient of
stability”)

How well did you know this?

Not at all

Perfectly

TRUE or FALSE: It is not
recommended to use the
test-retest coefficient when
attempting to obtain
reliability for a test that
measures attributes that
are unstable (e.g., mood).

TRUE: Low coefficients, in
such cases, would likely
be more a reflection of the
attribute's unreliability
rather than the test's
unreliability

How well did you know this?

Not at all

Perfectly

A researcher administers one
form of a test on one day, then
administers an equivalent form
to the same group of people at
a later date/time. What type of
reliability is being sought in this
example?

Alternate forms
reliability (or “coefficient
of equivalence;”
parallel-forms reliability)

How well did you know this?

Not at all

Perfectly

When correlations are obtained among individual
test items, ________ reliability is being
assessed; the 3 methods for obtaining this
reliability include ________ (involves dividing
test into 2 parts then correlating responses from
the 2 parts), ________ (used when test items are
dichotomously scored- e.g., “true/false”), and
________ (used for tests with multiple-scored
items- e.g., “never/rarely/sometimes/always”).

Internal consistency (or
"coefficient of internal
consistency"); split-half;
Kuder-Richardson
Formula 20; Cronbach's
coefficient alpha

How well did you know this?

Not at all

Perfectly

While the split-half reliability
coefficient usually lowers the
reliability coefficient
artificially, the \_\_\_\_\_\_\_\_ can
be used to correct for the
effects of shortening the
measure.

Spearman-Brown

prediction formula

How well did you know this?

Not at all

Perfectly

Measures of internal
consistency are not
good at assessing
reliability for
\_\_\_\_\_\_\_\_ tests.

Speed tests, as the
correlation would
be spuriously
inflated

How well did you know this?

Not at all

Perfectly

Instruments that rely on
rater judgments would be
best to have high
\_\_\_\_\_\_\_\_ reliability, which
is increased when scoring
categories are \_\_\_\_\_\_\_\_
and \_\_\_\_\_\_\_\_.

Inter-rater (interscorer);
mutually exclusive (a
particular behavior belongs to
a single category); exhaustive
(categories cover all possible
responses/behaviors)

How well did you know this?

Not at all

Perfectly

The \_\_\_\_\_\_\_\_ estimates the
amount of error to be expected
in an individual test score and
is used to determine a range,
referred to as a/an \_\_\_\_\_\_\_\_,
within which an examinee's true
score will likely fall.

Standard Error of
Measurement;
confidence
interval

How well did you know this?

Not at all

Perfectly

What is the
formula for the
standard error of
the measurement?

SDx√1-rxx (SDx =
standard deviation
of test scores; 
= reliability
coefficient)

How well did you know this?

Not at all

Perfectly

What is the probability that a
person's true score lies within a
range of plus or minus 1
standard error of measurement
(SEM) of their obtained score?
How about plus or minus 1.96
(2) SEM? And finally, plus or
minus 2.58 (2.5) SEM?

68% of the
time; 95% of
the time; 99%
of the time

How well did you know this?

Not at all

Perfectly

TRUE or FALSE:
Hypothetically, a test
with a reliability
coefficient of +1.0 would
have a standard error of
measurement of 0.0.

TRUE: A test
with perfect
reliability will
have no error

How well did you know this?

Not at all

Perfectly

The standard error of
measurement is \_\_\_\_\_\_\_\_
related to the reliability
coefficient (rxx) and
\_\_\_\_\_\_\_\_ related to the
standard deviation of test
scores (SDx).

Inversely;

positively

How well did you know this?

Not at all

Perfectly

What reliability
coefficient, when
practical, is the
best to use?

Alternate-forms

How well did you know this?

Not at all

Perfectly

Classical test
theory states that
an observed score
reflects \_\_\_\_\_\_\_\_
plus \_\_\_\_\_\_\_\_.

True score
variance;
random error
variance

How well did you know this?

Not at all

Perfectly

``` Methods of recording behaviors include ________ recording (elapsed time that behavior occurs is recorded), ________ recording (number of times behavior occurs is recorded), ________ recording (rater notes whether subject engages in behavior during given time period), and ________ recording (all behavior during an observation session is recorded). ```

Duration; frequency; interval; continuous

``` Simply put, ________ refers to the degree a test measures what it purports to measure. ```

Validity

``` A depression scale that only assesses the affective aspects of depression but fails to account for the behavioral aspects would be lacking what type of validity? ```

``` Content validity, which refers to the extent to which test items represent all facets of the content area being measured (e.g., EPPP) ```

``` TRUE or FALSE: Content validity assessment requires a degree of agreement between experts in the subject matter, thus it includes an element of subjectivity. ```

``` TRUE: Tests should also correlate highly with other tests that measure the same content domain ```

``` In contrast to content validity, ________ occurs when a test appears to valid by examinees, administrators, and other untrained observers; it is not technically a type of test validity. ```

Face | validity

``` A personality test that effectively predicts the future behavior of an examinee has what type validity? ```

``` Criterion-related validity, which is obtained by correlating scores on a predictor test to some external criterion (e.g., academic achievement, job performance) ```

``` Criterion-related validity is assessed using a/an ________ to determine the relationship between the predictor and the criterion; for interpretation this value can be squared, producing the "________," which indicates the proportion of variability in the criterion that is explained by variability in the predictor. ```

Correlation coefficient; coefficient of determination

``` The process of ________ validation involves the predictor and the criterion being collected at the same time, providing information regarding a test's usefulness for predicting a given current behavior; ________ validation involves a waiting period between collection of predictor scores and criterion data, providing information regarding a test's usefulness for predicting future behavior. ```

Concurrent; | predictive

``` When interpreting a person's predicted score on a given criterion measure, the ________ will determine within what range of scores their actual score will likely fall. ```

Standard Error of Estimate

``` The standard error of measurement constructs a confidence interval around an examinee's ________ score (using a reliability coefficient), while the standard error of estimate does the same for an examinee's ________ score (using a validity coefficient). ```

Obtained; | predicted

``` Interviewees are given an aptitude test (predictor) to predict work success (criterion), with hiring contingent on achieving a certain minimum score, called a/an ________ score. The manager then rates performance on work tasks, an indication of success, and only those who score above a certain ________ are deemed successful. ```

Predictor cutoff; criterion cutoff

``` Scoring above both the predictor and criterion cutoff points produces ________; scoring above the predictor cutoff point but below the criterion cutoff point produces ________; scoring below the predictor cutoff point but above the criterion cutoff point produces ________; and scoring below both the predictor and criterion cutoff points produces ________. ```

``` True positives (valid acceptances); false positives (false acceptances); false negatives (invalid rejections); true negatives (valid rejections) ```

``` Some factors contributing to a low validity coefficient include the validation group being ________ or the predictor and/or criterion being ________. ```

Homogenous; | unreliable

``` When a test has a different validity coefficient for one group compared to another, the variables affecting validity are called ________ variables; when this is the case, the test is said to have ________. ```

Moderator; differential validity

``` This is the process whereby an already validated test is re-validated with a different sample of people than the original validation sample. ```

Cross-validation

``` What term is used to describe the reduction that occurs in a criterion-related validity coefficient after cross-validation? ```

Shrinkage

``` The greatest shrinkage occurs when the original validation sample is ________, the original item pool is ________, the number of items retained is ________ relative to the items in the item pool, and/or item are not chosen based on ________ or ________. ```

``` Small; large; small; previously formulated hypothesis; experience with the criterion ```

``` ________ is one way a predictor might end up looking more valid than it actually is, which occurs when predictor scores themselves influence any person's criterion status (e.g., manager is aware that factory worker did well on predictor, this knowledge positively influences manager's ratings on criterion performance). ```

Criterion | contamination

How is criterion contamination prevented?

``` Criterion raters should have no prior knowledge of examinees' predictor scores ```

``` Theorized psychological variables (e.g., personality, intelligence) that are abstract and not directly observable are referred to as ________, hence ________ provides an indication of the degree to which an instrument measures or correlates with such variables. ```

Construct; construct validity

``` A newly developed test of personality has a high correlation with the MMPI-2 and a low correlation with the Wechsler Memory Scale, indicating the test has both ________ validity and ________ validity, respectively. ```

Convergent; discriminant/divergent - both are forms of construct validity

``` TRUE or FALSE: The only time a low correlation coefficient provides evidence of high validity is when discriminant validity is indicated due to there being a low correlation between 2 tests that measure different constructs. ```

``` TRUE: In all other cases, high validity is indicated by a high correlation coefficient ```

``` What complex procedure for assessing convergent and discriminant validity requires the assessment of 2 or more traits (e.g., personality, depression) by 2 or more methods (e.g., self-report, peer rating)? ```

Multitrait-multimethod | matrix

When using the multitrait-multimethod matrix, ________ validity is indicated when tests that measure the same traits are highly correlated, even when different methods of measurement are used; conversely, ________ validity is indicated when tests that measure different constructs are minimally correlated, even when the same method of measurement.

Convergent; | discriminant

The ________ coefficient is a reliability coefficient, as it indicates the correlation between itself and the measure; correlations between two measures that measure the same trait using different methods are called ________ coefficients; correlations between two measures that measure different traits using the same method are called ________ coefficients; and correlations between 2 measures that measure different traits using different methods are called ________ coefficients.

Monotrait-monomethod; monotrait-heteromethod; heterotrait-monomethod; heterotrait-heteromethod

``` When assessing validity using the multitrait-multimethod matrix, convergent validity is indicated when there is a high ________ correlation, while discriminant validity is indicated by a low ________ correlation and further confirmed by a ________ heterotrait-heteromethod correlation. ```

Monotrait-heteromethod; heterotrait-monomethod; low

``` ________, often used to assess the construct validity of a test or tests, involves reducing a larger set of variables into fewer classified sets of variables based on the construct that is primarily "picked-up" by each measure; each variable is correlated with every other variable, creating a ________ ```

Factor analysis; factor matrix

``` The main purpose of factor analysis is to reveal how many and to what degree underlying constructs, also called ________ due to the fact that the analysis does not directly intend to measure them, can account for scores on a larger number of tests. ```

Latent | variables

``` In a hypothetical factor analysis, the factor matrix indicates a correlation coefficient of .68 between the depression subscale of the MMPI-2 and Factor II. What term is used to describe the correlation between the depression subscale and Factor II? ```

``` Factor loading, which refers to the correlation between a given test and a given factor (e.g., the depression subscale loads .68 on Factor II); it can be square to determine proportion of variability ```

``` ________ determines the proportion of variance of a test that is attributable to the factors; it is the sum of squared factor loadings. ```

``` Communality (h-squared) - not the case when oblique rotation is used ```

``` The amount of variability in a test that can be explained by whatever traits are represented by the factors is referred to as ________, while variance that is specific to the test and not explained by the factors is referred to as ________. ```

``` Common variance (represents communality); unique variance (represents specificity) ```

``` In a factor analysis, these values indicate the amount of variance in all the tests accounted for by the factor; they are analyzed to determine whether or not the factor is accounting for a significant amount of variability in the tests. ```

Eigenvalues (or explained variance)

``` If a factor analysis is performed on 8 tests, what is the largest the sum of the eigenvalues can be? ```

``` Since the sum of the eigenvalues can be no larger than the number of tests included in the factor analysis, the answer is 8 ```

``` A procedure that facilitates factor matrix interpretation is ________, which involves re-dividing the test's communalities so that a clearer pattern of loadings emerges. ```

Rotation

``` Two general rotation strategies include ________ for factors that are uncorrelated (independent of each other) and ________ for correlated factors; the decision as to which one is used is based on the researcher's theoretical assumptions. ```

Orthogonal; | oblique

``` When construct validity is being assessed using factor analysis, a high correlation between a test and a factor the test is expected to correlate highly with is referred to as what? ```

Factorial | validity

``` While factor analysis assumes variance in a variable is composed of ________, ________, and ________, principle components analysis assumes variance is composed of ________ and ________. ```

``` Communality; specificity; error; explained variance; error variance ```

``` Factor is to factor analysis as ________ or ________ is to principal components analysis. ```

Principal component; eigenvector

``` What method might a researcher who is interested in developing a taxonomy (classification system) of different personality characteristics use? ```

Cluster | analysis

``` In ________ analysis, only interval and ratio data can be used and researchers typically have an a priori hypothesis about what traits a set of variables measure; by contrast, ________ can be performed using any type of data (interval, ration, nominal, ordinal) and is not designed for studies where the researcher has an a priori hypothesis. ```

Factor analysis; | cluster analysis

``` TRUE or FALSE: A reliable test is not always a valid test, though a valid test must be a reliable test. ```

``` TRUE: Reliability is a necessary but not sufficient condition for validity ```

``` The ________ coefficient is less than or equal to the square root of the ________ coefficient; it cannot be any higher, thus the latter sets a ________ on the former. ```

Validity; reliability; ceiling (or upper-limit)

``` A researcher discovers a test has low reliability; however, she is interested in what the validity coefficient of the predictor would be if both the predictor and the criterion were perfectly reliable. What formula would she use? ```

Correction for attenuation

``` What is the correlation between the factors in a factor analysis where an orthogonal rotation is used? ```

By definition, the correlation would be 0.0

``` What is used to determine which test items will be retained for the final version of a test and to ensure that a test is both reliable and valid from the start? ```

Item | analysis

The ________ the p-value, the ________ the item.

Higher (lower); less difficult (more difficult)

``` The percentage of examinees that answer an item correctly is referred to as a/an ________, which is abbreviated ________; most test developers prefer items with a ________ value at or around ________. ```

Item difficulty index; p; p; .50

The rule-of-thumb for item difficulty on a test is that the optimal difficulty level of test items should be approximately halfway between 1.0 (i.e., everyone is correct) and the level of success expected by chance alone. That known, what is the optimal item difficulty level of a multiple choice test with 4 options (e.g., EPPP)?

``` p = .625, which means there is a 62.5% chance of guessing the correct answer to an item ```

``` According to Anastasi, the p-level expresses item difficulty in terms of an ________ scale, as conclusions cannot be made about the differences in difficulty between items, only that certain items are easier/harder than others. ```

Ordinal (difficulty level are rankings, according to Anastasi)

``` The degree to which a test item differentiates among test-takers in terms of the behavior the test is designed to measure is called ________ and can be assessed by calculating a/an ________, which is abbreviated as "________." ```

Item discrimination; item discrimination index; D

``` An item on a measure of anxiety would have good ________ if low-anxiety examinees consistently answered it differently than high-anxiety examinees. ```

Discriminability (item discrimination)

``` An item's ________ level places a ceiling on its ________ index; higher levels of discriminability are associated with ________ levels of difficulty. ```

Difficulty; discrimination; moderate

``` TRUE or FALSE: The reliability of a test will decrease as the mean discrimination index (D) increases. ```

``` FALSE: There is a direct correlation between test reliability and mean D ```

``` A graphical depiction of both item difficulty and item discrimination is called a/an ________; analysis based on ________ is derived from these. ```

Item characteristic curve (ICC); item response theory

``` What are the 2 technical properties of an item characteristic curve that are used to describe it? ```

Item difficulty and item discrimination

``` Item response theory assumes (1) performance on an item is related to the estimated amount of a/an ________ being measured by the item, and (2) ________ (an item should have the same characteristics regardless of the sample of people taking the test). ```

Latent trait; invariance of item parameters

``` The computerized selection of test items for individual examinees is referred to as what? ```

Computer adaptive assessment (or testing)

``` What item difficulty level is associated with the maximum level of differentiation among examinees? ```

.50, indicating half answered correctly and half answered incorrectly

What factor most affects an item's difficulty level?

Characteristics | of examinees

``` What type of interpretation indicates where the examinee stands in relation to others who have taken the same test? ```

Norm-referenced | interpretation

``` Providing a general indication as to the progression a person has made along the normal developmental path, ________ norms include ________ and ________. ```

Developmental; mental age; grade equivalent scores

What is the calculation for ratio IQ?

(mental age/chronological age) x 100

``` A 20-year-old performs as well on a test as the average 10-year-old. His mental age is ________ and his ratio IQ is ________. ```

10-years-old; | 50

``` Indicating the grade level a person's performance is equivalent to, ________ are typically used in the interpretation of educational achievement tests. ```

``` Grade equivalent scores (e.g., Wide Range Achievement Test, 4th Ed [WRAT-4]) ```

``` TRUE or FALSE: When using developmental norms, scores obtained by people of different age groups are not comparable. ```

``` TRUE: This is due to the fact that standard deviation is not accounted for ```

``` Including percentile ranks and standard scores, ________ norms compare examinee scores to those of the most nearly comparable standardization sample. ```

Within-group

``` Z-scores, t-scores, stanine scores, and deviation IQ scores are all examples of ________, which express a raw score's distance from the mean in terms of standard deviation. ```

Standard | scores

``` Identify the mean (M) and standard deviation (sd) of: z-scores, t-scores, stanine scores, and deviation IQ scores. ```

``` Z-score (M = 0, sd = 1); T-score (M = 50, sd = 10); Stanine (M = 5, sd = about 2); Deviation IQ (M = 100, sd = 15) ```