Chapter 7 - 8 Flashcards
usefulness or practical value of testing to improve efficiency
Utility
used to refer to the usefulness or practical value of a training program or
intervention
Utility
Factors that affect a test’s utility
- Psychometric Soundness
- Cost
- Benefits
Gives us the practical value of both the scores (reliability
and validity)
Psychometric Soundness
They tell us whether decisions are cost-effective
Psychometric Soundness
A test must be valid to be useful, but a valid test is not always a useful test, especially if testtakers do not follow test directions
True
It refers to disadvantages, losses or expenses in both economic and noneconomic terms
Cost
It refers to profits, gains or advantages
Benefit
It is a family of techniques that entail a cost-benefit analysis designed to yield information relevant to a decision about the usefulness and/or practical value of a tool of assessment
Utility Analysis
provide an indication of likelihood that a testtaker will score within some interval of scores on a criterion measure – an
interval may be categorized as “passing”, “acceptable” or “failing”
Expectancy Table/Chart
estimate of the percentage of employees hired by a particular test who will be successful to their jobs
Taylor-Russell Tables
used for obtaining the difference between the means of the selected and unselected groups to derive an index of what the test is
adding to already established procedure
Naylor-Shine Tables
A formula used to calculate the dollar amount of a utility gain resulting from the
use of a particular selection instrument under specified conditions
Brodgen-Cronbach-Gleser Formula
an estimate of the benefit (monetary/otherwise) of using a particular
test or selection method
Utility gain
a body of methods used to quantitatively evaluate selection procedures,
diagnostic classifications, therapeutic interventions or other assessment or
intervention-related procedures in terms of how optimal they are (most typically
from a cost-benefit perspective)
Decision Theory
a correct classification
hit
a qualified driver is hired; an unqualified driver is not hired
It is a hit
an incorrect classification; a mistake
miss
a qualified driver is not hired; an unqualified driver is hired
It is a miss
the proportion of people that an assessment tool accurately identified
as possessing a particular variable
hit rate
the proportion of qualified drivers with a passing score who actually
gain permanent employee status; the proportion of unqualified drivers with a
failing score who did not gain permanent status
This is a hit rate
the proportion of people that an assessment tool inaccurately identified
as possessing a particular variable
miss rate
the proportion of drivers whom inaccurately predicted to be qualified;
the proportion of drivers whom inaccurately predicted to be unqualified
this is a miss rate
falsely indicates that the testtaker possesses a particular variable; example: a driver who is hired is not qualified
false positive
falsely indicates that the testtaker does not possess a particular variable; the assessment tool says to not hire but driver would have been rated as qualified
false negative
Some practical considerations
The Pool of Job Applicants
The Complexity of the Job
The Cut Score in Use
a (usually numerical) reference point derived as a result of a judgment and used to divide a set of data into two or more classifications, with some action to be taken
or some inference to be made on the basis of these classifications
Cut Score/Cutoff Score
dictate what sort of information will be required as well as the
specific methods to be used
objective of utility analysis
Used to measure costs vs. benefits
Expectancy Data
- Based on norm-related considerations rather
than on the relationship of test scores to a
criterion - Also called norm-referenced cut score
- Ex.) top 10% of test scores get A’s
- normative
Relative cut score
- set with reference to a judgment concerning a minimum level of proficiency required to be included in a particular classification.
- Also called absolute cut score
- criterion
Fixed cut score
using two or more cut scores with reference to one predictor for the purpose of categorizing
testtakers
Multiple cut scores
Ex.) having cut score that marks an A, B, C etc.
all measuring same predictor
Multiple cut scores
the achievement of a particular cut score on one test is necessary in order to
advance to the next stage of evaluation in the selection process
Multiple-stage or Multi Hurdle
written application->group interview->personal interview
Multiple-stage or Multi Hurdle
assumption is made that high scores on one attribute can compensate for low scores on another attribute
Compensatory model of selection
Who devised Angoff method?
William Angoff
Who devised Angoff method?
William Angoff
a way to set fixed cut scores that entails averaging the judgments of experts; must have high inter-rater reliability
Angoff Method
a system of collecting data on a predictor of interest from groups known to
possess (and not to possess) a trait, attribute or ability of interest
Know Groups Method/Method of Contrasting Groups
a system of collecting data on a predictor of interest from groups known to
possess (and not to possess) a trait, attribute or ability of interest
Know Groups Method/Method of Contrasting Groups
a cut score is set on the test that best discriminates the high performance from low
performers
Know Groups Method/Method of Contrasting Groups
-in order to “pass” the test, the testtaker must answer items that are considered that has some minimum level of difficulty, which is determined by the experts and serves as the cut score
Item Response Theory (IRT)-Based Methods
- Based on testtaker’s performance across all items on a test
- Some portion of test items must be correct
IRT Based Method
a technique for identifying cut scores based on the number of positions to be
filled
Method of Predictive Yield
a family of statistical techniques used to shed light on the relationship between certain variables and two or more naturally occurring groups
Discriminant Analysis
determining difficulty level reflected by cut score
Item mapping method
test items are listed, one per page, in ascending level of
difficulty. An expert places a bookmark to mark the divide which separates testtakers who have acquired minimal knowledge, skills, or abilities and those that have not. Problems include training of experts, possible floor and ceiling effects, and the optimal length of item booklets
Bookmark-method
Steps in Test Development
- TEST CONCEPTUALIZATION
- TEST CONSTRUCTION
- TEST TRYOUT
- ITEM ANALYSIS
- TEST REVISION
Conception of idea by the test developer
Test Conceptualization
An emerging social phenomenon or pattern of behavior might serve
as the stimulus for the development of a new test.
Test Conceptualization
An item for which high scorers on the test respond correctly. Low scorers respond to that same item incorrectly
Norm-referenced conceptualization
The conceptualization is on the construct that is need to maste
Criterion-referenced conceptualization
high scorers on the test get a particular item right whereas low scorers on the test get that same item wrong.
Criterion-referenced conceptualization
prototype of the test; necessary for research reason; but not required for
teacher-made test
Pilot work
To know whether some items should be included in the final form of the instrument
Pilot work
the test developer typically attempts to determine how
best to measure a targeted construct
Pilot work
process of setting rules for assigning numbers in
measurement.
Scaling
credited for being the forefront of efforts to develop methodologically sound scaling methods
LL Thurstone
Stanine scale
Raw score converted from 1-9
measuring one construct
Unidimensional Scale
measuring more than one construc
Multidimensional Scale
entails judgments of a stimulus in comparison with every other stimulus on the scale (best to worst)
Comparative Scaling
stimuli are placed into one of two or more alternative categories that differ quantitatively with
respect to some continuum (section 1, section 2, section 3)
Categorical Scaling
Which can be defined as a grouping of words,
statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker
Rating Scale
when final score is obtained by summing the ratings across all the items
Summative Scale
a type of summative rating scale wherein each item presents the testtaker with five alternative responses usually on agree-disagree, or approve-disapprove continuum. It is ordinal in nature
Likert Scale
scaling method whereby one of a pair of stimuli (such
as photos) is selected according to a rule (such as –
“select the one that is more appealing”)
Paired Comparison
presented with two stimuli and asked to compare
Paired comparison
judging of a stimulus in comparison with every
other stimulus on the scale
Comparative Scaling
testtaker places stimuli into a category; those categories differ quantitatively on a spectrum
Categorical Scaling
items range from sequentially weaker to stronger expressions of attitude, belief, or feeling. A
testtaker who agrees with the stronger statement is assumed to also agree with the milder statements
Guttman Scale/Scalogram Analysis
a scale wherein items range
sequentially from weaker to stronger expressions of the
attitude or belief being measured
Guttman Scale/Scalogram Analysis
Developer of Guttman Scale/Scalogram Analysis
Louis Guttman
direct estimation because
don’t need to transform testtaker’s response to another scale. It is presumed to be interval in nature
Thurstone’s Equal Appearing Intervals Method
When devising a standardized test using a multiple-choice format, it is usually advisable that the first draft contains approximately ______ the number of items that the final version of the test will contain
twice
What to consider in writing items
- range of content that the items should cover
- which item format should be employed
- written in total and for each content area covered
reservoir from which items will not be drawn for the final version of the test
Item pool
Item pool should be about _____ the number of questions as final will have
double
variables such as the form, plan, structure, arrangement and layout of individual test items
Item format
the collection of items to be further evaluated for possible selection for use in an item bank
Item pool
testtaker selects a response from a set of alternative responses
Selected-Response Format
What type of item format is multiple choice, true-false, and matching
Selected-Response Format
testtaker supplies or creates
the correct answer
Constructed-Response Format
Item format that includes completion item, short answer and essay
constructed-response format
constructed-response format
item bank
interactive, computer-administered testtaking process wherein items presented to the testtaker are based in part on testtaker’s performance on
previous items.
Computerized Adaptive Testing (CAT)
the diminished utility of an assessment tool for distinguishing testtakers at the low end of the ability, trait, or other attribute being measured
floor effect
diminished utility of an assessment tool for distinguishing testtakers at the high end of the ability, trait, attribute being measured
ceiling effect
ability of computer to tailor the content and order of presentation of test items on the basis of responses to previous items
item branching
testtakers earn cumulative credit with regard to a particular construct
cummulative scoring
testtaker responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way
class/category scoring
comparing a testtaker’s score on one within a test to
another scale within that same test
ipsative scoring
John’s need for achievement is higher than his need for affiliation
ipsative scoring
offers two alternatives for each item
dichotomous format
resembles the dichotomous format except that each item has more than two alternatives
polytomous format
incorrect choices in multiple choice
distractors
describes the chances that a low-ability test taker will obtain each score
guessing threshold
uses more choices than Likert; 10-point rating scale
category format
respondent is given a 100-millimeter line and asked to place a mark between two well-defined endpoints. It measures self-rate healt
Visual analogue scale
subject receives a long list of adjectives and indicates whether each one is characteristic of himself or herself
adjective scale
Obtained by calculating the proportion of the total number of testtakers who answered the item correctly “p”
Item-Difficulty Index
Higher p indicates
easier items
Difficulty can be replaced with _________________in non-achievement tests
endorsement
- Indication of the internal consistency of a test
- Equal to the product of the item-score standard deviation (s) and the
correlation (r) - Factor analysis and inter-item consistency
item Reliability Index
Statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure. It requires: item-score standard deviation, the correlation between the
item score and criterion score
Item-Validity Index
means greater number of high scorers answering the item correctly
higher d
means low-scoring examinees are more likely to answer the item correctly than high-scoring examinees
negative d
compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores
Item-Discrimination Index
Graphic representation of item difficulty and discrimination
Item-Characteristic Curves
techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures
Qualitative method
various nonstatistical procedures designed to explore how individual test items work
Qualitative item analysis
- approach to cognitive assessment that entails respondents vocalizing thoughts as they occur
- used to shed light on the testtker’s though processes during the administration of a test
“Think aloud” test administration
study of test items in which they are examined for fairness to all prospective testtakers as well as for the presence of offensive language, stereotypes, or
situations
Sensitivity Review
Find the correlation between performance on the item and performance on the total test
The Point Biserial Method
Correlation between a dichotomous variable and a continuous variable
point biserial correlation
revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion
Cross-validation
decrease in item validities that inevitably occurs after cross-validation of finding
Validity Shrinkage
test validation process conducted on two or
more tests using the same sample of testtakers
Co-validation
when co-validation is used in conjunction with the creation of norms or the revision of existing norms
Co-norming
test protocol scored by a
highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies
anchor protocol
a discrepancy between scoring in an anchor protocol and the scoring of another protocol
scoring drift
phenomenon, wherein an item functions differently in one group of testtakers as compared to another group of testtakers known to have the same level of the underlying trait
Differential item functioning (DIF)
(level of difficulty) optimal average item difficulty (whole test)
0.5
(level of difficulty) average item difficulty on individual items
0.3 to 0.8
(level of difficulty) true or false
0.75
(level of difficulty) multiple choice (4 choices)
0.625