Chapter 7 - 8 Flashcards by . .

usefulness or practical value of testing to improve efficiency

Utility

How well did you know this?

Not at all

Perfectly

used to refer to the usefulness or practical value of a training program or
intervention

Utility

How well did you know this?

Not at all

Perfectly

Factors that affect a test’s utility

Psychometric Soundness
Cost
Benefits

How well did you know this?

Not at all

Perfectly

Gives us the practical value of both the scores (reliability
and validity)

Psychometric Soundness

How well did you know this?

Not at all

Perfectly

They tell us whether decisions are cost-effective

Psychometric Soundness

How well did you know this?

Not at all

Perfectly

A test must be valid to be useful, but a valid test is not always a useful test, especially if testtakers do not follow test directions

True

How well did you know this?

Not at all

Perfectly

It refers to disadvantages, losses or expenses in both economic and noneconomic terms

Cost

How well did you know this?

Not at all

Perfectly

It refers to profits, gains or advantages

Benefit

How well did you know this?

Not at all

Perfectly

It is a family of techniques that entail a cost-benefit analysis designed to yield information relevant to a decision about the usefulness and/or practical value of a tool of assessment

Utility Analysis

How well did you know this?

Not at all

Perfectly

provide an indication of likelihood that a testtaker will score within some interval of scores on a criterion measure – an
interval may be categorized as “passing”, “acceptable” or “failing”

Expectancy Table/Chart

How well did you know this?

Not at all

Perfectly

estimate of the percentage of employees hired by a particular test who will be successful to their jobs

Taylor-Russell Tables

How well did you know this?

Not at all

Perfectly

used for obtaining the difference between the means of the selected and unselected groups to derive an index of what the test is
adding to already established procedure

Naylor-Shine Tables

How well did you know this?

Not at all

Perfectly

A formula used to calculate the dollar amount of a utility gain resulting from the
use of a particular selection instrument under specified conditions

Brodgen-Cronbach-Gleser Formula

How well did you know this?

Not at all

Perfectly

an estimate of the benefit (monetary/otherwise) of using a particular
test or selection method

Utility gain

How well did you know this?

Not at all

Perfectly

a body of methods used to quantitatively evaluate selection procedures,
diagnostic classifications, therapeutic interventions or other assessment or
intervention-related procedures in terms of how optimal they are (most typically
from a cost-benefit perspective)

Decision Theory

How well did you know this?

Not at all

Perfectly

a correct classification

hit

How well did you know this?

Not at all

Perfectly

a qualified driver is hired; an unqualified driver is not hired

It is a hit

How well did you know this?

Not at all

Perfectly

an incorrect classification; a mistake

miss

How well did you know this?

Not at all

Perfectly

a qualified driver is not hired; an unqualified driver is hired

It is a miss

How well did you know this?

Not at all

Perfectly

the proportion of people that an assessment tool accurately identified
as possessing a particular variable

hit rate

How well did you know this?

Not at all

Perfectly

the proportion of qualified drivers with a passing score who actually
gain permanent employee status; the proportion of unqualified drivers with a
failing score who did not gain permanent status

This is a hit rate

How well did you know this?

Not at all

Perfectly

the proportion of people that an assessment tool inaccurately identified
as possessing a particular variable

miss rate

How well did you know this?

Not at all

Perfectly

the proportion of drivers whom inaccurately predicted to be qualified;
the proportion of drivers whom inaccurately predicted to be unqualified

this is a miss rate

How well did you know this?

Not at all

Perfectly

falsely indicates that the testtaker possesses a particular variable; example: a driver who is hired is not qualified

false positive

How well did you know this?

Not at all

Perfectly

falsely indicates that the testtaker does not possess a particular variable; the assessment tool says to not hire but driver would have been rated as qualified

false negative

Some practical considerations

The Pool of Job Applicants The Complexity of the Job The Cut Score in Use

a (usually numerical) reference point derived as a result of a judgment and used to divide a set of data into two or more classifications, with some action to be taken or some inference to be made on the basis of these classifications

Cut Score/Cutoff Score

dictate what sort of information will be required as well as the specific methods to be used

objective of utility analysis

Used to measure costs vs. benefits

Expectancy Data

- Based on norm-related considerations rather than on the relationship of test scores to a criterion - Also called norm-referenced cut score - Ex.) top 10% of test scores get A’s - normative

Relative cut score

- set with reference to a judgment concerning a minimum level of proficiency required to be included in a particular classification. - Also called absolute cut score - criterion

Fixed cut score

using two or more cut scores with reference to one predictor for the purpose of categorizing testtakers

Multiple cut scores

Ex.) having cut score that marks an A, B, C etc. all measuring same predictor

Multiple cut scores

the achievement of a particular cut score on one test is necessary in order to advance to the next stage of evaluation in the selection process

Multiple-stage or Multi Hurdle

written application->group interview->personal interview

Multiple-stage or Multi Hurdle

assumption is made that high scores on one attribute can compensate for low scores on another attribute

Compensatory model of selection

Who devised Angoff method?

William Angoff

Who devised Angoff method?

William Angoff

a way to set fixed cut scores that entails averaging the judgments of experts; must have high inter-rater reliability

Angoff Method

a system of collecting data on a predictor of interest from groups known to possess (and not to possess) a trait, attribute or ability of interest

Know Groups Method/Method of Contrasting Groups

a system of collecting data on a predictor of interest from groups known to possess (and not to possess) a trait, attribute or ability of interest

Know Groups Method/Method of Contrasting Groups

a cut score is set on the test that best discriminates the high performance from low performers

Know Groups Method/Method of Contrasting Groups

-in order to “pass” the test, the testtaker must answer items that are considered that has some minimum level of difficulty, which is determined by the experts and serves as the cut score

Item Response Theory (IRT)-Based Methods

- Based on testtaker’s performance across all items on a test - Some portion of test items must be correct

IRT Based Method

a technique for identifying cut scores based on the number of positions to be filled

Method of Predictive Yield

a family of statistical techniques used to shed light on the relationship between certain variables and two or more naturally occurring groups

Discriminant Analysis

determining difficulty level reflected by cut score

Item mapping method

test items are listed, one per page, in ascending level of difficulty. An expert places a bookmark to mark the divide which separates testtakers who have acquired minimal knowledge, skills, or abilities and those that have not. Problems include training of experts, possible floor and ceiling effects, and the optimal length of item booklets

Bookmark-method

Steps in Test Development

1. TEST CONCEPTUALIZATION 2. TEST CONSTRUCTION 3. TEST TRYOUT 4. ITEM ANALYSIS 5. TEST REVISION

Conception of idea by the test developer

Test Conceptualization

An emerging social phenomenon or pattern of behavior might serve as the stimulus for the development of a new test.

Test Conceptualization

An item for which high scorers on the test respond correctly. Low scorers respond to that same item incorrectly

Norm-referenced conceptualization

The conceptualization is on the construct that is need to maste

Criterion-referenced conceptualization

high scorers on the test get a particular item right whereas low scorers on the test get that same item wrong.

Criterion-referenced conceptualization

prototype of the test; necessary for research reason; but not required for teacher-made test

Pilot work

To know whether some items should be included in the final form of the instrument

Pilot work

the test developer typically attempts to determine how best to measure a targeted construct

Pilot work

process of setting rules for assigning numbers in measurement.

Scaling

credited for being the forefront of efforts to develop methodologically sound scaling methods

LL Thurstone

Stanine scale

Raw score converted from 1-9

measuring one construct

Unidimensional Scale

measuring more than one construc

Multidimensional Scale

entails judgments of a stimulus in comparison with every other stimulus on the scale (best to worst)

Comparative Scaling

stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum (section 1, section 2, section 3)

Categorical Scaling

Which can be defined as a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker

Rating Scale

when final score is obtained by summing the ratings across all the items

Summative Scale

a type of summative rating scale wherein each item presents the testtaker with five alternative responses usually on agree-disagree, or approve-disapprove continuum. It is ordinal in nature

Likert Scale

scaling method whereby one of a pair of stimuli (such as photos) is selected according to a rule (such as – “select the one that is more appealing”)

Paired Comparison

presented with two stimuli and asked to compare

Paired comparison

judging of a stimulus in comparison with every other stimulus on the scale

Comparative Scaling

testtaker places stimuli into a category; those categories differ quantitatively on a spectrum

Categorical Scaling

items range from sequentially weaker to stronger expressions of attitude, belief, or feeling. A testtaker who agrees with the stronger statement is assumed to also agree with the milder statements

Guttman Scale/Scalogram Analysis

a scale wherein items range sequentially from weaker to stronger expressions of the attitude or belief being measured

Guttman Scale/Scalogram Analysis

Developer of Guttman Scale/Scalogram Analysis

Louis Guttman

direct estimation because don’t need to transform testtaker’s response to another scale. It is presumed to be interval in nature

Thurstone’s Equal Appearing Intervals Method

When devising a standardized test using a multiple-choice format, it is usually advisable that the first draft contains approximately ______ the number of items that the final version of the test will contain

twice

What to consider in writing items

- range of content that the items should cover - which item format should be employed - written in total and for each content area covered

reservoir from which items will not be drawn for the final version of the test

Item pool

Item pool should be about _____ the number of questions as final will have

double

variables such as the form, plan, structure, arrangement and layout of individual test items

Item format

the collection of items to be further evaluated for possible selection for use in an item bank

Item pool

testtaker selects a response from a set of alternative responses

Selected-Response Format

What type of item format is multiple choice, true-false, and matching

Selected-Response Format

testtaker supplies or creates the correct answer

Constructed-Response Format

Item format that includes completion item, short answer and essay

constructed-response format

item bank

interactive, computer-administered testtaking process wherein items presented to the testtaker are based in part on testtaker’s performance on previous items.

Computerized Adaptive Testing (CAT)

the diminished utility of an assessment tool for distinguishing testtakers at the low end of the ability, trait, or other attribute being measured

floor effect

diminished utility of an assessment tool for distinguishing testtakers at the high end of the ability, trait, attribute being measured

ceiling effect

ability of computer to tailor the content and order of presentation of test items on the basis of responses to previous items

item branching

testtakers earn cumulative credit with regard to a particular construct

cummulative scoring

testtaker responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way

class/category scoring

comparing a testtaker’s score on one within a test to another scale within that same test

ipsative scoring

John’s need for achievement is higher than his need for affiliation

ipsative scoring

offers two alternatives for each item

dichotomous format

resembles the dichotomous format except that each item has more than two alternatives

polytomous format

incorrect choices in multiple choice

distractors

describes the chances that a low-ability test taker will obtain each score

guessing threshold

uses more choices than Likert; 10-point rating scale

category format

respondent is given a 100-millimeter line and asked to place a mark between two well-defined endpoints. It measures self-rate healt

Visual analogue scale

subject receives a long list of adjectives and indicates whether each one is characteristic of himself or herself

adjective scale

Obtained by calculating the proportion of the total number of testtakers who answered the item correctly “p”

Item-Difficulty Index

Higher p indicates

easier items

Difficulty can be replaced with _________________in non-achievement tests

endorsement

- Indication of the internal consistency of a test - Equal to the product of the item-score standard deviation (s) and the correlation (r) - Factor analysis and inter-item consistency

item Reliability Index

Statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure. It requires: item-score standard deviation, the correlation between the item score and criterion score

Item-Validity Index

means greater number of high scorers answering the item correctly

higher d

means low-scoring examinees are more likely to answer the item correctly than high-scoring examinees

negative d

compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores

Item-Discrimination Index

Graphic representation of item difficulty and discrimination

Item-Characteristic Curves

techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures

Qualitative method

various nonstatistical procedures designed to explore how individual test items work

Qualitative item analysis

- approach to cognitive assessment that entails respondents vocalizing thoughts as they occur - used to shed light on the testtker’s though processes during the administration of a test

"Think aloud” test administration

study of test items in which they are examined for fairness to all prospective testtakers as well as for the presence of offensive language, stereotypes, or situations

Sensitivity Review

Find the correlation between performance on the item and performance on the total test

The Point Biserial Method

Correlation between a dichotomous variable and a continuous variable

point biserial correlation

revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion

Cross-validation

decrease in item validities that inevitably occurs after cross-validation of finding

Validity Shrinkage

test validation process conducted on two or more tests using the same sample of testtakers

Co-validation

when co-validation is used in conjunction with the creation of norms or the revision of existing norms

Co-norming

test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies

anchor protocol

a discrepancy between scoring in an anchor protocol and the scoring of another protocol

scoring drift

phenomenon, wherein an item functions differently in one group of testtakers as compared to another group of testtakers known to have the same level of the underlying trait

Differential item functioning (DIF)

(level of difficulty) optimal average item difficulty (whole test)

0.5

(level of difficulty) average item difficulty on individual items

0.3 to 0.8

(level of difficulty) true or false

0.75

(level of difficulty) multiple choice (4 choices)

0.625