Test Development Flashcards by Jyllen Arambulo

An emerging social phenomenon or pattern of behavior might serve as the stimulus for the development of a new test or in response to a need to assess mastery in emerging occupations or professions.

Test Conceptualization

How well did you know this?

Not at all

Perfectly

Criterion-referenced testing and assessment are commonly employed in _ and _ contexts.

Licensing
Educational context

How well did you know this?

Not at all

Perfectly

The items that best discriminate between 2 groups would be considered the _ items.

Good items

How well did you know this?

Not at all

Perfectly

A good items on a _ test is an item for which high scorers on the test respond correctly and low scorers respond incorrectly.

Norm-referenced test

How well did you know this?

Not at all

Perfectly

The preliminary research surrounding the creation of a prototype of the test. It should be done to evaluate whether they should be included in the final form of the instrument.

Pilot work

How well did you know this?

Not at all

Perfectly

The process by which a measuring device is designed and calibrated and by which numbers are assigned to different amounts of trait, attribute or characteristics being measured.

Scaling

How well did you know this?

Not at all

Perfectly

He is the one credited for being at the forefront efforts to develop methodologically sound scaling methods.

L. L. Thurstone

How well did you know this?

Not at all

Perfectly

Types of scales

Age-based scale
Grade-based scale
Stanine scale

How well did you know this?

Not at all

Perfectly

A type of scale where all raw scores on the test are to be transformed into scores that can range from 1-9.

Stanine scale

How well did you know this?

Not at all

Perfectly

The 3 scaling methods

Rating Scale
Summative scale
Likert scale

How well did you know this?

Not at all

Perfectly

A grouping of words, statements or symbols on which judgments of the strength of a particular trait, attitude of emotion are indicated by the test taker.

Rating Scale

How well did you know this?

Not at all

Perfectly

Test score is obtained by summing the rating across all the items.

Summative scale

How well did you know this?

Not at all

Perfectly

A type of summative rating scale that is used extensively in psychology to scale attitudes. Each items present the testtakers with five alternative responses usually on an agree-disagree or approve-disappaprove continuum.

Likert scale

How well did you know this?

Not at all

Perfectly

When one dimension is presumed to underlie the ratings.

Unidimensional

How well did you know this?

Not at all

Perfectly

When more than 1 dimension is thought to guide the testtaker’s responses.

Multidimensional

How well did you know this?

Not at all

Perfectly

What are the 4 scaling methods that produce ordinal data?

Method of paired comparison
Comparative scaling
Categorical scaling
Guttman scale

How well did you know this?

Not at all

Perfectly

A scaling method that produces ordinal data. Testtakers are presented with pairs of stimuli which they are asked to compare, and they must select one of the stimuli according to some rule. Then they receive a higher score for selecting the option deemed more justifiable by the majority of a group of judges.

Method of Paired comparison

How well did you know this?

Not at all

Perfectly

A scaling method that produces ordinal data. Stimuli such as printed cards, drawings, photographs or other objects are typically presented to testtakers for evaluation and must be sort from most justifiable to least justifiable. It could also be accomplished through the use of list of items on a sheet of paper.

Comparative scaling

How well did you know this?

Not at all

Perfectly

A scaling method that produces ordinal data. Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum.

Categorical scaling

How well did you know this?

Not at all

Perfectly

A scaling method that produces ordinal data. Items on it range sequentially from weaker to stronger expressions of the attitude, belief or feeling being measured. All respondents who agree with the stronger statements of the attitude will also agree with milder statements.

Guttman scale

How well did you know this?

Not at all

Perfectly

The resulting data of Guttman scale are analyzed by the means of this. This is an item-analysis procedure and approach to test development that involves a graphic mapping of a testtaker’s responses.

Scalogram Analysis

How well did you know this?

Not at all

Perfectly

The reservoir from which the item will or will not be drawn for the final version of the test. Items available for use as well as new items created especially for the item bank.

Item pool

How well did you know this?

Not at all

Perfectly

It is the form, plan, structure, arrangement and layout of individual test items.

Item format

How well did you know this?

Not at all

Perfectly

The two types of item format:

Selected response format
Constructed response format

How well did you know this?

Not at all

Perfectly

It requires testtakers to select a response from a set of alternative responses.

Selected response format

3 Types of selected response format:

Multiple choice format Matching item True-false

Several incorrect alternatives or options in a multiple choice format are referred to as _.

Distractors or foils

A selected response format where the testtaker is presented with 2 columns where they have to determine which response is best associated with which premise.

Matching item

A multiple choice item format that contains only two possible responses (binary choice) (agree or not, yes or no, right or wrong, fact or opinion). It usually takes the form of a sentence.

True-false

3 types of constructed response items:

Completion item Short-answer item Essay

A constructed response format that requires the examinee to provide a word or phrase that completes a sentence.

Completion item

A constructed response format where a word, term, sentence or paragraph may qualify as an answer.

Short-answer item

A constructed response format that requires the testtaker to respond to a question by writing a composition, typically one that demonstrates recall of facts, understanding, analysis and/or interpretation.

Essay

A relatively large and easily accessible collection of test questions.

Item bank

An interactive, computer-administered test-taking process wherein items presented to the testtaker are based in part on the test takers' performance on previous items.

Computerized adaptive testing

It refers to the diminished utility of an assessment tool for distinguishing testtakers at the low end of the ability, trait or other attribute being measured. Testtakers who have not yet achieved such ability might fail all the items.

Floor effect

It refers to the diminished utility of an assessment tool for distinguishing testtakers at the high end of the attribute being measured. It is likely that the test users who answered all of the items correctly conclude that the test was too easy.

Ceiling effect

The ability of the computer to tailor the content and order of presentation of test items on the basis of responses to the previous items. Random presentation of test items.

Item branching

What are the 3 different scoring models?

Cumulative model Class scoring or Category scoring Ipsative scoring

Scoring model where the higher the score on the test, the higher the testtakers are on the ability or characteristic that the test purpots to measure.

Cumulative model

Scoring model where testtakers earn credit toward placement in a particular class or category with other testtaker whose pattern of responses is presumably similar in some way. Used by some diagnostic systems.

Class scoring or Category scoring

Scoring model that compares a testtaker's score on one scale within a test to another scale within that same test.

Ipsative scoring

The informal rule of thumb for test tryout is that there should be no fewer than _ subjects and preferably as many as _ for each item on the test.

5 10

Factors that actually are just artifacts of the small sample size.

Phantom factors

A lowercase itallic "p" is used to denote _.

Item Difficulty

The larger the item difficulty index, the _ the item.

Easier

The optimal average item difficulty for maximum discrimination among the abilities of testtakers.

Approximate 0.5

The range of difficulty for individual items on the test.

0.3-0.8

For the possible effect of guessing, the optimal average item difficulty is usually the midpoint between _ and the chance success proportion.

1.00

The probability of answering correctly by random guessing.

Chance success proportion

The higher the index, the greater the test's _.

Internal consistency

A statistical tool useful in determining whether items on a test appear to be measuring the same thing.

Factor analysis

It is a statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure.

Item-validity index

The higher the item validity index, the greater the test's _.

Criterion-related validity

It compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores.

Item discrimination index

Item discrimination index is symbolized by _.

Lowercase itallic "d"

The _ the value of d, the more adequately the item discriminates the higher-scoring from the lower-scoring testtakers.

Higher

The highest possible value of d.

+/-1.00

The value of d that indicates the item is not discriminating for there is the same proportion of members of the upper and lower groups who pass the item.

The lowest value that an index of item discrimination can take. It indicates that all members of the upper group failed the item and all members of the lower group passed it.

- -1

It is a graphic representation of item difficulty and discrimination. The steeper the slope, the greater the item discrimination.

Item-characteristic curves

It is an item that favors one particular group of examinees in relation to another when differences in group ability at e controlled.

Biased test item

It is exemplified by different shapes of item-characteristic curves for different groups when the 2 groups do not differ in total test score.

Differential item functioning

These are techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.

Qualitative methods

It is a general term for various no statistical procedures designed to explore how individual test items work. Involves exploration of the issues through verbal means.

Qualitative item analysis

A qualitative research tool designed to shed light on the testtaker's thought processes during the administration of a test. They are asked to think aloud as they respond to each item.

"Think aloud" test administration

If a study of test items typically conducted during the test developmental processes in which the items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes or situations.

Sensitivity review

It refers to the revalidation of a test on a sample of testtakers other than those of whom test performance was originally found to be a valid predictor of some criterion.

Cross validation

The decrease in item validitities that inevitably occurs after cross-validation of finding.

Validaty shrinkage

Test validation process conducted on two or more tests using the same sample of testtakers.

Co-validation

Used in conjunction with the creation of norms or the revision of existing norms.

Co-norming

The discrepancies of the scorers are resolved by another scorer which is called?

Resolver

Scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies.

Anchor protocol

Discrepancy between scoring in an anchor protocol and the scoring of another protocol.

Scoring drift

Those items that respondents from different groups at the same level have underlying trait have different probabilities of endorsing as a function of their group membership.

Differential item functioning (DIF) items

Test Development Flashcards

(75 cards)