Chapter 8 Test Development Flashcards

1
Q

Biased test item:

A

Biased test item is an item that favours one particular group of examinees in relation to another when differences in group ability are controlled.
p.264

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How to detect a biased test item?

A

Methods of item analysis:
Item characteristic curves. Specific items are identified as biased if exhibit differential item functioning.
The item characteristic curves (ICC)
for the different groups should not be statistically different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the order of Test Development from conceptualization?

A
Test conceptualization
Test construction
Test Tryout
Analysis
Revision to Test tryout again
p.234
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a good item on a norm referenced achievement test?

A

Is an item for which high scorers on the test respond correctly.
Low scorers on the test tend to respond to that item incorrectly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What pattern should occur on a criterion referenced test?

A

On a criterion oriented test, the pattern of results may be the same as norm referenced test-
high scorers get a particular item right whereas the low scorers get it wrong.
p.235

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Criterion-referenced test: difference …

A

Ideally, each item on a criterion referenced test addresses the issue of whether the test taker has met a certain criteria - eg pilot.
Norm referenced insufficient when knowledge of mastery is needed.
p.236

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Pilot work

A

Refers to the preliminary research surrounding the creation of a prototype of the test.
Test developer typically attempts to determine how best to measure a targeted construct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is scaling?

A

Scaling is the process of setting rules for assigning numbers in measurement.
A process by which a measuring device is designed and calibrated and by which numbers - scale values - are assigned to different amounts of the trait, attribute or characteristic being measured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Stanine scale?

A

When raw scores are transformed to scale that can range between 1 to 9.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the MDBS?

A

The MDBS is an example of a rating scale.
Morally debatable behaviours scale.
30 items.Never justified to always justified -10 point scale.
Rating scales are:
A grouping of words, statements or symbols on which judgements of the strength of a particular trait, attitude or emotion are indicated by the test taker.
p.239

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a rating scale?

A

Rating scales are:
A grouping of words, statements or symbols on which judgements of the strength of a particular trait, attitude or emotion are indicated by the test taker.
Used to record judgements of oneself, others, experiences, or objects, and they can take several forms.
p.239

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a summative scale?

A

Is where the final test score is obtained by summing the ratings across all the items.
p.240

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the Likert Scale?

A

A summative scale used to scale attitudes.
Five alternative responses…sometimes 7.
Usually on an agree - disagree or
approve - disapprove continuum.

Use of scales results in ordinal level data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Unidimensional raring scale?

A

Only one dimension is underlying the ratings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Multidimensional rating scales.

A

More than one dimension is thought to guide the test taker’s responses.
When more than one dimension is tapped by an item.p241.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Method of paired comparisons?

A

A scaling method that produces ordinal data.
Test-takers are presented with pairs of stimuli.. two photos, two statements, two objects…
They must select one of the stimuli according to some rule.
p.241
An advantage is that it forces test takers to choose between items.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Categorical scaling

A
Relies on sorting
Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum.
 e.g. MDBS-R
eg sorting 30 cards into 3 piles:
behaviours never justified
sometimes justified
always justified
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Guttman scale:

A

Scaling method that yields ordinal level measures.
Items on it range sequentially from weaker to stronger expressions of attitude, belief, or feeling being measured.
Feature is that all respondents that agree with the stronger statements will also agree with the milder statements.
Assessed by a scalogram analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Scalogram analysis.

A

An item analysis procedure and approach to test development that involves a graphic mapping of a test taker’s responses.
p.242.
Guttman scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Item pool

A

An item pool is the reservoir from which items will or will not be drawn for the final version of a test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Item format

A

Variables such as the form, plan, stricture, arrangement, and layout of individual test items…collectively referred to as item format.
Selected response format
Constructed response format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Selected response format

A

Require test takers to select a response from a set of alternative responses.
Eg Multiple choice format
Matching
True/false.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Constructed response format.

A

Requires test takers to supply or to create the correct answer, not merely to select it.
Eg essay
short answer

24
Q

Multiple choice format.

A

3 elements:

  1. a stem
  2. correct alternative or option
  3. several incorrect options - distractors or foils.
25
Q

What sort of item os matching item?

A

In a matching item the test taker is presented with two columns:
premises on the left and responses on the right.
Test taker task is to determine which response is best suited with which premise.
p.246

26
Q

Binary choice item.

A
Where a multiple choice item contains only two possible responses.
EG True - false.
Agree - disagree
Yes - no
Fact - opinion
Right - wrong.
27
Q

Constructed response format:

A

Completion item
Short answer
Essay

28
Q

Computer administration items:

A

Advantages:
Ability to store items in an item bank.
Item bank = large collection of testing questions.
Ability to individualize testing through item branching.

29
Q

Computerized adaptive testing.

A

CAT refers to an interactive, computer administered test taking process wherein items presented to the test taker are based in part on the teat takers performance on previous items.
p.248

30
Q

Floor effects

A

A floor effect refers to the diminished utility of an assessment tool for distinguishing test takers at the low end of the ability, trait, or other attribute being measured.
Solution = to add some less difficult items.

31
Q

Ceiling effect

A

A ceiling effect refers to the diminished utility of an assessment tool for distinguishing test takers at the high end of the ability, trait, or other attribute being measured.
ie test too easy.
Solution- add some harder questions.

32
Q

Item branching

A

Is the ability of the computer to tailor the content and order of presentation of test items on the basis of responses to previous items.
Patterns of items (eg) based on consecutive correct responses.
p. 252

33
Q

Class or category scoring.

A

Test taker responses earn credit toward placement in a particular class or category with other test takers whose pattern of responses similar.

34
Q

Ipsative Scoring

A

Scoring model that compares a test taker’s score on one scale with a test to another scale within that same test.
p. 253.

35
Q

Item fairness.

Biased item

A

A biased item is one that favours one particular group of examinees in relation to another when differences in group ability are controlled.

36
Q

What do Item Characteristic Curves do?

A

They can be used to identify biased items.
Specific items are identified as biased in a statistical sense if they exhibit differential item functioning…different shapes of item-characteristic curves for different groups.

37
Q

Qualitative Item Analysis

A

Is a general term for various non statistical procedures designed to explore how individual test items work.
Compares individual test items to each other and to the test as a whole.
Qualitative methods involve:
interviews
group discussions

38
Q

Think aloud test administration

A

Cognitive assessment approach.
Respondents verbalize thoughts as they occur.
p.266 table

39
Q

Qualitative Analysis

Expert panels

A

eg A sensitivity review - a study of items - conducted during test development process in which items are examined for fairness to all prospective test takers… and for the presence of offensive language, stereotypes, etc…

40
Q

Test Revision

A

Some items will be eliminated and others will be rewritten from the original pool.

Look at difficult- easy - biased - etc

41
Q

Cross-validation

A

Cross-validation refers to the revalidation of a test on a sample of test takers other than those on whom test performance was originally found to be a predictor of some criterion.

42
Q

Validity Shrinkage

A

Validity shrinkage is the decrease in item validities that occurs after cross-validation of findings
Such shrinkage is expected and integral to the test development process.

43
Q

Co-validation

A

Co-validation is a test validation process conducted on two or more tests using the same sample of test takers.

44
Q

Co-norming

A

When used in conjunction with the creation of norms or the revision of existing norms, co-validation may also be referred as co-norming.

A current trend among test publishers who publish more than one test designed for use with the same population is to co-validate and/or co-norm tests.
Economical.

45
Q

Anchor protocol

A

Is a mechanism for ensuring consistency in scoring …
and is a test protocol scored by an authoritative scorer that os designed as a model for scoring and a mechanism for resolving scoring discrepancies.

46
Q

Scoring drift

A

A scoring drift is a discrepancy between scoring in an anchor protocol and the scoring of another protocol.

Once protocols are scored, the data from them must be entered into a data base.

47
Q

Item banks

A

Each of the items assembled as part of an item bank has undergone rigorous qualitative and quantitative evaluation.

Many items come from existing instruments.
New items may be written.
All items constitute the item pool.
p.274

48
Q

What scales of measurement are there?

A

.Likert scales (eg 1=strongly disagree - 7=strongly disagree)
.Binary choice scales (true/false: like/dislike)
.Forced choice (eg. I am happy most of the time OR I am sad most of the time)
. Semantic differential scales (eg. strong …….weak).

49
Q

Writing test items

What’s the first step?

A

To create an item pool.

Two general item format options:

  1. selected response items
  2. constructed response items
50
Q

What are the 4 analytic tools that test developers use to analyze and select items?

A
  • Item difficulty Index
  • Item discrimination index
  • Item validity index
  • Item reliability index
51
Q

Item difficulty.

How is it calculated?

A

Item difficulty index is calculated as the proportion of test takers who answered the item correctly.
(p)

P value ranges from 0 to 1

Each item has a corresponding p value. eg
p1 is read “ item difficulty index for item 1”

52
Q

What is the ideal level of item difficulty for a test as a whole?

A

It is calculated as the average of all the p values for the test items.

Optimal average item difficulty is 0.5

IE individual items should range in difficulty from 0.3 (somewhat difficult) to 0.8 (somewhat easy).
The effect of guessing must be taken into account.

53
Q

Which items do not discriminate between test takers?

A

Items that everyone answers correctly p item = 1
or
that no one answers correctly
p item = 0

DO NOT DISCRIMINATE between test takers.

54
Q

What is the Item Discrimination Index?

A

Item discrimination index is the degree to which an item differentiates correctly on the behaviour the test is designed to measure.

IE. An item is good if most of the high scorers on the test overall answer the item correctly.

Most of the low scorers on the test answer the item incorrectly.

55
Q

Item difficulty

Formula

A

(1 + Probability) /2. =

eg

(1+.25) /2. =.625

=>. .63

Optimal

56
Q

Anchor Protocol?

A

A test answer sheet developed by a test publisher to check the accuracy of examiner’s scoring. To resolve scoring discrepancies.