- Just one sentence - Focus on trait and scores/interpretation

- Short limits reliability - Long makes it more reliable

- Type of question to consider when addressing issues - Some subjectivity in essay responses, etc.

- Issue to consider when designing - Bigger test developers with wider application - Improves marketability

- Literature search issue - Discussions with practitioners if used by clinicians

- Stimulus/item stem (question, apparatus, etc.) - Response format (M-C, T-F) - Scoring procedures (correct/incorrect, partial credit, etc.) - Conditions governing response (time limit, probing of responses, etc.)

- Scoring scheme in which a single judgment about quality - Overall impression of what the paper is

- Scoring scheme in which it is rated on several different dimensions - Grammar, organization, vocabulary, criteria, etc.

- Scoring scheme by computer programs that simulate human judgment - Comes from the "and now" period - Used for scoring essays too

Chapter 6 Flashcards by Adam McAteer

Major Steps in Test Development

Define purpose
Preliminary design issues
Item preparation
Item analysis
Standardization/research
Final materials and publication

How well did you know this?

Not at all

Perfectly

Statement of Purpose

Just one sentence

- Focus on trait and scores/interpretation

How well did you know this?

Not at all

Perfectly

Mode of Administration

Group versus individual issue to consider

- Understand processes as person is taking

How well did you know this?

Not at all

Perfectly

Length of Exam

Short limits reliability

- Long makes it more reliable

How well did you know this?

Not at all

Perfectly

Item Format

Type of question to consider when addressing issues

- Some subjectivity in essay responses, etc.

How well did you know this?

Not at all

Perfectly

Number of Scores

Issue to consider when designing
Bigger test developers with wider application
Improves marketability

How well did you know this?

Not at all

Perfectly

Administrator Training

How much to be considered when designing exam

How well did you know this?

Not at all

Perfectly

Background Research

Literature search issue

- Discussions with practitioners if used by clinicians

How well did you know this?

Not at all

Perfectly

Item Preparation

Stimulus/item stem (question, apparatus, etc.)
Response format (M-C, T-F)
Scoring procedures (correct/incorrect, partial credit, etc.)
Conditions governing response (time limit, probing of responses, etc.)

How well did you know this?

Not at all

Perfectly

Selected Response Items (Fixed Response)

T/F, M/C, Likert, etc.
Objectively scored
Assigning of points
Keep content right, simple, and don’t be too obvious

How well did you know this?

Not at all

Perfectly

Constructed Response Items (Free Response)

Fill in the blank style test items

How well did you know this?

Not at all

Perfectly

Inter-Rater Reliability

Scoring requires judgment and a certain degree of agreement is crucial so items are evaluated in the same way or similar way

How well did you know this?

Not at all

Perfectly

Holistic

Scoring scheme in which a single judgment about quality

- Overall impression of what the paper is

How well did you know this?

Not at all

Perfectly

Analytic

Scoring scheme in which it is rated on several different dimensions
Grammar, organization, vocabulary, criteria, etc.

How well did you know this?

Not at all

Perfectly

Point System

Scoring scheme in which certain points must be included for a perfect answer or full credit

How well did you know this?

Not at all

Perfectly

Automated Scoring

Scoring scheme by computer programs that simulate human judgment
Comes from the “and now” period
Used for scoring essays too

How well did you know this?

Not at all

Perfectly

Suggestions for Writing Constructed Response Items

Clear directions
Avoid optional items (chose to answer 3 out of 5 essays)
Be specific about scoring procedure when preparing questions
Score anonymously
Use sufficient number of items to maximize reliability and validity

How well did you know this?

Not at all

Perfectly

Three Selected-Response Advantages

Scoring reliability
Scoring efficiency
Temporal efficiency

How well did you know this?

Not at all

Perfectly

Two Constructed Response Advantages

Study These Flashcards

Easier observation of behavior and processes

- Exploring unusual areas such as covering materials multiple choice cannot address

Item Analysis

Study These Flashcards

Involves the statistical analysis of data obtained from an item tryout

Three Phases of Item Tryout

Study These Flashcards

Item tryout
Statistical analysis
Item selection

Informal Item Tryout

Study These Flashcards

5-10 people similar to those for whom the test is intended comment on the items and directions

Formal Item Tryout

Study These Flashcards

Administration of test items to samples of examinees who are representative of the target population for the test

Independent Study

Study These Flashcards

Conducting a study exclusively for the purpose of item analysis
Most common practice
Formal practice for item tryout, subjects often paid

Attachment

- Including tryout items in the regular administration of an existing test - SATs and GREs for example

Item Difficulty (p)

- Percent of examinees answering the item correctly - P value of .95 is very easy (95% got it right) - P value of .15 is very difficult

Item Discrimination (D)

- An item's ability to differentiate statistically in a desired way between groups of examinees - D = sample difference in percent correct in the high and low scoring groups - 50% maximum possible scoring differentiation

Truman Kelly

- Showed that statistically, the best way to look at D is 27% top/bottom - Has become the "industry standard" for splits

Factor Analysis

- Used to select items that will yield relatively independent/meaningful scores - Applications commonly include attitude scales and personality/interest evaluations - Basic approach: iter-correlations among the items are factor analyzed and underlying dimensions (factors) are identified

High Loading Item

- 0.3 or higher in factor analysis - Items that are selected for inclusion in the final test - Each correlation between each item and factor - High cross-loading is NOT a good item

Five Guidelines for Item Selection

- Number of items - Content considerations - High discrimination indices - Relationship between p-value and D - Average difficulty level

Increased Number of Items

Increased reliability

Starting with Easier Test Items

Increases motivation

High Discrimination Indices

0.3 to 0.5

Maximum possible D-value

- Occurs when p-value is at its midpoint | - Maximum D (1.0) when p = 0.5

Mean Score

Sum of the p-values

To get an easy test...

Use items with high p-values (closer to 1)

To get a difficult test...

Use items with low p-values (closer to 0)

Discrimination Index

Difference in percent correct between high-scoring and low-scoring groups

Standardization

- Used to develop the norms for the test - Should be the exact version that is published (changing items throws off established norms) - Representativeness is key

Equating Programs

- Might be conducted at the same time as the standardization program - Alternate forms - Revised editions - Different levels (such as K-12)

Final Test Forms

- Test booklets - Technical manuals (psychometrics, how norms obtained, etc.) - Administration and scoring manuals (how to score, etc.) - Score reports and services - Supplementary materials

Continuing Research on Published Tests

- Updating norms | - Applicability of the test to various other populations

Test Fairness

A test measures a trait with equivalent validity in different groups

Test Bias

- A test does not measure the trait in the same way across different groups - Simple difference in average performance does not constitute bias - Difference in averages must NOT correspond to real difference in underlying trait - Group averages should differ if the groups really do differ in the trait we are trying to measure

Chapter 6 Flashcards

(45 cards)