Chapter 6 Flashcards
Major Steps in Test Development
- Define purpose
- Preliminary design issues
- Item preparation
- Item analysis
- Standardization/research
- Final materials and publication
Statement of Purpose
- Just one sentence
- Focus on trait and scores/interpretation
Mode of Administration
- Group versus individual issue to consider
- Understand processes as person is taking
Length of Exam
- Short limits reliability
- Long makes it more reliable
Item Format
- Type of question to consider when addressing issues
- Some subjectivity in essay responses, etc.
Number of Scores
- Issue to consider when designing
- Bigger test developers with wider application
- Improves marketability
Administrator Training
- How much to be considered when designing exam
Background Research
- Literature search issue
- Discussions with practitioners if used by clinicians
Item Preparation
- Stimulus/item stem (question, apparatus, etc.)
- Response format (M-C, T-F)
- Scoring procedures (correct/incorrect, partial credit, etc.)
- Conditions governing response (time limit, probing of responses, etc.)
Selected Response Items (Fixed Response)
- T/F, M/C, Likert, etc.
- Objectively scored
- Assigning of points
- Keep content right, simple, and don’t be too obvious
Constructed Response Items (Free Response)
Fill in the blank style test items
Inter-Rater Reliability
Scoring requires judgment and a certain degree of agreement is crucial so items are evaluated in the same way or similar way
Holistic
- Scoring scheme in which a single judgment about quality
- Overall impression of what the paper is
Analytic
- Scoring scheme in which it is rated on several different dimensions
- Grammar, organization, vocabulary, criteria, etc.
Point System
Scoring scheme in which certain points must be included for a perfect answer or full credit
Automated Scoring
- Scoring scheme by computer programs that simulate human judgment
- Comes from the “and now” period
- Used for scoring essays too
Suggestions for Writing Constructed Response Items
- Clear directions
- Avoid optional items (chose to answer 3 out of 5 essays)
- Be specific about scoring procedure when preparing questions
- Score anonymously
- Use sufficient number of items to maximize reliability and validity
Three Selected-Response Advantages
- Scoring reliability
- Scoring efficiency
- Temporal efficiency
Two Constructed Response Advantages
- Easier observation of behavior and processes
- Exploring unusual areas such as covering materials multiple choice cannot address
Item Analysis
Involves the statistical analysis of data obtained from an item tryout
Three Phases of Item Tryout
- Item tryout
- Statistical analysis
- Item selection
Informal Item Tryout
5-10 people similar to those for whom the test is intended comment on the items and directions
Formal Item Tryout
Administration of test items to samples of examinees who are representative of the target population for the test
Independent Study
- Conducting a study exclusively for the purpose of item analysis
- Most common practice
- Formal practice for item tryout, subjects often paid
Attachment
- Including tryout items in the regular administration of an existing test
- SATs and GREs for example
Item Difficulty (p)
- Percent of examinees answering the item correctly
- P value of .95 is very easy (95% got it right)
- P value of .15 is very difficult
Item Discrimination (D)
- An item’s ability to differentiate statistically in a desired way between groups of examinees
- D = sample difference in percent correct in the high and low scoring groups
- 50% maximum possible scoring differentiation
Truman Kelly
- Showed that statistically, the best way to look at D is 27% top/bottom
- Has become the “industry standard” for splits
Factor Analysis
- Used to select items that will yield relatively independent/meaningful scores
- Applications commonly include attitude scales and personality/interest evaluations
- Basic approach: iter-correlations among the items are factor analyzed and underlying dimensions (factors) are identified
High Loading Item
- 0.3 or higher in factor analysis
- Items that are selected for inclusion in the final test
- Each correlation between each item and factor
- High cross-loading is NOT a good item
Five Guidelines for Item Selection
- Number of items
- Content considerations
- High discrimination indices
- Relationship between p-value and D
- Average difficulty level
Increased Number of Items
Increased reliability
Starting with Easier Test Items
Increases motivation
High Discrimination Indices
0.3 to 0.5
Maximum possible D-value
- Occurs when p-value is at its midpoint
- Maximum D (1.0) when p = 0.5
Mean Score
Sum of the p-values
To get an easy test…
Use items with high p-values (closer to 1)
To get a difficult test…
Use items with low p-values (closer to 0)
Discrimination Index
Difference in percent correct between high-scoring and low-scoring groups
Standardization
- Used to develop the norms for the test
- Should be the exact version that is published (changing items throws off established norms)
- Representativeness is key
Equating Programs
- Might be conducted at the same time as the standardization program
- Alternate forms
- Revised editions
- Different levels (such as K-12)
Final Test Forms
- Test booklets
- Technical manuals (psychometrics, how norms obtained, etc.)
- Administration and scoring manuals (how to score, etc.)
- Score reports and services
- Supplementary materials
Continuing Research on Published Tests
- Updating norms
- Applicability of the test to various other populations
Test Fairness
A test measures a trait with equivalent validity in different groups
Test Bias
- A test does not measure the trait in the same way across different groups
- Simple difference in average performance does not constitute bias
- Difference in averages must NOT correspond to real difference in underlying trait
- Group averages should differ if the groups really do differ in the trait we are trying to measure