Ch 6 - Item Statistics Flashcards
Test items
units that make up a test and the means through which samples of test taker’s behaviour are gathered
Item analysis
general term, refers to all the techniques used to asses the characteristics of test items and evaluate their quality during the process of test development and test construction
Qualitative item analysis
rely on the judgements of reviewers concerning the substantive/stylistic characteristics of items, as well as their accuracy and fairness
○ Appropriateness of item content and format to the purpose of the tests and the populations for which it's designed ○ Clarity of expression ○ Grammatical correctness ○ Adherence to some basic rules for writing items that have evolved over time
Quantitative item analysis
variety of statistical procedures designed to ascertain the psychometric characteristics of items based on the responses obtained from the samples used in the process of test development
Bias in the context of psychometrics
measurement bias: systematic error that enters into scores and affects their meaning in relation to what the scores are designed to measure/predict
Steps of test development
- Generating item pool - creating test items, and their administration/scoring procedures
- Submit item pool to qualitative analysis to experts
- Revise/replace items that are problematic
- Try the items on samples that are representative of the intended pop
- Evaluate the results through quantitative item analysis
- Add/modify/delete items as needed
- Conduct additional trial administrations for checking if item statistics remains stable across samples (AKA cross-validation)
- Determine the length of the test and the sequencing of items, and the scoring/administration procedures
- Administer the test to a new sample - representative of pop - in order to develop normative data
- Publish the test, along with administration/scoring manual and intended uses, development procedures, standardization data, reliability/validity studies, and materials needed for test administration, scoring and interpretation
**Steps mostly for pen-and-paper pencil, for CAT the procedures are different - relies more on item banking
Tests also need to go through this process again to be revised - due to the changing norms/criterion/flynn effect mentioned in ch 3
Selected-Response Items
AKA Objective or fixed-response items
Close-ended in nature - limited nbr of alternatives from which the respondent can choose from
In ability tests:
• MC, true false, ranking, matching
• Usually scored as pass-fail
In personality tests:
• Dichotomous (true false, yes no, like dislike, etc)
• Polytomous (more than 2 options)
Scaled in terms of degree of acceptance, intensity of agreement, frequency, etc
Forced-Choice items
Respondent needs to choose which option represents them the + or the -
Each of the option represents a construct
Ipsative scores in the context of forced choice
Resulting scores are ipsative in nature: essentially ordinal numbers that reflect test taker’s rankings of the constructs assessed by the scales within a forced choice format test
Advantages of Selected-Response Items
• Ease and objectivity of scoring - enhances reliability, saves time
• Make efficient use of testing time
• Can be administered individually, but also collectively
Can be easily transformed into numerical scales - facilitates quantitative analysis
Disadvantages of Selected-Response Items
• Issue of guessing (can be up to 50% in dichotomous items)
• Similarly, wrong answers can happen due to inattention, haste, etc
• Items can be misleading
• Can be more easily manipulated bc of demand characteristics
○ Many personality inventories use validity scales to account for that
• Preparing selected response items is difficult and requires great skills
○ Carelessly constructed items can include:
§ Options not grammatically related with the question
§ Options susceptible to more than 1 interpretation
§ Options so dumb that they can be easily dismissed
• Selected-response items are less flexible
Constructed-Response Items
AKA free-response items
Variety is limitless - constructed responses may involve writing samples, free oral responses, performances of any kind, and products of all sorts
In ability tests
• Essay questions
• Fill-in-the-blanks
• Thorough instructions and procedural rules are indispensable for standard administration of free-response items
○ Time limits
○ Medium, manner or length of the required response
○ Whether access to materials/instruments is permitted
In personality tests
• Interviews
• Biographical data
• Behavioural observations
• Projective techniques (AKA performance-based measures of personality)
○ Responses to ambiguous stimuli
Respondents can respond freely, revealing aspects of their personality
Advantages of Constructed-Response Items
• Provide richer samples of the behaviour of examinees
• Offer a wider range of possibilities/creative approaches to test/assess
Elicit authentic samples of behaviour
Disadvantages of Constructed-Response Items
• Scoring is more time consuming and complex because of the presence of subjectivity
○ Even with scoring rubrics
• Checking for inter-rater reliability is essential
• Scorers need constant monitoring and thorough training
• Projective responses are even more susceptible to subjective scoring errors
• Because of the length of time it takes to answer, less items can be answered in the same amount of time than for selected-response items
○ Shorter tests are more prone to content sampling errors and producing less consistent scores
○ Less reliability
Response length can vary - therefore the number of scorable elements also varies
Meaning of “discrimination” in psychometrics
Considered a desirable feature of test items. It refers to the extent to which items elicit responses that accurately differentiate test takers along the dimensions that tests are designed to evaluate
Item validity
most important aspect of quantitative item analysis
• Whether a specific items carries its own weight within a test by eliciting information that advances the purpose of the test
Item discrimination
way to refer to item validity statistics
• Refers to the extent to which an item accurately differentiates among test takers with regard to the trait/behaviour the test is supposed to measure
For ability tests, item analysis for validity includes item validity, discrimination, AND? (2)
Item difficulty
Item fairness
How is Item Difficulty Gauged? (CTT)
At the beginning; test specifications by experts in the field can be used as difficulty criteria
Once it’s administered to a group: quantitative indexes can be obtained (normative perspective)
• Using the % of test takers who answer an item correctly (AKA proportion/percentage passing, “p”)
• The higher p, the easier the item is
• P is an ordinal number (like percentile ranks), therefore it’s often converted to a Z score
• Once we have a Z score for items the difficulty of items can be compared across various groups by administering anchor items (common set of items) to 2+ groups
• Formulas to estimate the difficulty of additional items across the groups in question can be derived based on the established relationships among the anchor items - AKA absolute scaling
○ Allows for the difficulty of items to be placed on a uniform numerical scale
Explain this: “For any given group/test, the average score on a test is the same as the average difficulty of its items”
- Ex: in a classroom test designed to evaluate how much of the content students grasped, there will be items that everyone gets (p=1), and others that the average students guess (p=0.7), and very little, if none, that no student will get (p=0), so that the average grade will be around 0.7-0.8.
- In a test designed to determine the top 10% of students, we expect most items to have a p value of around 0.1, so that the average will be 0.1
Disctractors
the incorrect alternatives in multiple choice items
Can have great influence on item difficulty:
• The number of distractors influences the probability of guessing right/wrong
The plausibility of the distractors to the test takers that don’t know the right answer significantly influences the difficulty of the item
Analyses of distractors need to be conducted:
• Proportion of time respondents choose each distractor
• To detect possible flaws and eventually replace the ones that don’t work correctly
If a distractor is never chosen or chosen more often than the right answer, then it’s not working
Is Item Difficulty a Relevant Concept in Personality Testing?
- Selected response: The ability of the test takers to understand the items (reading and vocabulary abilities) must be taken into consideration so that their answers are more truthful
- Projective tasks: need an amount of proficiency in the mode of answering (talking/ writing)
Item Validity
Refers to the extent to which items elicit responses that accurately differentiate test takers in terms of the behaviours, knowledge, or other characteristics that a test is designed to evaluate
• Discriminating power: most basic quality of test items • Validity indexes/indexes of item discrimination - obtained using some criterion of the test takers' standing on the construct that the test assesses. Can be: ○ Internal criteria (ex: total score on the test) - increases homogeneity of test (increases reliability due to interitem consistency) § Often for tests evaluating a single construct/trait § Based on the assumption that all test items should correlate highly with the construct of interest, and with each other ○ External criteria (ex: age, education, diagnostic, etc) - increases score validity § Often used for tests evaluating many different aspects/constructs § The correlation between the items and test scores is not expected to be high ○ Combination of both
Index of discrimination statistic (D) (CTT)
○ Used for hand calculations when computer is not accessible
○ For validity of items
○ Mainly applied in pass/fail items in ability tests, but also other types of binary scoring possible
○ Test takers must be divided into criterion groups based on scores or external criterion
§ Usually the top and lower thirds of people are taken as the groups to be compared
§ The % of people passing the item is used to calculate the difference in the % of test takers in the upper and lower criterion groups who pass a given item
○ Can range from +100 to -100 (AKA from 1 to -1)
§ Positive D indicates that more individuals in the upper criterion than in the lower passed the item (most desirable values of D are closest to +1)
§ Negative D indicates that the items in question discriminate in the opposite direction and need to be fixed/discarded
2 other correlational indexes (other than D) to measure item validity
○ Most widely used classical test theory methods for expressing item validity
○ The type of coefficient chosen depends on the nature of the 2 variables that are to be correlated (AKA the item scores and the criterion measures)
§ When item scores are dichotomous, and criterion measure is continuous - point biserial (rpb) is best
§ When item and criterion measures are both dichotomous - phi coefficient is best
Both of these can range from -1 to +1 and are interpreted same as Pearson r
3 types of tests regarding speed
Speed: speed of performance
Tests can be classified in 3 types
• Pure speed tests
○ Simply measure the speed with which test takers can perform a task
○ Difficulty is manipulated mainly through timing
○ Score is often the number of items completed in the allotted time
• Pure power tests
○ Have no time limits
○ Difficulty is manipulated by increasing or decreasing the complexity of items
○ Items are in ascending order of difficulty
○ Only the best respondents can answer all items
• Tests that blend speed and power
In any test that’s closely timed, the p value is a function of the position of items within the test rather than of their intrinsic difficulty/validity
Item-test regression
Necessary to calculate the proportion of individuals at each total score level who passed a given item
Item-test regression graphs combine info on both item difficulty and item discrimination - allow to visualize how each item functions within the group that was tested
Item Response Theory
Variety of models that can be used to design/develop new tests and to evaluate existing ones
• IRT differs from classical theory in:
○ The mathematical formulas they employ
○ The nbr of item characteristics they account for
○ The number of trait/ability dimensions they specify as the objective of measurement
○ Use different methods depending on dichotomous/polytomous items