test construction Flashcards
interpreting item analysis
e.g. interpretation might be that the item was too difficult or confusing or invalid, in which case the teacher can replace or modify the item, perhaps using information from the item’s discrimination index or analysis of response options.
The fairest tests for all students are tests which are valid and reliable. To improve the quality of tests, item analysis can identify items which are too difficult (or too easy if a teacher has that concern), are not able to differentiate between those who have learned the content and those who have not, or have distractors which are not plausible.
types of reliability
Test-retest
Inter-rater
Internal consistency
discrimination index
measures items ability to discriminate between those who scored high and low on the total test. To be of good validity an item must be correct because of p’s ability or knowledge rather than chance or bias etc. calculate – split total scores into top half and low half. Create difficulty indices for them, take low group from top group. Will be between -1 and +1
analysis of response options
A comparison of the proportion of students choosing each response option. fine tunes ‘distractor’ options in a MCQ and which one students are ‘falling for’ when they don’t know. The better the distractor the higher the validity as only the students who know the aqnsewr will get the answer. Calculate - For each answer option divide the number of students who choose that answer option by the number of students taking the test.
Graded attempts – no. of q attempts where grading is complete. Higher nnumbwes of graded attempts produce more reliable stats
difficulty index
Item analysis produces – difficulty index (proportion of p’s who got it right – so easy index) will be between 0 and 1
SE
amount of variablility in students score due to chance
internal consistency and types
Determines whether test items designed to measure an underlying construct/ trait actually do so – high IC is when items testing same thing yield similar scores. Normally inv. Determining correlations and how well they predict each other.
The split halves test for internal consistency reliability is the easiest type, and involves dividing a test into two halves.The results from both halves are statistically analysed, and if there is weak correlation between the two, then there is a reliability problem with the test. The division of the question into two sets must be random. Split halves testing was a popular way to measure reliability, because of its simplicity and speed.
Chronbachs aplpha is a reliability coeffient which is computed and should be igher than 0.7. The Cronbach’s Alpha test not only averages the correlation between every possible combination of split halves, but it allows multi-level responses.The test also takes into account both the size of the sample and the number of potential responses. A 40-question test with possible ratings of 1 - 5 is seen as having more accuracy than a ten-question test with three possible levels of response
problems with chronbach’s alpha
The first problem is that alpha is dependent not only on the magnitude of the correlations among items, but also on the number of items in the scale. A scale can be made to look more ‘homogenous’ simply by doubling the number of items, even though the average correlation remains the same. This leads directly to the second problem. If we have two scales which each measure a distinct aspect, and combine them to form one long scale, alpha would probably be high, although the merged scale is obviously tapping two different attributes. Third, if alpha is too high, then it may suggest a high level of item redundancy; that is, a number of items asking the same question in slightly different ways.
item production
1) Identify knowledge or skills tested using course outcomes or professional standards
2) Create blueprint
Identifies: learning to be assessed; no. of items for each learning area and their weights; type of test items (short answer, t/f, mcq, ordered response, matching etc) ; points for each item and points for overall test
3) Create items of appropriate types – should inc variety of selection response types and be appropriate to blooms taxonomy
4) Writing guidelines
constructing criterion referenced tests
Set up the criterion groups Sample size (bigger n = reliable norms and statistics)
Item trial administered to the sample/ c group. Should be at least 200 in sample, and if both men and women, then 200 of each
Statistical analysis e.g. CFA
….BUT
selection of criterion referenced groups
Lack of psychological meaning
factor analysis
researchers should first conduct EFA to identify cross-loading items (i.e., for subsequent removal from the analysis if necessary).
Hence, CFA should be used to confirm factor structure, while EFA should be used to identify potentially problematic items
CFA does not provide evidence of cross-loading items (a major source of insufficient discriminant validity
high construct inter-correlations would look to conduct some form of discriminant validity assessment on the constructs involved, to give greater confidence to later interpretation of findings
Due to numerous problems this is now largely replaced with item analysis
ITEM SELECTION
Based on item analysis. All significant items are selected, if too many then 20-30 depending on desired length of test and discrimination index is closest. Reliability should be 0.7.
item trials
Create ~100 items to be cut by half for final version of test
Find large representative sample – separate for male and female
Based on this sample create norms and standardised scores