Test Construction and Interpretation Flashcards
Define Psychological Test
An objective and standardized measure of a sample of behaviour
Norm-Referenced Scores: Pros and Cons
Pros:
* allows for comparison of an individuals performance on different tests
* E.g. one score may look better, but we can only tell by comparing it to similar others
Cons:
* don’t provide an absolute or universal standard of good or bad performance
What is meant by a ‘Sample of Behaviour’ in tests?
A measure can’t test ALL of a behaviour, it tests only a sample of it that should be representative of the entire concept it is measuring
Reliability: define
Consistency of results between testings
Validity: define
The degree to which a test measures what it is designed to measure
Test Characteristics
Maximum vs Typical Performance
Maximum: examinees best possible performance
Typical: what an examinee typically does or feels
Test Characteristics
Speed
Power
Mastery
Speed: response rate measured
Power: assesses level of difficult a person can attain. No time limit
Mastery: determine if a person can attain pre-established level of acceptable performance (e.g. the EPPP)
Ceiling Effects
If a test doesn’t include an adequate range of items at the hard end, it limits what information that test can tell you
E.g. if there aren’t enough challenging questions, everyone may get the max score
Threatens internal validity
Floor Effects
Not enough items on the easy end, so all low achieving test takers are likely to score similarly
Threatens internal validity
Ipsative Measure
Define
The individual is the frame of reference in score reporting, not a norm group
Questions involve expressing preference for one thing over another
e.g. a personal preference inventory
Normative Measure
Define
Measure the strength of each attribute measured on a test
Every item is answered, not chosen from amongst other options
Classical Test Theory
Reliability
People’s test scores consist of 2 things:
1. Truth
2. Error
True Score: the actual score that reflects their skill of whatever is being measured
Error: factors irrelevant to what is being measured that impact score (e.g. noise, luck, mood)
Reliability Coefficient
Correlation: 0.0 to +1.0
0.0 = entirely unreliable
0.90 = 90% of observed variability is due to true score differences; 10% due to measurement error
Test-Retest Reliability
AKA coefficient of stability
Need to get timing right; too soon (practice effects, memory), too far (more chance of random error)
Not good for unstable attributes (e.g. mood)
Alternate Forms Reliability
AKA coefficient of equivalence
Give 2 different forms of a test to the same group
Error due to content diffs between two forms, or time error. Time error reduced by giving tests in succession
Don’t use w/ unstable traits
How to measure Internal Consistency Reliability?
- Split-half reliability
- Cronbach’s coefficient alpha
- Kuder-Richardson Formula 20
Split-Half Reliability
Internal Consistency
Divide the test in two and correlate scores on the two halves
Shorter tests inherently less reliable-Spearman Brown Formula can mitigate this by estimating effect of test length on score
Not the most recommended
Coefficient Alpha
Internal Consistency
Single administration, measure average degree of inter-item consistency
Used for tests w/ multiple scored items
Kuder-Richardson Formula 20
Internal Consistency
Single administration, inter-item consistency
Used on dichotomously scored tests
How to measure Internal Consistency of speed tests?
Test-retest or alternate forms
Inter-item would wield perfect scores
Interscorer Reliability
What increases it?
- Raters well trained
- Raters know they are being observed
- Scoring categories should be mutually exclusive and exhaustive
What does Mutually Exclusive mean?
A behaviour belongs to one and only one category
Duration Recording
Interscorer Reliability
Rater records elapsed time during which target behaviour occurs
Frequency Recording
Interscorer Reliability
Observer keeps count of no. of times the target behaviour occurs