Test Construction and Interpretation Flashcards
Define Psychological Test
An objective and standardized measure of a sample of behaviour
Norm-Referenced Scores: Pros and Cons
Pros:
* allows for comparison of an individuals performance on different tests
* E.g. one score may look better, but we can only tell by comparing it to similar others
Cons:
* don’t provide an absolute or universal standard of good or bad performance
What is meant by a ‘Sample of Behaviour’ in tests?
A measure can’t test ALL of a behaviour, it tests only a sample of it that should be representative of the entire concept it is measuring
Reliability: define
Consistency of results between testings
Validity: define
The degree to which a test measures what it is designed to measure
Test Characteristics
Maximum vs Typical Performance
Maximum: examinees best possible performance
Typical: what an examinee typically does or feels
Test Characteristics
Speed
Power
Mastery
Speed: response rate measured
Power: assesses level of difficult a person can attain. No time limit
Mastery: determine if a person can attain pre-established level of acceptable performance (e.g. the EPPP)
Ceiling Effects
If a test doesn’t include an adequate range of items at the hard end, it limits what information that test can tell you
E.g. if there aren’t enough challenging questions, everyone may get the max score
Threatens internal validity
Floor Effects
Not enough items on the easy end, so all low achieving test takers are likely to score similarly
Threatens internal validity
Ipsative Measure
Define
The individual is the frame of reference in score reporting, not a norm group
Questions involve expressing preference for one thing over another
e.g. a personal preference inventory
Normative Measure
Define
Measure the strength of each attribute measured on a test
Every item is answered, not chosen from amongst other options
Classical Test Theory
Reliability
People’s test scores consist of 2 things:
1. Truth
2. Error
True Score: the actual score that reflects their skill of whatever is being measured
Error: factors irrelevant to what is being measured that impact score (e.g. noise, luck, mood)
Reliability Coefficient
Correlation: 0.0 to +1.0
0.0 = entirely unreliable
0.90 = 90% of observed variability is due to true score differences; 10% due to measurement error
Test-Retest Reliability
AKA coefficient of stability
Need to get timing right; too soon (practice effects, memory), too far (more chance of random error)
Not good for unstable attributes (e.g. mood)
Alternate Forms Reliability
AKA coefficient of equivalence
Give 2 different forms of a test to the same group
Error due to content diffs between two forms, or time error. Time error reduced by giving tests in succession
Don’t use w/ unstable traits
How to measure Internal Consistency Reliability?
- Split-half reliability
- Cronbach’s coefficient alpha
- Kuder-Richardson Formula 20
Split-Half Reliability
Internal Consistency
Divide the test in two and correlate scores on the two halves
Shorter tests inherently less reliable-Spearman Brown Formula can mitigate this by estimating effect of test length on score
Not the most recommended
Coefficient Alpha
Internal Consistency
Single administration, measure average degree of inter-item consistency
Used for tests w/ multiple scored items
Kuder-Richardson Formula 20
Internal Consistency
Single administration, inter-item consistency
Used on dichotomously scored tests
How to measure Internal Consistency of speed tests?
Test-retest or alternate forms
Inter-item would wield perfect scores
Interscorer Reliability
What increases it?
- Raters well trained
- Raters know they are being observed
- Scoring categories should be mutually exclusive and exhaustive
What does Mutually Exclusive mean?
A behaviour belongs to one and only one category
Duration Recording
Interscorer Reliability
Rater records elapsed time during which target behaviour occurs
Frequency Recording
Interscorer Reliability
Observer keeps count of no. of times the target behaviour occurs
Interval Recording
Interscorer Reliability
Observing subject at given intervals and noting whether the target behaviour occurs
Good for behaviours with no fixed beginning or end
Continuous Recording
Interscorer Reliabilty
Record all behaviour of the subject during the observation session
Standard Error of Measurement
How much error an individual test score can be expected to have
Used to construct a confidence interval, which is the range someone’s true score is likely to fall
What factors affect reliability?
- Length of test
- Homogeneity of testing group
- Floor/ceiling effects
- Guessing correct answers
Content Validity
The extent to which the test items adequately and representatively sample the content area to be measured
Shown through correlation w/ other tests that assess same content
Criterion Related Validity: Define
Is it useful for predicting an individuals behaviour in specified situations?
Criterion = ‘that which is being predicted’
E.g. the SAT being correlated with Uni GPA to establish relationship and determine criterion validity
Used in applied situations (selecting employees, college admissions, special classes)