Test Construction and Interpretation Flashcards
Define Psychological Test
An objective and standardized measure of a sample of behaviour
Norm-Referenced Scores: Pros and Cons
Pros:
* allows for comparison of an individuals performance on different tests
* E.g. one score may look better, but we can only tell by comparing it to similar others
Cons:
* don’t provide an absolute or universal standard of good or bad performance
What is meant by a ‘Sample of Behaviour’ in tests?
A measure can’t test ALL of a behaviour, it tests only a sample of it that should be representative of the entire concept it is measuring
Reliability: define
Consistency of results between testings
Validity: define
The degree to which a test measures what it is designed to measure
Test Characteristics
Maximum vs Typical Performance
Maximum: examinees best possible performance
Typical: what an examinee typically does or feels
Test Characteristics
Speed
Power
Mastery
Speed: response rate measured
Power: assesses level of difficult a person can attain. No time limit
Mastery: determine if a person can attain pre-established level of acceptable performance (e.g. the EPPP)
Ceiling Effects
If a test doesn’t include an adequate range of items at the hard end, it limits what information that test can tell you
E.g. if there aren’t enough challenging questions, everyone may get the max score
Threatens internal validity
Floor Effects
Not enough items on the easy end, so all low achieving test takers are likely to score similarly
Threatens internal validity
Ipsative Measure
Define
The individual is the frame of reference in score reporting, not a norm group
Questions involve expressing preference for one thing over another
e.g. a personal preference inventory
Normative Measure
Define
Measure the strength of each attribute measured on a test
Every item is answered, not chosen from amongst other options
Classical Test Theory
Reliability
People’s test scores consist of 2 things:
1. Truth
2. Error
True Score: the actual score that reflects their skill of whatever is being measured
Error: factors irrelevant to what is being measured that impact score (e.g. noise, luck, mood)
Reliability Coefficient
Correlation: 0.0 to +1.0
0.0 = entirely unreliable
0.90 = 90% of observed variability is due to true score differences; 10% due to measurement error
Test-Retest Reliability
AKA coefficient of stability
Need to get timing right; too soon (practice effects, memory), too far (more chance of random error)
Not good for unstable attributes (e.g. mood)
Alternate Forms Reliability
AKA coefficient of equivalence
Give 2 different forms of a test to the same group
Error due to content diffs between two forms, or time error. Time error reduced by giving tests in succession
Don’t use w/ unstable traits
How to measure Internal Consistency Reliability?
- Split-half reliability
- Cronbach’s coefficient alpha
- Kuder-Richardson Formula 20
Split-Half Reliability
Internal Consistency
Divide the test in two and correlate scores on the two halves
Shorter tests inherently less reliable-Spearman Brown Formula can mitigate this by estimating effect of test length on score
Not the most recommended
Coefficient Alpha
Internal Consistency
Single administration, measure average degree of inter-item consistency
Used for tests w/ multiple scored items
Kuder-Richardson Formula 20
Internal Consistency
Single administration, inter-item consistency
Used on dichotomously scored tests
How to measure Internal Consistency of speed tests?
Test-retest or alternate forms
Inter-item would wield perfect scores
Interscorer Reliability
What increases it?
- Raters well trained
- Raters know they are being observed
- Scoring categories should be mutually exclusive and exhaustive
What does Mutually Exclusive mean?
A behaviour belongs to one and only one category
Duration Recording
Interscorer Reliability
Rater records elapsed time during which target behaviour occurs
Frequency Recording
Interscorer Reliability
Observer keeps count of no. of times the target behaviour occurs
Interval Recording
Interscorer Reliability
Observing subject at given intervals and noting whether the target behaviour occurs
Good for behaviours with no fixed beginning or end
Continuous Recording
Interscorer Reliabilty
Record all behaviour of the subject during the observation session
Standard Error of Measurement
How much error an individual test score can be expected to have
Used to construct a confidence interval, which is the range someone’s true score is likely to fall
What factors affect reliability?
- Length of test
- Homogeneity of testing group
- Floor/ceiling effects
- Guessing correct answers
Content Validity
The extent to which the test items adequately and representatively sample the content area to be measured
Shown through correlation w/ other tests that assess same content
Criterion Related Validity: Define
Is it useful for predicting an individuals behaviour in specified situations?
Criterion = ‘that which is being predicted’
E.g. the SAT being correlated with Uni GPA to establish relationship and determine criterion validity
Used in applied situations (selecting employees, college admissions, special classes)
Criterion-Related Validity Coefficient
rxy (x = predictor; y = criterion)
-1.0 - +1.0 score
Few exceed .60
What is the Coefficient of Determination?
Criterion Validity
The square of a correlation coefficient, which shows the variability in criterion that is explained by variability in the predictor
Concurrent Validation
Criterion Validity
The predictor and criterion data are collected at same time
It predicts a current behaviour
E.g. job selection test for therapists given to current therapists, and it is correlated with their current performance ratings from supervisors
When would you use Concurrent Validation?
Criterion Validity
When you need the current status of a criterion
May be used over predictive for cost and convenience
Predictive Validity
Criterion Validity
Predictor scores are collected first, criterion data collected later
E.g. does the GRE predict grad school performance?
Standard Error of Estimate
Criterion Validity
Interprets an individual’s predicted score on a criterion measure
There will be difference between predicted criterion score and actual score, which is the standard error of estimate
E.g. using SAT score to predict GPA via a regression equation
Equation for the Standard Error of Estimate
Criterion Validity
SE est = SD y 1 - r xy2
* SE est = standard error of estimate
* SD y = standard deviation of criterion scores
* r xy = validity coefficient
This can be used to make a confidence interval
*Likely won’t need to remember equation for exam. But do need to for SEM
How can you use Criterion Validity to make decisions?
Criterion Cut off Point
Predict if someone is likely to make it above the cut off and be selected (e.g. all students w/ GPA of 3.0+)
What is a predictors Functional Utility?
Criterion Validity
Determine the increase in correct decision making that would result from using the predictor as a selection tool
Calculated once predictor and criterion cut off points are made
4 possibilities for Criterion Cut Off Point Scores
- True Positives: scored above cut off, and were successful
- False Positives: scored above cut off, not successful
- True Negatives: scored below cutoff, unsuccessful
- False Negatives: scored below cutoff, successful
Heterogeneity of Examinees
Factors that Affect Validity Coefficient
Restricted range of scores with lower the validity coefficient
Homogenous groups = lower validity coefficient
Reliability of Predictor and Criterion
Factors that Affect Validity Coefficient
They must both be reliable for a predictor to be valid
High reliability does not guarantee good validity
Moderator Variables
Factors that Affect Validity Coefficient
Differential Validity
What are they? an unrelated variable that affects the validity of the predictor
Differential Validity: a test has this if there are different validity coefficients for different groups
Cross-Validation
Factors that Affect Validity Coefficient
Shrinkage
After a test validated, it’s re-validated with a different group of people
Shrinkage: when the validity coefficient drops after cross-validation, because the predictor ended up being ‘tailor made’ by the OG sample
Criterion Contamination
Factors that Affect Validity Coefficient
What is it? knowledge of someones predictor scores impacts their criterion score
Prevention: people involved in assigning criterion ratings should not know the persons predictor score
Construct Validity
What is it? the degree to which a test measures the construct it is intended to
How Measured? over time, based on accumulation of evidence
Convergent Validity
Construct Validity
What is it? different ways of measuring the same trait yield similar results
Discriminant Validity
Construct Validity
What is it? when a test does NOT correlate with another test that measures something different
What is a Multi-Trait Multi-Method Matrix?
Construct Validity
Assessment of 2 or more traits by 2 or more methods.
Convergent Validity if tests that measure same traits have a high correlation, even when different methods used
Discriminant Validity when two tests that measure different traits have a low correlation, even when they use the same method
4 types of correlation coefficients in the Multitrait-Multimethod Matrix
Construct Validity
- Monotrait-monomethod: correlate between measure & itself. RELIABILITY
- Monotrait-heteromethod: correlation between two measures of same trait w/ different methods
- Heterotrait-monomethod: correlation between two measures of different traits using same method
- Heterotrait-heteromethod: correlation between two measures of different traits using different methods
What is Factor Analysis?
Construct Validity
A stats procedure that reduces a set of many variables to fewer ‘themed’ variables (underlying constructs/latent variables)
Factor Analysis: Factor Loading
Construct Validity
Correlation between a given test and a given factor
+1 to -1
Can be squard to determine proportion of variability in the test accounted for by the factor
Factor Analysis: Communality
Common Variance
Unique Variance
Construct Validity
Measures: The proportion of variance of a test that is attributable to the factors
How Measured? factor loadings are squared and added
Equation: h2
Common Variance: the factors affect variance in all parts of test
Unique Variance: variance specific to test, unrelated to factors
* Subtract communality from 1.00
Explained Variance (Eigenvalues)
Construct Validity
What are they? measure of the amount of variance in all the tests accounted for by the factor
Convert to percentage: (eigenvalue 100)/(# of tests)
Interpreting & Naming the Factors
Rotation
Construct Validity
You must make inferences based on theory about what the factors are measuring (e.g. based on teh contents of items that load highly on that factor)
Rotation: a procedure that places factors in a new position relative to the tests. Aids in interpretation
2 Types of Rotation
Interpreting & Naming Factors
Construct Validity
- Orthogonal: factors are independent of each other
- Oblique: factors that are correlated w/ each other to some degree
Notes: communality only exists for orthogonal
Post-rotation, eigenvalues may have changed. Eigenvalue only used for unrotated factors.
Factorial Validity
Construct Validity
What is it? when a test correlates highly with a factor it would be expected to
Principle Components Analysis
Construct Validity
Similar to Factor Analysis:
* reduce large set of variables to underlying constructs
* Factor matrix
* Eigenvalues: square & sum factor loadings
* Underlying factors ordered in terms of explanatory power
Differences to Factor Analysis:
* Factor = principle component/eigenvector
* no distinction between communality and specificity (variance only due to explained and error variance)
* Factors are always uncorrelated. i.e no such thing as oblique rotation
Cluster Analysis
Construct Validity
Purpose: develop a taxonomy/classification
Used to divide a group into similar subtypes (e.g. types of criminals)
Differences to Factor Analysis:
* Any type of data can be used for CA, whereas only interval or ratio for FA
* Clusters are just categories, not latent variables
* Not used when there is a pre-existing hypothesis, where as FA has one
Relationship between Reliability and Validity
A test can be reliable but not valid
For a test to be valid, it must be reliable (if it doesn’t have consistent results, it’s only measuring random error)
Validity coefficient is either less than or equal to the square root of the relability coefficient
Correction for Attenuation
Validity
This equation can show you what would happen to the validity of a test if both the criterion and predictor had higher reliability
How can Item Analysis help Reliability and Validity?
It can have them built into the test, item by item
Item Difficulty
- The percentage of examinees who answer it correctly (item difficult index; p)
- Moderate difficulty items are most common; increase score variability which increases reliability & validity
- Change based on purpose of the test
- Avg difficulty should be halfway between 1.0 and level of success expected by chance
What scale is associated with the p level according to Anne Anastasi?
Item Difficulty
Ordinal scale
Why? equivalent differences in p value do not indicate equivalent differences in difficulty
e.g. we can conclude which items are easier than others, but that doesn’t mean the difference in difficulty between items is equal to the difference between other items
Item Discrimination
Degree to which an item differentiates among examinees in terms of the behaviour it is designed to measure
e.g. depressed people answer item consistently different than non-depressed people
How to measure Item Discrimination?
Correlate Item Response with Total Score: those w/ highest correlation are kept. Useful when test only measures one thing
Correlate Item with Criterion Measure: choose items that correlate with criterion but not w/ each other
Item Discrimination Index: D
Divide group into top and bottom 27%. For each item, subtract % of examiners in low scoring from % of high scoring who answered the item correctly (D = U - L)
Range: 100 to -100
Relationship between Item Difficulty and Item Discrimination
Difficulty level places a ceiling on discrimination index (if everybody or nobody answers it correctly, there is no discrimination)
Moderate difficulty items have best discrimination
Item Response Theory: Define
How is it displayed?
What does it show?
Based on Item Characteristic Curves, which depict items in terms of how difficult it was for individuals in different ability groups
Slope on graph shows discrimination (steeper curve = less discrimination)
Difficulty, discrimination, and probability of answering correctly
2 Assumptions of Item Response Theory
- Performance on item is related to estimated amount of a latent trait being measured by item
- Results of testing are sample free (invariance of item parameters)
An item should have same difficulty & discrimination across all random samples of a population
Why do we need to compare peoples test scores to a norm?
Test Interpretation
Because without a reference point, tests results mean nothing
2 types of Developmental Norms
Test Interpretation
Mental Age
* Compare score to the avg performance of others at different age levels
* Used to calculate ratio IQ score (Mental Age/Chronological Age) x (100)
Grade Equivalent
* Primarily used for interpretation of educational achievement tests
Disadvantages of Developmental Norms
Test Interpretation
- Don’t allow for comparison of individuals at different age levels, because the standard deviation is not accounted
Within-Group Norms
Test Interpretation
- Compare score to those of most similar standardization sample
- E.g. percentile ranks, standard scores
Percentile Ranks
Test Interpretation
- Indicates the percentage of people in standardization sample who fall below a given raw score
- E.g. 90th percentile = you scored better than 90% of others
- Disadvantage: ordinal data, so can’t quantify difference in scores between someone in the 90th or 80th percentile rank
Standard Scores: define
Test Interpretation
- Show a raw score’s distance from the mean in standard deviation units
- Can compare an individual at different ages
4 types of Standard Scores
Test Interpretation
Z-Score
* shows how many SD’s above/below mean. E.g. +1.0 = one SD above mean
T-Score
* have a mean of 50, SD of 10
* T score 60 = score falls 1 SD above mean
Stanine Score
* Literally means ‘standard 9’, scores range 1-9
* Mean of 5, SD of 2
Deviation IQ Score
* mean 100, SD 15
* E.g. IQ tests