Reliability & Validity Flashcards
Reliability
- Are the results consistent?
- Provides an estimate of the proportion of unsystematic error <—need to know the degree of to determine reliability
Validity
- Does it measure what it says it measures?
- Overall eval of evidence and degree of trustworthiness
- Determine if enough support exists to use the test in a certain way
Classical Test Theory
- Observed score = T + E
- T is the true score if the test is completely free from error
- E is the error
Unsystematic Error
- Random errors: mood, health, fatigue
- Administration differences
- Scoring differences
- Random guessing
Systematic Error
Constant errors that occur every time tested, like a typo
Reliability Related to Validity
- High validity can occur if high reliability exists
- High validity cannot occur if low reliability
- High reliability does not suggest high validity
Correlation Related to Reliability
- Correlation: Statistical technique used to examine consistency
- Reliability is often based on consistency between two sets of scores
Positive Correlation
As one increases, so does the other
Negative Correlation
As one increases, the other decreases
Correlation Coefficient (Pearson-Product Moment)
- Correlation coefficient: numerical indicator of the relationship between two sets of data
- PPM correlation coefficient - most common
- -1 to +1: closer to absolute value 1=stronger relationship
Test-Retest
- Give same test twice to same group
- Correlation between first and second administration (2-6 weeks away)
- Possible influences: shorter gap, high correlation, changes in administration, interventions, practice test
- Ex: skills-based test
Alternate Forms
- Very difficult
- Correlation off scores from two equivalent forms of a test
- Measures stability (over time) and equivalence (construct similarity)
- Use sample of different times from same domain
Internal Consistency
- One administration
- One form of instrument
- Divides instrument and correlates the scores from the different portions
Split-Half Reliability
- Given once then split in half to determine reliability
- Need to divide instrument into equivalent halves, like even and odd
- Problem: dividing instrument in half makes number of items smaller —> smaller correlation
Doesn’t work if test increases in difficulty and doesn’t quick fix problem
Kinder-Richardson
- KR-20: heterogeneous items
- KR-21: homogenous items - single construct (cannot be used if items are from the same domain or differ in difficulty)
- Lower reliability coefficient then split-half
- Purpose: Estimate the average of all split-half reliabilities from all ways of splitting the instrument
Pearson-Product Coefficient Alpha
- Used for non-dichotomous scoring
- Ex: Likert scales
- Cronbach’s alpha
- Takes into account variance of each item
- Conservation estimate of reliability
- Most common
Standard Error of Measurement (SEM)
- Provides estimate of range of scores if someone were to take instrument repeatedly
- Based on idea that if someone takes test multiple times, scores would fall into a normal distribution
SEM v. SD
- SD is spread of scores between students
- SEM is spread of scores for one student
- Uses same estimations
Content-Related Validity
- Test items measure the objectives they are supposed to measure
- Focus on how content was determined
- May be based on test creator’s own analysis of topic or expert analysis
- How well do test items reflect the domain of material being tested
Criterion-Related Validity
- Test scores related to specific criterion/variable
- Sources of criterion scores: academic achievement, level of education, performance in specialized training, job performance, psychiatric diagnosis, ratings by supervisors, correlations with previously available tests
Concurrent Validity (Criterion-Related)
- Scores on test and criterion measure are collected at same point
- Ex: achievement, certification
- Scorer typically higher than predictive
- Require reliable and bias-free measures
Predictive Validity (Criterion-Related)
- Test is administered first and scores on criterion measure are collected at a later time
- Ex: SAT, college GPA
- Require reliable and bias-free measures
Construct Validity
- What do scores on this test mean or signify
- Construct: Grouping of variables that make up observed behavior patterns
- Ex: Self-efficacy, personality
- Measured by correlation of 2 scores or factor analysis
- Often seen in psych tests
Convergent v. Discriminant (Construct Validity)
-Covergent: Positive correlation with other tests measuring the same/similar construct
Threats to Construct Validity
- Too many variables
- Under-represented: missing measuring parts of construct
- Extra questions
- Items are too similar
Overall Threats to Validity
- History: outside events during course of test
- Maturation: natural development with age
- Testing: repeat testing; changes due to practice
- Instrumentation: changes in measurement procedures
- Statistical regression: regression to mean after extreme score first time
- Interaction: any combo of 2
- Mortality: drop out
- Collection of subjects: bias of collecting subjects and assigning to groups
Face Validity
- Not legitimate
- Based on appearance of the measure and its test items
Types of Evidences
- Test content
- Response processes
- Internet structures
- Relations to other variables
- Consequences of testing
Item Analysis
- Examine and eval each item in the test —> get rid of items that don’t work
- Done during instrument development or revision
Item Difficulty
- Index reflecting proportion of people getting item correct
- 0.0= no one got it correct
- 1.0= everyone got it correct
- 0.5= ideal for differentiation
Item Discrimination
- Degree to which item correctly differentiates among test takers
- Extreme group method: 2 groups - high scores, low scores (works with normal distribution)
- Correlational method: performance of test v. item
Item Response Theory (IRT)
- Focus on each item -considers mathematical relationship between abilities
- 2 major assumptions: unidimensionality, local independence
- Most common in testing where there is a right/wrong answer v. preference
- Models student ability using each question instead of aggregate score
Unidimensionality
Each item measures one ability or trait
Local Independence
Unrelated to responses on other items
Selecting Tasks
- Determine what info is needed
- Consider what info is needed
- Search assessment resources
- Eval possible instruments
Administering Tests
- Pre-testing procedures
- Administration
- Scoring: by hand, computer, Internet
Communicating Results
- Simple language
- Individual v. Group
- Written v. Oral
- Communicate test’s strengths and limitations
- Know the manual
- Describe v. Just report cases
- Use various results
- Involve client
- Encourage asking questions
- Relate test to a goal
Problems with Reporting Result
- Acceptance
- Readiness of client
- Negative results
- Flat profiles and doesn’t show anything
- Motivation and attitude
Communicating Test Results for Parents
- Identifying information
- Reason for referral
- Background info
- Test results and interpretation
- Diagnostic impressions and summary