Week 3 Flashcards
Reliability
- Synonym for dependability and consistency
- Refers to consistency of measurement in psychometrics
- Not necessarily reflecting good or bad results, just consistent ones
- Test may be reliable in one context but not another
Reliability Coefficient
- Quantifies reliablity ranging from 0 (not reliable) to 1 (perfectly reliable)
Measurement Error
- In everyday language Error means some kind of preventable mistake
- In science the meaning is broader relating to measurement imperfection which is inevitable
e.g. 25 could be 24.9978 or 25.1232 - Small fluctuations can be rounded but they are almost never trivial
- Noticeable differences are routinely observed - Think building a steel bridge in hot climate
True Scores
- Can never be observed directly
- Useful fiction that allows us to understand reliability
- At best true scores are guessed by averaging many measurements
Repeated Measurements
- Repeated measurement can have problems
- Time between testing has an effect
- some states are in constant flux so they might average differently at different times
Carryover Effects
- An effect of being tested in one condition on participants’ behavior in later conditions
- The practice effect , where participants perform a task better in later conditions because they have had a chance to practice it.
- The fatigue effect happens when repeated measurement causes results to diminish due to fatigue
Construct Score
- A theoretical variable we beleive exists such as depression agreeableness or reading ability
- Testing for these is flawed so it can never be a True Score
- Long term averages can still produce close to True Score flaws and all
True Scores
- We can never observe True Scores directly
- The concept helps us understand reliability
- High reliability does not mean high validity
Concept of Reliability
- True Score is the long term average of many measurements
- No Carryover Effects
- T = A True Score
- X = Measurement is called Observed Score
- E = Measurement Error
- If the observed score is moslty found by the measurement error the test is unreliable
- Better if the true score is found by the True Score
X = T + E
variance (o squared)
- Useful to describe test score variability
- Standard Deviation Squared
- Can be broken into components
- True Scores are stable and give consistency to tests
Reliability referesd to the proportion of total variance attributed to T
Measurement Error
Chapter 5 - Page 297
Measuring Psychological Constructs
- Constructs can’t be observed
- Can be inferred from what we observe
- Observe behaviour
- Observe responses to self report scales
Characteristics of a Typical Scale/Sub-Scale
- Statements or questions designed to measure a construct behaviour
- Fixed choide responses - consistent across scale
- Responses are correlated because they asses the same thing
- responses averaged to find an overall score
- Some items are reverse coded
- Strong psychometric properties - reliableitly validity & factor structure
- Normative or Standardised data collected from a wide range of people
Commercial Scales
- Pay per use
- Published by commercial publishers
- Commonly used for clinical or applied purposes like recruitment or diagnosis
- Sometimes used in research
- Expensive
- Detailed normative data
- MMPI, NEO-PI, Beck Depresion Inventory
Non-Commercial Scales
- Free to use
- Published in books, journals articles or online
- Often used for research purposes
- Unlikely to be published with Normative Data
- MINI-IPIP, Person Environment Fit Scale
- Typically used by research students
Uni Dimensional Scales
- Measures only one construct
- All items are intercorrelated
- All items averaged to derive overall score
e.g. Relationship Satisfaction Scale, Beck Depression Inventory
Multi Dimensional Scales
- Measures Multiple scales
- Each construct has a sub-scale and is a variable
- Each sub-scale is intercorrelated
- Each sub-scale is averaged to derive the score
- Adding up scores for multiple constructs is meaningless
- Sub-scales do not correlate
- Separate reliability and validity data calcullated for each sub-scale.
e.g.: MINI-IPIP (Big 5); Person Environment Fit (three fit dimensions).
Define Reliability
- How consistent are the tools we use to measure a construct
- Does it produce the same results over time
- Unreliable measures cannot be trusted
Four types of Reliability
- Internal Consistency
- Test-Retest
- Alternate Forms
- Inter Rater
Internal Consistency
- Consistency amongst items.
- Responses to all the items on a sub-scale should be similar/consistent
- if all items on the sub-scale measure the same thing then people would respond similarly to them
Test-Retest
- Consistency over time
- Scale scores at time 1 should be very similar to their scores at time 2.
Alternate-Forms
- Consistency over equvalent versions
- Scale scores on version 1 should be similar to their scores on version 2.
Alternate-Forms
- Consistency over equvalent versions
- Scale scores on version 1 should be similar to their scores on version 2.
Inter-Rater
- Consistency over Observers/Raters
- Multiple observers/researchers should provide similar accounts of the same event or behaviour
Internal Consistancy - Split Half Method
- A measure is split in half and averages for first half correlate with averages of second half
- Often multiple ways to split a scale
- Each different split will have a different reliability
Internal Consistency - Crohnbach’s Alpha
- Solves the problem of multiple split-half results
- Reports average of all possible split-half reliabilites
- Increasing the number of related items on a scale increases it’s internal consistency
- Only remove items if it substantially improves Chronbach’s Alpha
- Good is between 0.5 - 0.7
Acceptable Internal
Consistency
- Estimates of 0.7 or over are good
- Higher reliability estimates needed for diagnostic tools
Effect of Unreliable Measures - Attenuation
- If a measure is unreliable, it’s correlations with other variables are attenuated.
- That means they’re reduced in correlations
- True Correlation is always 0.6 between variables
Why is reliablity important
- Unreliable measures cannot be trusted.
- Unreliable measures make relationships between constructs difficult to detect.
- larger the error the less reliable the measure
- Observed Score = True Score + Error
Test-Retest Reliability
- If construct is stable, we should arrive at the same result over multiple tests
- Assumes life circumstances haven’t changed dramatically between tests
- Both sets of data need to come from the same respondents
- As time between tests increases the reliability decreases
- test retest coefficient is not interpretable without knowing the interval between tests
Acceptable Test-Retest Reliability
Test-Retest coefficients of 0.7 or over are generally considered fine.
Need to be interpreted with consideration to:
* The length of the test-retest interval.
* The stability of the characteristic being measured (e.g., a trait vs a state).
* The internal consistency of the measure.
* The impact of practice effects (use alternate forms?)
Test-Retest Practice Effects
- Abilities Tests are troublesome because people know the answers after the first test
- Alternate froms of the test are a good idea but difficult to create
Validity
- Does a test measure what it is intended to measure?
- Measure can be reliable but not valid
- Not all vlidity involve statistics
- measures are valid for specific purposes
- Validity is not inherent to a characteristic
- Evidence for validity builds up over time
- Different authors discuss validity in different but overlapping ways
- This can be confusing
Four Types of Validity
- Face
- Content
- Criterion
- Construct
Face Validity
- How much does measure look like it really measures what it says its measuring?
e.g. appear to measure extraversion/sociability
Content Validity
- Samples the full range/breadth of a factor
- Is it covering everything we think it should be covering
- Items cover all aspects of extraversion
e.g., talkative AND adventurous AND active etc
Criterion Validity
- Related to phsysiological or behavioural manifestations of factor
- Measured in the present and in the future
- Concurrent and Predictive measure
e.g. Measure predicts current base rate cortical arousal levels; as well as future sociable behaviour at parties, work, school etc
Construct Validity
- Related to other convergent measures of the same factor
- Not related to discriminant measures
- Does it behave consistently with theoretical predictions
e.g. Sales people have higher scores on this measure than accountants.
Face Validity - more
- How much a test looks like it is measuring what it says it is measuring.
- Crude, but can influence motivation to take the test seriously etc.
*
Content Validity - More
- How much a measuring instrument covers a sample of the behaviours to be measured
- e.g. Extraversion Captures:
Active = No
Assertive = No
Energetic = No
Outgoing = yes
Talkative = yes
Gesturally expressive = No
Gregarious= yes
Criterion Validity - more
- How much scores on a measure predict a behavioural or physicological criterion
- Is it related to things we expect it to be related to?
Two types of Criterion Validity
`
- Concurrent
- Predictive
Concurrent Validity
- A type of Criterion Validity
- Correlation between scores of Extraversion measure and base rate physiological arousal
- Both measures taken at the same time.
- Introverts have higher base line arousal than extroverts
Predictive Validity
- A type of Criterion Validity
- Good predictive validity if it can predict your preferences
e.g. Extraversion measure can predict if you wish to study alone or with others
Construct Validity
- How much a measure actually measures the construct it claims to measure
- Conceptualises a theory perspective
e.g. How well does IQ Test measure Intelligence?
How well does choice between toys gun/doll reflect aggression? - the more abstract the construct ther harder it is to have construct validity
Two Main Types of Construct Validity
- Convergent
- Discriminant
Convergent Validity
- A type of Construct Validity
- Strong relationship between the test and another similar test
- e.g. Allen Extraversion Questionairre & Extraverson subscale fro NEO-PI
Discriminant Validity
- A type of Construct Validity
- No or weak relationship between the current test and another measure of something different
e.g. Allen Extraversion Questionnaire & Beck Depression Inventory - Can also be examined using factor analysis and in testing theoretical predictions