testing, validity, reliability Flashcards
measurement:
administering a test for purpose of score and information
evaluation:
interpreting a score, place value on score, make a decision based on the results
construct:
theoretical representation of a characteristic; a concept that can be defined but NOT directly measured
ex: motivation, teamwork
test battery:
series of tests put together to answer a question
examples of constructs in kinesiology
- body image
- balance
- fatigue
- power
- body composition
- metabolic health
etc.
5 aspects to consider for ethical practice:
- fairness
- privacy and confidentiality
- data ownership and protection
- safety
- participant experience
consequential validity:
group differences in test scores due to bias or measuring skills that aren’t part of construct
construct underrepresentation:
construct defined to narrowly, resulting in lack of inclusion of all important components
test bia:
meanings and implications of tests scores are different for a particular subgroup than the rest of test-takers
4 ways to mitigate risk/ ensure client is fully informed:
- emergency action plan
- informed consent
- pre-participation screening
- MSK screening
5 stages of testing
- test selection
- preparation
- administration
- data processing
- decision-making and feedback
5 important characteristics of test selection
reliability/ objectivity
validity
sensitivity
practicality
participant burden
4 challenges in test selection
- test protocols evolve
- tests may be experimental, not routine
- field tests may be more valid for test construct but provide unreliable data
- subjects may not listen/ follow/ remember instructions
4 main sources of measurement error:
- test
- test taker
- test
- built environement
ways to reduce measurement error:
calibrate equipment
clear pretest instructions
control environment
use a valid and reliable test
train testers
provide warm up
administration: typical testing sequence (8)
- informed consent, preparticipation screening
- non-fatiguing tests ex: body comp, flexibility
- agility
- max power and strength
- sprint
- local muscular endurance
- fatiguing anaerobic power/ capacity
- aerobic power/ capacity (after sufficient rest)
why interpret data/ results
- to indicate client’s performance compared to norms/ set of standards
multiple trials: can use best score or mean score
criterion score:
measure used to indicate person’s ability:
- should be most accurate measure of construct
obtained by:
- participation
- known valid criterion
- expert judges
ex: pass/fail, proficient/ not
reliability:
consistency/ repeatability of observation; degree to which repeated measurements of trait are reproducible under same conditions (test is free from measurement error)
validity:
“most important test characteristic”
degree to which a test measures what it’s supposed to:
- reliability
- relevance
- appropriateness of score
can a test be valid but not reliable?
no
3 stage process of validity:
- definitional (what should test measure)
- confirmatory (how well does test measure construct)
- theory-testing (do results match up w definition)
logical validity:
does test measure construct of interest?
- assessed by interviews w experts in subject
*subjective, weak type of evidence
ex: single leg static balance
construct validity:
degree to which test measures hypothetical construct, typically through comparison of results to behaviour
- not best form of evidence
- usually abstract ex: generosity
- statistical procedures to confirm theory
ex: would vert jump or stair climb test best measure power in volleyball
3 types of construct validity:
- known differences (test to see if actual differences exists b/n populations)
- convergents (2+ tests measure same construct) and discriminant (2+ tests that measure different construct)
- factor analysis
criterion validity:
relationship b/n scores on test that measure construct and recognized standard/ criterion
ex: run test and compare to gold standard
- use 1+ statistical test
- R > or = to 0.80 (+ or -)
examples of suitable criterion measures
- expert judge ratings
- tournament standings
- predetermined criteria
- future outcome
criterion test:
DIRECT measure of fitness component
ex: aerobic fitness, strength, etc.
or
INDIRECT protocol
ex: body composition, anaerobic power
concurrent criterion validity:
criterion is measured at ~the same as the alternate measure and scores are compared
ex: those w/ fastest 1.5 mile run should have highest VO2 max scores
predictive criterion validity:
new protocol is subsequently used to predict performance - criterion measured some time later
ex: equation to predict VO2 max from 1.5 mile run should equal subject’s actual VO2 max score
- often uses regression analysis
what does regression analysis allow you to predict?
a continuous dependent variable from a number of independent variables
assumptions:
- # of subjects should be 5:1
- data is normally distributed
- linear relationship b/n variables
can you have reliability without validity?
yes
stability reliability:
measured w same instrument on 2 separate occasions
- scores shouldn’t really change
- beware of maturation/ practice effects
3 factors contributing to low stability reliability:
- test taker (may be injured/ fatigured/ trained to improve)
- test (different measuring instrument, issue w/ calibration)
- tester (may be difficult to standardize if test requires jusdgement)
internal-consistency reliability:
measures collected in single day/ session - at least 2 trials of test
ex: skin folds
- consistent rate of scoring by test takers throughout a test or trial to trial
differences between stability and internal consistency reliability:
time factor: cognitive learning will occur between administrations
changes in day to day performance:
- stability = unaffected
- internal consistency = major source or error
coefficients are not comparable
when is a test considered objective?
objectivity depends on (3):
consistency b/n 2+ judgements of same performance, scorer’s personal opinion and bias eliminated
1. competency of judges
2. clarity of scoring system
3. degree to which judge can assign scores
when is ICC used and what is the required R value for excellent reliability? below avg reliability?
“intraclass correlation coefficient”
used w small samples w/ data from 2+ sessions
excellent: = or > 0.75
below avg = < 0.40
coefficient of variation (Cv):
degree of variation b/n testing trials in participant’s repeated measurements
- used to compare variability of measurements
how to calculate Cv?
acceptable Cv %?
Cv = (standard deviation)/(mean) x 100
Cv = < 10% (sometimes 5?)
when can we expect reliability?
- testing environment favourable for good performance
- people are motivated, ready to be tested, informed, familiar
- person administering test is trained & competent
what might happen if same test is done on repeated days?
- familiarization and cognitive learning
- low stability reliability
- increased time b/n tests may produce more reliable response
why and when do we calibrate?
to confirm accuracy of measurement
at least every 6 months (unless otherwise indicated by manufacturer)
What does a Bland-Altman plot show? can be used to estimate?
- shows level of agreement b/n 2 methods/ plots difference in measurements against mean of measurements
- used to estimate reliability AND establish concurrent validity
- closer dots = better correlation