Mid Tri Exam Flashcards
Reliability
the consistency of repeatability of measures.
Validity
are we measuring what we are trying to measure.
Inter-rater or inter-observer reliability
Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon - agreement between the scores of two or more independent observers or judges.
Especially important when measures are subjective
Test-retest reliability
Used to assess the consistency of a measure from one time to another.
The correlation between scores across two admin of the measure (take one test then take it again at a later time)
Parallel-forms reliability
Used to assess the consistency of the results of two tests constructed in the same way from the same content domain
- Split Half reliability (splitting a test into 2 halves. If its reliably the answers should be similar)
- Item-total correlation (comparing the mid tri to the overall mark might not be the best because people might suck at exams but good at lab reports)
Internal consistency reliability
Used to assess the consistency of results across items within a test
Cronbach’s alpha - the average correlation among all possible pairs of items (.80 or above)
Reliability puts a ceiling on validity
If reliability is .70 validity can only reach .70
It can be reliable without being valid
It can’t be valid without being reliable
Construct validity
measurement validity - A construct refers to a behaviour or process that we are interested in studying
Both measures and manipulations must be valid!
Manipulations can be
instructional
environmental
stooges
instructional manipulations
experimental conditions defined by what you tell participants
environmental manipulations
stage and event, present a stimulus, induce a state
stooges manipulations
use fake participants to alter experiment condition
Convergent validity
Do scores on the measure correlate with scores on other similar measures related to the construct
Relates to the degree to which the measure converges on (is similar to) other constructs that it theoretically should be similar to
Discriminant (divergent) validity
Do scores on the measure have low correlations with scores on other different measures that are unrelated to the construct
Relates to the degree to which the measure diverges from (is dissimilar to) other constructs that is should be not similar to
Face validity
On its face value, thes the measure seem to be a good translation of the construct
Does it make sense?
Ask experts in the field
Content validity
Does the measure assess the entire range of characteristics that are representative of the construct it is intending to measure
Criterion validity
concurrent - Do scores on the measure distinguish participants on other variables that we would expect to be related to it (depressives from non-depressives, criminals from non-criminals)
predictive - Are scores on the measure able to predict future outcomes (attitudes, behaviours, performance)
How to Correct manipulations
Reduce random error (replicate procedure)
Reduce experimenter bias
Reduce participant bias
Ensure manipulation has construct validity
Do a manipulation check - ask participants about various aspects
External validity
extent to which the results can be generalised to other relevant populations, settings or times
Studies have good external validity when results can be replicated
Ecological validity
Population generalisation
Environmental generalisation
Temporal generalisation
Ecological validity
The extent to which the results can be generalised to real-life settings
Population generalisation
Applying the results from an experiment to a group of participants that is different and more encompassing than those used in the original experiment
Environmental generalisation
Applying the results from an experiment to a situation or environment that differs from that of the original experiment
Temporal generalisation
Applying the results from an experiment to a time that is different from the time when the original experiment was conducted
Internal Validity
ability to draw conclusions about causal relationships from the results of a study
The extent to which we can say that any effects on the DV were caused by the IV
The elimination of alternative explanations for the observed relationships
Inferences of cause and effect require 3 elements for strong internal validity
Co-variation
Temporal precedence
Elimination of alternative explanations
Threats to internal validity
Selection bias
Maturation
Statistical Regression
Mortality
History
Testing
Practice Effect
Instrumentation
observer reactivity
Social desirability
Controlling these threats
Randomly allocate participants
Treat all conditions equally except for intended IV manipulations
Use appropriate control conditions
Use double blind studies where possible
Experimenter bias
errors in a research study due to the notions or beliefs of the experimenter
Selection bias
A threat to internal validity that can occur if participants are chosen in such a way that groups are not equal before the the experiment
Differences after the experiment may reflect differences that existed before the experiment began plus a treatment effect
Maturation
changes in participants during and experiment or between the DV due to time (age, cognitive development)
Permanent - (age, biological growth, cognitive development)
Temporary - (fatigue boredom, hunger)
Most common naturally occurring developmental processes (children)
Statistical Regression
regression towards the mean
Participants with extreme scores on the first measurement of the dv tend to have scores closer to the mean on the second measurement
Subsequent scores are still likely to be extreme in the same direction but not as extreme
When you have extreme scores, it is difficult to maintain that degree of extremity over repeated measures
If participants are selected on the basis of extremes scores- regression to the mean is always going to be a possible explanation for higher or lower scores on a repeated test
Mortality
attrition (premature dropouts and differential dropout across experimental conditions)
Relates to premature dropouts
Differential dropout rates occur
If the intervention is unpleasant
History
outside events that may influence participants in the experiment in a repeated measures design
History can include major events like terrorist attacks or smaller personal changes like joining a gym, changing jobs
If they are relevant to the study in some way these events can influence the DV score
Prior Testing
Prior measurements of the DV may influence the results
Measuring the DV can cause a change in the DV
Participant becomes aware of the study aims
Practice Effect
When a beneficial effect on a DV measurement is caused by previous
Instrumentation
changes due to the measuring device
observer reactivity
changes to behaviour when being watched
Social desirability
changes from people to present in the best possible light
homogeneity of variance been met and what does this mean for the analyses?
Homogeneity of variance has been met so analyses should continue.
one-way ANOVA
comparing several means in situations where we want to compare more than two conditions
Assumptions of ANOVA
Levels of measurement
Random sampling
Independence of observations
Normal distribution
Homogeneity of variance
Levels of measurement
Dependent variable must be measured at the interval or ratio level
Random sampling
Participants must be obtained using a random sample from the population of interest
Independence of observations
The observations that make up the data must be independent of one another (one person can’t do multiple everyone has to do the same)
Violation of this assumption is very serious as it dramatically increases the type 1 error rate
Normal distribution
The populations from which the sample are taken is assumed to be normally distributed
Need to check this for each group separately in one-way ANOVA
Homogeneity of variance
Samples are obtained from populations of equal variances
ANOVA is fairly robust to this violations - provided the size of groups are reasonably similar
F ratio is reported with how many decimal points
2
Family wise error
the probability of at least 1 false positive when multiple comparisons are being tested.
Types of multiple comparisons
Planned comparisons (contrasts) prior to study
Post-hoc analyses post study
Planned comparisons (contrasts)
prior to study
A priori
Break down variance into component parts
Test specific hypothesis
rules
Once a group has been singled out - it cannot be used in another contrast
Each contrast must only compare 2 chunks of variation
There should always be 1 less comparison than the number of groups (number of contrasts = k-1)
(If the rules are met we are doing)
Orthogonal contrasts - compare unique chunks of variance
if not
Non Orthogonal contrasts
Overlap or use the same chunks of variance in multiple comparisons
Require careful interpretation
Lead to increased type 1 error rate
Post-hoc analyses
post study
Compare all groups using stricter alpha values (this reduces type 1 error rate)
Polynomial contrasts
Only used when the IV is ordinal
Linear
Quadratic
Cubic
Quartic
Standard contrasts
Orthogonal: Helmert and difference
non-orthogonal L deviation, simple, repeated
Helmert
Compare each category to the mean of subsequent categories (based on the odeer they are coded in SPSS) which might be alphabetical
With 3 groups
1 vs 2 & 3
2 vs 3
With 4 groups
1 vs 2 & 3 & 4
2 vs 3 & 4
3 vs 4
Difference planned contrast
Compare each category to the mean of the previous categories
With 3 groups
3 vs 2 & 1
2 vs 1
With 4 groups
4 vs 3 & 2 & 1
3 vs 2 & 1
2 vs 1
Post-hoc tests
Involve comparing all possible difference between pairs of means
Good approach with exploratory research or where there are no predefined specific hypothesis
Simplest post-hoc test is bonferroni
Bonferroni a(alpha) =
A(alpha) / number of tests
Tukey’s HSD
Honestly significant difference
The cumulative probability of a type 1 error never exceeds the specified level of significance (p<.05)
Supplies a single critical value (HSD) for evaluating the significance of each pair of means
The critical value (HSD) increases (with each additional group mean)
It becomes more difficult to reject the null hypothesis as a greater number of group means are compared
If the absolute (obtained) difference between two means exceeds the critical value for HSD, the null hypothesis for that pair of means can be rejected
Statistical power is difficult to calculate but important to know
Hypothesis testing
Estimating statistical power
Factors that influence power
Effect Size formulas
Eta squared
Omega Squared
r
Cohen’s d
Statistical power means
what is the probability that study will detect an effect when there is an effect there to be detected
calculating Statistical power
Power = 1 - Beta
Factors affecting statistical power
Alpha level
Error variance
Sample size
Effect size
Alpha level
If beta is smaller the statistical power will be larger
Error variance
Same means lower error variance
Decreasing the amount of variability increases the chance we will find an effect if there is one
Sample size
Works similar to error variance
Increasing the amount of samples will decrease the error variance
Testing more people we are able to better describe a distribution
30 people we hit central limit theorem where going over it might decrease. But 30 is normally a good line to understand a population
Effect size
The magnitude of difference between our samples
Larger effect size -means further apart
Power will increase as we increase effect size
Effect size measurements
Main effect (ANOVA)
Eta Squared
Omega Squared
Multiple comparisons (planned contrast or post-hoc)
r
Cohen’s d
Eta squared
Used for main effect
n2 = ssbetween / sstotal
Omega Squared
Used for main effect
Most accurate measure of effect size for main effect
w2 = ssB-(dfb*MSw) / sst+MSw
Effect sizes for planned contrasts
r
Used for follow-up tests
Particularly useful for planned contrasts
r= squareroot : t2 / t2+df
Effect sizes for Post-hoc tests
Cohen’s d
Used for follow up tests
Can be used for tukey’s post-hoc tests
Spooled = squareroot: (n1-1)s12+(n2-1)s22 / n1+n2
d = X(mean)1 - X(mean)2 / Spooled
Calculating required sample size
Using power and effect size to calculate sample size
Before running an experiment we want to ensure that if there is and effect present, that we will observe it (power)
Part of that will rely on testing sufficient numbers of participants (see effect of sample size on power)
We can use effect size and desired power to estimate how many people we will need to test