Test 2 Flashcards


Why do we use statistics?


 We use statistics as checks on our own
biases and help us better answer our RQ;
To understand the shape of the data, and validate our intuitions about patterns within the data.

How well did you know this?
Not at all
Do music lessons make kids smarter?
Causal claim (Mozart Effect)

Schellenberg (2004)
 Method:
o Over 36 weeks 4 groups of 6-year-old had
music lessons added to their course work.
Children were either taught; keyboard,
voice, drama or no lessons by qualified
 Why use three different treatment groups?
o No lessons is a control group but having
three other experimental conditions helps
us gain a better understanding of what it is
about music lessons which cause this
effect on IQ.
o Comparing drama to music lessons helps
understand if it’s something specific about
music or just being creative.
o Comparing two music groups to see if its
music in general or a specific type of
 Matching:
o He matched the four groups on extraneous
variables on age, family income (SES), IQ
before lessons. This is essentially a pre-test
post-test design which allows us to
compare IQ before and after the
experimental manipulation (lessons).
 Results:
o Indicate that IQ gains were greater for
music lessons (keyboard and voice) relative
to control and drama lessons. This
illustrates that it is something about music
which increases IQ and not just creative
o How do we know if these effects are
meaningful? We need to use statistics to
find if these between group differences
are statistically significant and not due to
sampling error.

How well did you know this?
Not at all

what do we need random assignments for?


> to create equivalent groups
to meet assumptions of t-tests and ANOVA
to rule out confounds (condition of

How well did you know this?
Not at all

If we are coming three groups means why not run t-tests?


We would need multiple and the more tests we run the higher we inflate our false positive rate.

How well did you know this?
Not at all

Analysis of Variance (ANOVA)


 F statistic: between-groups variance (how
groups differ from each other) / Within
group variance (how people differ from
others in their group.
 Comparing effect due to IV to the variance
which naturally occurs in your population.
 If the null hypothesis is not true the sample
distribution for each group should not
overlap, the means should be much
different and give us a big F statistics.
 We want between group variance to be
high and within groups differences (noise)

How well did you know this?
Not at all

How do we calculate variance (s2)


 Null Hypothesis: All kids drawn from the
same population (i.e., 36 participants all
from the same population and no effect of
the IV on groups).
 Variance: calculating each participant’s
distance from the mean. Since some with
be + and some -, they can not just be
added together because it will equal zero.
Instead, we add all the distances from the
mean all together and square root it/ n-1,
so we remove the – symbol. Bigger
number = more spread from the mean and
small variance indicates that they fall close
to the mean.
 Total Sums of Squares (SStotal)

How well did you know this?
Not at all

SSbetween: Total Sums of Squares Between Groups Variance

SSwithin groups: Sums of Squares within


SSbetween: Total Sums of Squares Between Groups Variance
 Mean for each group comparing it to the
overall grand mean
 Add the three means together, square it
to see how much each group differs from
the overall mean.
 If they are all different from one another
than the SSbetween will be larger. If
they’re very similar it will be small and no
effect of IV present

SSwithin groups: Sums of Squares within
 We want within group variance to be small
 How do each participant differ from their
group mean?

How well did you know this?
Not at all

Sums of Squares and Mean Squares


Sums of Squares and Mean Squares
§ In ANOVA we calculate variance using a
technique called sums of squares (SS).
o SStotal = how much each participant varies
from the overall mean (squared)
o SSbetween = how much each group varies
from the overall mean (squared).
o SSwithin = how much each person varies
from their own group mean (squared).
Spread of data within one group.
o SStotal = SSb + SSw
§ Mean Squares (MS) are adjusted for n by
dividing SS by df
o dfbetween = #Groups -1
o dfwithin = #Participants - #Groups
§ MSbetween = SSb/dfb (Mean Square
Between groups = sums of squares
between divided by degrees of freedom
§ MSwithin = SSw/dfw (mean square within
groups = sums of squares within divided by
degrees of freedom within).
§ F = MSb/MSw (F statistic = Mean Squared
Between divided by Mean Square Within
§ WARNING: MSw also called MSresidual or

Mean Squares
 The more people we have in each group,
the bigger the sums of square will be.
 What we want to know is the average how
far are people from the mean (mean
 Mean Squares (MS) are adjusted for n by
dividing SS by df
o dfbetween = #Groups -1
o dfwithin = #Participants - #Groups
 MSbetween = SSb/dfb
 MSwithin = SSw/dfw

*Two Degrees of freedom in ANOVA (between; top and within; bottom).

How well did you know this?
Not at all

ANOVA F Statistic Calculation:

F Distribution


§ F = Msbetween/Mswithin
§ We compare the F statistic to the F
sampling distribution which tells us how
often the null hypothesis can produce a F
that big.
§ All the F values are +, the Mean = 0 and the
distribution is all above 0 (one tailed; peaks
just under one).
§ The bigger the F statistic the less likely it is
being produced by the null hypothesis.
§ We want the between group variance to be
bigger than within group variance for the F
statistic to be big.
§ F of 1 tells us that the group variance is not
that much bigger than within (no effect of IV
– null is true).
§ Critical region: if the null hypothesis is true
less than 5% of the time it will produce a F
statistic 3.5 or bigger

F Distribution
 F is a sampling distribution of possible F
values if the null hypothesis is true.
 The exact size/shape of F depends on the
degrees of freedom
 If the groups differ from each other a lot
compared to how much people (or animals)
differ from others in their condition, you get
a large F.
 Reject the null if p < .05
 F is always positive
 No difference between one-tailed and two-
tailed F

Different df for different F distributions
§ df(x,y)
o X = number of groups – 1
o Y = total N – number of groups
§ In our example, we have 3 groups of 12 = 36
participants so the DF would be (2,33).

How well did you know this?
Not at all

How does Jamovi treat subject variables in a quasi-experimental factorial ANOVA?


o One-Way ANOVA to measure three group
means. In Jamovi it doesn’t care if the IV is
manipulated within or between subjects.
Jamovi could use subject variables,
statistically all that matters is that you have
three groups. In experimental studies we
need the IV to be manipulated (not subject
o ANOVA is not JUST useful for
experimentalists its used anytime you want
to compare three group means.
Experimentalists and Non-experimentalists
use them and both use correlational or
regression as well.

How well did you know this?
Not at all

Grand mean


Is useful because it corresponds with the null hypothesis (if the null is true, the % that these 3+ groups are sampled from the same population; group means all close to the grand mean).

How well did you know this?
Not at all

Recap about F Ratios


o We calculate the F statistic, a ratio: between
group mean variance/within group variance
(participants around mean). We want the
between group variance to be bigger than
the within group variance to get a big F
statistic! More likely to reject null hypothesis.
o Calculate (F = means squares between/
means squares within) by calculating mean
squares. Mean square between =
SSbetween/dfbetween and Mean Sqaures
within – Sswithin/dfwithin. Jamovi does this
all for you. We then use the F statistic to
identify its corresponding p-value. We do
this by comparing it to the f sampling
distribution (dependent of df; sample size
and number of groups; under the null
hypothesis the groups come from the same
population and differences are due to
sampling error and not the IV; 5% rejection
region where there is less than 5% chance
of making a false positive and it is likely to
be IV effect present).
o Unlike T-Tests where there is 1 df (N-
#groups) ANOVA have 2x (one for numerator
and denominator; df= x,y; number of groups
minus 1 and total N minus number of groups;
e.g., 2,33 = 3 groups of 12 = 36).

How well did you know this?
Not at all

A significant F tells me that my groups differ, but not how they differ.


 All the F tells me is that there is a group
mean difference that is statistically
significant. It doesn’t tell me which ones
are different and in what direction. We
need to look at our descriptive statistics
and each groups mean to find this out!
 Remember:
 I could do t-tests to compare each group
to each other group.
o 3 t-tests
o Probability of a false positive for each one
= .05
o Probability of a false positive in one of
them = ~ .15
o i.e., .5 x the number of tests you run
 Therefore, we use post hoc tests which
adjust for the number of tests you run.

How well did you know this?
Not at all

Post-hoc tests


Post-hoc tests
 Comparisons of pairs of means after finding
a significant F.
 Used when I have no hypothesis about
how the means might differ from each
other (2-tailed; no hypothesis in the
direction of the mean group difference[s]).
o Post-hocs are like t-tests to compare 2
means, but they have been adjusted to
correct for the increased chance of Type 1
o Penalises you for running multiple tests by
being stricter on the significance value so
after all of them are done the collective
false positive risk adds to .05. Generally,
.05 divided by the number of tests you run.
 Note: we can do contrasts instead of post
hoc tests if we have a prediction of the
direction of the mean group difference
 Post-Hoc on IV (class):
o No correction = t-test comparison without
o Tukey (general fits most situations; not too
strict or too loose)
o Tick effect size (how big is mean group
difference; cannot be answered with the p-
value! Since it is now two mean group
difference test the effect size we use is
cohens d).

 df does not change (unlike a t-test where
the df would be 22 =12 x 2 – 2; ANOVA 33
= 12 per group x 3 – 3). Ptukey tells us
significance of group difference and
cohens d tells us how big this difference is.
 These cohens d’s are big effect size. We
identify direction of group differences by
looking at the data! Which group mean is
higher? Not p-value or +/- of cohens d!

• Only do post-hocs if the interaction is
• Check for the specific means you want to
compare (use the graph to help you
• No corrections necessary as long as your
comparisons are relevant to your

How well did you know this?
Not at all

Planned Comparisons/Contrasts (instead of post hocs!)


• Sometimes, a certain comparison is
critical to testing my hypothesis.
• Just do it! (just don’t do too many of
them, and make sure they are justified by
the hypothesis).

How well did you know this?
Not at all

Factorial Designs


2 or more independent variables
• Each variable can be manipulated within- or
• Each variable can have 2 or more levels
(that’s what makes them categorical!)
• Some variables can be subject variables
(quasi-experiment; i.e., male or female
jurors; a good way to test for the
generalisability of the findings to other

How well did you know this?
Not at all

Why add variables?


Why add variables?
i.e., factorial designs
1. It is efficient. Assess 2 or more causes at
2. To refine a theory (because it depends…
e.g., in Stroop effect)
3. To isolate a particular process of interest.
4. To assess change over time (e.g. in a pre-
test/post-test design; mindfulness vs
5. To increase external validity (extend to
other populations, stimuli, situations…
subject variables in negotiation phase)

Note: Interaction and moderations are the
same thing. Moderations look at
continuous predictors but
experimental ANOVA use categorical
(levels of IV; continuous).

How well did you know this?
Not at all

Hypothesis in Factorials


can be:
> interaction only (i.e., no main effect, but
effect dependant on the level of the other
> main effects & interaction

*always want to know about an interaction

How well did you know this?
Not at all

Variables plotted on graph


The DV always goes on the y-axis but either IV can go on the x-axis or the key.

We decide what IV goes on the X-axis by referring back to our research question to see what makes the most sense.

How well did you know this?
Not at all

Main effects are ___s and interactions are ___s of _____s


• Main effects are the averages.
• To look at the main effect of drugs we
compare the both groups in placebo to
find the average score and compare both
groups of the Prozac group to find the
average. Is one average higher than the
other? Yes, Prozac appears to work better
than placebo.
• To look at the main effect of both CBT
groups and compare it to the average
improvement in both waitlist groups. Is one
average higher than the other? Yes, the
CBT group improved more than the waitlist
control group.
• Interactions are differences between
• We calculate the difference in means
between the two groups of each level of
IVs (will be different between each IV).
Then calculate the difference between the
differences of each level of the IV (i.e.,
difference of differences between Prozac
and placebo and difference of differences
between the CBT and waitlist should equal
the same value either way you do it; tells
us the magnitude of the interaction).

How well did you know this?
Not at all

Difference between Oneway ANOVA and ANOVA?


• OneWay ANOVA (1x IV with three
Or more levels)
• ANOVA (more than 2x IV with 2 or more levels)

How well did you know this?
Not at all

ANOVA with 3x groups have…


2x df, 3 f statistics,

How well did you know this?
Not at all

IF a significant main effect has more than 2 levels, you need to do a…


post-hoc test to determine where the differences are (just like oneway ANOVA).

How well did you know this?
Not at all

What are the advantages of combining two IVs in the same study?


• Using the same data, running an ANOVA
with 2x IVs rather than 1x the p-values and
F ratios are different!
• “Drug” has a larger effect size and lower p-
value when Therapy is included in the
• Why?
o Look at the residuals.
o Larger residuals (denominator;
unexplained variance) when only one IV is
included which will reduce the size of the F-
statistic and make p-value larger and the
partial eta squared smaller.
o Some of that variability can be explained
by IV if included in the model (moved from
denominator to numerator)
o This only works if the added IV are related
to the IV (increases power); if not it will
only make it worse.

  • Adding (useful) factors decreases the
    residuals, increasing F, decreasing p.
How well did you know this?
Not at all

Interaction Write Up:

& Main Effects


Describing a significant Interaction:
1. Split one of the IVs into levels
2. Compare the effect of the other IV at each
3. Explain how the differences are different

• For participants in CBT, the drug produced
improvement over placebo. However, in
those on the waitlist, the drug was no more
effective than placebo.
•For participants taking the drug, CBT was
more effective than staying on the waitlist.
For participants taking the placebo, CBT
was no more effective than the waitlist.
• The effectiveness of the drug depended on
•The effectiveness of therapy depended on
the drug.

What about the main effects?
• Main effect of drug isn’t meaningful.
• Main effect of therapy isn’t meaningful.
• These main effects are qualified by the

How well did you know this?
Not at all

ANOVA with within-subject variables


ANOVA with within-subject variables (both variables are manipulated within-subjects) the only difference is the use of statistics (repeated measures ANOVA; do not care about overall differences between people, just differences in individuals between conditions).

2 x 2 factorial (within-subjects) use a repeated measures ANOVA (we use it anytime there is a within-subjects variable; only difference is if you call it a mixed or within-subjects factorial).

Jamovi doesn’t have the theoretical variables in it, it only has the operationalized to levels. We need to tell Jamovi what we are measuring.

How well did you know this?
Not at all

Assumption of sphericity only matters …

Homogeneity of variance …


Assumption of sphericity only matters in within-subjects design with three or more levels (if two levels and two means than you only have one pairs of variances which means you can not violate sphericity if you are comparing 3+ pairs of means).

Homogeneity of variance (if the variance within groups is different, SD’s; can possibly violate this in within-subjects design with two levels!) .

How well did you know this?
Not at all

Small-N designs
Establishing causality when we do not have group means to compare

When do we use small-N designs?


When do we use small-N designs?
• To establish a causal effect of IV on DV
within a small number of participants
- Research question concerns a very small
sample (not enough people to sample from)
- Situations where we cannot recruit a
sufficiently-powered sample (too small for
inferential statistics)
- When we expect substantial variability in
individual responses (group means isn’t
useful for highly variable responses; the
mean doesn’t end up describing most
• A small-N design establishes causal
relationships through replicating the effect
of IV on the DV (to prove consistency)
- Consistent change in DV as IV is
manipulated, with little variability (level or
. trend)
- Direct replication of the IV’s effect within
the participant (exact same participants,
conditions and context)
- Systematic replication of the IV’s effect
across participants or contexts (different
participants, context or conditions)
• Control over other variables is achieved by
- Establishing a baseline for the behaviour
without intervention (acts as a control
- Collecting multiple observations until we
see consistency in behaviour (more
confident in the IV effect on DV; helps
establish that the DV is under the control of
the IV and supports claims of causality!)
- Replicating the change in DV with the
introduction of the intervention (comparing
change in DV from baseline-intervention
 Is one AB relationship enough to
demonstrate this? What is the problem with
inferring causality from a single-phase
 An AB design is not rule out history
effects? Extraneous variables which may
have caused the change in behaviour, and
provides and alternative explanation for the
results. Something that happens at the
same time as the IV and provides
alternative explanations.
 Solution = reversal design

How well did you know this?
Not at all

Reversal Designs


• Phase change between A and B happens
more than once (not a single phase design!)
• ABA (baseline-intervention-baseline; most
common reversal line; does their behaviour
go back to baseline after the intervention is
removed; under control of IV’s presence or
absence) or ABAB (baseline-intervention-
• Meets the final replication criteria to
demonstrate control by the IV (behaviour
changing at each phase change supports
causality claims that the DV is under the
control of the IV)

Example Reversal Design:
(Bicard et al., 2012)

• College athletes at risk of academic failure.
The researchers propose that one causal
factor for poor academic performance is
being late to class.
• The intervention would be to get students
to attend class on time.
• Minuets late to class (DV), Weeks into
trimester (timeline), Texting (intervention;
they have to text their student consuller
when they are coming to class), baseline
(no texting)
• At baseline their class attendance was 30+
minuets late to class.
• Intervention caused their behaviour to get
• The removal of the intervention caused the
behaviour to go back to baseline.
• Re-introduction of intervention improves
behaviour again.
• An ABAB design is good for:
o Behaviours that can be unlearnt
o Is not ethical to use ABAB designs when
the behaviour is harmful; cannot remove an
intervention that is working (you can use
alternative methods than reversals to
prove causality; still need to show cause-

*Intervention behaviour different from
baseline. Return to baseline reverses the


Multiple Baseline Designs


• Single AB phase change that is replicated
across participants, behaviours or contexts
• At least twice
• Target outcomes must be independent
(observations in one individual Is
independent from observations in another)
• Useful for effects that can’t be reversed
• Behaviour can’t be un-learnt
• Not ethical to withdraw treatment
• Example: Multiple Baselines (same
behaviour and context but different
• Notice in this graph:
- Multiple students are being tested with the
same intervention at different times (the
intervention doesn’t start at the same time
for every participant)
- Why do they start at different times?
(staggered baseline switches common
feature) because if the goal is to rule out
history effect where extraneous variables
are impacting the DV at the same time as
the IV and can provide alternative
explanations for the change in DV.

Example Multiple Baseline Design (same context and participants but different behaviour)

  • Pairs of students enrolled into a
    mathematics course. The experimenters
    what to test the effects of a peer tutoring
    intervention (home-based) on test scores
    (three different types of mathematics
  • Participants had a history of maltreatment
    which is associated with poorer maths and
    literacy skills.
  • Baseline (test scores without interventions;
    rewarded with money) no answerers correct
  • Intervention (taught one member how to
    solve it; participants taught the other and
    test scores were calculated) scores got
  • If children did not reach mastery with the
    intervention alone. They were provided with
    additional interventions.
  • Staggered baseline is present (specific to
    multiple behaviours same participants) for:
    ruling out history effects, gives us a chance
    to see if the effects of the intervention for
    one maths equations (dv) influence’s the
    others. Impairs causality claims if the
    intervention effects subsequent behaviours.

*Intervention effect replicated across

  • Say I’m not happy with this intervention?
    Their behaviour doesn’t work very well and
    I want to compare the effectiveness of two
    different treatments? Use an alternating
    treatments design using same participants
    or not sure what component of the
    intervention is causing the effect (i.e.,
    separate things out).

Alternating Treatment Designs


• At least 2 interventions are tested, usually
one at a time, and frequently alternated
from session to session
• Combinations of interventions can be used
to test for interactions (A-B-C-B-C; baseline,
intervention 1, intervention 1 + 2,
intervention 1 + intervention 1 + 2; Baseline,
intervention 1, intervention 2, intervention 1
+ 2).

Example: Alternating treatment design
- Kallie is 6, autistic and has a high rate of
PICA (compulsive consumption of inedible
objects i.e., Christmas decorations or
destroying them)
- The intervention[s] to reduce the frequency
of this harmful behaviour
- Baseline:
- Functional Analysis Context: baited PICA
items on the table; safe to eat but look like
the unsafe items she likes to eat
- Holiday decoration context: attempts to eat
holiday decorations or destroy them but
was blocked by the experimenter
- High levels (clinically significant) of PICA
behaviours and warrants an intervention
- B (DRA)
- Differential reinforcement of alternative
behaviour (block PICA attempts, and
rewards non-harmful interactions with the
toys/objects with edible food items she
- Not punishing bad behaviour, but rewards
good behaviour to reduce frequency of bed
- We saw reductions in target behaviour but
not sufficient clinical significance (normal
level; PICA 1x attempt per minuet is too
- C (DRA and Facial Screen)
- Reinforcement of good behaviour
- Anytime she tries to eat an inedible object
the experimenter puts her hand on top of
her hand and eyes for 30sec. Is a form of
stimulus avoidance (not restraining them at
all). Then they redirect her attention to
objects safe to play with.
- Strong behaviour reduction with instances
of 0 PICA behaviours (clinically significant)
- At NO point do we return back to baseline,
intervention + both interventions +
intervention + both interventions
- We can see in the second DRA intervention
their behaviour gets worse which illustrates
that the intervention on its own doesn’t
have a lasting effect and in combination
with facial screen preforms better
- Reversal between intervention phases

  • Behaviour changes as the treatment is

Mitter et al. (2015)


Changing Criterion Designs

  • Used for behaviour which we do not expect
    an immediate but rather gradual change,
    more resistant behaviours such as smoking
    • Level of required behaviour is varied across
    trials (level of intervention or its criteria
    changes; gets harder and we measure
    peoples change in response to level
    • A change in criterion for target behaviour is
    based on successfully maintaining
    behaviour at the previous criterion
    (maintaining target behaviour from all levels
    of IV)
    • Causal relationship is demonstrated by
    successive replication of a change in
    behaviour with changes to the IV

• Pole vaulter university student
• Student is not lifting their arms high enough
(technique) to clear the bar, as they should
be able to so an intervention is needed.
• Current height the raise their hands when
doing pole vaulting is a long-term
behaviour that is now a habit and harder to
change (more resistant to change)
• Baseline (pole vault attempts without
intervention; common height they keep
their hand at and the height they can clear)
- Intervention shout reach as they are about
to jump, using a feedback pole where if
they raise their arms high enough they get
instant feedback of noise.
• Levels: baseline was 225 so first criterion
was to learn to do 230, then 235, then 240,
then 245 etc.
- They got better overtime, and had to reach
the stability criteria (consistency of correct
arm extensions) before the criterion level
was changed
- At 252 they were not able to reach stability
criteria and the study was terminated
• Tips for a good criterion study:
- Increases in criterion need to be small
enough to be reasonably meet but big
enough to be able to see when behaviours
change (small changes is not good for high
variability behaviour!), criterion interval
changes need to not be done at
predictable times (random or staggered; to
rule out history or maturation effects), can
return to baseline or previous criterion level
support causal claims of IV-DV.

*Successive demonstration of a change in
behaviour with changes to the IV (level)
*We are more concerned with clinical
significance rather than statistical
significance (healthy behaviour levels more
important than having the best experiment)


Validity of small-N designs

  1. Construct Validity
    • How well do the operational variables map
    onto the theoretical variables?
2. Internal Validity
• Are there other possible explanations for 
  the findings?
• Are there confounds?
• Is it the best design to address the 
  1. External Validity
    • Can the conclusions generalise to other
    people, other stimuli, other contexts?
4. Statistical Validity
• How big is the effect?
• Does the study have sufficient power? 
• Is the data treated appropriately?
• Are the statistical conclusions (e.g., 
  significance) justified?

Sufficient Power
• Does the study have a big enough sample
size to identify the effect size we are
interested in?
• Small sample sizes can only detect big
• The bigger the sample size the smaller the
effect size you can identify.


Threats to Good Replicable Science:

  • One problem with study design, is low
    statistical power and not being able to
    detect small effects, unreplicable findings.
  • failure to control for bias
  • P- hacking (repeat statistical tests till one is
  • Harking (hypothesis after results are
  • publication bias (only publish significant
    and experimental hypothesis testing)

Step 1: Do good Science

  1. Make clear predictions based on
  2. Ensure you have enough statistical power
    (sample size is sufficient for effect size you
    are interested in, big effects need smaller
    sample sizes, but smaller effects need
    bigger samples).
  3. Set a stopping rule
  4. Reduce flexibility in data analysis
    (predetermined DV variables; exclusion
    criteria, subgroups and covariate analysis).
  5. Adjust for multiple comparisons when
  6. Upgrade statistical skills and

*How does this apply to small N design which
do not have samples of people. In small N
designs it could be argued that we exploit
optional stopping, researcher can extend
the number of trails they do until they reach
a consistent/stable pattern of behaviour. It is
also a flexible approach in how many trials
we do ad at what point we change phases
(baseline-intervention) - these are dynamic
changes we can make in Small N design.
Does this mean Small N designs are not
doing good science?


Replication allows us to determine ____ in a small N design but…

Optional Stopping Rules
Inductive and Deductive Reasoning

  • Replication allows us to determine
    causality in a small N design but one phase
    change is not sufficient to conclude that
    the IV has power over the DV, too many.
  • Unlike group mean designs, small n
    designs treat each participant as a unit of
    replication (this is an advantage of small n
    designs, within one study with multiple
    participants you are replicating an effect;
    two participants = two replication). This
    disproves the critique that small n’s do not
    care about replication. They do, it is just
    built into the design itself.
  • Is it okay to do research without a stopping
    rule? People who stop responding or do
    not complete all trials; there data is
    removed for not being complete. Is this a
    problem, to answer this we will look at the
    inductive and deductive model of
  • Group-based studies use a deductive
    approach, theory-hypothesis-data
  • Small N-design is inductive in nature,
    observation of behaviour, finding patterns,
    using theory to explain patterns found
    (being dynamic is necessary for clinical
    work where the goal is to reduce negative
    behaviour and replace it with positive
    behaviours, a null effects cannot be found,
    we keep the study going to we see the
    behaviour change we want). Small-N’s can
    be deductive when they are test the
    effectiveness of an intervention (testing
    theory informed intervention on child,
    taking observations to see if it works).
  • Most Small N designs are a mixture of
    inductive and deductive reasoning. A
    mixture of theory-driven and explanatory
    work. This is true of all psychology research
    (including group designs) where there is no
    clear distinction between inductive and
    deductive work.

example Small N design experiment abstract write up

  1. Set up goal with clinical significance: little
    research on PICA interventions
  2. Link their goal to literature (previous)
  3. Identify purpose of the current study:
    identifying the function of someone’s
    behaviour (functional analysis is an
    inductive goal! Observe people’s behaviour
    in different contexts to find out what it
    causing it and keep doing study till it is
    fixed- flexible/dynamic/no problems with
    publication bias because there is no null
    findings! Not testing a
    hypothesis/prediction) but there is
    deductive as well (evaluate the function of
    treatment informed by research; theory-
    driven intervention selection and using
    data to see if it effective and supports our
    hypothesis, can have null effect and
    publication bias). Deductive reasoning
    issues will include what we have identified
    looking at group-level analysis.
  4. This example is a mixture of inductive and
    deductive approaches. Small N designs
    allow them to test their primary goal of
    identifying the function of the girls
    behaviour and keeping it under control
    whilst being flexible enough to test a
    secondary deductive hypothesis, see
    which theory-driven treatment is most
    effective. This is not a clear distinction to
    find in studies!

Step 2: Transparency in Reporting

  1. Report all measures, variables and
  2. Clearly distinguish confirmatory (planned)
    from exploratory (unplanned) analysis
  3. Clearly document hypothesis, predictions,
    design decisions, procedures.
  4. Share data in a public repository (if
    ethically possible)
  5. Share analysis codes and results
  6. Share research materials
  7. Improve journal standards

*Once we have identified the best design for
our RQ. We should then declare as much as
possible. In deductive studies were are
committing to our design decisions and
there is no flexibility to change them
throughout the study if we see it not
panning out as we expected. However, in
an inductive design it is much more dynamic
and expect changes to be made in the
study but we still need to declare the logic
or reasoning we will use to make these
changes (i.e., what will I use to decide when
to change phases).
- If our hypothesis is conceptualised at the
individual level than it should be tested at
the individual level with a small n design. If
it is more generalised to groups of people
on average performing the same then it
should be tested with a group design. The
design will decide what statistical validity
questions you need to ask.
- For example, an exploratory aim we asked
was that people who were distracted were
expected to take longer to respond. Not
theory drive, just an intuitive expectation
we had with no specific hypothesis was


External Validity in Small N designs:


External Validity in Small N designs:
• Each participant is a replication unit which
supports external validity of IV-DV to
another person, context or behaviour
(specific advantage of small N over group
designs, which compare averages, means,
that ignores individual differences)

Example: concern of replication of IV-DV effect (visitor contact impacts tuatara welfare; animal subjects) does a replication of effect in 3 subjects mean it will generalise to other tuataras?
• Do tuataras get distressed with visitor
• Enough in the literature to justify that this
should be studied because some animals
are distressed when handled but others
• They make analysis at the species level
(group differences expected within a
species) and individual differences (within
members of a species is appropriate to use
small n design)
• Small n design meets animal ethics (3 R’s:
reduce, refine and replace; to use the
minimum number of animals you need to
answer question)

van Heerbeek et al. 2021
• Baseline: no handling count and record
target behaviours
• Intervention: handling (hold in had for 30
minuet and touched by visitors) and count
and record target behaviours
• High variability in baseline, visible, when
distressed they burrow (cannot be seen).
Do they burrow more when distressed.
• Visitation days are set and cannot be
changed which is a limitation because we
cannot rule out extraneous variables which
coincide with these days. No staggering of
start in baseline! Does allow us to rule out
environmental factors though.
• Dark line shows that tuataras burrowed
more (were less visible) on visitor days
where contact was high which indicates it
distressed the animals.
• External validity is high in this example
because the tuatara and ecological context
it is meant to apply to are included in the
study. We can be confident the behaviour
will generalised outside of the study and be
repeated in the real world.
• Small N designs generally have high
external validity! The subjects being studies
are who it is meant to be applied to. Not
clear if it can generalise to others but
definitely the people in the study.
• Side note: a single study does not lead to
changes in policy, or interventions. Multiple
studies are needed in research which
provide converging evidence!


Statistical Power

In general:


In general:
- The more power you have the better
- How to design a study to have the most
power that you can


We aim to collect as much data as we can to decide about the truth, but there is
always room for error:

what 2 errors can we make?


• True Positive: Reject the null and the null is
• True Negative: Accept the null and the null
is true.
• False Positive (Alpha):
- Type 1 error
- Is our tolerance for being wrong which we
arbitrarily set at .05 aka 5% of the time we
reject the null when the null is true.
- In other words, we conclude that the
between-group differences is due to the
effect of the IV but is actually due to
sampling error.
• False Negative (Beta):
- Type 2 error.
- Fail to reject the null hypothesis when it is
actually false.
- In other words, have insufficient evidence
to reject the null hypothesis when an effect
of IV is present.
• These two errors are independent from one


Power = 1 – β


For example,
- if beta (false negative) is set at .2 (20%) than
your power is .8 (80%). 80% Power means
you have an 80% probability of rejecting
the null hypothesis IF it is actually false.
90% Power means you have a 90%
probability of rejecting the null hypothesis
IF it is actually false.
- Calculating the power before you start the
study is important to understand what is the
probability of finding a significant difference
in the study.
- In other words, with 80% power, if we ran
the study ten times, we would expect to
find a significant difference 8/10 times.
- Before the replication crisis studies typically
only had 40% power which is a waste of
time and resources. We need to design
better studies with more statistical power =
better science.


What determines Beta? How do we get more power?


choens d: difference between means/SD

  • The bigger the difference between groups
    (numerator) the bigger the effect size.
  • Cohen’s d uses SD NOT SE because SD is
    not effected by sample size (n).
  • The bigger the SD (denominator) the
    smaller the effect size.

Cohen’s d & Distribution overlap:

*Measures the difference in distribution overlap between two groups

  • Calculated by M1-M2/SD, 100-115 = 15/10 =
    1.5 is a big effect size
  • 100-115/60 = .025 (same mean group
    difference with larger SD’s mean there is
    more overlap and a smaller effect size).

*Two things determine effect sizes: mean
group difference and SD (variability)


How to get more power:

*To get a bigger t and smaller p


Design a study to optimize power:
*Applies to any experimental or
quasiexperimental design with 2+ groups

(A) Increase the difference between groups
(B) Decrease SE
a. Decrease SD
b. Increase N

*because the SE is the square root of SD/N.
smaller numerator and bigger denominator
makes a smaller statistic.

(A) Increase Difference Between

• Increase the strength of the manipulation
o i.e., more sessions of CBT
o i.e., higher dosage of drugs
o Wouldn’t want to do a strong
manipulation all the time when using (-)
valance stimuli which is unethical to do
or within-subjects design which would
introduce demand characteristics
• Sample from extreme groups/ends of the
o Has its own limitations; regression to the
mean and not being able to know what
happens on average.

*Always aim for the strongest manipulation
you can so we can be confident that the
effect is not present or due to having
insufficient power to detect an effect.

(B) Decrease SD
• Standardise measurement
o high consistency, reliability and validity
of measure to reduce variability.
• Homogeneous sample
o Less variability the IV has to compete
with but impairs external validity;
generalize to other groups; if it doesn’t
work in a homogeneous sample it is not
likely to work in a heterogeneous
• Matched pairs
o Match people for important extraneous
• Within-subjects designs
o I don’t care that people vary from one

(C) Increase N
• Incentives /rewards for participation
• Online data collection rather than in person
• Collaborative studies (many lab method;
multiple labs run same study and share data)

*harder to do due to cost in time and


What about alpha?


• We set the alpha or significance level at
.05. If the p-value is less than .05 we have
sufficient evidence to reject the null
hypothesis and accept that there is a 5%
chance we are making a false positive error.
In other words, we reject the null
hypothesis and conclude the IV caused the
effect on the DV but it is actually due to
sampling error.
• A one-tailed test has more power than a
two tailed test. For example, in a two tailed
test we predict an effect will be present but
we don’t know in what direction so the 5%
is split between both tails and requires a
larger t (+/-) to produce a significant p-value.
• In contrast, a one tailed test in when we
make a prediction on the direction of the
effect and get to pool the 5% at one end of
the tail which means we need a smaller t
statistic for the p-value to be significant.
One caveat is that if you predict in the
wrong direction the result will be
insignificant, even if its significant in the
other direction.


Calculating Power
Use statistical software (GPower) to
calculate it for you


• Function between N, mean group
differences and SD
• Will ask:
o One tail or two
o Predicted effect size
o Alpha
o How much power do you want (.90 is
o Equal number of participants in each
• It then tells you how many participants you
need to meet these criteria (decide before
sampling, acts as a stopping rule)


Smallest Effect Size of Interest (SESOI)


• Detecting small effects requires precise
measurement and large N but both are
costly design elements.
- Solution: consider the theoretical and
practical significance of finding an effect to
justify the costs of detecting smaller effects.
- Use literature to see what effect sizes
others have found
-Start with medium .05 as a base
- What is the smallest effect size do you care
about? Small effect sizes are costly (time,
money and effort)
- Theoretically, some people may want
evidence of an effect no matter the cost in
physics can really influence theory but in
psychology this is not always the case.
- Psychology is more important in practical
significance, what effect size would be
important enough to inform treatment
design? Improve daily functioning?

*Small effect sizes may be theoretically
interesting but not practically


Interpreting null effects

  1. Your experiment didn’t fail! (significant
    effects are not the goal of research, we are
    aiming to find the truth!)
  2. Do you have sufficient power? (not large
    enough sample size to have sufficient
    evidence to reject the null-hypothesis)
  3. Did you effectively manipulate your IV?
    (construct validity; did the IV manipulation
    measure what we intended it to? Did we
    include a manipulation check?)
  4. Do you replicate your null effects?
    (answers not found in a single study; same
    effect found in another study supports that
    its is a null effect and not issues with
    experimental design)
  5. Did you preregister your hypotheses?
    (faith in null effect; because it requires you
    to do a power analysis).

Can we ever show that the null is true?


• Not with NHST (the logic of null hypothesis
testing is that we assume what the shape of
the sampling distribution is when the null
hypothesis is true and then we look for
evidence that would allow us to reject the
null-hypothesis; we have two options we
can reject the null hypothesis or fail to reject
the null hypothesis; when we fail to reject
the null hypothesis, we are saying that we
have insufficient evidence to reject it, it
does not mean that we accept it; NHST is
not designed to answer this question! Only
via replication of same effect can we
conclude this!)
• Demonstrate high power
• Replicate the null
• Use Bayesian statistics (an alternative to
NHST where they weigh the evidence for
the null and the experimental hypothesis; it
doesn’t rely on the logic of rejecting the null
or failing to reject the null as NHST does; it
would allow you to say that there is more
evidence for the null and not the
experimental; which would allow us to claim
if the null hypothesis is true, unlike NHST).

*Absence of evidence does not equal
evidence of absence! = having insufficient
evidence to reject the null hypothesis, does
not mean it is true! I don’t have evidence
that they are equivalent I just don’t have
sufficient evidence to show that they are different


Quasi-Experimental Designs

*Designs, like experiments (manipulation of
IV but doesn’t have the same level of
control as a true experiment)

Three examples


Criteria for Causal Inferences
1. An association between two variables.
2. The cause comes before the effect
(temporal precedence) – manipulating IV
measure the effect on DV
3. Alternative explanations are controlled
(control of extraneous variables; RA,
Expectations, control groups, order etc.
the only difference between experimental
group and control should be the IV).

*This is not always possible or desirable.

a) Non-equivalent groups
b) Pre-test/post-test
c) Interrupted time series


Non-equivalent Groups


Non-equivalent Groups
*When your variable of interest cannot be
manipulated (no causal claim can be made;
direction of effect and alternative
explanations are not ruled out).
• Subject variables
- culture, age, IQ, personality, performance,
gender, income or education level.
• Ethical concerns
- fear, anxiety, depression, pain, malnutrition.

Adding a Participant Variable
*Does your experimental effect generalise to
other populations?

Example: Cultural Variations in Anger in negotiations (north American and Asian-American)

• Design:
- 2 x (culture; cannot be manipulated) x 2
(emotion) factorial design with a subject
• Theory:
- anger is a negative emotion, but it is
effective in negotiations (instrumental
• Hypothesis:
- Anger is an adaptive mechanism that
demonstrates strength and encourages
• Prediction:
- If anger encourages concession, then
participants in the anger condition will be
more likely to offer the warranty.
• Method:
- Given a scenario where you are trying to
sell a product to the client who wants the
warranty thrown in before they accept your
offer (warranty is expensive and does not
want to offer it without accepting the offer).
- The IV: end of script the client either says it
in an angry or non-angry tone.
- At the end of script they were asked two
- DV: What is the likelihood you will give the
client the warranty? 1-7 likert scale
- Manipulation check: how angry do you
think the client was? Construct validity did
we actually make people think they were

• Results:
- People in angry condition concede more.
(give warranty) than in the no anger
- Independent t-test (one variable, two levels,
manipulated between groups)

• Samples:
o WEIRD white, educated, industrialized, rich
and democratic (biased sample only reflects
a small proportion of the world; psychology
undergraduate samples, majority of psyc
samples, not generalizable to other groups).

Why important? For theory
• Anger is a good example of how WEIRD
samples are a problem. Cultures vary in their
expectance/tolerance of public displays of
• emotions-as-social-information model theory
behind study: cultures vary on what
emotions are appropriate to display in public
and therefore will influence their utility in
acts such as negotiations. Collectivist
cultures disapprove of public display of
anger, western’s are more accepting of its
intrinsic value.

- Anger condition is an independent variable,
but culture is a subject variable. This does
not affect analysis, but it affects
the interpretation of the results.
- Jamovi will treat culture as an IV in the
analysis, but when interpreting it its up to us
as the researcher than no causal claims can
be made because we didn’t manipulate it
(effects interpretation not stats).

Clustered bar graph:
- Replicates previous research that client’s
anger in negotiations lead to more
concessions than non-anger in European
- However, Asian Americans had the opposite
effect. More anger led to less concessions
than no anger in negotiations.
- = cross-over interaction (no main effects of
anger or culture because has opposite
effect at different levels of the IV).
- Categorical Variables (2x) should be
presented as a clustered bar graph

Line Graph:
cross pattern

Write Up
• Concession making (primary dv) write up:
- Introduce analysis and variables
- Main effect
- Main effect
- Interaction
- Post hocs 
*Same write up steps for quasi-experimental and true experimental designs

Evaluating Validity (non-equivilant groups)
Internal validity
External validity


Internal validity
• Better than an association study because
anger is manipulated (cross-sectional or
correlational studies; more control in
quasi-experimental and can make causal
claims about anger).
• But, cannot make causal claims about
culture as the cause of the difference,
because it was not manipulated (practical
knowledge to know if what works in one
culture may not work in another even if I do
not know the causal mechanism)

External validity
• Better than study in a homogeneous
population (generalisability of effect to
other cultures; using student samples for
connivence and power supports internal
validity but sacrifices external validity; once
effect is found in student sample can
replicate in more heterogenous or different
samples to see if the effect generalises to
other groups)
• Still constrained by experimental
methodology (manipulated anger, not real-
world situation, still a valuable extension of.
our knowledge).
• Other examples:
• Medical conditions
• Anything which can not be randomly
assigned with manipulation of IV


Pre-test/Post Designs


Pre-test/Post Designs
*quasi-experimental design (studies for
companies to improve their services where
practical or cost constraints are present and
make quasi-experimental designs the better

When you want to measure change within individuals but cannot have a control group or counterbalance order
a) Cost/practical constraints (can you provide
it to one group and not the other, can you
afford to)
b) Participants are in a cohort (a class, a
programme, a neighbourhood = have to
apply it to everyone)
c) Carry-over concerns in a within-subjects
design (can’t counterbalance using a
standard within-subjects design to use
quasi-experimental pre-post to test one
order = VR fear/neutral studies where fear
response would contaminate neutral

Note: before we looked at true-experimental between subject’s pre-test/post-test design


Evaluating Validity (pre-post test design)
Internal validity
External validity


Evaluating Validity
Internal Validity
o Better than association study because IV
is manipulated (direction of effect is
o Risk of history, maturation, regression to
the mean (threats to internal validity
without control groups)
o Reduce threats to internal validity by
running a non-equivalent comparison
group if possible (not a proper control
group if not randomly assigned and can
have self-selection effects but it at least
gives a comparison which rule out
maturation, regression to the mean and
history effects can be assessed).

External Validity
o Similar to experiment (really only
applicable to the sample I studied it in)


Interrupted Time Series(like pre=post design but better)

  • Use it because it helps with addressing
    random variation in scores overtime in an
    erratic way
  • For example, covid-19 cases can still have
    an overall trend but go up and down
    erratically day to day. Running a pre-post
    design on variables which are highly
    unstable the effect we see may not be due
    to IV and just random variations in the
  • If I looked at 2x given points I would make
    an inaccurate conclusion from the data. We
    need to look at the overall pattern before
    and after the intervention to determine its

Intervention Research
*its common method used in these areas
• Government policy (i.e., banning cell-
phones whist driving)
• Organisational change (i.e., health care,
education systems adopt new strategies
and want to test their effectiveness)
• Management strategies
• Catastrophic events (i.e., compare online-in-
person teaching due to covid-19; natural
disasters disrupting daily functioning
psychologically, economically and

*It would be impractical, unethical and too
expensive to test these with a true
experiment but we can still make strong
casual claims using a quasi-experimental

Time Interruption Designs:
- We take a series of measurements (months
in a row, or years with historical data; days),
then introduce IV, then time series for post
is taken to look at the overall trend in the
data pre-to-post and we can make stronger
claims about the effect being stable (IV-DV)
- As opposed to normal pre-post design with
one time point, IV, second time point (dv)

Example: Time Interrupted Design
Spoelman et al (2017)
- Reducing the number of consolation people
make with their physicians for easy to solve
medical issues.
- They made a website for patients to get
medical advice (FAQ) for minor and easy to
solve medical issues.
- Measured number of medical consultations
(per 100 people) before and after website
was introduced.
- We see that post intervention (2-years) we
see a general decrease in number of
consultations (reversed; increase in trend to
decrease trend).
- Collecting enough data pre and post to
wash away individual differences day to day
and allows us to see the overall trend in the
data and gives us more confidence that IV
caused the DV.

*Where an experiment would’ve been
unethical (to withhold from some patient
when you believe it will be beneficial)


Evaluating Validity (Interrupted Time Series)
Internal validity
External validity


Evaluating Validity
Internal Validity
• Better than single pre-test/post-test
because it controls for random variation
(most internally valid then other quasi-
experimental designs)
• IV may be confounded with other factors,
so alternative explanations are possible
(because no control group; do not not what
it is about the website that causes this

External Validity
• Very High (almost always done in real-
world situations; not in the lab so can not
do a true experiment)
• Still may not generalise to other contexts
(specific to the real-world situation that it
was conducted in; would need to replicate
in other organisations or countries)


Assumptions of Parametric Tests


Assumptions of Parametric Tests
• Normality (normally distributed)
o the DV is normally distributed
o the mean is a good estimate of the
o If variables are not normally distributed,
the means of two groups are no longer
good group estimates and cause
problems for t or f test.

• Variance (equal SD)
o the variance within each condition is
similar (homogeneous)
o affects formula for pooled variance (t) or
SS residual (ANOVA)
o When calculating the denominator, it
assumes that these variances are relatively
equal and can be pooled together. If they
are not equal, then the pooled variance is
not a good estimate of the average

o only applies to repeated measures ANOVA
(within SS variables with more than 2
o like a variance assumption, but based on
the variance of the difference scores
between conditions, not variance within
each condition


Violating homogeneity of variance


• Student t test which has one measure of
variance (s2; pooled variance from both
groups; averaging variance across groups
is inappropriate because it doesn’t reflect
either group well; would make it too small
and not reflect the actual SE).
• Welch’s t test (more conservative; where
the formula considers the S2 (variance) of
both groups rather than pooling it).
• Note – these assumptions only apply to
between-subject comparisons (2 sets of
variance); a paired t-test is the difference
between condition 1-2 so there is only one
set of variance then

*Not: these assumptions only apply to
between-subject comparisons

• In Jamovi, it will do both student t and
welch’s t, if assumption is violated than use
information from welch’s t. the key
difference will be that the df will be
significantly smaller (conservative) with
welch’s rather than student t test.
• The error in using a students t (without
correction) when SD’s are not equal is that
the SE will be artificially small, and the T
artificially big and result in more false
positives. Violating assumptions increases
false positive rates, so we penalize
ourselves with reducing our df, to reshape
sampling distribution and p-value. Some
people think we should just use the welch’s
t test all the time, since it doesn’t effect the
p-value too much and we can be safe.
• It is therefore more conservative, and less
likely to produce a spurious significant

*f’s and df are different


Violating Sphericity


Repeated measures ANOVA with 3+ levels
- In a paired t test, it compares the difference
between condition A and B for each
participant. We end up with one group of
participants with difference scores, we can
calculate the mean difference and SD
scores from this. This means we can not
violate homogeneity of variance because
we only have one set of difference scores,
there is nothing for it to be homogeneous
- In an ANOVA with 3+ levels, we now have
three sets of differences (A-B, B-C, A-C)
with three different SE’s which need to be
equal within one another (difference of
- We will ask for sphericity, greenhouse
gieszier correction (if sphericity is violated)
and homogeneity test (for any between
subject’s variables).
- Greenhouse Gizier correction makes the df
smaller (more conservative)
- 2, 1.76 (1.76/2 = penalization of .88)


If we violate normality…


When our data is not normally distributed (i.e., skewed) we cannot run parametric tests (t/f tests). Why?

Normality assumptions
• Parametric tests are based on comparisons
of means (assumes that means are a good
estimate of the group average, but if data is
skewed than the mean will produce
artificially big/small group differences,
because it is pulled in one direction +/- of
the tail; outliers warp means and
subsequently t and p-value).
• Using the means to represent a group or
conditions assumes the mean is a good

Solution: Use Non-parametric tests (ranked test)
• Based on medians, not means (median is
better because ½ of participants scores will
fall above or below the median; is based off
of participants and not their scores)
• Various ways of ranking individual data
points, to determine if high/low scores are
more likely in one condition than another
(clear split in the data)
• More conservative than parametric tests
(less likely to give a false positive)
• Less powerful than parametric tests (we are
throwing out individual scores/data and just
focus on high/low without looking at how
different they are)

(A) Mann-Whitney U Test
1. Non-parametric alternative to independent
2. Rank order all the participant RTs
3. Null hypothesis – all RTs are equally likely
to come from either condition
4. Research hypothesis – the faster RTs are
more likely to come from the one
condition, and the slower RTs are more
likely to come from the other condition (½
participants above the median will come
from group A and ½ participants below
the median from group B; what is the
probability of finding an 80-20 split?)
5. Use for non-normal data, especially if N is

*will lose power.
*No df for non-parametric tests because they
are about the sampling distribution which
non-parametric tests do not use (i.e., they
rank data and look at the probability people
are in one group or the other).

(B) Wilcoxon Signed Rank test
1. Non-parametric alternative a paired t-test
(e.g., comparing Condition A to Condition
B within-subjects)
2. For each participant, sort them into two
groups based on whether they score
higher in Condition A or Condition B (will
½ participants be better at A and ½ B = no
difference between conditions; if more
people on A than B, 80-20 split then it is
highly unlikely that these between
condition differences are due to chance
and we would reject the null hypothesis).
3. Null hypothesis – two equal groups
(people equally likely to be better in
Condition A than Condition B)
4. Research hypothesis – unequal groups
(more people with A>B than B>A).

*Normality in difference scores and not in
the raw data (variables is for independent t)
*No df for non-parametric tests because they
do not use sampling distribution, they rank
*We would report it as w = 3.50, p = .008,
np2 = .873

(C) Kruskal-Wallis (alternative to oneway

  • Comparing 3+ means; between subjects
  • Run normal ANOVA
  • If we’ve violated normality assumption (i.e.,
    small n or skewed data)
  • Rerun test using a Kruskal-Wallis non-
    parametric ANOVA
  • Chi square test statistic (x2)
  • Report it as x2 (2) = 11.40, p = .003

Post Hocs
- We know there is a significant difference
but not what group means are significantly
different, so we use post hocs (with or
without correction if we have directional
prediction or not).
- Non-parametric oneway ANOVA’s use
Dwass-Steel-Critchlow-Fligner pairwise

(D) Friedman Test (alternative to repeated 
- Chi square (x2)
- Pairwise comparisons AKA post hocs = 
- If 2 df = 3 groups (N-1)


4 Parametric Tests and their Non-Parametric Substitutes


(A) Independent t-test Mann-Whitney U test
(B) Paired t-test Wilcoxon Signed Rank test
(C) Oneway ANOVA (between) Kruskal-
Wallis test
(D) Oneway Repeated ANOVA Friedmans


What about factorial ANOVA? We ignore violations to normality typically in these cases


• ANOVA is robust to minor violations of
normality (ignore them; its not surprising
that one group may be slightly different
than the others but once they’re averaged
out it shouldn’t matter; 2 x 2 has 4 groups)
• Transformations (if big violations to
normality we can transform the data; there
is no non-parametric test for a factorial

- Many options!
- Different transformations are possible,
depending on the shape of your original
- A common option for a heavily positive
skewed via logarithmic transformation. Take
the log of each of these values using
exponent system (101 system; where values
of 10 become 1; 102 where values of 100
become 2; 103 where values of 1000
become 3 etc. now any value below 10
becomes bigger but those below 10 (in the
tail) get pulled closer to the middle of the
distribution making it look more like a
normal distribution). It admits that these
values are different but not enough to
justifiably pull the mean that far out. Note
that it’s a transformed mean not the actual
mean. We don’t care about the specific
value more the effect of the IV on Dvmean
(bigger, - skew; smaller, + skew) between
two groups.
- Commonly use logarithmic
transformations in eeg data in the lab. The
socres still sit in the high end of the
distribution but closer to the mean.
- Inverse 1/850. Flip data and put each
individual score over 1 to scrunch the tail in
closer to the middle of the distribution.
- Exponents
- The type of transformation you use
depends on the problem with your data!
- Logmarithmic data transformations is good
for (-) not (+) skews. If scores were too high,
you will only make it worse.

*We find an expert source to help decide
what would be the best transformation for
our data.


Take-home messages

  1. Check your assumptions
  2. Use corrections for variance and
    sphericity-related violations
  3. Use non-parametric tests for normality
  4. Use transformations if non-parametric
    tests aren’t possible (i.e., factorial
  5. Ignore minor violations in factorial and
    repeated ANOVA (false positives are not
    really effected)
  6. Get help! (people specialize in
    understanding what correction is best
    for different data problems; go ask

Types of scales (4)

Types of variables (2)

Nominal (categorical) 
- no link between categories
- can not average or say bigger or smaller = 
- gender, eye colour
Ordinal (ranked categorical)
- there is a natural meaningful way to 
  rank/order categories 
- position in a race, item Q’s gradually 
  increase in extremity
- can not average them

Interval Scale (continuous)
- numerical value is genuinely meaningful
- differences between intervals/scores is
- temperature
- addition, subtraction and averaging is
meaningful but can not be a ratio because 0
is not meaningful

Ratio (interval)

  • 0 is meaningful = absence of variable
  • scores/numbers are meaningful
  • can multiple and divide
  • reaction time
Types of variables:
-there is nothing in-between two 
- year went to school
- nominal, ordinal, interval and ratio

- interval & ratio
- a variable with given any two variables it
is logical for there to be a variable in
- RT, temperature in degrees interval but
ratio for farenhieght, # of t/f answers correct
on a test (ratio), Likert scale (interval)


If we square the t-test statistic we get


If we square the t-test statistic we get the f statistic for the anova run on the same data


Graphs have two primary functions:


Graphs have two primary functions: to help us understand our own data or to communicate our findings to the public


Reading: histogram and box plots


Histograms simplest graph that works best with interval or ratio data and gives you an overall impression of the variable. Has an advantage over histograms being that their shape is NOT influenced by the number of bins used. Their strength is that it shows the entire spread of the data which is a flaw if it has multiple bins (not compact). It is not helpful for nominal data.

Boxplots (box and whiskers) works best for interval and ratio data. Includes visual presentation of the median, IQR and range of the data. Compact and useful exploratory analysis of your own data. Good way to identify outliers.


Reading: Null hypothesis Testing


The goal of a hypothesis test is not to show that the alternative hypothesis is (probably) true. The goal is to show that the null hypothesis is (probably) false. Like a court trial the all hypothesis is deemed true until we find sufficient evidence to prove beyond a reasonable doubt that it is false. The goal behind statistical hypothesis testing is not to eliminate errors, but to minimise them. There will always be error. If we reject a null hypothesis that is actually true then we have made a type I error. On the other hand, if we retain the null hypothesis when it is in fact false then we have made a type II error.
The single most important design principle of the test is to control the probability of a type I error, to keep it below some fixed probability. This probability, which is denoted α, is called the significance level of the test. And I’ll say it again, because it is so central to the whole set-up, a hypothesis test is said to have significance level α if the type I error rate is no larger than α.
So, what about the type II error rate? Well, we’d also like to keep those under control too, and we denote this probability by β. However, it’s much more common to refer to the power of the test, that is the probability with which we reject a null hypothesis when it really is false, which is 1 ́ β. To help keep this straight, here’s the same table again but with the relevant numbers added:

A “powerful” hypothesis test is one that has a small value of β, while still keeping α fixed at some (small) desired level .05.
An aside regarding the language you use to talk about hypothesis testing. First, one thing you really want to avoid is the word “prove”. A statistical test really doesn’t prove that a hypothesis is true or false. Proof implies certainty and, as the saying goes, statistics means never having to say you’re certain.


Reading: Test statistics and sampling distributions


Calculate test statistic from our sample and compare it to its corresponding sampling distribution (data we expect given the null hypothesis is true). If our test statistic falls in the tail of the sampling distribution (within the rejection region) then it is not likely the null hypothesis produced our results. The p-value is the probability if we replicated the study that the sampling distribution would produce a test statistic the same or greater, given the null is true.

  • nothing about proving null is wrong or the research hypothesis is right
  • to be in the tail the test statistic either has to be very big or very small (5% or 2.5% depending on whether its a one tailed or two tailed hypothesis).

Statistically significant simply means we have enough evidence to reject the null and conclude there is a significant difference present. It doesn’t tell us how big or how important this finding is to practice. It doesn’t tell us if our study was “good”. It doesn’t tell us the probability that the null is true.

we don’t usually talk in terms of minimising Type II errors. Instead, we talk about maximising the power of the test. Since power is defined as 1 ́ β, this is the same thing.


Reading: Comparing two means


+/- sign is arbitrary for test statistics The main difference is that the standard error calculations are different. If the two populations have different standard deviations, then it’s a complete nonsense to try to calculate a pooled standard deviation estimate, because you’re averaging apples and oranges.9

Table 11.1: A (very) rough guide to interpreting Cohen’s d. My personal recommendation is to not use these blindly. The d statistic has a natural interpretation in and of itself. It re-describes the difference in means as the number of standard deviations that separates those means. So it’s generally a good idea to think about what that means in practical terms. In some contexts a “small” effect could be of big practical importance. In other situations a “large” effect may not be all that interesting.

In statistical jargon, this makes them nonparametric tests. While avoiding the normality assumption is nice, there’s a drawback: the Wilcoxon test is usually less powerful than the t-test (i.e., higher Type II error rate).

An independent samples t-test is used to compare the means of two groups, and tests the null hypothesis that they have the same mean. It comes in two forms: the Student test (Section 11.3) assumes that the groups have the same standard deviation, the Welch test (Section 11.4) does not. 

A paired samples t-test is used when you have two scores from each person, and you want to test the null hypothesis that the two scores have the same mean. It is equivalent to taking the difference between the two scores for each person, and then running a one sample t-test on the difference scores. (Section 11.5) 


What test do we do if we have a categorical DV?


Chi square