- Reject null: if the likelihood of the null hypothesis being true is less than ..... - Bidirectional .... - Directional ....

- Reject null: if the likelihood of the null hypothesis being true is less than 5% (produce test statistic this size or greater less than 5% of the time). - Bidirectional (5% reject, two tail, 2.5% each side; need bigger t to fit in small tail or more evidence to reject) - Directional (5% reject, one tail 5% on one side, smaller t is needed, less evidence needed, divide p- value by two to account for bigger rejection region)

Weeks 5-6 Flashcards by Melanie Turner

Statistics…

allow us to evaluate the….
In experiments, most of the statistics is comparing….
Statistical analysis used depends on the…

• Statistics allow us to evaluate the evidence to
determine if our IV had any effect.
• In experiments, most of the statistics is comparing
groups (2+) on one or two/three independent
variables. Mean group differences.
• Statistical analysis used depends on design (W/B)
and the nature of the variables (DV: cat or con;
levels: 2+)

How well did you know this?

Not at all

Perfectly

How do we test if 28ms is enough evidence to conclude we have sufficient evidence to reject the null hypothesis? what does the __ ___ tell us?

Use sampling distribution, the probability of findings the same result or bigger given the null hypothesis is true (i.e., the probability of results being due to sampling error and not the manipulation of the IV).

How well did you know this?

Not at all

Perfectly

Null hypothesis means…. it is believed to be…

is the likelihood of finding the same results, without
IV manipulation, given the null hypothesis is true is
less than 5% (p-value less than .05) we reject the null
hypothesis.
Null hypothesis is believed to be true till proven
guilty.

How well did you know this?

Not at all

Perfectly

the test statistic is ___ divided by ___. We compare it to the ___ to test….

• group difference divided by standard error
• = ratio (variability between groups explained by
IV/natural variability in sample-sampling error-
difference from population to mean-confidence it’s
close to true mean)
• compare it to the sampling distribution to see how
often would I get a t of this big or bigger if the null
hypothesis is true?

How well did you know this?

Not at all

Perfectly

what is the sampling distribution? It samples from a ___ population. Its shape is determined by __. A larger N is more likely to give us…. Its tail is where…. A t-score closer to zero will fall where in the distribution?

• The sampling distribution tells us how likely it is to
get the same t-score if the null hypothesis is true (or
bigger when sampling randomly from the same
sample).
• Sampling distribution shows me how often I will get
different values of a test statistic (e.g., t) by randomly
sampling from a SINGLE population.
• Shape of the sampling distribution depends on how
big my sample is (for an independent t-test, degrees
of freedom = N – 2).
• The larger the N the more likely given the null
hypothesis is true should the differences between
groups should be smaller i.e., closer to 0.
• In the tail of the sampling distribution, the rejection
region where it is not likely that the test statistic was
produced by the null hypothesis being true
(sampling error and not due to IV)
• T-score that is closer to 0 means it’s more likely to
be caused by null and not IV, Bigger Test scores are
better! In tails.

How well did you know this?

Not at all

Perfectly

what do we look at first when looking at independent and dependent t-test outputs?

*not the p-value first!
1. descriptive statistics:
mean differences between groups (most important
information). SE in this case refers to the mean[s] 92.4,
I’m happy to accept the variability of 90 miliseconds
either side from the mean, not confident in my mean
being true of the group.
2. Assumptions:
are the assumptions of my t-test met? do I need to do
a non-parametric test instead?
3. Now we look at the test-statistics, p-value and cohens
d.

How well did you know this?

Not at all

Perfectly

if the t-test statistic is 0.218 and the p-value is .83 what does this tell us?

83% of the time the null hypothesis will produce a test-statistic of the value 0.218 or greater. this is why we set the p-value significance cut-off at 5%.

How well did you know this?

Not at all

Perfectly

Reject null: if the likelihood of the null hypothesis
being true is less than …..
Bidirectional ….
Directional ….

Reject null: if the likelihood of the null hypothesis
being true is less than 5% (produce test statistic this
size or greater less than 5% of the time).
Bidirectional (5% reject, two tail, 2.5% each side;
need bigger t to fit in small tail or more evidence to
reject)
Directional (5% reject, one tail 5% on one side,
smaller t is needed, less evidence needed, divide p-
value by two to account for bigger rejection region)

How well did you know this?

Not at all

Perfectly

P-value vs Cohen’s D: what do they each tell us?

§ P-value:
o How confident I am that the null hypothesis did not
produce this data (less than 0.05, less than 5%
chance that the t-test score was due to the null
hypothesis (sampling error) rather than the IV). P-
values do not tell me how big the effect is! Smaller p-
value doesn’t mean bigger effect, the P-value tells us
about our confidence that the effect is due to IV (i.e.,
is effect statistically significant not how big is the
effect).
§ Cohen’s D:
o Effect size is determined by Cohens d = difference
between means/SD (pooled) not SE so not corrected
for n
o How big is the effect of the IV on the DV?
o - sign is arbitrary in Jamovi, so ignore it.
o Directional hypothesis= one tail hypothesis, divide
.83/2 = .415
o P-value .415 (one tailed; significant)
o P-value .83 (two tailed; non-significant)
o 0.098 is smaller than small (small .2, medium .5 and
large .8).

How well did you know this?

Not at all

Perfectly

three steps of reporting a t-test:

Introduce your test (IV, DV, and t-test should be
mentioned in first sentence)Mean response times to angry and happy faces
were compared with an independent t-test
Report your test statistics
t, df, p, d, (to 3 dp).If t and d are negative, drop the negative sign (it’s
arbitrary)It is assumed your t is two-tailed. If you are doing a
one-tailed test, divide by 2, and report that it is one-
tailed (note – there is another way to do this in
JAMOVI which we’ll learn Friday)P-values don’t get a leading 0 (they can never be
more than one). t and d get leading zerosIndicate whether you reject or fail to reject the null
hypothesis.Results failed to reject the null hypothesis, t(18) =
0.218, p = .415 (one-tailed), d = 0.098.
Describe the effect in English, providing descriptive
statistics (M, SD).There was no significant difference in RT between
participants who searched for an angry face (M =
730 ms; SD = 292 ms) and those who searched for a
happy face (M = 738 ms; SD = 282 ms).

How well did you know this?

Not at all

Perfectly

Within-Subjects Design each row is _ ___, it has __ SE, mean difference, and SD’s.

the test statistic for a dependent-t-test is calculated by….

it will produce a bigger ___ and ___ relative to between-subjects designs?

• Each row is a person
• Same variability in sample, but smaller SE, SD and
Mean difference is smaller (IV effect is weaker now
the random noise in the data is removed).
• SE is 3
• SD 10ms (is small)
• Mean Difference (12ms is small)

• The difference between conditions (not response
overall but the calculated mean difference of 12ms)
divided by SE of the mean of the differences.
• Smaller Error because of within subject’s design
means the test statistic will be bigger and is more
likely to be statistically significant (not big effect
though).

t-test statistic and cohen’s d.

How well did you know this?

Not at all

Perfectly

A paired t-test is also called what (3) things and is used when….

Also called

Dependent t-test
Matched t-test
Repeated measures t-test

when two within-subjects groups.

How well did you know this?

Not at all

Perfectly

the larger the effect size (cohen’s d) the __ overlapped the distributions of the two groups are.

.2 =
.5 =
.8 =
2 =

less.

.2 = 83% overlap with 58% of control group falling below
the experimental groups mean.
.5 = 67% overlap with 69% of control group falling below
the experimental groups mean.
.8 = 53% overlap with 79% of control group falling below
the experimental groups mean.
2 = 19% overlap.

How well did you know this?

Not at all

Perfectly

Misconception about p-values

it tells me ___ but doesn’t tell me what (4) things…

A p-value is…
The likelihood/probability of getting a statistical result
(t, F, r, etc) that big (or bigger) IF THE NULL HYPOTHESIS IS TRUE.

Misconception about p-values
1. The p-value tells me how big (or important) my effect
is.
2. If I reject the null hypothesis, my research hypothesis
must be true.
3. If I fail to reject the null hypothesis, the null
hypothesis must be true.
4. If I find a significant effect, I must have conducted my
experiment well (e.g., the experiment “worked”).

How well did you know this?

Not at all

Perfectly

The replication crisis is now called the __ ___ and occurs…

credibility revolution; occurs in all sciences.

How well did you know this?

Not at all

Perfectly

Daryl Bem (2011) precognition study showed us....
he was likely to have made what (3) errors?

his findings went against the foundational theories of cognition within psychology and sparked public outcry. His work illustrated that there were problems in our statistical methods and procedures within psychology to be able to provide these results.

Hypothesized after the results were known if it applied
to only one subgroup.
Outlier dropping
40, 40 , 200 sample size across conditions looks like
optional stopping.

what is the publication bias and why is it an issue?
which science is it worse in?
A study which aimed to test if classic psychology findings were replicable found…
what role do journals play in the publication bias? it leads to (4) things.

 The publication bias where journals select studies
that show significant effects which means we have a
very biased foundational knowledge (only half of the
picture).
 Psychology has the highest bias in positive effects
(90%). This is due to people studying things that they
know will be positive (not the point of science) its
more common to find null effects than positive
effects.
 The experiment still works if we get a null effect. We
are searching for the truth not significant effects.
Sometimes the answer is no (null).
 If we only publish positive effects we would expect
them to be easier to reproduce (not the case). They
did a study where they aimed to reproduce
significant effects from classic studies in well
reputable journals

They found that almost all the replications had smaller effect sizes relative to the original effects. Replication rate was 40% (half of what we know was wrong).

We need to replicate but why don’t we? Its boring, journals do not like to publish these. This highlighted that psychology science wasn’t creditable for the following reasons:
• Fraud
• Sexy (but unlikely) findings dominated the literature
• Proliferation of positive results
• Many studies fail to replicate

False Positive Psychology: UNDISCLOSED FLEXIBILITY IN DATA COLLECTION AND ANALYSIS Allows Presenting Anything as Significant.

They identified 4 ways we can lie to ourselves about the patterns in our data:

 One reason for inability to reproduce findings is due
to how we analyze data. Journals are the
gatekeepers of science which prefer exciting
positive findings that are more likely to be cited
(increase the journals reputation).
 This motivates researchers to want to find positive
results: where we lie to ourselves to find significant
effects.

They identified 4 ways we can lie to ourselves about the patterns in our data:
1. Have two dependent variables (adding another
hypothesis doubles the chance of the null
hypothesis creating a significant effect; false
positive rate increases).
2. Add 10 more observations per cell (optional
stopping; when non-significant results you run more
people till you get a significant effect; increases
false positive rate).
3. Controlling for gender or interactions for gender and
treatment (non–significant male-female effect so
you only look at one gender; this doubles
hypothesis and false positive rates; smaller sample
size reduces effectiveness of random assignment).
4. Dropping or not dropping one of three conditions:
choosing which conditions to keep based on what
combo gives significant findings

*These techniques are legit if done BEFORE data
collection.
*We should have a false positive rate of 5% as we see
with two dependent variables the rate goes up to 9.5%,
if we combine these methods, it can get as high as 60%
false positive rate.

Optional Stopping: what is it?

In optional stopping, we run a study, and stop if results are significant. If not, run a few more, and check again. Repeat until you get a significant effect, or you run out of time/money/participants.
• Each check is a new test of the hypothesis (increases
false positive rates)
• P-values are very unstable, especially when Ns are
low (random assignment is able to create equal
groups).

*you need to decide BEFORE data collection what the
sample size is going to be and where you will stop.

HARKING:

Harking (Hypothesising After Results are Known)
• Where you test multiple hypothesis till you get a
significant result and claim that was your original
hypothesis. If we did three hypothesis the 5% false
positive rate would increase to 14.3%. The more
hypothesis we add after we fail to support our
original hypothesis it increases the false positive.

summary:
(9) questionable research practices
(3) problems in psychological research
the solution requires what three steps?

Questionable Research Practices
•	Multiple DVs
•	Multiple scoring systems
•	Dropping observations
•	Dropping conditions
•	Unplanned subgroup analyses
•	Unplanned covariates
•	Underpowered studies
•	Optional stopping
•	HARKing

*Increases false positive rates! Can be intentional or
accidental.

Problems:
1. Replicability
	Effects aren’t real
2. Reproducibility
	Conclusions aren’t justified
3. Publication Bias
	The literature doesn’t reflect the science (only a 
         small proportion of it)

The solution is open science practices:
Step 1: Do good science
Step 2: Transparency in reporting (share better science).
Step 3: Preregistration

Step 1: Do good science (6)

Make clear predictions based on hypotheses.
Ensure you have enough statistical power (e.g.,
sample size for stable p-value, smallest effect size of
interest).
Set a stopping rule (how many subjects)
Reduce flexibility in data analysis (predetermined
dependent variables, exclusion criteria,
subgroups/covariates).
Adjust for multiple comparisons when appropriate
(Bonferroni correction! Adjust for multiple statistical
analysis by setting a more conservative significance
level; each time you repeat it it x’s the false positive
rate).
Upgrade statistical skills and understanding (know
statistics and the consequences of each decision)

Step 2: Transparency in reporting (share better science) (7)

Report all measures, conditions, and variables in ms.
“We report how we determined our sample size, all
data exclusions (if any), all manipulations, and all
measures in the study” (the 21 word solution).
Clearly distinguish confirmatory (planned) from
exploratory (unplanned) analyses.
Clearly document hypotheses, predictions, design
decisions, procedures.
Share data in a public repository (if ethically
possible; not with small samples where participants
could be identified or in clinical studies where
participants do not want their data shared).
Share analysis code and results.
Share research materials.
Improve journal standards.

Step 3: Preregistration (8)

Preregister hypotheses, predictions, measures,
planned analyses.
◦ Time-stamped and posted online.
◦ Public or private (available to reviewers)
◦ Have data been collected?
What is the main question or hypothesis?
Describe the key DVs, and how they will be
measured.
What conditions will participants be assigned to?
Specify the analysis to test the main hypothesis.
Explain how outliers will be handled, and how
observations will be excluded.
How many observations will be collected (trials), and
what will determine the sample size?
Anything else you would like to pre-register
(exploratory analysis)?

*As predicted is an example of a preregistration site
where you answer these (8) questions BEFORE you
begin your study and post it on a public platform like
OSF. If you are worried about people stealing your
hypothesis you can keep it a secret till you’ve conducted
your study and sent it to a journal for review/publishing.
*you do this if you use your own or others data!
*p-values are for testing hypothesis. Exploratory analysis
is fine if you preregister it and don’t interpret it as a
hypothesis.
*this makes replication easier, to test the same or an
alternative hypothesis.
*journals are beginning to request researcher do this, this
makes them less worried about findings null results and
journals are becoming more likely to publish null results.

Who owns science?

 Originally, journals are the gatekeepers of science who determine what gets published, favour significant findings and can restrict access to those who will pay a subscription free.  The solution to this is to, as a researcher, pay to have your work published in an open access journal.  Pre-prints are an alternative where you publish your work pre-print before the article is accepted by the journal (may or may not have been peer reviewed). It’s a public archive and the studies available on it are not guaranteed to be accepted the journal they send it to for publishing. It is only the studies manuscript; researchers are not allowed to use the journals fancy formatting; people have to pay for that.

Cohens d calcuated

Mean group difference/ sd