Week 5 - Sampling and random error & Week 6 - Statistical significance Flashcards
What is a sample?
A sample is a selected subset of a source population
Ideally should be representative of source population
What is a source population?
The source population is the group of all
individuals in which we are interested to assess some parameter(s)
What is sampling?
The process of selecting a number of
individuals from all individuals found in a source population
Many different sampling methods.
What is a sampling frame?
a list (or database) containing all
individuals in a population and is used for sampling
What are sampling units?
The sampling units are the individuals to be potentially
selected.
Sampling units most of the time are individual
people, but we could also have larger sampling units (i.e.
families, streets, hospitals, schools, etc.)
Who can be part of the source population?
The source population can be the general population (i.e.
the total population of a country or city), but can also be a
specific sub-population (i.e. all smokers of a country, all
patients with heart disease, all children with cancer, etc.)
Describe what the sample should represent for each type of research.
- In descriptive research (i.e. when we want to investigate
prevalence/incidence of a condition in a population), it is
particularly important that the sample accurately
represents the specific source population - In analytic research (i.e. when we investigate association
between exposure and outcome), we can be more general
regarding the source population, depending on the
research question of interest - In situations where we investigate a biological effect on
some disease (i.e. effect of smoking on risk of cancer), we
can be more general in identifying the source population
(i.e. not necessarily restricted to specific country/region)
4.In situations where we investigate social/cultural effects
(i.e. effect of social class on risk of heart disease), we have
to more careful and restrict the source population to the
specific country/region from where the sample was derived
What is an estimate?
In order to determine the proportion of a characteristic in a
population, we usually measure
that in a sample
Therefore what we measure is an ESTIMATE. This estimate
carries an inherent error (sampling error)
The sample estimate attempts to quantify the
corresponding population parameter
What is statistical inference?
*When the sample estimate is used to draw conclusions
(inferences) about the population from which the sample
was taken, this is called STATISTICAL INFERENCE
*Statistical inference, as the name suggests, involves the use
of statistics to determine the degree of uncertainty in the
estimate of interest
What is a parameter?
- A parameter is a measurement of a quantity (or association)
in a population, which we are interested about, e.g: - mean age
- prevalence of obesity
- mean difference in blood pressure between men and women
- Odds Ratio for association between smoking and cancer
Population parameter and sample estimate for any given variable is?
Sample estimate mean = 3.75
Population parameter = 3.72
What is sampling variation?
The difference (variation) between different sample
estimates derived from the same source population
What is sampling error?
The difference in magnitude between the sample estimates
and the actual population parameter caused by measuring a
quantity (or association) in a sample rather than in the
source population
Also called “random error”, because it depends on chance
What happens when you decrease sample size?
- Sampling variation = increase
- sampling error = increase
NB! All principles covered thus far apply for all measures of association (incidence, risk ratio, rate ratio, mean diff, correlation coefficient, regression coefficient) and all are termed ‘estimates’ calculated from ‘sample’.
What is sampling distribution?
All the samples calculated plotted on a histogram
Sampling distribution for very small sample size
Sampling distribution for larger sample size
Sampling distribution for very large sample size
What is standard error?
The standard error describes the uncertainty of how well
the sample estimate represents the population parameter
It essentially estimates the standard deviation of the
sampling distribution, i.e. the average error that can occur
whenever we take a sample of a certain size n
What is the standard error formula?
Standard error can be estimated from a single (!) sample
What is the 95% Confidence Interval?
- Confidence intervals indicate a range (interval) within which
we are confident (with some degree of uncertainty) that the
true population parameter lies - the 95% Confidence Interval (95% CI) for a sample estimate is
calculated as: - Lower confidence interval
sample estimate – 1.96*standard error - Upper confidence interval
sample estimate + 1.96*standard error - Interpretation: We are 95% confident that
the population parameter is contained within the interval
sample estimate +/- 1.96 SE
What are the two things we assess we assess associations?
The presence of
an association and the magnitude of this association
What are the two things we assess we assess associations?
The presence of
an association and the magnitude of this association
What are the 2 possibilities for any given association?
- The association does not exist in the population
(i.e. the two variables are not linked) - The association exists in the population (the two
variables are linked)
What are the two types of associations called?
- The Null hypothesis (H0
) always states that there is no
association between the two variables in the population - The Alternative hypothesis (HA
) always states that there
is an association between the two variables in the
population
Explain the formal process of hypothesis testing.
- Define statistical null ( H0
) and alternative hypotheses (HA
) - Start by assuming NO association exists in population➔ i.e. start
with H0 - Define what is sufficient evidence against H0
: the significance
level - Collect some sample data from population (evidence)
- Does sample estimate provide sufficient evidence against H0
(i.e. no association)?
* Or alternatively could sample estimate be explained by random
error alone, i.e. consistent with expected sampling variation if no
association exists in the population - Calculate value of test statistic (using sample)
- Using test statistic derive probability that quantifies our belief
against H0
: p-value - Interpret p-value: often in the context of the significance level
What is the p-value?
- What is the probability of obtaining an association as strong (or
stronger) as the one observed in our sample, if in fact there is no
association present in the source population (i.e. H0
is true)
-The lower the p-value , the lesser the chance we could have
obtained an association this strong (or stronger) in our sample if no
true association existed (in the population)
-Thus , the lower the p-value, the more we think about rejecting H0
(no association exists in population) in favour of HA
(association exists
in population) - Generally, it is true that the stronger the association ( and the
larger the value of the test statistic), the lower the p-value
What’s the relationship between p-value and association?
Inversely proportional.
What is the significance level?
It is the binary cut-off to say what is sufficient evidence and how low the p-value should be.
* Often a significance level of 5% is chosen and therefore a
p-value of <0.05 is used to infer statistical significance
* An estimate with a p-value of <0.05 is deemed statistically
significant
Rejecting or not the Null Hypothesis based on p-value
- The p-value is used as evidence for rejecting or not rejecting
the Null hypothesis H0
in favour of the alternative HA - If the p-value is <0.05 (or whatever the chosen significance
level was), we reject H0
(no association in the population) - If the p-value is ≥0.05 (or whatever the chosen significance
level was), we cannot reject H0
(no association in the
population)
IMPORTANT TO REMEMBER
- In hypothesis testing …
- We either have, or do not have, enough evidence to “reject H0
” - Can only either “reject H0 “ or “fail to reject H0
” - We cannot confirm whether HA or H0 are true
What can we expect to happen to the p-value if we assume association?
Generally, if we assume the presence of an association in
the source population:
o large sample sizes will give smaller p-values
o estimates of large magnitude will also give
smaller p-values
What 2 factors affect the p-value?
(Just like with 95% CI)
1. Sample size
2. Magnitude of association
What two ways can be used to reject or not the Null hypotheses?
- p-value
- 95% Confidence Interval
How can 95% CI be used to reject or not the Null Hypothesis?
*Mean difference: If the 95% CIs include 0 then H0
cannot be rejected. This is because 0 (meaning no
difference between the two means) is a likely value
in the source population.
* Regression coefficient and correlation coefficient: If
the 95% CIs include 0 then H0 cannot be rejected.
This is because 0 (meaning no correlation between
the two variables and a slope of 0) is a likely value
in the source population
* Odds Ratio/Risk Ratio/Rate Ratio: If the 95% CIs
include 1 then H0
cannot be rejected. This is
because 1 (meaning equal risks, rates or odds
between the two groups) is a likely value in the
source population
Statistical Significance Summarised
- In cases where the p-value of an estimate is ≥0.05 or
when the 95% Confidence Intervals include 0 (mean
difference, regression coefficient, correlation
coefficient) or 1 (Odds Ratio/Risk Ratio/Rate Ratio),
then the estimate is considered not statistically
significant, thus the study finding is not conclusive. - In cases where the p-value of an estimate is <0.05 or
when the 95% Confidence Intervals do not include 0
(mean difference, regression coefficient, correlation
coefficient) or 1 (Odds Ratio/Risk Ratio/Rate Ratio),
then the estimate is considered statistically significant,
thus the study finding is conclusive
LOOK AT EXAMPLES
IN SLIDES