Inferential statistics Flashcards
Outlier
An outlying observation, or outlier, is one that appears to deviate
markedly from other members of the sample in which it occurs.
different measures:
• more than 3 SD away from the mean
• more than 1.5 times of the IQR (mild) 3 times
(extreme) ñ custom boxplot criteria
Confidence Interval (CI)
Confidence interval
we need to quantify uncertainity about the population value
• a confidence interval states our uncertainty
• confidence intervals are available for means,
differences between means, proportions, correlations…
A condence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter. Vorhersage liegt bei 0.25. Wenn CI=0.95 würde würde die Vorhersage in 95 von 100 Fällen den realen Wert abdecken
CI95% = mean+/- 2SE
Ho
• H0 with one group:
- our sample proportion is not different from the
known population proportion
- our sample mean is not different from the known population mean
• H0 with two or more groups:
- groups are not different from each other regarding their population proportions or means
• alternative hypothesis: same sentences but without not
p-value
gives the probability that we observe a difference as large or even larger as seen with the sample if H0 would be true!
• Can we reject the null hypothesis?
• We will never know if the null hypothesis is true or false!!
• We almost always observe a difference!!
• One group: We need to know the population
mean/proportion!: Tells us how sure we are that our sample is not(!) different from the population
• Two groups: Tells you how often you would get the observed or a larger difference by random sampling from two populations if the means/proportions of both populations would be equal
• The p-value tells us not(!!) how sure we are that there is really a difference
CI vs p-value
- p-values are somehow only a measure of randomness
- CI’s tells you about your proportions and the probable value in the population
- CI’s are often the better measure, but unfortunately in science less frequently used
- may be because telling one number is easier than two?
Significance
• statistically significant does not necessary mean that
the observation is\important”
• just a custom threshold () for the p-value to claim
significance (mostly < 0.05)
• significant , highly significant **, extremly
significant **
Testing Numeric vs Numeric
Numeric vs Categoric
Categoric vs. Categoric
• Numeric vs Numeric
- Correlation, Regression
• Numeric vs Categoric - t-test, anova (today)
• Categoric vs Categorical - chisq-test, sher-test
Central Limit Theorem
No matter how the population is distributed: the population of sample means will approximate a Gaussian distribution if the sample size is large enough
What is large ? It depends:
• a less normal distribution more samples (100 should be enough in any case)
• more normal distribution (10 or more)
Properties of a Normal Distribution
• symmetrical bell shaped distribution • extends in both directions to infinity • mean and median are closed to each other • 95% of all values are within 2 SD • this assumption gives very wrong results if the the distribution is non-normal !! • normal data --> t.test • non-normal data, skewed or multi-modal distributions --> wilcox.test
t distribution
- Derives from the normal distribution
* t is the difference between the sample mean and the population mean, divided by the SEM
Effectsize: Cohens D
How large is the deviation between two groups in comparison to the standard deviation.
Correlation
• observe the association between two numerical
variables
• if two numerical variables are associated we say they are correlated
• the correlation coefficient is a quantity that describes the strength of the association
4 interpretations of r
Why the variables correlate so well?
• Lipid content of membranes determines insulin sensitivity?
• Insulin sensitivity affects lipid content?
• Insulin sensitivity and lipid content are controlled by
third factor?
• There is no correlation, our r is just a random finding (type 1 error)?
• We did not know the truth …
• Correlation did not mean causation!!!
Correlation: r-squared r^2
• r2 often also called coefficient of determination
• r is between -1 and 1
• r2 is between 0 and 1, smaller than r
• r2 is interpreted as the fraction of variance that is
shared between the variables
• runners: 0,192 = 0,036 meaning that only 3.6% or
the variance of time are shared by age
• students: 0,782 = 0,6084 means that 60% of the weight variance is shared by size
influence outliers
• Just one point can change everything with Pearson correlation!