Inferential statistics Flashcards
Outlier
An outlying observation, or outlier, is one that appears to deviate
markedly from other members of the sample in which it occurs.
different measures:
• more than 3 SD away from the mean
• more than 1.5 times of the IQR (mild) 3 times
(extreme) ñ custom boxplot criteria
Confidence Interval (CI)
Confidence interval
we need to quantify uncertainity about the population value
• a confidence interval states our uncertainty
• confidence intervals are available for means,
differences between means, proportions, correlations…
A condence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter. Vorhersage liegt bei 0.25. Wenn CI=0.95 würde würde die Vorhersage in 95 von 100 Fällen den realen Wert abdecken
CI95% = mean+/- 2SE
Ho
• H0 with one group:
- our sample proportion is not different from the
known population proportion
- our sample mean is not different from the known population mean
• H0 with two or more groups:
- groups are not different from each other regarding their population proportions or means
• alternative hypothesis: same sentences but without not
p-value
gives the probability that we observe a difference as large or even larger as seen with the sample if H0 would be true!
• Can we reject the null hypothesis?
• We will never know if the null hypothesis is true or false!!
• We almost always observe a difference!!
• One group: We need to know the population
mean/proportion!: Tells us how sure we are that our sample is not(!) different from the population
• Two groups: Tells you how often you would get the observed or a larger difference by random sampling from two populations if the means/proportions of both populations would be equal
• The p-value tells us not(!!) how sure we are that there is really a difference
CI vs p-value
- p-values are somehow only a measure of randomness
- CI’s tells you about your proportions and the probable value in the population
- CI’s are often the better measure, but unfortunately in science less frequently used
- may be because telling one number is easier than two?
Significance
• statistically significant does not necessary mean that
the observation is\important”
• just a custom threshold () for the p-value to claim
significance (mostly < 0.05)
• significant , highly significant **, extremly
significant **
Testing Numeric vs Numeric
Numeric vs Categoric
Categoric vs. Categoric
• Numeric vs Numeric
- Correlation, Regression
• Numeric vs Categoric - t-test, anova (today)
• Categoric vs Categorical - chisq-test, sher-test
Central Limit Theorem
No matter how the population is distributed: the population of sample means will approximate a Gaussian distribution if the sample size is large enough
What is large ? It depends:
• a less normal distribution more samples (100 should be enough in any case)
• more normal distribution (10 or more)
Properties of a Normal Distribution
• symmetrical bell shaped distribution • extends in both directions to infinity • mean and median are closed to each other • 95% of all values are within 2 SD • this assumption gives very wrong results if the the distribution is non-normal !! • normal data --> t.test • non-normal data, skewed or multi-modal distributions --> wilcox.test
t distribution
- Derives from the normal distribution
* t is the difference between the sample mean and the population mean, divided by the SEM
Effectsize: Cohens D
How large is the deviation between two groups in comparison to the standard deviation.
Correlation
• observe the association between two numerical
variables
• if two numerical variables are associated we say they are correlated
• the correlation coefficient is a quantity that describes the strength of the association
4 interpretations of r
Why the variables correlate so well?
• Lipid content of membranes determines insulin sensitivity?
• Insulin sensitivity affects lipid content?
• Insulin sensitivity and lipid content are controlled by
third factor?
• There is no correlation, our r is just a random finding (type 1 error)?
• We did not know the truth …
• Correlation did not mean causation!!!
Correlation: r-squared r^2
• r2 often also called coefficient of determination
• r is between -1 and 1
• r2 is between 0 and 1, smaller than r
• r2 is interpreted as the fraction of variance that is
shared between the variables
• runners: 0,192 = 0,036 meaning that only 3.6% or
the variance of time are shared by age
• students: 0,782 = 0,6084 means that 60% of the weight variance is shared by size
influence outliers
• Just one point can change everything with Pearson correlation!
Spearman Rank Correlation
- Spearman correlation is more robust against outliers!
- Correlation with one outlier is not significant!!
- Spearman correlation is calculated on ranks of values.
- It’s a non-parametric test.
- It does not assumes normal distribution of data.
- It is more conservative.
- If in doubt use Spearman correlation.
Effectsize Pearsons r and Spearmans rs
• Pearsons r and Spearmans rs are quite similar in their values
• but rs^2 is the proportion of rank variances for
• Kendalls tau is numerical different
66-75% of r or rs, don’t square it
• r of 0.1 small effect, 1% of variance
• r of 0.3 medium effect, 9% of variance
• r of 0.5 large effect, 25% of variance
partial correlation
In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. If we are interested in finding whether or to what extent there is a numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another, confounding, variable that is numerically related to both variables of
interest.
• partial correlation of body height and weight after removing the effect of sex
• correlation of shoesize and writing capabilites after removing the effect of …
Mutual Information
- Berechnet die Zahl einer bestimmten Abfolge zur Zahl aller Abfolgen (beispielsweise die Anzahl aller AC-Paare im Vergleich zu allen Paaren)
o Gibt Korrellationsinformationen wie die statistische Unabhängigkeit über die Sequenz für alle Pärchen korrelliert
o Nennt man übrigens auch Boltzmann Entropie, …
Correlation vs Regression
Correlation
• description of an undirected relationship between two or more
variables
• how strong it is
• direction is not known, not existing or we are simply not interested
• phones in household and baby deaths
Regression • description of a directed relationship between two or more variables • one variable influences the other • smoking and cancer • weight and height • model to describe the relationship • model to predict one variable
Regression; aims, types;
- looking for a trend: linear, sigmoid, exponential
- curve fitting : which model ist most similar to the data
- prediction: predict response variable Y from X
- standard curve: assays
Regression Types
• simple linear regression (numerical variables)
• multiple linear regression (numerical variables)
• logistic regression (Y is categorical)
• non-linear regression (numerical variables)
• regression trees (Y is numerical)
• classification trees (Y is categorical)
Simple linear regression
Simple Linear Regression • most common regression type • method to find a best straigth line to a cloud of data points • one variable (independent) is used to predict a second (dependent) In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variables. The adjective simple refers to the fact that the outcome variable is related to a single predictor
Regression predict values
Use the equation to determine Y for certain values of X. Example what would be the values of insuline response for C20.22 values of 0, 15,16 and 17%.
Linear regression: slope and intercept
Intercept (a, alpha): Value of Y if X is zero (Y-Intercept).
Slope (b, beta )): Increase on Y by one unit of X.
Example: y = 2x + 1
beta= (Summe (xi- durchschnittx)*(yi-durchschnitty))/(summe(xi- durchschnittx)^2)
Residual in regression (and error)
Error (or disturbance) of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest (for example, a population mean).
Residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest (for example a sample mean) (Wikipedia).
When use a non-parametric test?
- outcome is a rank or score
- some values are “too high” or “too low” - off scale to measure
- measurements although numbers are just not Gaussian
- you can try to transform the data
- log-transformation is often used
Tests with gaussian distribution and large vs small size
Non-gaussian large:parametric ok but what is large ? How far from the normal distribution small: parametric not ok; P-value wrong !
Gaussian
large: non-parametric ok; p-val slightly too large
small: non-parametric not ok, p-value much too high; lack of statistical power
(log2/10) Transformation der Daten
Wenn sich Deine Daten als nicht normalverteilt herausstellen, kannst Du versuchen, sie durch Transformation in eine annähernde Normalverteilung umzuformen. Wenn das gelingt, rechnest Du anschließend die weiteren Analysen wie Signifikanztests mit den transformierten Daten. Dann ist es möglich, parametrische Methoden, die Normalverteilung fordern, anzuwenden.
Auch andere Probleme mit der Verteilung, wie zum Beispiel Hetereskedastizität, Nicht-Linearität oder Ausreißer können eventuell mit Transformationen behoben werden.
• natural log2, log10 logarithm are most often used
• values of zero are existing? Add to all values a 1: log(n+1)
• negative values: asinh transformation
• if used un-transform for instance condence intervals for
reporting
Multiple testing correction
Bonferroni-Korrektur
- korrigiert die Family-Wise-Error-Range (FWER)
o FWER = Wahrscheinlichkeit, dass einer oder mehrere Tests falsch positiv sind - Entweder alle p-Werte mit N multiplizieren (und nur was dann noch immer über 0.05 liegt ist positiv)
- … oder neues alpha berechnen= altesalpha/Anzahl Hypothesen
Multiple testing correction Holm-Verfahren
- Kontrolliert noch immer die FWER
- P-Werte der Größe nach sortieren
- Festlegung mehrerer Signifikanzschwellen
für kleinsten p Wert: alphaNeu= altesAlpha/N
für zweitkleinsten: alphaneu=altesAlpha/N-1
… - Abgelehnt werden Nullhypothesen mit allen p-Werten die kleiner als die zugehörige Schranke sind (bis die Schranke zum ersten Mal überschritten wurde)
Multiple testing Benjamin-Hochberg-Verfahren
Kontrolliert die FDR - p-Werte sortieren - Festlegung mehrerer Signifikanzschwellen: für kleinsten pWert: alphaneu=alpha/N
für zweitkleinsten:
alphaneu=2xalpha/N
für drittkleinsten:
alphaneu=3xalpha/N
Vorteil gegenüber Holm: Hier können auch nach der ersten Grenzüberschreitung weitere p-Werte abgelehnt werden
Visulizins p values
and random p values
histogram
rn=rnorm(100,mean=10,sd=2)