Inferential statistics Flashcards

1
Q

Outlier

A

An outlying observation, or outlier, is one that appears to deviate
markedly from other members of the sample in which it occurs.
different measures:
• more than 3 SD away from the mean
• more than 1.5 times of the IQR (mild) 3 times
(extreme) ñ custom boxplot criteria

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Confidence Interval (CI)

A

Confidence interval
we need to quantify uncertainity about the population value
• a confidence interval states our uncertainty
• confidence intervals are available for means,
differences between means, proportions, correlations…
A condence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter. Vorhersage liegt bei 0.25. Wenn CI=0.95 würde würde die Vorhersage in 95 von 100 Fällen den realen Wert abdecken
CI95% = mean+/- 2SE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Ho

A

• H0 with one group:
- our sample proportion is not different from the
known population proportion
- our sample mean is not different from the known population mean
• H0 with two or more groups:
- groups are not different from each other regarding their population proportions or means
• alternative hypothesis: same sentences but without not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

p-value

A

gives the probability that we observe a difference as large or even larger as seen with the sample if H0 would be true!

• Can we reject the null hypothesis?
• We will never know if the null hypothesis is true or false!!
• We almost always observe a difference!!
• One group: We need to know the population
mean/proportion!: Tells us how sure we are that our sample is not(!) different from the population
• Two groups: Tells you how often you would get the observed or a larger difference by random sampling from two populations if the means/proportions of both populations would be equal
• The p-value tells us not(!!) how sure we are that there is really a difference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

CI vs p-value

A
  • p-values are somehow only a measure of randomness
  • CI’s tells you about your proportions and the probable value in the population
  • CI’s are often the better measure, but unfortunately in science less frequently used
  • may be because telling one number is easier than two?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Significance

A

• statistically significant does not necessary mean that
the observation is\important”
• just a custom threshold () for the p-value to claim
significance (mostly < 0.05)
• significant , highly significant **, extremly
significant **

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Testing Numeric vs Numeric
Numeric vs Categoric
Categoric vs. Categoric

A

• Numeric vs Numeric
- Correlation, Regression
• Numeric vs Categoric - t-test, anova (today)
• Categoric vs Categorical - chisq-test, sher-test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Central Limit Theorem

A

No matter how the population is distributed: the population of sample means will approximate a Gaussian distribution if the sample size is large enough

What is large ? It depends:
• a less normal distribution more samples (100 should be enough in any case)
• more normal distribution (10 or more)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Properties of a Normal Distribution

A
• symmetrical bell shaped distribution
• extends in both directions to infinity
• mean and median are closed to each other
• 95% of all values are within 2 SD
• this assumption gives very wrong results if the the distribution is non-normal !!
• normal data --> t.test
• non-normal data, skewed or multi-modal
distributions --> wilcox.test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

t distribution

A
  • Derives from the normal distribution

* t is the difference between the sample mean and the population mean, divided by the SEM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Effectsize: Cohens D

A

How large is the deviation between two groups in comparison to the standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Correlation

A

• observe the association between two numerical
variables
• if two numerical variables are associated we say they are correlated
• the correlation coefficient is a quantity that describes the strength of the association

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

4 interpretations of r

A

Why the variables correlate so well?
• Lipid content of membranes determines insulin sensitivity?
• Insulin sensitivity affects lipid content?
• Insulin sensitivity and lipid content are controlled by
third factor?
• There is no correlation, our r is just a random finding (type 1 error)?
• We did not know the truth …
• Correlation did not mean causation!!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Correlation: r-squared r^2

A

• r2 often also called coefficient of determination
• r is between -1 and 1
• r2 is between 0 and 1, smaller than r
• r2 is interpreted as the fraction of variance that is
shared between the variables
• runners: 0,192 = 0,036 meaning that only 3.6% or
the variance of time are shared by age
• students: 0,782 = 0,6084 means that 60% of the weight variance is shared by size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

influence outliers

A

• Just one point can change everything with Pearson correlation!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Spearman Rank Correlation

A
  • Spearman correlation is more robust against outliers!
  • Correlation with one outlier is not significant!!
  • Spearman correlation is calculated on ranks of values.
  • It’s a non-parametric test.
  • It does not assumes normal distribution of data.
  • It is more conservative.
  • If in doubt use Spearman correlation.
17
Q

Effectsize Pearsons r and Spearmans rs

A

• Pearsons r and Spearmans rs are quite similar in their values
• but rs^2 is the proportion of rank variances for
• Kendalls tau is numerical different
66-75% of r or rs, don’t square it
• r of 0.1 small effect, 1% of variance
• r of 0.3 medium effect, 9% of variance
• r of 0.5 large effect, 25% of variance

18
Q

partial correlation

A

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. If we are interested in finding whether or to what extent there is a numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another, confounding, variable that is numerically related to both variables of
interest.
• partial correlation of body height and weight after removing the effect of sex
• correlation of shoesize and writing capabilites after removing the effect of …

19
Q

Mutual Information

A
  • Berechnet die Zahl einer bestimmten Abfolge zur Zahl aller Abfolgen (beispielsweise die Anzahl aller AC-Paare im Vergleich zu allen Paaren)

o Gibt Korrellationsinformationen wie die statistische Unabhängigkeit über die Sequenz für alle Pärchen korrelliert
o Nennt man übrigens auch Boltzmann Entropie, …

20
Q

Correlation vs Regression

A

Correlation
• description of an undirected relationship between two or more
variables
• how strong it is
• direction is not known, not existing or we are simply not interested
• phones in household and baby deaths

Regression
• description of a directed relationship between two or more variables
• one variable influences the other
• smoking and cancer
• weight and height
• model to describe the relationship
• model to predict one
variable
21
Q

Regression; aims, types;

A
  • looking for a trend: linear, sigmoid, exponential
  • curve fitting : which model ist most similar to the data
  • prediction: predict response variable Y from X
  • standard curve: assays

Regression Types
• simple linear regression (numerical variables)
• multiple linear regression (numerical variables)
• logistic regression (Y is categorical)
• non-linear regression (numerical variables)
• regression trees (Y is numerical)
• classification trees (Y is categorical)

22
Q

Simple linear regression

A
Simple Linear Regression
• most common regression type
• method to find a best straigth line to a cloud of data points
• one variable (independent) is used to predict a second
(dependent) In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable
and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear
function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variables. The adjective simple refers to the fact that the outcome variable is related to a single predictor
23
Q

Regression predict values

A

Use the equation to determine Y for certain values of X. Example what would be the values of insuline response for C20.22 values of 0, 15,16 and 17%.

24
Q

Linear regression: slope and intercept

A

Intercept (a, alpha): Value of Y if X is zero (Y-Intercept).
Slope (b, beta )): Increase on Y by one unit of X.
Example: y = 2x + 1
beta= (Summe (xi- durchschnittx)*(yi-durchschnitty))/(summe(xi- durchschnittx)^2)

25
Q

Residual in regression (and error)

A

Error (or disturbance) of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest (for example, a population mean).
Residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest (for example a sample mean) (Wikipedia).

26
Q

When use a non-parametric test?

A
  • outcome is a rank or score
  • some values are “too high” or “too low” - off scale to measure
  • measurements although numbers are just not Gaussian
  • you can try to transform the data
  • log-transformation is often used
27
Q

Tests with gaussian distribution and large vs small size

A
Non-gaussian
large:parametric ok
but what is large ?
How far from the normal distribution
small: parametric not ok; P-value wrong !

Gaussian

large: non-parametric ok; p-val slightly too large
small: non-parametric not ok, p-value much too high; lack of statistical power

28
Q

(log2/10) Transformation der Daten

A

Wenn sich Deine Daten als nicht normalverteilt herausstellen, kannst Du versuchen, sie durch Transformation in eine annähernde Normalverteilung umzuformen. Wenn das gelingt, rechnest Du anschließend die weiteren Analysen wie Signifikanztests mit den transformierten Daten. Dann ist es möglich, parametrische Methoden, die Normalverteilung fordern, anzuwenden.

Auch andere Probleme mit der Verteilung, wie zum Beispiel Hetereskedastizität, Nicht-Linearität oder Ausreißer können eventuell mit Transformationen behoben werden.
• natural log2, log10 logarithm are most often used
• values of zero are existing? Add to all values a 1: log(n+1)
• negative values: asinh transformation
• if used un-transform for instance condence intervals for
reporting

29
Q

Multiple testing correction

Bonferroni-Korrektur

A
  • korrigiert die Family-Wise-Error-Range (FWER)
    o FWER = Wahrscheinlichkeit, dass einer oder mehrere Tests falsch positiv sind
  • Entweder alle p-Werte mit N multiplizieren (und nur was dann noch immer über 0.05 liegt ist positiv)
  • … oder neues alpha berechnen= altesalpha/Anzahl Hypothesen
30
Q

Multiple testing correction Holm-Verfahren

A
  • Kontrolliert noch immer die FWER
  • P-Werte der Größe nach sortieren
  • Festlegung mehrerer Signifikanzschwellen
    für kleinsten p Wert: alphaNeu= altesAlpha/N
    für zweitkleinsten: alphaneu=altesAlpha/N-1
  • Abgelehnt werden Nullhypothesen mit allen p-Werten die kleiner als die zugehörige Schranke sind (bis die Schranke zum ersten Mal überschritten wurde)
31
Q

Multiple testing Benjamin-Hochberg-Verfahren

A
Kontrolliert die FDR
- p-Werte sortieren
- Festlegung mehrerer Signifikanzschwellen:
für kleinsten pWert: 
alphaneu=alpha/N

für zweitkleinsten:
alphaneu=2xalpha/N

für drittkleinsten:
alphaneu=3xalpha/N

Vorteil gegenüber Holm: Hier können auch nach der ersten Grenzüberschreitung weitere p-Werte abgelehnt werden

32
Q

Visulizins p values

and random p values

A

histogram

rn=rnorm(100,mean=10,sd=2)