p-values are somehow only a measure of randomness CI's tells you about your proportions and the probable value in the population CI's are often the better measure, but unfortunately in science less frequently used may be because telling one number is easier than two?

Inferential statistics Flashcards by Anni Saß

Outlier

An outlying observation, or outlier, is one that appears to deviate
markedly from other members of the sample in which it occurs.
different measures:
• more than 3 SD away from the mean
• more than 1.5 times of the IQR (mild) 3 times
(extreme) ñ custom boxplot criteria

How well did you know this?

Not at all

Perfectly

Confidence Interval (CI)

Confidence interval
we need to quantify uncertainity about the population value
• a confidence interval states our uncertainty
• confidence intervals are available for means,
differences between means, proportions, correlations…
A condence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter. Vorhersage liegt bei 0.25. Wenn CI=0.95 würde würde die Vorhersage in 95 von 100 Fällen den realen Wert abdecken
CI95% = mean+/- 2SE

How well did you know this?

Not at all

Perfectly

• H0 with one group:
- our sample proportion is not different from the
known population proportion
- our sample mean is not different from the known population mean
• H0 with two or more groups:
- groups are not different from each other regarding their population proportions or means
• alternative hypothesis: same sentences but without not

How well did you know this?

Not at all

Perfectly

p-value

gives the probability that we observe a difference as large or even larger as seen with the sample if H0 would be true!

• Can we reject the null hypothesis?
• We will never know if the null hypothesis is true or false!!
• We almost always observe a difference!!
• One group: We need to know the population
mean/proportion!: Tells us how sure we are that our sample is not(!) different from the population
• Two groups: Tells you how often you would get the observed or a larger difference by random sampling from two populations if the means/proportions of both populations would be equal
• The p-value tells us not(!!) how sure we are that there is really a difference

How well did you know this?

Not at all

Perfectly

CI vs p-value

p-values are somehow only a measure of randomness
CI’s tells you about your proportions and the probable value in the population
CI’s are often the better measure, but unfortunately in science less frequently used
may be because telling one number is easier than two?

How well did you know this?

Not at all

Perfectly

Significance

• statistically significant does not necessary mean that
the observation is\important”
• just a custom threshold () for the p-value to claim
significance (mostly < 0.05)
• significant , highly significant **, extremly
significant **

How well did you know this?

Not at all

Perfectly

Testing Numeric vs Numeric
Numeric vs Categoric
Categoric vs. Categoric

• Numeric vs Numeric
- Correlation, Regression
• Numeric vs Categoric - t-test, anova (today)
• Categoric vs Categorical - chisq-test, sher-test

How well did you know this?

Not at all

Perfectly

Central Limit Theorem

No matter how the population is distributed: the population of sample means will approximate a Gaussian distribution if the sample size is large enough

What is large ? It depends:
• a less normal distribution more samples (100 should be enough in any case)
• more normal distribution (10 or more)

How well did you know this?

Not at all

Perfectly

Properties of a Normal Distribution

• symmetrical bell shaped distribution
• extends in both directions to infinity
• mean and median are closed to each other
• 95% of all values are within 2 SD
• this assumption gives very wrong results if the the distribution is non-normal !!
• normal data --> t.test
• non-normal data, skewed or multi-modal
distributions --> wilcox.test

How well did you know this?

Not at all

Perfectly

t distribution

Derives from the normal distribution

* t is the difference between the sample mean and the population mean, divided by the SEM

How well did you know this?

Not at all

Perfectly

Effectsize: Cohens D

How large is the deviation between two groups in comparison to the standard deviation.

How well did you know this?

Not at all

Perfectly

Correlation

• observe the association between two numerical
variables
• if two numerical variables are associated we say they are correlated
• the correlation coefficient is a quantity that describes the strength of the association

How well did you know this?

Not at all

Perfectly

4 interpretations of r

Why the variables correlate so well?
• Lipid content of membranes determines insulin sensitivity?
• Insulin sensitivity affects lipid content?
• Insulin sensitivity and lipid content are controlled by
third factor?
• There is no correlation, our r is just a random finding (type 1 error)?
• We did not know the truth …
• Correlation did not mean causation!!!

How well did you know this?

Not at all

Perfectly

Correlation: r-squared r^2

• r2 often also called coefficient of determination
• r is between -1 and 1
• r2 is between 0 and 1, smaller than r
• r2 is interpreted as the fraction of variance that is
shared between the variables
• runners: 0,192 = 0,036 meaning that only 3.6% or
the variance of time are shared by age
• students: 0,782 = 0,6084 means that 60% of the weight variance is shared by size

How well did you know this?

Not at all

Perfectly

influence outliers

• Just one point can change everything with Pearson correlation!

How well did you know this?

Not at all

Perfectly

Spearman Rank Correlation

Study These Flashcards

Spearman correlation is more robust against outliers!
Correlation with one outlier is not significant!!
Spearman correlation is calculated on ranks of values.
It’s a non-parametric test.
It does not assumes normal distribution of data.
It is more conservative.
If in doubt use Spearman correlation.

Effectsize Pearsons r and Spearmans rs

Study These Flashcards

• Pearsons r and Spearmans rs are quite similar in their values
• but rs^2 is the proportion of rank variances for
• Kendalls tau is numerical different
66-75% of r or rs, don’t square it
• r of 0.1 small effect, 1% of variance
• r of 0.3 medium effect, 9% of variance
• r of 0.5 large effect, 25% of variance

partial correlation

Study These Flashcards

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. If we are interested in finding whether or to what extent there is a numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another, confounding, variable that is numerically related to both variables of
interest.
• partial correlation of body height and weight after removing the effect of sex
• correlation of shoesize and writing capabilites after removing the effect of …

Mutual Information

Study These Flashcards

Berechnet die Zahl einer bestimmten Abfolge zur Zahl aller Abfolgen (beispielsweise die Anzahl aller AC-Paare im Vergleich zu allen Paaren)

o Gibt Korrellationsinformationen wie die statistische Unabhängigkeit über die Sequenz für alle Pärchen korrelliert
o Nennt man übrigens auch Boltzmann Entropie, …

Correlation vs Regression

Study These Flashcards

Correlation
• description of an undirected relationship between two or more
variables
• how strong it is
• direction is not known, not existing or we are simply not interested
• phones in household and baby deaths

Regression
• description of a directed relationship between two or more variables
• one variable influences the other
• smoking and cancer
• weight and height
• model to describe the relationship
• model to predict one
variable

Regression; aims, types;

Study These Flashcards

looking for a trend: linear, sigmoid, exponential
curve fitting : which model ist most similar to the data
prediction: predict response variable Y from X
standard curve: assays

Regression Types
• simple linear regression (numerical variables)
• multiple linear regression (numerical variables)
• logistic regression (Y is categorical)
• non-linear regression (numerical variables)
• regression trees (Y is numerical)
• classification trees (Y is categorical)

Simple linear regression

Study These Flashcards

Simple Linear Regression
• most common regression type
• method to find a best straigth line to a cloud of data points
• one variable (independent) is used to predict a second
(dependent) In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable
and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear
function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variables. The adjective simple refers to the fact that the outcome variable is related to a single predictor

Regression predict values

Study These Flashcards

Use the equation to determine Y for certain values of X. Example what would be the values of insuline response for C20.22 values of 0, 15,16 and 17%.

Linear regression: slope and intercept

Study These Flashcards

Intercept (a, alpha): Value of Y if X is zero (Y-Intercept).
Slope (b, beta )): Increase on Y by one unit of X.
Example: y = 2x + 1
beta= (Summe (xi- durchschnittx)*(yi-durchschnitty))/(summe(xi- durchschnittx)^2)

Residual in regression (and error)

Error (or disturbance) of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest (for example, a population mean). Residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest (for example a sample mean) (Wikipedia).

When use a non-parametric test?

* outcome is a rank or score * some values are "too high" or "too low" - off scale to measure * measurements although numbers are just not Gaussian * you can try to transform the data * log-transformation is often used

Tests with gaussian distribution and large vs small size

``` Non-gaussian large:parametric ok but what is large ? How far from the normal distribution small: parametric not ok; P-value wrong ! ``` Gaussian large: non-parametric ok; p-val slightly too large small: non-parametric not ok, p-value much too high; lack of statistical power

(log2/10) Transformation der Daten

Wenn sich Deine Daten als nicht normalverteilt herausstellen, kannst Du versuchen, sie durch Transformation in eine annähernde Normalverteilung umzuformen. Wenn das gelingt, rechnest Du anschließend die weiteren Analysen wie Signifikanztests mit den transformierten Daten. Dann ist es möglich, parametrische Methoden, die Normalverteilung fordern, anzuwenden. Auch andere Probleme mit der Verteilung, wie zum Beispiel Hetereskedastizität, Nicht-Linearität oder Ausreißer können eventuell mit Transformationen behoben werden. • natural log2, log10 logarithm are most often used • values of zero are existing? Add to all values a 1: log(n+1) • negative values: asinh transformation • if used un-transform for instance condence intervals for reporting

Multiple testing correction | Bonferroni-Korrektur

- korrigiert die Family-Wise-Error-Range (FWER) o FWER = Wahrscheinlichkeit, dass einer oder mehrere Tests falsch positiv sind - Entweder alle p-Werte mit N multiplizieren (und nur was dann noch immer über 0.05 liegt ist positiv) - … oder neues alpha berechnen= altesalpha/Anzahl Hypothesen

Multiple testing correction Holm-Verfahren

- Kontrolliert noch immer die FWER - P-Werte der Größe nach sortieren - Festlegung mehrerer Signifikanzschwellen für kleinsten p Wert: alphaNeu= altesAlpha/N für zweitkleinsten: alphaneu=altesAlpha/N-1 ... - Abgelehnt werden Nullhypothesen mit allen p-Werten die kleiner als die zugehörige Schranke sind (bis die Schranke zum ersten Mal überschritten wurde)

Multiple testing Benjamin-Hochberg-Verfahren

``` Kontrolliert die FDR - p-Werte sortieren - Festlegung mehrerer Signifikanzschwellen: für kleinsten pWert: alphaneu=alpha/N ``` für zweitkleinsten: alphaneu=2xalpha/N für drittkleinsten: alphaneu=3xalpha/N Vorteil gegenüber Holm: Hier können auch nach der ersten Grenzüberschreitung weitere p-Werte abgelehnt werden

Visulizins p values and random p values

histogram rn=rnorm(100,mean=10,sd=2)

Inferential statistics Flashcards

(32 cards)