topics Flashcards
1
Q
distribution for CI
A
t-distribution
2
Q
distribution for minimal sample size
A
z-distribution
3
Q
2E and E
A
- 2E = full range of CI
- E = half of CI range = margin of error
4
Q
power
A
- probability of correct decision
- higher sample sizes yield higher power
5
Q
influence of sample size
A
the same deviation from H0 with more data yields a lower p-value
6
Q
bootstrap CI
A
- sample from the original dataset
- enlarging B will reduce the variation
7
Q
bootstrap test
A
- sample from H0 distribution
- compare t-value of original data to surrogate T* values
- p-value is determined by proportion of T*-values exceeding the t-value of the data
8
Q
sign test: test statistic
A
- number of observations that are different from m0
- binomial test is done on this outcome
9
Q
wilcoxon signed rank test
A
- requires symmetric population
- one sample or (difference between) matched pairs (wilcox.test() with 1 argument)
- lose a lot of information but is really robust
10
Q
paired sample permutation test:
- what is permuted
- what is logic behind this
A
- permute original (x,y) labels
- under H0 of no difference between distributions of X and Y within pairs, permuting the labels should not chsnge the distribution of T
11
Q
how to test dependence in two paired samples
A
- pearson’s correlation test
- spearman’s rank correlation test
12
Q
two paired samples tests
A
- sign test
- wilcoxon signed rank test
- uses wilcox.test() with 1 argument - permutation test
- t.test(x,y,paired=TRUE)
13
Q
two independent samples tests
A
- mann-whitney test
- kolmogorov-smirnov test
- t.test(x, y)
14
Q
mann whitney test
A
- based on ranks
- uses wilcox.test() with 2 arguments
15
Q
kolmogorov-smirnov test
A
- tests in distributions are the same
- differences in histograms
- T - max vertical difference in summed histograms
16
Q
one-way ANOVA
A
- NI experimental units
- I = 2 = two-sample t-test
- always right sided
17
Q
SSa and RSS
A
- SSa: variance due to factor
- RSS: variance not explained by factor in the model
18
Q
kruskal wallis test
A
- nonparametric anova
- based on ranks
- distribution of W under H0 = X^2(I-1)
19
Q
independent samples permutation test
- what is permuted
- what is logic behind this
A
- 1way ANOVA
1. group labels are permuted
2. permutation of groups should not affect group means if there is no effect
20
Q
two way ANOVA
A
- NIJ experimental units
- main and interaction effects are tested
- I + J + 1 linear restrictions: treatment and sum parametrizations
21
Q
F statistic
A
- always right sided
- explained variance/unexplained variance
22
Q
interaction plot
A
interaction shows up as nonparallel curves
23
Q
testing interaction
A
- model that includes interaction –> only significance of interaction effect is relevant
- model without interaction –> additive model. check for presence of main effect
24
Q
block designs in 2way anova
A
- randomized block design
–> block = variable not of interest
–> dont look at significance of block variable in output - repeated measures
–> block = ID
–> exchangeable case: errors within a single unit are exchangeable, meaning that ordering is irrelevant
–> lack of exchangeability makes the block design invalid - friedman test
–> nonparametric for 2 designs above
25
block designs for random effects
1. crossover design
--> 2 outcomes per experimental unit (paired samples)
--> apply treatment in opposite orders between conditions
--> treatment, learning, and sequence effects
2. split plot design
--> 2 treatment factors (independent samples)
--> subplot and whole plot
- to get p-values, anova(reduced model, full model)
- (1|f) for random effect block
26
unbalanced design
- order of variables in the model matters
- variable of interest goes last
- otherwise, p-values are unreliable
27
difference RBD and split-plot design
RBD:
- 1 level of blocks
- fixed effects
SPD:
- 2 levels of (randomized) blocks (whole and subplots)
- mixed effects
28
fixed and mixed designs
fixed
1. one way ANOVA
2. two way ANOVA
3. randomized block design
4. repeated measures block design
mixed
1. crossover design (paired)
2. split-plot design (independent)
29
contingency tables
- count of units in cross categories
- test statistic: difference between expected and observed counts
- always right sided (1-chisq())
30
fisher's exact test
- for 2x2 tables
- odds ratio is used
31
simple linear regression
comparable to pearson's correlation test
- will give exactly the same t-score and p-value
32
multiple linear regression
- multiple explanatory variables
- to find the best parameters, we minimize the sum of squared differences (SSE)
33
global model fit
- sigma hat squared: residual standard error
- R^2: proportion of explained variance compared to base model Y = B0 + e
- F-statistic and overall p-value
- all of these are found at the bottom of the output
34
coefficients in multiple linear regression
- not all variables have explanatory power
- we need to find the relevant ones by testing for individual coefficients
- these are found in the individual rows of the output
35
step up and step down method
step down: remove highest nonsignificant variable
step up: add significant variable that yields maximum increase in R^2
36
preferred linear model has
1. least variables
2. highest R^2 (or only slight decrease)
3. interpretability
37
confidence interval
for population mean Ynew value
38
prediction interval
- for individual observation of Ynew
- larger interval than CI as the error is taken into account
39
model assumption linear regression
1. linearity of the relationship
2. normality
40
outliers
extremely low or high observation on the response variable
41
leverage point
extremely low or high observation on the explanatory variable
42
effect of leverage point
- can be studied by testing model fit with and without the leverage point
- if parameters change drastically by deleting this point, it's called an influence point
- cook's distance quantifies the influence of an observation on predictions (>1)
43
mean shift outlier model
- dummy vector with all 0s but 1 at outlier index
- include as variable in the model
- if variable is significant, the outlier is significant
44
collinearity
- linear relations between explanatory variables, meaning they explain the same
- straight line in scatterplot
- reflected in large variances and large CIs --> unreliable estimates
45
how to investigate collinearity
1. pairwise linear correlations
2. VIF factor. (>5 = concern)
46
ANCOVA
- extends ANOVA by including one or more variables that are expected to influence the dependent variable, but are not of primary interest
- adjusts the DV for the covariates by holding them constant
- variable not of interest is continuous (unlike RBD)
- the only relevant p-value is for the variable of interest
47
summary() parameter estimates
gives coefficient estimates as difference between ai and a1
48
anova()
gives us p-values, t-statistics, etc
49
interaction between relevant factor and irrelevant variable (ANCOVA)
- H0: B1 = ... = Bi
- parallel lines = no interaction
- modeled with B_i instead of gamma
- look at interaction p-valye in the output, the other values should be calculated separately
50
order of factors
1. does not matter in balanced ANOVA
2. matters in unbalanced ANOVA
3. matters in ANCOVA (always)
4. matters in logistic regression (always)
51
family wise error rate
- probability of making a Type I error (false positive) when multiple comparisons are being testsed
- to provide FWER < 0.05, we use the bonferroni correction (alpha_ind = 0.05/m)
52
multiple testing arises when
1. there are many parameters of interest
2. investigating all differences between factprs pf a set of effects in ANOVA
53
simultaneous testing
- usually everything is compared to B1 or a1. this is not simultaneous testing
- tukey etc. show adjusted p-values for simultaneous testing of all Bs
54
logistic regression
- binary outcome
- linear model for the log odds
- probability of success
55
log odds
- log odds = log (p(success)/p(failure) = model
- odds = e^model
56
a change delta in the linear predictor
multiplies the odds by e^delta
57
linear predictor
coefficient or additive model
58
odds
e^delta
59
p(y=1)
1/(1+e^delta)
60
poisson regression: lambda
- if Y ~ poisson(lambda), then E(Y) = var(Y) = lambda
- the larger the lambda parameter, the larger the values of Y on average, and the larger the spread in the values of Y
- for very large values of lambda, the poisson distribution is approximately normal
61
lambda is modelled as
- log(lambda) = model
- lambda = e^model
- QQplot is not useful here
62
survival analysis
- analysis of lifetimes
- survival function: probability of survival until time t
63
hazard function
- rate of dying within a short interval
- how likely the event is to happen at a particular moment in timee
64
censoring
- incomplete observation of the survival time of a variable
- (di = Ti < Ci) = event has not happened yet
65
Kaplan-Meier estimator of the survival function
- only categorical IVs
- survival probabilities for specific times
66
Nelson Aalen estimator of the cumulative hazard function
- step function increases only at times where events occur
67
log rank test
- tests whether 2+ survival curves are identical
- can only deal with grouped data
68
proportional hazards model
- unlike KM model, can take many - kinds of predictors
- main feature: coefficients can be estimated by maximizing the partial likelihood
69
treatment parametrization
- 1 group is a reference group
- ai are expressed as difference between a1 and ai
- can be set with 'contrasts' command
70
sum parametrization
- ai are expressed as deviations from the mean
- combined ai average is 0