Statistics Flashcards
P-value
probability of rejecting H0 with the value of the test statistic obtained from the data given H0 is true (small p value is preferred)
(if p value is small then it’s saying that the observation is unlikely to happen, so the data is statistically significant enough to reject the null hypothesis)
Confidence interval
measures the degree of uncertainty or certainty in sampling method, and is the range of plausible values for an unknown parameter (finding test statistic)
Odds ratio
describes the strength of the association between two events, ad/bc (the ratio of the odds of A in the presence of B and the ratio of the odds of A in the absence of B. p1/1-p1 / p2/1-p2)
Z distribution
normal distribution when variance is known with a large sample size
t distribution
normal distribution when variance is unknown with a small sample size
power
probability of avoiding a type 2 error (when the type 2 error is to accept a false hypothesis), (1 - p(type 11 error)
Type 1 error
h0 is rejected but h0 true
Type 2 error
h0 is accepted but h0 false
Unbiased
expected value of the estimator is the parameter (when the estimator is an estimate of the parameter)
MSE
variance(T) + bias(T)^2
consistent
statistic tends to the parameter as n increases
Method of moments
equating E(t) = mu and solving in terms of parameter, can also do for variance if sample has multiple parameters
MLE
method for choosing the ‘best’ parameter which maximises the probability that the parameter produces the sample
Sufficient
statistic is sufficient if it contains all the information about the parameter in it
Neyman-Fisher factorisation theorem
the likelihood can be factorised in terms of h(x) and g(t, theta) where t is a sufficient statistic
Cram´er-Rao lower bound.
smallest the variance of any unbiased estimator can become is 1/I(θ)
Invariance property
g is a 1-1 monotonic function and theta is an MLE then g(theta) is the MLE of g
significance level/ size
p(type 1 error) = p(reject Ho| Ho true)
Statistic to use when comparing two variances
F statistic (chi over chi)
What to do if testing two means (non independent)
find difference between the two samples then use t test
difference between classical and Bayesian
classical: theta would be an unknown fixed parameter and the likelihood is a function of the sample
Bayesian: assumes the parameter is a random variable which we assign prior beliefs onto
model deviance
sum of the squared difference between true and estimate
least squares
approximate a solution (parameters of the model) by minimising the squares of the residuals
Gauss-Markov Theorem
m If βˆ is the least squares estimator of β, then aTβˆ
is the unique linear unbiased estimator of aTβ with minimum variance
test for existence of regression
using F statistic (finding difference of model deviances etc)
When to transform variables of a model
we need to transform the variables of a model (Y or X) if the residuals don’t look random
total sum of squares
the total variability in y
coefficient of determination (R^2)
proportion of variability explained by the regression (model)
ANOVA
uses/ finds F to measure the existence of regression (where regression is strength of relationship)
decision rule
formal rule which spells out the circumstances under which you would reject the null hypothesis
Bernoulli distribution
two outcomes (success or failure), with probability p
Binomial distribution
two outcomes (success or failure), repeated n times, probability p, x is number of successes
Geometric distribution
two outcomes (success or failure), x is number of trials until a success occurs
Poisson distribution
used to measure ‘frequency’ given a parameter (originally used to approximate the binomial distribution with large n and small p)
chi distribution
normal distribution squared with 1 degree of freedom if not a sum, and n degrees of freedom is sum of n
Central limit theorem
independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed
Confounding variable
say we want to show the effects of X on Y then C is a confounder if it has an effect on X and Y. So confounding is when the true effect of X on Y is hidden by another variable.
Interaction
an interaction is when the effect of one variable on the outcome depends on another variable.
What information would you need from a clinician in order to perform a sample size calculation?
We need the power, type 1 error rate, type 11 error rate, the clinically relevant difference the known variance sigma, and for two sample we need the ratio in each treatment group. The we approximate n via the Z distribution (normal) and iteration to find n based on the t distribution.
Standard error
a measure of the statistical accuracy of an estimate, equal to the standard deviation of the theoretical distribution of a large population of such estimates.
Standard deviation
a quantity expressing by how much the members of a group differ from the mean value for the group