Probability and Statistics Flashcards
What is the binomial coefficient
(n k) = n! / k!(n-k)!
what is a Bernoulli trial
only has 2 possible outcomes
rule for probability that 2 independent events both occur
‘and’ rule -> multiplication
rule for probability that one or another event occurs
addition
how to find probability that a and b occur given that b occurs
P(a and b) / P(b)
pdf for binomial distribution
P(X=k) =
(n k) p^k (1-p)^n-k
binomial coefficient X probability of success k times X probability of failure n-k times
definition of Expectation
the sum of all possible outcomes, weighted by their probabilities
When can the Poisson distribution be used
large n
small p
(ie rare events)
formula for µ, the density parameter
µ = np (=E(x))
n = number of trials
p = probability of success
pdf for poisson distribution P(X=k) ≈
e^µ µ^k / k!
E(x) for binomial distribution
np
E(x) for Poisson distribution
µ = np
what are the parameters for the geometric distribution
p, probability of success
pdf for geometric distribution
P(X = k) =
(1 - p)^n-1 p
probability of the n-1 failures before the one probability of success
expectation E(x) for geometric distribution
1 / p
parameters for the exponential distribution
lambda = the rate parameter
what’s the difference between the exponential and geometric distribution
geometric = discrete
exponential = continuous
exponential distribution can be used to model the geometric when n gets large and p gets very small
pdf for exponential distribution
f(x) =
lambda e^ - (lambda x)
cdf for exponential distribution
F(x) =
1 - e^ - (lambda x)
(if can’t remember can integrate the pdf between 0 and x)
what does the cdf show
an expression that gives the probability that a random variable X falls between 0 and x
expected value of the exponential distribution
1 / lambda
parameters of the normal distribution
µ - the mean
sigma - std
expectation for normal distribution
µ = mean
what does the Z scale do (normal distribution)
measures how many stds a point lies from the mean of its parent distribution
normalises the data
formula for Z scale
Z = (Xi - µ) / std
X = point
µ = mean of parent distribution
std = std of parent distribution
critical value for 2 tailed standard normal at alpha=0.05
+- 1.96
+-1.96*sigma for not normalised
when is the t distribution used
small sample size
don’t know mean
difference between t distribution and normal distribution
t has longer tale, therefore has more extreme critical values for same significance level
as the sample size in t increases, the t distribution tends to the normal
formula for t scale
( X - µ) / Sx
X = sample mean
µ = population mean (often unknown)
Sx = standard error of mean
what is standard error of mean (SEM)
Sx = s / root(n)
s = sample std
n = sample size
what is the p value
probability of observing a result equal to or more extreme than the outcome
what is a type one error
rejecting the null when its true
‘False positive’
what is a type two error
fail to reject the null when its false
‘False negative’
what is alpha level
level of confidence at which we reject the null
probability of a type one error
why shouldn’t you use multiple t tests for multiple comparisons
the probability of a type 1 error gets large
what should you use instead of multiple t tests for comparisons
ANOVA
what is the within-group variance
comparing the distribution of replicates to their treatment mean
what is the among/between group variance
comparing the distribution of the treatment means to the grand mean
what is the F statistic in ANOVA
among / within
what are treatments in ANOVA
the different samples
what are replicates in ANOVA
sample units within treatments
formula for Chi-square test statistic
∑ (o - e)^2 / e
formula for Pearsons r test statistic
(use Z scale)
r = ∑(Zxi + Zyi) / n-1
formula for slope estimate, b of a regression line
b = ∑(Xi - X)(Yi - Y)
—————————-
∑(Xi - X)(Xi - X)
Xi = x values
X = mean of x values
residual formula
residual = Yi - ^Yi
y value minus the value of y on the regression line
problems with regression analysis
- induced correlations ( ie values that sum to 100% or 1, such as mineral compositions may indicate correlation in more than one variable falsely)
- correlation vs causation
- pseudoreplication (single area data taken from doesn’t represent all)
What is the t-test used for
- test whether a sample is drawn from a population of specific mean
- test if means of 2 samples differ
what is the ANOVA test used for
- test whether ≥ 3 samples are drawn from populations with equal means
(like students t)
what is the Chi-square test used for
- test how well observed categorial data fits a given model/expected values
How to find within-group variance
s.s / d.f
s.s. = ∑(Xi - X)^2
-> distance from treatment means
d.f. = n-1 (for each treatment, then added together (ie total replicates - number of treatments))
how to find among (between) group variance
s.s / d.f
s.s. = ∑ (Xti - Xg)
-> distance of treatment means from grand mean
d.f. = n - 1 ( number of treatments -1)
when do you reject ANOVA null hypothesis
when F statistic > table value, based on numerator and denominator degrees of freedom
assumptions for t test
- data from normally distributed populations
- data from populations of equal variance
- samples drawn at random from parent distributions
assumptions for ANOVA
- data drawn from normally distributed populations
- data from populations of equal variance
- data independent of one another