BIO 330 Flashcards
sampling error imposes
imprecision (accuracy intact)
caused by chance
sampling bias imposes
inaccuracy (precision intact)
accurate sample
unbiased
precise sample
low sampling error
good sample
accurate
precise
random
large
2 types of data
numerical
categorical
numerical data
continuous
discrete
categorical data
nominal
ordinal
types of variable
response
explanatory
response variable
dependent
outcome
Y
explanatory variable
independent
predictor
x
subsamples treated as true replicate
pseudoreplication
subsamples are useful for
increasing precision of estimate for individual samples (multiple samples from same site averaged)
contingency table
explanatory- columns
response- rows
totals of columns and rows
2 data descriptions
central tendency
width
central tendency
mean
median
mode
width (spread)
range standard deviation variance coefficient of variation IQR
effect of outliers on mean
shifts mean towards outliers- sensitive to extremes
median doesn’t shift
sample variance s^2 =
sum( Y_i - Ybar )^2 / n-1
coefficient of variation CV =
100% ( s / Ybar )
high CV
more variability
skewed box plot
left skewed- more data in ‘bottom’- first quartile
right skewed- more data in ‘top’- 3rd quartile
when/why random sample
uniform study area
removes bias in sample selection
when/why systematic sample
detect patterns along gradient- fixed intervals along transect/belt
using quadrats
more better
stop when mean/variance stabilize (asymptote)
what does changing n do to sampling distribution
reduces spread (narrows graph) - increases preciesion
standard error of estimate SE_Ybar =
s / sqr rt (n)
SD vs. SE
SD- spread of distribution/deviation from mean
SE- precisions of an estimate (ex. mean)
95% CI ~=
+/- 2SE
kurtosis
leptokurtic- sharper peak (+)
platykurtic- rounder peak (-)
mesokurtic- normal (0)
Normal distribution, 1SD
~2/3 of the area under the curve (2SD = 95%)
random trial
process/experiment with ≥2 possible outcomes who occurrence can not be predicted
sample space
all possible outcomes
event
any subset of the sample space (≥1 outcome)
mutually exclusive events
P[A and B] = 0
mutually exclusive addition rule
P[7U11] = P[7} + P[11]
general addition rule
P[AUB] = P[A] + P[B] - P[A and B]
multiplication rule
independent events
P[A and B] = P[A] x P[B]
conditional probability
P[A I B] = P[A and B] / P[B]
collection of individual easily available to researcher
sample of convenience
random sample
ever unit has equal opportunity, selection of unit independent, minimizes bias, possible to measure sampling error
problem with sample of convenience
assume unbiased/independent- no guarantee
volunteer bias
health conscious, low income, ill, more time, angry, less prudish
frequency distribution
describes # of times each value of a variable occurs in sample
probability distribution
distribution of variable in whole population
absolute frequency
of times value is observed
relative frequency
proportion of individuals which have that value
experimental studies can
determine cause and effect
*cause
observational studies can
only point to cause
*correlations
quantifying precision
smaller range of values (spread)
determining accuracy
usually can’t- don’t know true value
nominal categorical data with 2 choices
binomial
why aim for numerical data
it can be converted to categorical if need be
species richness
discrete (count)
rates
continuous
large sample
less effected by chance
lower sampling error
lower bias
rounding
round to one decimal place more than measurement (in calculations)
higher CV
more variability
proportions
p^ = # of observations in category of interest/ total # of observations in all categories
sum of squares
it is squared so that each value is +, so they don’t cancel each other out
n-1 to account for population bias
CV used for
relative measures- comparing data sets
sampling distribution
probability distribution of all values for an estimate that we might obtain when we sample a population, centred at true µ
values outside of CI
implausible
how many quadrats to use
till cumulative number of observations asymptotes
law of total probability
P[A] = Σ P[B].P[A I B]
for all B_i ‘s
null distribution
sampling distribution for test statistic, if repeated trials many time and graphed test statistics for H_o
Type I error
P[Reject Ho I Ho true] = alpha
reject null
P-vale < alpha
Type II error
P[do not reject Ho I Ho false]
Power
P[Reject Ho I Ho false]
increases with large n
decreases P[Type II E]
test statistic
used to evaluate whether data are reasonably expected under Ho
p-value
probability of getting data as extreme or more, given Ho is true
statistically significant
data differ from H_o
not necessarily important- depends on magnitude of difference and n
why not reduce alpha
would decrease P[Type I] but increase P[Type II]
continuous probability
P[Y = y] =
0
sampling without replacement
ex. drawing cards
1/52).(1/51).(1/50
Bayes Theorem
P[A I B] = ΣP[B I A].P[A] / P[B]
P-value > alpha
do not reject Ho
data are consistent with Ho
meaning of ‘z’ in standardization
how many sd’s Y is from µ
standardization for sample mean, t =
Ybar - µ / (s / sq.rt. n)
CI on µ
Ybar ± SE.tcrit
SE of Ybar
t of alpha(1 or 2), degrees of freedom
1 sample t-test
compares sample mean from normal pop. to population µ proposed by Ho
why n-1 account for sampling error
last value is not free to vary if mean is a specified value
1 sample t-test assumptions
data are a random sample
variable is normally distributed in pop.
paired t-test assumptions
pairs are a random sample from pop.
paired differences are normally distributed in the pop.
how to tell whether to reject with t-test
if test statistic is further into tails than critical t then reject
2 sample design compares
treatment vs. control
2 sample t-test assumptions
both samples are random samples
variable is normally distributed in each group
standard deviation in two groups ± equal
degrees of freedom
1 sample t-test: n - 1
paired t-test: n - 1
2 sample t-test: n1 + n2 - 2
confounding variables
mask/distort causal relationships btw measured variables
problem w/ observational studies
impossible to differentiate 1 variable
experimental artifacts
bias resulting from experiment, unnatural conditions
problem w/ experimental studies
should try to mimic natural environment
minimum study design requirements
knowledge of initial/natural conditions via preliminary data to ID hypotheses and confounding variables
controls to reduce bias
replication to reduce sampling error
study design process
develop clear statement of research question
list possible outcomes
develop experimental plan
check for design problems
developing a clear statement of research question
ID question, Ho, Ha
choose factors, response variable
what is being testes? will the experiment actually test this?
list possible outcome of experiment
ID sample space
explain how each outcome supports/refutes Ho
consider external risk factors
develop experimental plan
based on step 1
outline different experimental designs
check literature for existing/accepted designs
develop experimental plan based on step 2
what kind of data will you have- aim for numerical
what type of statistical test will you use
minimize bias in experimental plan
control group
randomization
blinding
minimize sampling error in experimental plan
replication
balance
blocking
types of controls
positive
negative
positive control
treatment that should produce obvious, strong effect
ensuring experiment design doesn’t block effect
negative control
subjects go through all same steps but do not receive treatment- no effect
maintaining power with controls
add controls w/o reducing sample size- too many controls samples using up resources will reduce power
placebo effect
improvement in condition from psychological effect
randomization
breaks correlation btw explanatory variable and confounding variables (averages effects of confounding variables)
blinding
conceals from subjects/researchers which treatment was received
prevent conscious/unconscious changes in behaviour
single blind or double blind
better chance of IDing treatment effect if
sample error/noise is minimized
replication =
smaller SE, tighter CI
spacial autocorrelation
each sample is correlated w/ sample area not independent (unless testing differences in that population)
temporal autocorrelation
measurement at one pt in time is directly correlated w/ the one before/after it
balance =
small SE, narrow CI
blocking
accounts for extraneous variation by putting experimental units that are similar into ‘blocks’
only concerned w/ differences within block- differences btw blocks don’t matter
lowers noise
factorial design
most powerful study design
study multiple treatments and their interactions
equal replication of all combinations of treatment
checking for pseudoreplication
check degrees of freedom, very large- problem
overestimate = easier to reject Ho- pretending we have more power than we do
determining sample size, plan for
precision, power, data loss
determining sample size, wanting precision
want low CI
n ~ 8(sigma/uncertainty)^2
uncertainty is 1/2 CI
determining sample size, wanting power
detecting effect/difference plan for probability of rejecting a false Ho n~16(sigma/D)^2 D is min. effect size you want to detect power is 0.8
ethics
avoid trivial experiment
collaborate to streamline efforts
substitute models for live animals when possible
keep encounters brief to reduce stress
most important in experimental study design
check common design problems
sample size (precision,power,data loss)
get a second opinion
most important in observational study design
keep track of confounding variables
good skewness range for normality
[-1,1]
normal quantile plot
QQ plot
compares data w/ standardized value, should follow a straight line
right skew in QQ plot
above line (more positive data)
Shapiro-Wilk test
works like Hypothesis test, Ho: data normal
estimate pop mean and SD using sample data, tests match to normal distribution with same mean and SD
p-value < alpha, reject Ho (don’t want to reject)
testing normality
Histogram
QQ plot
Shapiro-Wilk
normality tests sensitive
especially to outliers, over-rejection rate
sensitive to sample size
large n = more power
testing equal variances
Levene’s test
Levene’s test
Ho: sigma1 = sigma2
difference btw each data point and mean, test difference btw groups in the means of these differences
p-value < alpha reject (don’t want to reject)
how to handle violations of test assumptions
ignore it
transform data
use nonparametric test
use permutation test
when to ignore normality
CLT- n >30 —-means are ~normally distributed
depends on data set though
can’t ignore normality and compare one set skewed left with one skewed right
when to ignore equal variances
n large, n1 ~ n2
3 fold difference in SD usually ok
if can’t ignore violation of equal variances
Welch’s t-test- computes SE and df differently
most common transformations
log, arcsine, square-root
log- only in data all > 0
nonparametrics
assume less about underlying distributions
usually based on rank data
Ho: ranks are same btw groups
sign test (instead of t test)
sign test
compares median to median in Ho
each data pt- record whether above (+) or below (-) the Ho median
if Ho is true in sign test
half data will be above Ho, half will be below
sign test p-value
use binomial distribution– probability of getting your measurement if Ho true, compare to alpha
binomial
P[Y≤y] = Σ(n choose y)(p)^y(1-p)^n-y
Mann-Whitney U-test
compare 2 groups using ranks
doesn’t assume normality
assumes distributions are same shape
rank all data from both groups together, sum ranks for individual groups
Mann-Whitney U-test equation
U1 = n1n2 + [(n1(n1+1)/2] - R1 U2 = n1n2 - U1
interpreting Mann-Whitney U-test
choose larger of U1, U2 (test statistics)- compare to critical U from U distribution (table E)
note that Ucrit = U_alpha,(2 sided), n1, n2
used n1, n2 not DF
U < Ucrit d.n.r. Ho (2 groups not statistically different)
why Mann-Whitney doesn’t use DF
not looking at estimating mean/variance, just comparing the shapes
problem with non-parametrics
low power- P[Type II] higher– especially with low n
ranking data = major info loss
avoid use
Type I not altered
comparing > 2 groups
ANOVA - analysis of variance
Ho: µ1 = µ2 = µ3 = µ4….
why use ANOVA
multiple t-tests to compare >2 groups increase Type I error- more tests = higher chance of falling within alpha
P[Type I]
1 - ( 1 - alpha ) ^N
N is number of t-tests you do
ex. 5 groups- 10 unique tests- P[TI] = 0.4
ANOVA tests
is there more variation btw groups than can be attributed to chance- breaks it down into: total variation, btw group variation, within group variation
maintains P[TI] = alpha
between-group variation
effect of interest (signal)
within-group variation
sampling error (noise)
2x2 ANOVA design
take 2 different variables– look at all combinations and see if any effects between them in all directions
2 variables w/controls = 8 options
Hypothesis test steps
State Ho, Ha
calculate test statistic
determine critical value of null distribution (or P-value)
compare tests statistic to critical value (or P-value to sig. level)
evaluate Ho using alpha