vocab definitions Flashcards
sample of convenience
a collection of individuals that happen to be available at the time
variable
a measured characteristic on individuals from a population under study
data
measurements of one or more variables made on a collection of individuals
explanatory variable
a variable we use to predict or explain a response vairable
response variable
a variable that is predicted or explained from a explanatory variable
populations
a group of all individuals or groups that you want to study
sample
a subset ideally randomly chosen from a population you wish to study
parameters
things we want to know about the population
estimates
are calculated from a sample to help understand perameters
bias
a systematic discrepancy between estimates and the true population characteristic
volunteer bias
volunteers for a study are likely to be different on average from the poulation
sampling error
chance difference from the truth
precision
the spread of estimates resulting from sampling error
-gives a similar answer repeatedly
accurate or unbiased
the average of estimates that are obtained is on the true population value
-accuracy (on average gets the correct answer
random sample
in a random sample each member of a population has equal and independent chance of being selected
categorial variables (attribute or qualitative variables)
describe membership in a category or group
numerical variable
when measurements of individuals are quantitative and have magnitude. numbers
continuous
numerical data that can take on any real-number value within some range. Between any two values of a continuous variable, an infinite number of other values are possible.
discrete
numerical data that come in indivisible units. Example: number of amino acids in a protein and numerical rating of a statistics professor in a student evaluation are discrete numerical measurements
frequency
the number of observations having a particular value of the measurement
frequency distribution
shows how often each value of the variable occurs in the sample.
The frequency distribution describes the number of times each value of a variable occurs in a sample
independence
two events are independeent if the occurance of on egives no info about whether the second will occrur
multiplication principle
if two evens A and B are independent, then Pr[A and B] = Pr[A] xPr[B]
The addition principle
If two events A and B are mutually exclusive, then Pr[A or B]= Pr[A] + Pr[B]
Probibility distribution
A prob distribution describes the true relative frequency of all possible values of a random vairable
Mutually exclusive
if two events are mutually exclusive they cannot both be true
Pr(A and B)= 0
probability
The prob of an event is its true relative frequency, the proportion of times the event would occur if we repeat the same process over and over
pseudoreplication
the error that occurs when samples are not indepenent, but they are treated as though they are
standard error
estimate is the standard deviation of its sampling distribution. predicts the sampling error of the estimate
standard error of an estimate
the standard deviation of its sampling distribution
It predicts the sampling error of estimate
conditional probability
the conditional probability of an event is the probability of that event occurring given that a condition is met.
Pr[X|Y]
confidence interval
the 95% confidence provides a plausible range for a parameter. All values for the parameter lying within the interval are plausible, given the data, whereas those outside are unlikely
The 2SE rule-of thumb
the interval from Y-2SEy to Y+2SEy provides a rough estimate of the 95% confidence interval for the mean
what does a x^2 goodness of Fit test do?
compares count data to a model of the expected frequencies of a set of categories
-it is an approximation (don’t use when there’s little amount of data)
H0: the data come from a specified probability distribution
x^2= sum of all classes (observed-expected)^2/ expected
Degrees of freedom
the number of degrees of freedom of a test specifies which of a family of distributions to use
for x^2 df= number of categories-number of parameters estimated from the data-1
Critical value
the value of the test statistic where P= alpha
What are test statistics
A test statistic is a number calculated from the data and the null hypothesis that can be compared to a standard distribution to find the P-value of the test
What are assumptions of x^2 test
=that its a random sample
- No more than 20% of categories have expected <5
- no category with expected = 1
when both these conditions are not met the approximations to make the x^2 test do not work
what is a discrete distribution?
a porbobility disribution descibing a discrete numerical random variable
example:
number of heads from 10 flips of a coin
number of flowers in a square meter
number of disease outbreaks in a year
Poisson distribution
a mathematical probability distribution.
- describes the probability that a certain number of events occur in a block of time or space, when those events happen independently of each other and occur with equal probability at every point in time or space
x^2 contingency analysis
tests the independence of two of more categorical variables
Fishers exact test
for 2x2 contingency analysis
does not make assumptions about the size of expectations
use when you cant do x^2 contingency analysis
*don’t need to do by hand
odds
the probability of success divided by the probability of failure
Odds ratio
the odds of success in one group divided by the odds of success in another group
OR< 1 means odds of bad thing happens lower
OR>1 means odds of bad thing is higher
properties of a good sample
-independent selection of individuals
-random selection of individuals
-sufficiently large
sampling error
The difference between the estimate and average value of the estimate
larger samples on average will have ____ sampling error
smaller
best way to graph a numerical variable frequency
histogram
cumulative frequency distribution
The cumulative frequency of a value is the proportion of individuals equal to or less than the value
graphed this goes from 0-1 on y axis, never decreasing
best way to show association between two categorical variables
contingency table,
grouped bar graph,
mosaic plot,
best way to show association between categorical and numerical variable
multiple histograms
best way to show association between two numerical variables
scatter plot
how to calculate mean
Ybar= sum of Yi/n n=sample size
median
The median is the middle measurement in a set of ordered data
Mode
the mode is the most frequent measurment
Range
the maximum minus the minimum
Small samples tend to give ___ estimates of the range than small samples.
So sample range is a _______ of the true range of the population
Small samples tend to give _lower__ estimates of the range than small samples.
So sample range is a biased estimator of the true range of the population
Variance in a population
sigma^2= sum of (Yi- u)^2/N
N is the number of individuals in population
u= true mean of the population
Sample variance
s^2= sum of(Yi-Ybar)^2/n-1
n=sample size
Ybar= sample mean
Standard deviation (SD)
positive square root of the variance
sigma is the true standard deviation
s is the sample stand deciation
s= sqare root of s^2= sqrt(sum(Y-Ybar)^2/n-1)
coefficient of variation (CV)
CV= 100% S/Ybar
skew
a measurement of asymmetry
refers to the pointy tail of a distribution
Standard error of the mean
standard error of he mean:
sigma ybar= sigma/ srt(n)
Estimate of the standard error of the mean
SEYbar= S/ srt(n)
gives us some knowledge of the likely difference b/w our sample mean and the true population mean
law of total probability
Pr[x]= sum of all values of Y Pr[X|Y] Pr[Y]
probability of a positive result using the law of total probability (example if the events are not independent)
P[positive result]= Pr(positive result| X)Pr(x) +Pr(positive result| Y) Pr(Y)
Bayes theorem
Pr[A|B]= Pr[B|A]Pr[A]/ Pr[B]
what does hypothesis testing do
hypothesis testing asks how unusual it is to get data that differ from the null hypothesis
If the data would be quite unlikely under H0 we reject H0
hypothesis are about populations but are tested from? with the assumptin?
sapmples with assumption it is random
Null hypothesis
a specific statement about a population parameter made for the purposes of argument.
usually the simplest statement
Alternative hypothesis
represent all other possible parameter values except that stated in the null hypothesis
usually the statement of greatest interes
A good null hypothesis
would be interesting if proven wrong
What is P-vale
the probability of getting the data or something as or more unusual, if the hypothesis where true
How do you find the P-value?
Simulation
Parametric tests
Permutation
Statistical significance
The significance level, alpha, is a probability used as a criterion for rejecting the null hypothesis
If the P-value for a test is less than or equal to alpha then the null hypothesis is rejected
often 0.05
A large sample will tend to give and estimate with a ____ confidence interval
A larger sample will give ____ a false null hypothesis
A large sample will tend to give and estimate with a smaller confidence interval
A larger sample will give _more power to reject___ a false null hypothesis
Type I error
Rejecting a true null hypothesis
Probability of Type I error is alpha (the significance level)
Type II error
Not rejecting a false null hypothesis
The probability of a Type II error is beta
The smaller beta the more power a test has
Power
The ability of a test to reject a false null hypothesis
Power = 1- beta
Most tests are ___ tailed tests which means…
most tests are two-tailed tests and this means that a deviation in either direction would reject the null hypothesis
normally alpha is divided into alpha/2 on one side and alpha/2 on the other
One-tailed test are
are only used when the other tail is nonsensical
example: comparing grades on a multiple choice test to that expected by random guessing
Critical value
The value of a test statistic beyond which the null hypothesis can be rejected
If a hypothesis test rejects a null hypothesis thest (P<0.05) the value proposed by the null hypothesis is
outside the 95% confidence interval
confounding variable
an unmeasured variable that may be the cause of both X and Y
a proportion
a fraction of individuals having a particular attribute
binomial distribution
describes the probability of a given number of successes from a fixed number of independent traits
Pr[X]=(n given X)p^X(1-p)^n-X
n trails; p probability of success
properties of the binomial distribution
mean and variance of number of succusses
u=np
sigma^2= np(1-p)
proportion of successes in a sample
phat= X/n
the hat (^) shows that this is an estimate of p
properties of sample
mean: p
variance: p(1-p)/n
Agresti-Coull confidence interval
(p’-1.96 sqrt(p’(1-p’)/n+4) <= p<= (p’+1.96 sqrt(p’(1-p’)/n+4)
p’= X+2/n+4
The binomial test
uses data to test whether a population proportion p matches a null expectation for the proportion
example:
H0: dog good is chosen at best 20% of time Ho=0.2
N=18, p0= 0.2, X=2
P-value= 2(Pr[2]+Pr[1]+Pr[0])
example say = 0.543 > 0.05 therefor cannot reject the null hypothesis (It is plausible that people do not prefer pate over dog food)