Statistics I Flashcards
Difference between and observational study and a survey
Survey requests information from the subjects
Difference between binomial and normal distribution
Binomial: variable is counter the number of successes in a certain number of trials Normal: Variable takes on values that occur according to the “bell shaped curve”
What is the t-distribution
Variable is based on smaple averages and you have limited data
What is correlation
The strength and direction of the linear relationship between x and y
Census vs. sample
Census is the entire population, sample is only part of it
Mean, median, mode
Mean: average
Median: equal number of data points above and below that specific data point
Mode: data point that occured the most
Standard deviation equation
n = sample size
What is the empirical rule
68 / 95 / 99.7 rule
68% of the data lies within 1 standard deviation
95% of the data lies within 2 standard deviations
99.7% of the data lies within 3 standard deviations
Distribution of z score
Central Limit Theorem
Gives you the ability to measure how much your sample mean will vary, without having to take any other sample means to compare it with
Gives you the ability to use confidence intervals and hyposthesis tests
Basically, if you keep taking samples of a set size, the resulting distribution of the means of the samples will be normal! The higher the set size, the more “normal” the distribution is.
What is the basic definition a a z distribution
Mean = 0, Std dev = 1
Blind vs. double-blind
Blind: participant doesn’t know
Double-blind: participant and admin doesn’t know
Margin of error
Supposed to measure the maximum amount by which the sample results are expected to differ from those of the actual population.
Often, this is referencing the confidence interval
Confidence Interval
The Percentage that represents the certainty that the mean is within a particular range
Hypothesis test
Data collected from a sample and measured against a claim about a population parameter
p-value
Shows the confidence for or against the null hypothesis.
The null hypothesis is the claim that’s on trial.
The alternative hypothesis is the one you would believe if the null hypothesis was untrue.
p-value < 0.05 indicates strong evidence against the null hypothesis, so reject it
p-value > 0.05 indicates weak evidence against the null hypothesis, so you fail to reject it
p-value == 0.05 could go either way
What is the relationship between mean, median, and skew
If the mean is larger than the median, skewed right
If the mean is smaller than the median, skewed left
Skewed right has a tail off to the right
Skewed left has a tail off to the left
What is the definition of a percentile
The percentage fo data that is below or above the particular data point. This doesn’t have to be continuous distributions, can be discrete counting
What is the “five number summary” of a dataset
[minimum, 25 percentile (first quartile, Q1), median (50 percentile), 75th percentile (third quartile, Q3), maximum]
Innerquartile range is Q3-Q1
Box plot
Great way to represent the five number summary
What are the characteristics of a binomial
- fixed number of trials
- each trial is either a success or failure
- there is a probability of success that is constant for each trial
- trials are independent (the outcome of one doesn’t influence others)
Equation for determining the probability of a certain number of desirable outcomes in a binomial distribution
Where:
b = binomial probability
x = total number of “successes” (pass or fail, heads or tails etc.)
P = probability of a success on an individual trial
n = number of trials
“A coin is tossed 10 times. What is the probability of getting exactly 6 heads?”
Combination and choose notation
“n choose r”
A coin is tossed 10 times, what is the probability of getting exactly 6 heads
10C6 is the notation for the formula
Also (10 over 6) (can’t upload two images, but number 10 over number 6 in parenthesis is another form of notation)
Relationship between variance and standard deviation
Standard deviation is the square root of variance
Equations for mean and variance of binomial distributions
The mean of X (number of favorable occurances) is
u = np
The variance of X is
σ2 = np(1-p)
For a normal distribution, what is significant about the two inflection points of the curve
The two inflection points represent where 1 standard deviation occurs
Equation for z score
How to find the corresponding x value when given a percentile
Go to the corresponding z value for the percentile, then use the z score / x value / std dev / mean equation to get the x value
( TI-84: DISTR > invNorm() )
What to do if the standard binomial equations fail you (numbers too high for the factorials)
Approximate it with a normal distribution
The following conditions must me true
n * p >= 10
n * (1-p) >= 10
You will need to calculate the mean, std dev to get the z score, then find the percentile
(TI 84: DISTR > invNorm
CDF vs. PDF
CDF: cumulative density function (eventually rises up to 1)
PDF: probability density function (doesn’t rise up to 1, like the normal distribution)
Basically CDF is good for a range of occurences. instead of a specific number of successes (i.e. “3 trials”) this function gives you the probability there will be 0 to x successes in n trials. In other words, if you put X=3 it will five you the probability for 0,1,2 and 3 trials (all together).
Basics of the t distribution
Shorter and fatter than the z distribution, gets taller and skinnier with more samples
used when you only have a sample, and trying to determine facts about the population
What is degrees of freedom
used to describe t distributions
Equal to sample size - 1 (n-1)
notated as t9 (9 degrees of freedom)
t30 is desired (very close to normal)
Something to keep in mind about probability distributions
They can be 1 sided, or 2 sided. Be careful
Formula for standard error of the mean
This is the standard deviation of the sampling distribution of the sample mean…
σx
Relationship between confidence interval, margin of error, and critical value
Margin of error = Critical value x Standard deviation of the statistic
or
Margin of error = Critical value x Standard error of the statistic
Standard error is a function of sample size and standard deviation. Standard error is basically the same as standard deviation, except you can’t use population parameters because you don’t know them.
If the confidence interval is 95%, then alpha is equal to .05. Critical probability is 1-(alpha / 2)= (0.975). Critical value is the z or t score associated with that probability. Then go back to the original equation
Central limit theorem basically says
All distributions are somewhat normal, and that 30 is the good transition point for sample size
When calculating the z score when you need to use standard error:
(CTL)
What is p hat (p^) ?
p hat is the proportion of individuals in the sample who have a particular characteristic
What is standard error: σp^
where p is the sample number
CLT needs to be large enough for
np and n(1-p) to be greater than or eqaul to 10
u vs. x
population mean vs. sample mean
(you can have a ux)
How do you get the percentage given a z score on the TI-84
You have to use normalCDF()
The range has to be that z score and an extreme z score (like -999 or 999)
Basically, margin of error can be two things
Calculate margin of error for a sample proportion (sort of like binomial, approve disaprove of politicians)
or
Calculate the margin of error for a sample mean
Margin of error = Critical value x Standard deviation of the statistic
or
Margin of error = Critical value x Standard error of the statistic
If the sample size is too small to use the CLT, what do you do
If you can assume it came from a normal distribution, use t-values
Trick: if the population standard dev, σ, is not given, you can use the sample standard dev and use t values
What are standard errors
The building blocks of confidence intervals. A conficence interval is a statistic plus or minus a margin of error, and the margin of error is the number of standard errors you need.
The number of standard errors required is called the critical value (z*) called the z star value
During hypothesis testing, what are you testing the p value against
The significance level, or alpha level (typically 0.05)
What is a Type-I eror?
Rejecting the null hypothesis when you shouldn’t
What is a Type II error?
Not rejecting the null hypothesis when you should have
Equation for a t-test
Equation for test statistic for a single proportion
Equation for comparing two independent population averages
Test for an average difference (the paired t-test)
d is for differences
Equation for comparing two population proportions
0 because the theoretical difference between proportions is zero
Correlation equation
s is sample std devs, bars are means of the samples
how to calculate:
for each (x,y) multiply the differences, then add up all of those results
The rest of the formula is clear
- 1: negative linear relationship
0: no relationship
1: positive linear relationship
What is the best fitting line (regression line)
The line that minimizes the sum of squares for error (SSE)
Slope is the standard deviations and r is correlation, y int is calculated using the two means
What is a confounding variable
Illustration of a simple confounding case: in this graphical model, given Z, there is no association between X and Y. However, not observing Z will create fake association between X and Y. In the latter case, Z is called a confounding factor.
Marginal distributions between two way tables
Pick the row or column variable, and divide each subtotal by the grand total as shown:
How to work with joint distribution in two way tables
Divide each cell by the grand total. Sum of all should be 1.
Conditional distribution in a two way table
“Find the conditional distribution of gender by country”
Say there’s 3 countries…
The result will be 3 totals all equal to 1, with each having a percentage of gender
If it’s find x by y… If x is a row, each row adds up to 1, if x is a column, each colum adds up to 1
Ways you can determine independence in a two way table
Compare the reslts of two conditional distributions (check if they match)
Compare the marginal and conditional distributions to check for independence
^ if greater than a 2 way table, go to the Chi-square test