Intro Statistics Flashcards
central limit theorem
If x_bar is the mean of a random sample X1, X2, … , Xn of size n from a distribution with a finite mean mu and a finite positive variance sigma ², then the distribution of W = (x_bar -mu)/ (sigma/sqrt(n)) is N(0,1) in the limit as n approaches infinity.
This means that the variable is distributed N(mu,sigma/sqrt(n)).
binomial distribution
with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question
P(x=k) = (n,k) * p^k * (1 - p)^(n-k)
(n,k) = n! / (k! (n - k)!)
Mu = n*p Sigma = n*p*(1-p)
Accuracy
the proportion of true results (both true positives and true negatives) among the total number of cases examined.[
accuracy = tp + tn / (tp + tn + fp + fn)
Precision
precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances
precision = tp / (tp + fp)
Recall
recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances
recall = tp / (tp + fn)
type I error
a type I error is the rejection of a true null hypothesis (also known as a “false positive” finding)
a type I error is to falsely infer the existence of something that is not there
type II error
type II error is retaining a false null hypothesis (also known as a “false negative” finding)
a type II error is to falsely infer the absence of something that is
kullback liebler divergence
a measure of how one probability distribution diverges from a second, expected probability distribution
kolmogrov smirnoff test
is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test)
Bootstrap
statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like a mean, median, proportion, odds ratio, correlation coefficient or regression coefficient.
Jackknife
The jackknife estimator of a parameter is found by systematically leaving out each observation from a dataset and calculating the estimate and then finding the average of these calculations.
Permutation test
the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. In other words, the method by which treatme
Two tailed test
appropriate if the estimated value may be more than or less than the reference value, for example, whether a test taker may score above or below the historical average
One tailed test
appropriate if the estimated value may depart from the reference value in only one direction, for example, whether a machine produces more than one-percent defective products
Assessing normality
Subtract mean divide by variance, compare to standard normal values — nscore
Box plot
Categorical variables, shows shape of distribution, central value and variability
Median black center line
Box top bottom are first and third quartiles
Vertical lines 1.5 times IQR
Outside lines points shown
IQR
Inter quartile range
Distance between first and third quartiles
Two way table
two-way table presents categorical data by counting the number of observations that fall into
Correlation coefficient
R = 1/(n-1) * Sum( ((x-x_mean)/std_x ) * ((y-y_mean)/std_y)) )
ANOVA
Analysis of variance is a statistical method used to test differences between two or more means of variance
Parameter
parameter is a number describing a population, such as a percentage or proportion.
true proportion of defective items in the entire population
Statistic
is a number which may be computed from the data observed in a random sample without requiring the use of any unknown parameters, such as a sample mean.
takes a sample 300 items and observes that 15 of these are defective- computes the statistic , p_hat = 15/300 = 0.05 an estimate of the parameter p
Biased estimator
statistic is systematically skewed away from the true parameter p, it is considered to be a biased estimator of the parameter
Unbiased estimator
unbiased estimator will have a sampling distribution whose mean is equal to the true value of the parameter.
Variability of statistic
Determined by the spread of its sampling distribution. In general, larger samples will have smaller variability
Probability model
mathematical representation of a random phenomenon. It is defined by its sample space, events within the sample space, and probabilities associated with each event.
Sample space
set of all possible outcomes
Probability
numerical value assigned to a given event A. The probability of an event is written P(A), and describes the long-run relative frequency of the event.
Rule 1: Any probability P(A) is a number between 0 and 1 (0 < P(A) < 1).
Rule 2: The probability of the sample space S is equal to 1 (P(S) = 1).
Probability disjoint
If two events have no outcomes in common
Rule 3: If two events A and B are disjoint, then the probability of either event is the sum of the probabilities of the two events:
P(A or B) = P(A) + P(B). I’ll
Probability union
chance of any (one or more) of two or more events occurring is called the union of the events. The probability of the union of disjoint events is the sum of their individual probabilities.
If two events A and B are not disjoint, then the probability of their union (the event that A or B occurs) is equal to the sum of their probabilities minus the sum of their intersection.
Probability complement
Rule 4: The probability that any event A does not occur is P(Ac) = 1 - P(A).
Probability independence
If the outcome of the first event has no effect on the probability of the second event,
Rule 5: If two events A and B are independent, then the probability of both events is the product of the probabilities for each event:
P(A and B) = P(A)P(B)
Probability intersection
chance of all of two or more events occurring
For independent events, the probability of the intersection of two or more events is the product of the probabilities.
Conditional probability
event B is the probability that the event will occur given the knowledge that an event A has already occurred
If events A and B are not independent, then the probability of the intersection of A and B (the probability that both events occur) is defined by
P(A and B) = P(A)P(B|A).
From this definition, the conditional probability P(B|A) is easily obtained by dividing by P(A):
P(B|A)= P(A and B)/P(A)
Random variable
is a variable whose possible values are numerical outcomes of a random phenomenon
Law of large numbers
law of large numbers states that the observed random mean from an increasingly large number of observations of a random variable will always approach the distribution mean
Properties of random variate means
Mu_a+by = a+b*mu_x
Mu_x+y = mu_x + mu_y
Properties of random variate variance
Sigma^2_a+by = b^2*sigma^2
Sigma^2_x+y = sigma_x^2 + sigma_y^2
Sample mean and variance
Mu_x = mu Sigma_x = sigma/sqrt(n)
Sample mean sigma gets smaller as n goes up
If distribution of population is normal then distribution of sample mean is normal with mean mu and stdev sigma/sqrt(n)
Tests of Significance for Two Unknown Means and Known Standard Deviations
two-sample z statistic
from two normal populations of size n1 and n2 with unknown means and and known standard deviations and , the test statistic comparing the means is known as the two-sample z statistic
z = ((x1 - x2) - (mu1 - mu2))/ sqrt((sigma1^2/n1) + (sigma2^2/n2))
Tests of Significance for Two Unknown Means and Unknown Standard Deviations
two-sample t-statistic
t = ((x1 - x2) - (mu1 - mu2))/ sqrt((s1^2/n1) + (s2^2/n2))
confidence interval
(x1 - x2) +/- t*(sqrt(s1^2/n1 + s2^2/n2))
conservative P-values may be obtained using the t(k) distribution where k represents the smaller of n1-1 and n2-1
Pooled t-statistic
same variance for both
s_p^2 = (n1 - 1)s1^2 + (n2 -1)s2^2/ (n1 + n2 - 2)
t = ((x1 - x2) - (mu1 - mu2))/ s_p * sqrt((1/n1) + (1/n2))
sample proportion (categorical)
given a simple random sample of size n from a population, the number of “successes” X divided by the sample size n gives us p_hat the sample proportion
This proportion follows a binomial distribution with mean p and variance (p(1-p))/n
An approximate level C confidence interval for p is p_hat +/- z* (sqrt((p(1-p))/n) where z is the upper (1-C)/2 critical value from the standard normal distribution.
Confidence Intervals for Unknown Mean and Known Standard Deviation
For a population with unknown mean mu and known standard deviation sigma, a confidence interval for the population mean, based on a simple random sample (SRS) of size n, is x_bar +/- z(sigma/sqrt(n)), where z is the upper (1-C)/2 critical value for the standard normal distribution.
Level C
gives the probability that the interval produced by the method employed includes the true value of the parameter theta
Confidence Intervals for Unknown Mean and Unknown Standard Deviation
For a population with unknown mean mu and unknown standard deviation, a confidence interval for the population mean, based on a simple random sample (SRS) of size n, is x_bar +/- t* (s/sqrt(n)), where t* is the upper (1-C)/2 critical value for the t distribution with n-1 degrees of freedom, t(n-1).
s = standard error (estimated stddev)
Significance Tests for Unknown Mean and Known Standard Deviation
For claims about a population mean from a population with a normal distribution or for any sample with large sample size n (for which the sample mean will follow a normal distribution by the Central Limit Theorem), if the standard deviation sigma is known, the appropriate significance test is known as the z-test, where the test statistic is defined as
z = (z_bar - mu_theta)/(sigma/sqrt(n))
Power of a test
the probability that a fixed level significance test will reject the null hypothesis H0 when a particular alternative value of the parameter is true.
Significance Tests for Unknown Mean and Unknown Standard Deviation
claims about a population mean from a population with a normal distribution or for any sample with large sample size n (for which the sample mean will follow a normal distribution by the Central Limit Theorem) with unknown standard deviation, the appropriate significance test is known as the t-test, where the test statistic is defined as
t = (x_bar - mu_theta)/(s/sqrt(n))
Sign test
o perform a sign test on matched pairs data, take the difference between the two measurements in each pair and count the number of non-zero differences n. Of these, count the number of positive differences X. Determine the probability of observing X positive differences for a B(n,1/2) distribution, and use this probability as a P-value for the null hypothesis.
categorical test single proportion
To test the null hypothesis H0: p = p0 against a one- or two-sided alternative hypothesis Ha, replace p with p0 in the test statistic
z = (p - p0)/(sqrt((p0*(1-p0))/n)
categorical sample size single proportion
n = (z/m)²p(1-p*).
The margin of error is maximized when p* = 0.5, in which case
n = (z*/2m)².
Comparison of Two Proportions
An approximate level C confidence interval for p1 - p2 is p_hat1 - p_hat2 + zsD where z is the upper (1-C)/2 critical value from the standard normal distribution.
sD = sqrt( (p1_hat(1-p1_hat)/n1) + (p2_hat(1-p2_hat)/n2)
Test two proportions
To test the null hypothesis H0: p1 = p2 against a one- or two-sided alternative hypothesis Ha, first compute a pooled estimate for the parameter =
(X1 + X2)/(n1 + n2), where X1 and X2 represent the number of “successes” in each population sample
sP = sqrt(p_hat(1-p_hat)(1/n1 + 1/n2))
z = (1 - 2)/sp follows the standard normal distribution (with mean = 0 and standard deviation = 1). The test statistic z is used to compute the P-value
chi-squared statistic
chi^2 = Sum( (observed - expected)^2/expected )
chi-squared distribution
random variable is said to have a chi-square distribution with m degrees of freedom if it is the sum of the squares of m independent standard normal random variables
he distribution of the chi-square test statistic based on k counts is approximately the chi-square distribution with m = k-1 degrees of freedom, denoted chi^2(k-1).
chi-squared fitting
In general, if we estimate d parameters under the null hypothesis with k possible counts the degrees of freedom for the associated chi-square distribution will be k - 1 - d.
chi-squared hypothesis test
use the chi-square test to test the validity of a distribution assumed for a random phenomenon. The test evaluates the null hypotheses H0 (that the data are governed by the assumed distribution) against the alternative (that the data are not drawn from the assumed distribution).
Let p1, p2, …, pk denote the probabilities hypothesized for k possible outcomes. In n independent trials, we let Y1, Y2, …, Yk denote the observed counts of each outcome which are to be
The chi-square test statistic is qk-1 =
= (Y1 - np1)² + (Y2 - np2)² + … + (Yk - npk)²
———- ———- ——–
np1 np2 npk
Reject H0 if this value exceeds the upper critical value of the (k-1) distribution, where is the desired level of significance.
Permutations and combinations
Permutations are for lists (order matters)
Combinations are for groups (order doesn’t matter)
Permutation formula
P(n,k)=n!/(n-k)!
You have n items and want to find the number of ways k of those items can be ordered
N pick k
Combination formula
C(n,k) = n!/(k!(n-k!))
Multinomial coefficient
n!/(k1!k2!k3!…*km!)
as the number of ways of depositing n distinct objects into m distinct bins, with k1 objects in the first bin, k2 objects in the second bin, and so on.
the number of distinct ways to permute a multiset of n elements, and ki are the multiplicities of each of the distinct elements