Concepts Flashcards
What are some common measures of central tendency?
Mean - the average value (sum of n numbers/n)
- geometric mean can be useful for rates of growth (product of n numbers to the 1/n-th power)
- harmonic mean is used to compute the F1 score in machine learning (n * 1 / the sum from 1 to n of 1/x_i)
Median - the middle value
- more resilient to extreme values than the mean
Mode - the most common value
What are some common measures of spread?
Range (maximum value - minimum value)
Variance - the average distance of values from the mean (sigma^2 = (1/n-1) * sum from 1 to n ((x_i - mean of x)^2)
Standard deviation - the square root of the variance
Quartiles
- Q1 = the number between minimum and median
- Q2 = median
- Q3 = the number between median and maximum
What is a p-value? (Or, explain statistical significance.)
- A number between 0 and 1
- Describes the probability that the data were observed by random chance
- P-values are used in hypothesis testing to accept or reject a null hypothesis
- There is typically a threshold below which the null hypothesis is rejected; historically this is p = 0.05
- If p <= 0.05 there is a 95% or higher probability that the data would NOT be observed by random chance
- Or, there’s only a 5% probability of observing the data if the null hypothesis is true.
- Results that meet this criterion are said to be statistically significant
What is a null hypothesis?
- The hypothesis you’re trying to accept or reject during a hypothesis test
- Often, the hypothesis is that the treatment has no effect, or that there’s no difference between two groups
What is R^2?
- R^2 is the proportion of the variance in the dependent variable (y) that can be explained by the independent variable (x)
- It’s also called the coefficient of determination
- It’s a goodness of fit measure in linear regression, among other things
How to calculate:
- If the residuals e_i = y_i, actual - y_i, predicted
- Total sum of squares = SS_tot = the sum of (y_i, actual - mean of y)^2
- Residual sum of squares = SS_res = sum of (e_i)^2
- R^2 = 1 - (SS_res/SS_total)
If R = 0, none of the variance in y is explained by x.
If R = 1, all of the variance in y is explained by x.
Explain the chi-squared test.
A statistical test used to determine if there are significant differences in counts between one or more categories.
Null hypothesis: The categories are independent of each other (e.g., being female is independent of being blond).
Assumptions:
- Data are independent
How to perform the test:
- Calculate chi-squared statistic = sum from 1 to k ( (x_i - m_i)^2 / m_i) = sum from 1 to k ( (x_i^2 / m_i) - n) where k = number of categories, x_i = counts in each category, n = number of observations, m = expected number of counts in each category (m_i = np_i)
- Use a chi-squared table to find the p-value and accept or reject the null hypothesis
- To use a chi-squared table, you will need to know the degrees of freedom for the test. This is usually k-1 (single set of categories). For tests where there is more than one set of categories, dof = (k_first category set - 1)(k_second category set -1) etc.
Explain degrees of freedom.
The number of values in the final calculation of a statistic that are free to vary.
Explain the Central Limit Theorem (CLT)
The CLT states that given repeated draws of sample size n (with replacement) from an underlying distribution with a mean of mu and a standard deviation of sigma, the distribution of sample means (or sums) will approximate a normal distribution with a mean of mu and a standard deviation of sigma/sqrt(n).
How well the distribution of sample means approximates a normal distribution increases as n gets larger.
This works regardless of the shape of the original, underlying distribution.
This is useful because many distributions in the real world are not normal. We can get around this problem by treating a sampling distribution as if it were normal, as long as the sample size is large enough. Exploratory data analysis can help us figure out how large is “large enough.”
In practice, the standard deviation sigma of the underlying distribution is often not known, and we have to approximate it using the sample standard deviation.
Information on the normal distribution.
PDF = f(x) = e^(( - (x-mu)^2)/2sigma^2)/(sigmasqrt(2*pi)) where mu is the location parameter and sigma is the scale parameter
Mean = median = mode = mu
Range = -infinity to infinity
Variance = sigma^2
For two or more normal distributions, means and variances are additive.
Information on the binomial distribution
A discrete distribution that’s used to model the number of successes and failures in a sample of size n drawn with replacement from a population of size N (or, a series of Bernoulli trials).
PMF = (n choose k) * p^k * (1 - p)^(n-k) where n is the number of trials, k is the number of successes, and p is the probability of success
n choose k is the binomial coefficient and = n! / k! * (n-k)!
Mean = np
Range = integers > 0
Variance = np * (1-p)
Normal approximation to the binomial can be used when n is large and/or p is close to 1/2*: B(n, p) ~ N(np, np * (1-p))
- if np and n(1-p) are > 5, this is a good rule of thumb to know if the normal approximation can be used.
Information on the Poisson distribution
A discrete distribution closely related to the exponential distribution. Models the number of times an event occurs during a fixed time interval.
PMF = (lambda^k * e^- lambda) / k! where k is the number of events and lambda is the expected number of events per unit time (event rate)
Mean = lambda
Range = integers > 0
Variance = lambda
Commonly used to model count data.
Information on the exponential distribution.
A continuous probability distribution that describes the time between events in a Poisson process (a discrete series of events where the average timing between events is known, but the exact timing of events is random). Often used to model the time or space between events.
PDF = lambda * e^(-lambda * x) where lambda is the rate parameter (e.g. average? number of hurricanes per year)
Mean = 1/lamba
Range = numbers > 0 and < infinity
Variance = 1/lambda^2
The exponential distribution is “memoryless” - as you continue to wait, the probability of an event occurring “soon” remains constant.
What is a t-test?
A one sample t-test determines whether the sample mean of a set of data is statistically significantly different from a distribution with known mean (and unknown variance). In other words, it determines how likely the difference between the sample mean and the distribution mean would occur by chance.
Null hypothesis: The sample mean is not different from the distribution mean.
Assumptions:
- Data are independent
- Data are normally distributed
- Homoskedasticity (homogeneity of variance)
What is a z-test?
A Z-test is used to determine whether the mean of a data set is statistically different from a normal distribution with known mean and variance.
Null hypothesis: The sample mean is not significantly different from the known mean.
Assumption: The data are normally distributed.
Z = (x bar - mu_0) / (sigma * sqrt(n)) where x bar is the sample mean, mu_0 is the mean of the known distribution, sigma is the standard deviation of the known distribution and n is the sample size.
To reject the null hypothesis at a 95% confidence interval Z must exceed 1.96 (or approximately 2).
For normally-distributed data, 68% of the data are within one standard deviation from the mean (z = 1) and 95% of the data are within two standard deviations from the mean (z=2).
Explain a 1-sample t-test.
Compares the mean of a single group against a known mean. (This is different from a Z-test in that the variance of the “known” distribution has to be estimated as the standard deviation of the sampled data).
Null hypothesis: The sample mean is equal to the known mean.
t = (x bar - mu_0) / (s/sqrt(n)) where x bar is the sample mean, mu_0 is the mean of the known distribution, s is the sample standard deviation, and n is the sample size.
Degrees of freedom = n-1
To reject the null hypothesis, use a t-table.