- R^2 is the proportion of the variance in the dependent variable (y) that can be explained by the independent variable (x) - It's also called the coefficient of determination - It's a goodness of fit measure in linear regression, among other things How to calculate: - If the residuals e_i = y_i, actual - y_i, predicted - Total sum of squares = SS_tot = the sum of (y_i, actual - mean of y)^2 - Residual sum of squares = SS_res = sum of (e_i)^2 - R^2 = 1 - (SS_res/SS_total) If R = 0, none of the variance in y is explained by x. If R = 1, all of the variance in y is explained by x.

Concepts Flashcards by Diana LaScala-Gruenewald

What are some common measures of central tendency?

Mean - the average value (sum of n numbers/n)
- geometric mean can be useful for rates of growth (product of n numbers to the 1/n-th power)
- harmonic mean is used to compute the F1 score in machine learning (n * 1 / the sum from 1 to n of 1/x_i)

Median - the middle value
- more resilient to extreme values than the mean

Mode - the most common value

How well did you know this?

Not at all

Perfectly

What are some common measures of spread?

Range (maximum value - minimum value)

Variance - the average distance of values from the mean (sigma^2 = (1/n-1) * sum from 1 to n ((x_i - mean of x)^2)

Standard deviation - the square root of the variance

Quartiles
- Q1 = the number between minimum and median
- Q2 = median
- Q3 = the number between median and maximum

How well did you know this?

Not at all

Perfectly

What is a p-value? (Or, explain statistical significance.)

A number between 0 and 1
Describes the probability that the data were observed by random chance
P-values are used in hypothesis testing to accept or reject a null hypothesis
There is typically a threshold below which the null hypothesis is rejected; historically this is p = 0.05
If p <= 0.05 there is a 95% or higher probability that the data would NOT be observed by random chance
Or, there’s only a 5% probability of observing the data if the null hypothesis is true.
Results that meet this criterion are said to be statistically significant

How well did you know this?

Not at all

Perfectly

What is a null hypothesis?

The hypothesis you’re trying to accept or reject during a hypothesis test
Often, the hypothesis is that the treatment has no effect, or that there’s no difference between two groups

How well did you know this?

Not at all

Perfectly

What is R^2?

R^2 is the proportion of the variance in the dependent variable (y) that can be explained by the independent variable (x)
It’s also called the coefficient of determination
It’s a goodness of fit measure in linear regression, among other things

How to calculate:
- If the residuals e_i = y_i, actual - y_i, predicted
- Total sum of squares = SS_tot = the sum of (y_i, actual - mean of y)^2
- Residual sum of squares = SS_res = sum of (e_i)^2
- R^2 = 1 - (SS_res/SS_total)

If R = 0, none of the variance in y is explained by x.
If R = 1, all of the variance in y is explained by x.

How well did you know this?

Not at all

Perfectly

Explain the chi-squared test.

A statistical test used to determine if there are significant differences in counts between one or more categories.

Null hypothesis: The categories are independent of each other (e.g., being female is independent of being blond).

Assumptions:
- Data are independent

How to perform the test:
- Calculate chi-squared statistic = sum from 1 to k ( (x_i - m_i)^2 / m_i) = sum from 1 to k ( (x_i^2 / m_i) - n) where k = number of categories, x_i = counts in each category, n = number of observations, m = expected number of counts in each category (m_i = np_i)
- Use a chi-squared table to find the p-value and accept or reject the null hypothesis
- To use a chi-squared table, you will need to know the degrees of freedom for the test. This is usually k-1 (single set of categories). For tests where there is more than one set of categories, dof = (k_first category set - 1)(k_second category set -1) etc.

How well did you know this?

Not at all

Perfectly

Explain degrees of freedom.

The number of values in the final calculation of a statistic that are free to vary.

How well did you know this?

Not at all

Perfectly

Explain the Central Limit Theorem (CLT)

The CLT states that given repeated draws of sample size n (with replacement) from an underlying distribution with a mean of mu and a standard deviation of sigma, the distribution of sample means (or sums) will approximate a normal distribution with a mean of mu and a standard deviation of sigma/sqrt(n).

How well the distribution of sample means approximates a normal distribution increases as n gets larger.

This works regardless of the shape of the original, underlying distribution.

This is useful because many distributions in the real world are not normal. We can get around this problem by treating a sampling distribution as if it were normal, as long as the sample size is large enough. Exploratory data analysis can help us figure out how large is “large enough.”

In practice, the standard deviation sigma of the underlying distribution is often not known, and we have to approximate it using the sample standard deviation.

How well did you know this?

Not at all

Perfectly

Information on the normal distribution.

PDF = f(x) = e^(( - (x-mu)^2)/2sigma^2)/(sigmasqrt(2*pi)) where mu is the location parameter and sigma is the scale parameter

Mean = median = mode = mu

Range = -infinity to infinity

Variance = sigma^2

For two or more normal distributions, means and variances are additive.

How well did you know this?

Not at all

Perfectly

Information on the binomial distribution

A discrete distribution that’s used to model the number of successes and failures in a sample of size n drawn with replacement from a population of size N (or, a series of Bernoulli trials).

PMF = (n choose k) * p^k * (1 - p)^(n-k) where n is the number of trials, k is the number of successes, and p is the probability of success

n choose k is the binomial coefficient and = n! / k! * (n-k)!

Mean = np

Range = integers > 0

Variance = np * (1-p)

Normal approximation to the binomial can be used when n is large and/or p is close to 1/2*: B(n, p) ~ N(np, np * (1-p))

if np and n(1-p) are > 5, this is a good rule of thumb to know if the normal approximation can be used.

How well did you know this?

Not at all

Perfectly

Information on the Poisson distribution

A discrete distribution closely related to the exponential distribution. Models the number of times an event occurs during a fixed time interval.

PMF = (lambda^k * e^- lambda) / k! where k is the number of events and lambda is the expected number of events per unit time (event rate)

Mean = lambda

Range = integers > 0

Variance = lambda

Commonly used to model count data.

How well did you know this?

Not at all

Perfectly

Information on the exponential distribution.

A continuous probability distribution that describes the time between events in a Poisson process (a discrete series of events where the average timing between events is known, but the exact timing of events is random). Often used to model the time or space between events.

PDF = lambda * e^(-lambda * x) where lambda is the rate parameter (e.g. average? number of hurricanes per year)

Mean = 1/lamba

Range = numbers > 0 and < infinity

Variance = 1/lambda^2

The exponential distribution is “memoryless” - as you continue to wait, the probability of an event occurring “soon” remains constant.

How well did you know this?

Not at all

Perfectly

What is a t-test?

A one sample t-test determines whether the sample mean of a set of data is statistically significantly different from a distribution with known mean (and unknown variance). In other words, it determines how likely the difference between the sample mean and the distribution mean would occur by chance.

Null hypothesis: The sample mean is not different from the distribution mean.

Assumptions:
- Data are independent
- Data are normally distributed
- Homoskedasticity (homogeneity of variance)

How well did you know this?

Not at all

Perfectly

What is a z-test?

A Z-test is used to determine whether the mean of a data set is statistically different from a normal distribution with known mean and variance.

Null hypothesis: The sample mean is not significantly different from the known mean.

Assumption: The data are normally distributed.

Z = (x bar - mu_0) / (sigma * sqrt(n)) where x bar is the sample mean, mu_0 is the mean of the known distribution, sigma is the standard deviation of the known distribution and n is the sample size.

To reject the null hypothesis at a 95% confidence interval Z must exceed 1.96 (or approximately 2).

For normally-distributed data, 68% of the data are within one standard deviation from the mean (z = 1) and 95% of the data are within two standard deviations from the mean (z=2).

How well did you know this?

Not at all

Perfectly

Explain a 1-sample t-test.

Compares the mean of a single group against a known mean. (This is different from a Z-test in that the variance of the “known” distribution has to be estimated as the standard deviation of the sampled data).

Null hypothesis: The sample mean is equal to the known mean.

t = (x bar - mu_0) / (s/sqrt(n)) where x bar is the sample mean, mu_0 is the mean of the known distribution, s is the sample standard deviation, and n is the sample size.

Degrees of freedom = n-1

To reject the null hypothesis, use a t-table.

How well did you know this?

Not at all

Perfectly

Explain a 2-sample t-test.

Study These Flashcards

Compares the means of two sampled groups.

Null hypothesis: The means are equal.

Assumption: Equal variances.

t = (x_1 bar - x_2 bar) / sqrt( s^2 * (1/n_1 + 1/n_2) ) where x_1 bar is the sample mean for group 1, x_2 bar is the sample mean for group 2, n_1 is the sample size for group 1, n_2 is the sample size for group 2 and…

s^2 = ( sum from 1 to n_1 (x_1, i - x_1 bar)^2 + sum from 1 to n_2(x_2,i - x_2 bar)^2 ) / (n_1 + n_2 - 2)

s^2 is also called the pooled variance.

Degrees of freedom = n_1 + n_2 - 2

To reject the null hypothesis, use a t-table.

Explain a 2-sample t-test with unequal variances (Welch’s t-test).

Study These Flashcards

Compares the means of two sampled groups.

This test is useful 1) when the assumption of homoskedasticity doesn’t hold (e.g. variance of one group is > 4 x the variance of the other) and/or 2) the sample sizes between the two groups are very different.

Null hypothesis: The means are the same.

Assumptions:
- Data are independent
- Data are normally distributed

t = (x_1 bar - x_2 bar) / sqrt( s_1^2/n_1 + s_2^2/n_2) where x_1 bar is the sample mean for group 1, x_2 bar is the sample mean for group 2, s_1 is the sample variance for group 1, s_2 is the sample variance for group 2, n_1 is the sample size for group 1, and n_2 is the sample size for group 2.

Degrees of freedom is complicated to calculate in this case.

To reject the null hypothesis, use a t-table.

Explain a paired t-test.

Study These Flashcards

Compares the means from the same group at different times, or from two groups that have been matched or paired.

Null hypothesis: The means are the same.

Assumptions:
- Data are independent
- Data are normally distributed

t = x_D bar / (s_D / sqrt(n)) where x_D is the mean of the differences between group 1 and group 2, s_D is the standard deviation of the differences between group 1 and group 2, and n is the number of differences (i.e. number of paired samples).

Degrees of freedom = n - 1

Explain a k-s test.

Study These Flashcards

A nonparametric goodness-of-fit test that determines whether a test distribution is identical to a reference distribution (one-sample) or whether two test distributions are identical (two-sample).

Null hypothesis: the test distribution was drawn from the reference distribution (one-sample) or the two test distributions were drawn from the same underlying distribution (two-sample).

Although k-s is nonparametric, it is commonly used to check whether the data are normally distributed, and to check whether the normality assumption in linear regression, ANOVA, etc. is met. However, it is not as powerful a normality test as the Shapiro-Wilk test or the Anderson-Darling test.

The test statistic, D = sup_x * abs( F_0(x) - F_data(x) ) where F_0(x) is the CDF of the hypothesized distribution, F_data(x) is the empirical CDF of your observed data and sup_x is the supremum.

In the case of a one-sample test, the absolute value can be omitted from the formula.

What is a supremum?

Study These Flashcards

The supremum of a set is its least upper bound. It need not be an element of the set.

What does “nonparametric” mean?

Study These Flashcards

A test that’s not based on any assumptions about parameters or frequency distributions of the data

Q-Q plot

Study These Flashcards

Quantile-quantile plot. Allows you to assess whether a set of data plausibly comes from some kind of theoretical distribution (e.g. Normal). Can be used to check normality assumption.

To make one, take your sample data, sort it in ascending order, and then (scatter) plot it against quantiles calculated from the theoretical base distribution.

Quantiles are points in your data below which a certain amount of the data fall. For example, half of your data points will lie below the 0.5 quantile.

If the scatter plot is approximately linear, then you’ve performed a visual check that your data comply with the theoretical distribution.

Reference: https://data.library.virginia.edu/understanding-q-q-plots/

Bootstrapping

Study These Flashcards

A nonparametric technique where you resample a single data set in order to construct things like standard errors and confidence intervals and perform hypothesis testing. It has advantages over traditional inference in that you don’t have to worry about assumptions, properties of the data, etc. You also can work with smaller sample sizes (down to ~10), whereas the CLT often allows you to assume normality for sample sizes > ~30.

Procedure:
- Sample your data with replacement n times using sample size m
- Calculate a statistic of interest from each sample (often the mean, but you can do just about any statistic: mean, median, mode, standard deviation, regression coefficients, etc.)
- From these, construct a distribution of sample means (“sampling distribution”)

For confidence intervals, identify the thresholds where 2.5% of the data are captured and 97.5% of the data are captured (middle 95%).

ANOVA

Study These Flashcards

A statistical test that allows you to ask whether some continuous measurement is significantly different between two or more groups.

It asks whether the variance between groups is greater than the variance within groups.

The null hypothesis is that the variance within groups is bigger (i.e. the measurement is not significantly different between groups)

The test statistic, F_s = variance among means / average variance within groups

Degrees of freedom = number of groups - 1, number of total observations - number of groups

Assumptions:
- independence
- normality
- homoskedasticity

Linear regression

A statistical test that compares two continuous variables. Null hypothesis: the variables are not related, the slope of the line is not significantly different from 0. Procedure: Given n points (x,y) and best fit line y_pred = a + bx, find a and b and calculate y_pred. This is usually done using Ordinary Least Squares, where you minimize the derivative of the cost function with respect to a and b. The cost function is MSE. Assumptions: - independence - normality (test via Q-Q, k-s, Shapiro-Wilk - homoskedasticity (residuals are evenly distributed around the regression line) - when there's multiple x's, non-colinearity

Bayes theorem

Power analysis

Power is defined as the probability of detecting an effect given that it actually exists. Power analysis allows you to estimate the minimum required sample size to detect an effect at a given confidence level. The equation differs based on the test. For a 2-sample t-test, you need: - mu_A, mu_B (the averages of sample A and sample B - the difference between these values is the effect size you hope to detect) - sigma_A, sigma_B (the standard deviation of sample A and sample B - if guessing, it's common to choose the same value) - Beta (desired power, often 0.8) - i.e. the probability of detecting an effect given that it actually exists is 0.8 - Alpha (usually 0.5, significance threshold) - i.e. the confidence with which you want to be able to identify the effect Note that mu_A, mu_B, sigma_A, and sigma_B are usually not all known and researchers have to guess. You can use prior data, normal approximation to the binomial, etc. The difference between mu_A and mu_B - the effect size - is a value you often set. It's the size of the effect you'd like to be able to detect with your experiment.

Logistic regression

Models the probability of an event taking place by having the log-odds of the event modeled by a linear combination of one or more independent variables. Null hypothesis: There is no relationship between the independent and dependent variables. Assumptions: - response variable is binary - independence - if more than one x, no multicollinearity (NOT normality or homoskedasticity)

Why have critical p-values and statistical significance been discounted lately as useful concepts?

Multiple comparisons problem

Concepts Flashcards

(30 cards)