General Statistics Flashcards
What is parallel slopes regression?
A special case of regression with 1 numeric and 1 categorical explanatory variable
What is Simpson’s Paradox?
Simpson’s Paradox - occurs when the trend of a model on the whole dataset is very different from the trends shown by models on subsets of the dataset. In the most extreme case, you may see a positive slope on the whole dataset, and negative slopes on every subset of that dataset (or the other way around).
Interpret this interaction regression model. What does each coefficient mean?
Height = B0 + B1*Bacteria + B2*Sun + B3*Bacteria*Sun
Without the interaction term, we can interpret B1 as the unique effect of bacteria on height. With the interaction term we can no longer do so as the effect of bacteria on height is now different for different values of Sun. Thus B1 is now interpreted as the unique effect of bacteria on Height ONLY WHEN Sun = 0.
B2 is the unique effect of the Sun when bacteria = 0
The overall effect of Bacteria on Height is now B1 + B3 * Sun. So if we have the following coefficients:
Height = 35 + 4.2*Bacteria + 9*Sun + 3.2*Bacteria*Sun
So for Sun = 0, an increase in 1 unit of bacteria results in an increase of 4.2 units in height. For Sun = 1, the effect of bacteria is now 7.4. Thus, for an increase in 1 unit of bacteria and sun = 1, we would expect an increase in 7.4 units of height.
What are the basic assumptions that linear regression makes about the data?
- Linearity of the data - relationship between x & y is linear
- Normality of residuals - the residual errors are assumed to be normally distributed
- Homogeneity of residuals variance - the residuals are assumed to have a constant variance (homoscedasticity)
- Independence of residuals error terms.
What is the difference between parametric and nonparametric statistics?
Parametric statistics are based on assumptions about the distribution of the population from which the sample was taken. Example: Student’s t-test
Nonparametric statistics are not based on assumptions about the distribution of the population. In many cases the distribution of the population is unknown. These are cases when nonparametric statistics are used. Example: Mann-Whitney-Wilcoxon test
Which statistical test is used to asses the difference in means of two groups? (Parametric and nonparametric)
Parametric - Student’s t-test
Nonparametric - Mann Whitney Wilcoxon rank test
What statistical test is used to compare the means of more than two groups? (parametric and nonparametric)
Parametric - ANOVA: extension of t-test to compare more than two groups
Nonparametric - Kruskal-Wallis rank sum test (extended version of Wilcoxon rank test)
What statistical test is used to compare the variances of two groups? (parametric and nonparametric)
Parametric - F-test for 2 groups, Bartlett’s or Levene’s for multiple groups/samples
Nonparametric -
Interpret the coefficient for X2 as if it were a categorical variable.
Yi = B0 + B1*X1i + B2*X2i + ei.
Y = 42 + 2.3*X1 + 11*X2
B2 is the average difference in Y between the category for which X2 = 0 (the reference group) and the category for which X2 = 1 (the comparison group).
So compared to when X2 = 0, we would expect Y to be 11 units greater when X2 = 1, controlling for X1.
What is a confusion matrix?
Confusion matrix is the visual representation of the Actual vs. Predicted values. This is used in logistic regression to visualize and assess the performance of the model. This term is also used a lot in machine learning.
How does linear regression relate to the generalized linear model?
Linear regression is a specialized case of the GLM. It is the specialized case where the link function is just the identity function as Y does not need to be transformed.
What is a “link function” in regression?
The link function makes the distribution of Y compatible with the right-hand side of a regression equation.
When can you use least-squares regression and/or maximum likelihood estimation to solve a GLM equation?
Least-squares and MLE will give the same result for a linear regression problem.
Can only use MLE for other types of regression under the GLM (logistic, poisson, etc)
What is a logarithm?
In it’s simplest form, a logarithm answers the question “how many of one number do we multiply to get another number?”
For example, Log2(8) is asking, “how many 2’s do we multiply to get 8?” Therefore, Log2(8) = 3
Another way of looking at this is, 2^X = 8
What is a parameter in statistics?
In statistics, a parameter is any measured quantity of a statistical population that summarizes or describes an aspect of the population, such as a mean or standard deviation.
A parameter is to a population as a statistic is to a sample.