General Statistics Flashcards

1
Q

What is parallel slopes regression?

A

A special case of regression with 1 numeric and 1 categorical explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Simpson’s Paradox?

A

Simpson’s Paradox - occurs when the trend of a model on the whole dataset is very different from the trends shown by models on subsets of the dataset. In the most extreme case, you may see a positive slope on the whole dataset, and negative slopes on every subset of that dataset (or the other way around).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Interpret this interaction regression model. What does each coefficient mean?

Height = B0 + B1*Bacteria + B2*Sun + B3*Bacteria*Sun

A

Without the interaction term, we can interpret B1 as the unique effect of bacteria on height. With the interaction term we can no longer do so as the effect of bacteria on height is now different for different values of Sun. Thus B1 is now interpreted as the unique effect of bacteria on Height ONLY WHEN Sun = 0.

B2 is the unique effect of the Sun when bacteria = 0

The overall effect of Bacteria on Height is now B1 + B3 * Sun. So if we have the following coefficients:

Height = 35 + 4.2*Bacteria + 9*Sun + 3.2*Bacteria*Sun

So for Sun = 0, an increase in 1 unit of bacteria results in an increase of 4.2 units in height. For Sun = 1, the effect of bacteria is now 7.4. Thus, for an increase in 1 unit of bacteria and sun = 1, we would expect an increase in 7.4 units of height.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the basic assumptions that linear regression makes about the data?

A
  1. Linearity of the data - relationship between x & y is linear
  2. Normality of residuals - the residual errors are assumed to be normally distributed
  3. Homogeneity of residuals variance - the residuals are assumed to have a constant variance (homoscedasticity)
  4. Independence of residuals error terms.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between parametric and nonparametric statistics?

A

Parametric statistics are based on assumptions about the distribution of the population from which the sample was taken. Example: Student’s t-test

Nonparametric statistics are not based on assumptions about the distribution of the population. In many cases the distribution of the population is unknown. These are cases when nonparametric statistics are used. Example: Mann-Whitney-Wilcoxon test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which statistical test is used to asses the difference in means of two groups? (Parametric and nonparametric)

A

Parametric - Student’s t-test
Nonparametric - Mann Whitney Wilcoxon rank test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What statistical test is used to compare the means of more than two groups? (parametric and nonparametric)

A

Parametric - ANOVA: extension of t-test to compare more than two groups

Nonparametric - Kruskal-Wallis rank sum test (extended version of Wilcoxon rank test)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What statistical test is used to compare the variances of two groups? (parametric and nonparametric)

A

Parametric - F-test for 2 groups, Bartlett’s or Levene’s for multiple groups/samples

Nonparametric -

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Interpret the coefficient for X2 as if it were a categorical variable.

Yi = B0 + B1*X1i + B2*X2i + ei.

Y = 42 + 2.3*X1 + 11*X2

A

B2 is the average difference in Y between the category for which X2 = 0 (the reference group) and the category for which X2 = 1 (the comparison group).

So compared to when X2 = 0, we would expect Y to be 11 units greater when X2 = 1, controlling for X1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a confusion matrix?

A

Confusion matrix is the visual representation of the Actual vs. Predicted values. This is used in logistic regression to visualize and assess the performance of the model. This term is also used a lot in machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does linear regression relate to the generalized linear model?

A

Linear regression is a specialized case of the GLM. It is the specialized case where the link function is just the identity function as Y does not need to be transformed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a “link function” in regression?

A

The link function makes the distribution of Y compatible with the right-hand side of a regression equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When can you use least-squares regression and/or maximum likelihood estimation to solve a GLM equation?

A

Least-squares and MLE will give the same result for a linear regression problem.

Can only use MLE for other types of regression under the GLM (logistic, poisson, etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a logarithm?

A

In it’s simplest form, a logarithm answers the question “how many of one number do we multiply to get another number?”

For example, Log2(8) is asking, “how many 2’s do we multiply to get 8?” Therefore, Log2(8) = 3

Another way of looking at this is, 2^X = 8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a parameter in statistics?

A

In statistics, a parameter is any measured quantity of a statistical population that summarizes or describes an aspect of the population, such as a mean or standard deviation.

A parameter is to a population as a statistic is to a sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between the likelihood function and the probability density function?

A
  1. a probability density function expresses the probability of observing our data given the underlying distribution parameters. It assumes that the parameters are known
  2. The likelihood function expresses the likelihood of parameter values occurring given the observed data. It assumes that the parameters are unknown.

probability and likelihood do not mean the same thing in statistics

Probability attaches to results; likelihood attaches to hypotheses

https://www.psychologicalscience.org/observer/bayes-for-beginners-probability-and-likelihood#:~:text=The%20distinction%20between%20probability%20and,results%3B%20likelihood%20attaches%20to%20hypotheses.&text=There%20are%20only%2011%20possible,one%20of%20the%20possible%20results.

17
Q

What is Maximum Likelihood Estimation?

A

a statistical method for estimating the parameters of a model. In MLE, the parameters are chosen to maximize the likelihood that the assumed model results in the observed data.

18
Q

What is the central limit theorem?

A

-The sampling distribution of a statistic becomes closer to the normal distributions as the number of trails increases.

  • this only applies when samples are random and independent from one another
  • very useful for large populations, generate more and more samples that will give you better estimates of true mean and standard deviations
  • the central limit theorem applies to all distributions including binomial and poisson, discrete and continuous
19
Q

What is the difference between supervised and unsupervised learning in statistics?

A

Supervised learning: for each observation of a predictor you have a corresponding observation of the response variable

Unsupervised learning: you DO NOT have a corresponding observation of the response variable for each measurement of a predictor variable. In this case you try and understand the relationships between variables or between observations. Ex. cluster analysis

Semi-supervised: when you have some supervised data and some unsupervised data

20
Q

What is a regression problem vs a classification problem?

A

Regression problem - problems with a quantitative response

Classification problem - problems with a qualitative response

However, the distinction isn’t necessarily hard and fast. For example, logistic regression is technically regression but involves a qualitative response variable and thus sort of belongs in both categories.

21
Q

What is the difference between frequentist statistics and Bayesian statistics?

A

The frequentist believes that probability represents long term frequencies of repeatable events such as flipping a coin. Frequentists do not attach probabilities to hypotheses or unknown values.

Bayesian approach uses probabilities to represent the uncertainty in any event or hypothesis.

Bayesian approaches assign probability to events on the basis of confidence/belief. This confidence is updated in light of new evidence.

In the frequentist sense, probability can only be assigned to repeated events.

Frequentists focus on point estimates while bayesians focus on probability distributions.

Parameters are assigned a probability for bayesians while parameters are typically fixed for frequentists.

22
Q

What is the relationship and difference between covariance and correlation?

A

Covariance refers to the systematic relationship between two random variables in which a change in the other reflects a change in one variable. Values range from -infinity to infinity, greater the number the more reliant the relationship.

Cov(X,Y) = Σ E((X – x̄) E(Y – ȳ )) / n-1

Correlation is a measure that determines the degree to which two or more random variables move in sequence. Covariance is the numerator of the correlation formula.

​Correlation Coefficient = ∑(x(i)- mean(x))*(y(i)-mean(y)) / √ (∑(x(i)-mean(x))2 * ∑(y(i)-mean(y))2)
Correlation = covariance/square root of variance of x * square root of variance of y

So Covariance just means relationship between change of 2 variables and correlation handles the degree to which that change clings together. The greater the variance of either x or y, the lower the correlation. The greater the covariance, the greater the correlation.

23
Q

What is an estimator?

A

An estimator is a rule for calculating an estimate of a given quantity based on observed data. For example, the sample mean is an estimator for the population mean.

The formula used to calculate a value from a sample is called the estimator; the value is called the estimate.

So really its a formula

24
Q

What is SMOTE?

A

Synthetic Minority Oversampling Technique

Suppose, you’re working on a health insurance based fraud detection problem. In such problems, we generally observe that in every 100 insurance claims 99 of them are non-fraudulent and 1 is fraudulent. So a binary classifier model need not be a complex model to predict all outcomes as 0 meaning non-fraudulent and achieve a great accuracy of 99%. Clearly, in such cases where class distribution is skewed, the accuracy metric is biased and not preferable.

This is where SMOTE comes in. SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.

https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/

25
Q

What is “deviance”?

A

Deviance is a key concept in logistic regression. Intuitively, it measures the deviance of the fitted logistic model with respect to a perfect model. It is defined as the difference of likelihoods between the fitted model and the saturated model (perfect fitting model).

Deviance = -2 * the loglikelihood of the fitted model

Lower deviance numbers means a better fit, 0 would be a perfect fit

Similar to RSS for linear models

26
Q

What is the difference between using statistical methods for prediction vs inference?

A

When we are trying to predict something, we don’t care about the underlying model so much, we just care how accurate it’s predictions are.

When we care about the relationships between predictors and outcomes, then we are doing statistics for inference. In this case we care very much about the underlying model b/c we often want to make reccomendations about how to change those varibles in order to affect the outcome.

Ex. If all I care about is predicting student success, I don’t care about the underlying model except for how accurate it is. But if I care about trying to increase student success then I care a lot about the underlying model b/c it will help me infer what variables are importantly related to student success.

27
Q

Explain the difference between standard deviation and standard error. What is the formula for each?

A

Standard deviation: tells you how much distance each data point is from its average value; tells you how the overal data is distributed around the mean

Formula: square root of the variance,

Standard error: the standard deviation of a sampling distribution of statistics; tells you how the mean itself is distributed among a sampling distribution

formula: standard deviation / square root of the sample size, is this right?

28
Q

What is the difference between a confidence interval and a prediction interval?

A

Confidence intervals tell you how well you estimated a parameter of interest

Prediction intervals tell you where you can expect to see the next data point sampled.

Prediction intervals are always wider than confidence intervals b/c they account for 2 sources of uncertainty; the uncertainty in estimating a population parameter and the random variation of individual values. Confidence intervals only account for uncertainty of estimating population parameter.

A confidence interval gives a range for E[y|x]. A prediction interval gives give a range for y itself. Given a certain value of x, what range can we be confident that a single point of y will fall in? To illustrate the difference, imagine that we could get perfect estimates of our 𝛽 coefficients. Then, our estimate of E[𝑦∣𝑥] would be perfect. But we still wouldn’t be sure what 𝑦 itself was because there is a true error term that we need to consider. Our confidence “interval” would just be a point because we estimate E[𝑦∣𝑥] exactly right, but our prediction interval would be wider because we take the true error term into account.

https://www.graphpad.com/support/faq/the-distinction-between-confidence-intervals-prediction-intervals-and-tolerance-intervals/