Part I: Regressions Flashcards
What is correlation?
Measures the strength and direction of a statistical relationship
What is covariance?
If high values for one variable means high values for the other variable and the same holds for the lower values, the covariance would be positive. If the opposite, it would be negative. It is scale dependent meaning that it’s harder to interpret than the correlation.
What is ordinary least squares (OLS)?
Fitting a line based on the minimum space from points to line.
What is linear regression?
A regression model where we assume the relationship between the response and predictor is linear. The response variable is continuous.
What is logistic regression?
A regression model where the response variable is binary and we assume the relationship between the response and predictor is logistic (S curve).
The response of therefore between 0 and 1, and we can predict a probability as the response.
When is logistic regression used over linear?
If the response variable is binary (meaning we want to predict a classification)
What is AIC and BIC and how to use them
They are ways to compare different models. Lower scores = better models
What is a regression?
Way to understand the relationship between a dependent variable y (also known as the response or outcome variable) and one or more independent variables x (also known as the predicters or explanatory variables). The dependent variable must be continuous (it can take any value within a range). The independent variable can be continuous, discrete (countable number of specific values) or categorical. Regressions usually use the least squares method where we find which linear line fits the data best (with minimal space between points and the line aka least squares).
What are the assumptions for linear regressions?
- The errors should follow the normal distribution
- The errors should be independent (meaning that we cannot have correlation between the residuals/erors)
- Homoscedastity (meanins that we should have equal variance over the predictions. If not we can have negative effects on the confidence intervals)
How do we measure correlation in linear regressions?
Pearson correlations
How can we estimate linear regression variables?
Maximum likelihood estimation (MLE and REML) OR ordinary least squares (OLS).
IMPORTANT: both gives the same results, but MLE is faster and requires less computational power
What is Maximum likelihood estimation?
Model that finds the variables that make the observation most likely. Statistical inference that uses probabilistic data generating models to estimate parameters for models. The idea is to make a graph that shows the distribution of how likely we are to see observations in this location. Therefore instead of having a model based on the means of the data, we now have a model that is based on the likelihood of seeing observations.
What is the main drawback from using MLE?
The problem here is that it does not take into account the decrease in degrees of freedom that comes along with estimating the mean.
Therefore the estimate will be biased.
What is the difference between MLE and REML? Why can we not compare REML?
REML: Good for multilevel models.
This basically corrects the bias in MLE by correcting for the degrees of freedom lost in estimating fixed effects.
REML is specifically designed to provide more accurate estimates of variance components (random effects) by adjusting for the degrees of freedom used by the fixed effects.
We can however not compare REML results because REML uses a subset of data that can be different for each run/model.
What is SSTO, SSR, SSE and SSTO? How can they be used to calculate R^2
Variation measures.
SSR (Regression Sum of Squares): Represents the variation that is explained by the regression line (the fitted values $\hat{Y}_i$). It is the sum of the squares of the differences between each predicted $\hat{Y}_i$ and the overall mean $\bar{Y}$.
SSE (Error Sum of Squares): Represents the unexplained variation or the variation that is due to random error. It is the sum of the squares of the differences between each observed $Y_i$ and the predicted $\hat{Y}_i$.
SSTO (Total Sum of Squares): SSR+SSE. Represents the total variation in the observed variable Y. It is the sum of the squares of the differences between each observed $Y_i$ and the overall mean $\bar{Y}$
Coefficient of determination:
R^2 = SSR/SSTO
This coefficient measures the proportion of the total variation in the response that can be explained by the linear regression in range [0:1].