4: Simple linear regression Flashcards
Simple linear regression
A model for a continuous response variable and a continuous explanatory variable, between which a linear relationship is assumed. The data-generating process is assumed to follow a normal distribution.
Confidence Interval (CI)
The range that may contain the true mean; with a specified level of certainty (commonly 95% or 99%)
uses a range, the more often you repeat an experiment, every time the CI is different depending on your data, you may or may not have the true mean, if you can get a 95% confidence interval will contain the true mean
Difference Total Sum of Squares (TSS) and Residual Sum of Squares (RSS)
TSS gives you an idea of the total variation in your data, while RSS tells you how much variation remains after accounting for the model’s predictions
homoscedasticity-heteroscedasticity
Non-constant variance
Equal spread then homo if shows cone shape then hetero. Many statistical tests, including linear regression, assume homoscedasticity. If this assumption is violated (which leads to a condition called heteroscedasticity), it can affect the validity of the test results
Breusch-Pagan test or the White test;
If heteroscedasticity is detected, transformations (like taking the logarithm) of the dependent variable can sometimes help stabilize the variance;
indicates that the model might not be a good fit for the data, and addressing it is crucial for obtaining reliable statistical results
Response Variable
also known as the dependent variable, is the outcome or the variable that you are trying to predict.
And can be changed with the explanitory variables
## Footnote
Example: In a study examining how study hours affect test scores, the test score is the response variable because it is what you want to measure or predict based on study hours.
Explanatory Variable
also known as the independent variable or predictor variable, is the variable that you manipulate or observe to see how it affects the response variable. It is used to explain changes in the response variable.
the number of study hours is the explanatory variable, as it is the factor you think will influence the test scores.
Linear model
A linear model assumes a straight-line relationship between the response variable and the explanatory variable.
linear function y=B0 + B1*x
This means that as the explanatory variable changes, the response variable changes in a predictable linear manner.
Model Diagnostics for linear models
- Residuals vs Fitted plot: linearity
- QQ-plot: asesses normality of your residuals can also asses using shapiro wilk, significant p means non normality or even Omnibus test also includes skewness and kurtosis checks
- Scale-Location plot: the Scale-Location plot helps you check if the residuals have constant variance across the fitted values (homoscedasticity) or use BP test signific means heteroscedacity/non constance veriance
- Cook’s distance: checks for outliers
- non linear -> try transformation; a parabolic shape means quadratic data
- If the points fall along a straight line (typically the diagonal line), it suggests that the data is normally distributed. If skewed/s-shape can mean Poisson or Binomial data, non normailty when falling outside of the confidence interval -> GLM
- difference is ≤0.5, A good plot will have points scattered randomly, while a funnel-shaped pattern suggests a problem with heteroscedasticity. With non constance veriance often also non linearity -> try transformation
- 0.5 is anomalouse and 1 is outlier; high leverage means pulling hard on the estimates
Box-Cox
To check what tranformation fits best, if it falls between the CI then viable transformation
Intercept and slope in regression tabel
where the line crosses the y-axis and the slope of X1 shows 1 step on x-axis is (for example) 9,87 higher then intercept (2) so 11,87 y-axis and 1 x-axis can draw slope
Ordinary Least Squares (OLS)
Method to minimize the total amount of noise (error) by minimizing the residual sum of squares (RSS)
Difference correlation and causation
Correlation shows a relationship, while causation explains why one variable affects another