Linear Regression Flashcards
What is one of the most common methods of prediction?
Regression Analysis
it is used whenever we have a casual relationship between variables
What is a Linear Regression?
a linear regression is a linear approximation of a causal relationship between two or more variables
How is the Dependent Variable labeled? (the predicted variable)
as Y
How are Independent Variables labeled? (the predictors)
x1, x2, etc
In Y hat - what does the hat denote?
An estimated or predicted value
What is the simple linear regression formula?
Y hat = b0 + b1 * x1
You have an ice-cream shop. You noticed a relationship between the number of cones you order and the number of ice-creams you sell. Is this a suitable situation for regression analysis?
Yes
No
No
You are trying to predict the amount of beer consumed in the US, depending on the state. Is this regression material?
Yes
No
Yes
What does correlation measure?
The degree of relationship of two variables
it doesn’t capture causality but shows that two variables move together (no matter in which direction)
What is the purpose of regression analysis?
To see how one variable affects another, or what changes it causes the other
it shows no degree of connection but cause and effect
Which statement is false?
Correlation does not imply causation.
Correlation is symmetrical regarding both variables.
Correlation could be represented as a line.
Correlation does not capture the direction of the causal relationship.
Correlation could be represented as a line.
What does it mean if x and y have a positive correlation?
An increase in x translates to a decrease in y.
An increase in y translates to a decrease in x.
The variables x and y tend to move in the same direction.
None of the above
The variables x and y tend to move in the same direction.
Assume you have the following sample regression: y = 6 + x. If we draw the regression line, what would be its slope?
1
6
x
None of the above
1
What does a p-value of 0.503 suggest about the intercept coefficient?
It is significantly different from 0.
It is not significantly different from 0.
It is equal to 0.503.
None of the above.
It is not significantly different from 0.
What does a p-value of 0.000 suggest about the coefficient (x)?
It is significantly different from 0.
It is not significantly different from 0.
It does not tell us anything.
None of the above.
It is significantly different from 0.
What is the predicted GPA of students with an SAT score of 1850? (Unlike in the lectures, this time assume that any coefficient with a p-value greater than 0.05 is not significantly different from 0)
3.42
3.06
3.23
3.145
3.145
Using the value of the coefficients in front of const and SAT, let’s write down the corresponding formula for linear regression, namely:
GPA = 0.2750 + 0.0017SAT
We can see that the variable const has a p-value of 0.503 which makes it statistically insignificant. The question asks to make a prediction excluding such insignificant variables and that reduces the equation above down to
GPA = 0.0017SAT
Now, plugging in SAT = 1850, we obtain the desired result.
Hope this helps!
Kind regards,
365 Hristina
What is the Sum of Squares Total?
denoted: SST, or TSS
squared difference between the independent variable and its mean
measures the total variability of the dataset
What is the Sum of Squares Regression?
SSR or ESS
sum of the differences between predicted value and the mean of the dependent variable
a measure that describes of how well your line fits the data
if equal to SST then the model captures all the variability and is perfect
What is the Sum of Squares Error?
SSE or RSS
the difference between the observed value and the predicted value
the smaller the error the better the estimation power of the regression
What is the connection between SST, SSR, and SSE?
SST = SSR + SSE
the total variability of the dataset = the explained variability by the regression line + the unexplained variability
a lower error will cause a more powerful regression
Which of the following is true?
SST = SSR + SSE
SSR = SST + SEE
SSE = SST + SSR
SST = SSR + SSE
What is the OLS?
Ordinary Least Squares
The most common method to estimate the linear regression equation
What software do beginner statisticians prefer?
Excel, SPSS, SAS, STATA
What software do data scientist prefer?
Programming languages like, R and Python
the offer limitless capabilities and unmatched speed
What are other methods for determining the regression line?
- Generalized Least Squares
- Maximum likelihood estimation
- Bayesian Regression
- Kernel Regression
- Gaussian Process Regression
Since OLS (Ordinary Least Squares) is simple enough to understand, why do advanced statisticians prefer using programming languages to solve regressions?
Limitless capabilities and unmatched speed.
Other software cannot compute so many calculations.
Huge datasets cannot be used in Excel
None of the above.
Limitless capabilities and unmatched speed.
What is the R-squared?
R2 = SSR/SST
it measures the goodness of fit or your model - the more factors you include in your regression, the higher the R-squared
a relative measure and takes values ranging from 0-1
R2 = 0 means your regression line explains none of the variability of the data.
R2 = 1 means your regression line explains all of the variability and is perfect
Typical range: 0.2 - 0.9
SST = 1245, SSR = 945, SSE = 300. What is the R-squared of this regression?
0.24
0.52
0.76
0.87
0.76
The R-squared measures:
How well your data fits the regression line
How well your regression line fits your data
How well your data fits your model
How well your model fits your data
How well your model fits your data
ie. it measures how much of the total variability is explained by our model
What is the best fitting model?
The least SSE
the lower the SSE the higher the SSR
The more powerful the model is
Why do we prefer using a multiple linear regression model to a simple linear regression model?
Easier to compute.
Having more independent variables makes the graphical representation clearer.
More realistic - things often depend on 2, 3, 10 or even more factors.
None of the above.
More realistic - things often depend on 2, 3, 10 or even more factors.
also, multiple regressions are always better than simples ones, as with each additional variable your add, the explanatory power may only increase or stay the same.
What is the Adjusted R-squared?
Always smaller than the R-squared, as it penalizes excessive use of variables
The adjusted R-squared is a measure that:
measures how well your model fits the data
measures how well your model fits the data but penalizes the excessive use of variables
measures how well your model fits the data but penalizes excessive use of p-values
measures how well your data fits the model but penalizes the excessive use of variables
measures how well your model fits the data but penalizes the excessive use of variables
The adjusted R-squared is:
usually bigger than the R-squared
usually smaller than the R-squared
usually the same as the R-squared
incomparable to the R-squared
usually smaller than the R-squared
What can you tell about a new variable if adding it increases R-squared but decreases the adjusted R-squared?
The variable improves our model
The variable can be omitted since it holds no predictive power
It has a quadratic relationship with the dependant variable
None of the above
The variable can be omitted since it holds no predictive power
What is the F-statistic?
It is used for testing the overall significance of the model
The lower the F-statistic the closer to a non-significant model
It follows an F distribution. It is used for tests.
What are the 5 linear regression assumptions?
- Linearity
- No endogeneity
- Normality and homoscedasticity
- No autocorrelation
- No multicollinearity
What is one of the biggest mistakes you can make in OLS?
is to perform a regression that violates one of the 5 assumptions.
If a regression assumption is violated:
Some things change.
You cannot perform a regression.
Performing regression analysis will yield an incorrect result.
It is no big deal.
Performing regression analysis will yield an incorrect result.
Why is a Linear Regression called linear?
Because the equation is linear
How can you verify if the relationship between two variables is linear?
Plot the independent variable x1 against the dependent variable y on a scatter plot and if the result looks like a line then a linear regression model is suitable
if the relationship is non linear you should not use the data before transforming it appropriately
What are some fixes for when a relationship between x1 and y is not linear?
- Run a non-linear regression
- Exponential transformation
- Log transformation
*if the relationship is non linear you should not use the data before transforming it appropriately
What should you do if you want to employ a linear regression but the relationship in your data is not linear?
Not use it.
Ignore it and proceed with your analysis
Transform it appropriately before using it.
None of the above.
Transform it appropriately before using it.
What is No Endogeneity or regressors?
Endogeneity refers to situations in which a predictor in a linear regression model is correlated to the error term.
The error becomes correlated with everything else
What is Omitted Variable Bias?
It happens when you forget to include a relevant variable
everything you don’t explain in your model goes into the error
Leads to biased and counterintuitive estimates
What are the sources of Endogeneity?
There is a wide range of sources of Endogeneity. The common sources of Endogeneity can be classified as: omitted variables, simultaneity, and measurement error.
The easiest way to detect an omitted variable bias is through:
the error term
the independent variables
the dependent variable
sophisticated software
the error term
What should you do if the data exhibits heteroscedasticity?
Try to identify and remove outliers
Try a log transformation
Try to reduce bias by accounting for omitted variables
All of the above.
All of the above.
How does one detect autocorrelation?
Plot all the residuals on a graph and look for patters. If you can’t find any, you are safe.
or
Burbin-Watson test: values are from 0-4
2 -> no autocorrelation
<1 and >3 are cause for alarm
Autocorrelation is not likely to be observed in:
time series data
sample data
panel data
cross-sectional data
cross-sectional data
How do you fix autocorrelation when using a linear regression model?
Try to identify and remove outliers.
Use log transformation.
Try to reduce bias by accounting for omitted variables.
None of the above.
None of the above.
What is multicollinearity?
When two or more variables have a high correlation
How do we fix multicollinearity?
- Drop one of the two variables;
- Transform them into one;
- Keep them both.
How do we determine multicollinearity?
Before creating the regression, between each two pairs of independent variables
No multicollinearity is:
easy to spot and easy to fix
easy to spot but hard to fix
hard to spot but easy to fix
hard to spot and hard to fix
easy to spot and easy to fix
What is a Dummy Variable?
A variable that is used to include categorical data into a regression model
What is a Dummy Variable?
A variable that is used to include categorical data into a regression model
How do you determine the variables that are unneeded in a model?
feature selection through p-values
if a variable has a p-value > 0.05, we can disregard it
r-squared will also show how well a model is fit
What does feature selection do?
It simplifies models, improves speed, and prevents a series of unwanted issues arising from having too many features
What is a common problem when working with numerical data in linear regressions? What is the fix?
Differences in magnitudes
The fix is: standardization or feature scaling or normalization - all the same thing
How: subtracting the mean and dividing by the standard deviation
What are Standardized Coefficients or Weights?
The bigger the weight, the bigger the impact. It carries weight on the result.
What is another name for Intercepts in ML?
Bias - if a model needs to have it..
How do you interperet Weights in result summaries?
The closer a weight is to 0, the smaller its impact;
the bigger the weiht, the bigger its impact
What is overfitting and how do we deal with it?
Our training has focused on the particular training set so much, it has “missed the point”
split the dataset into two - a training set and a test set
Splits of 80/20 or 90/10 are common
What is underfitting?
The model has not captured the underlying logic of the data
it provides an answer that is far from correct
You’ll realize that there are not relationships to be found or you need a different model
What is one of the best ways of checking for multicollinearity?
Through VIF - variance inflation factor
VIF = 1 - no multicollinearity at all (min value of measure)
VIF = 1- 5 - considered perfectly ok
VIF = > 5 - could be unacceptable