Linear Regression Flashcards

1
Q

What is one of the most common methods of prediction?

A

Regression Analysis

it is used whenever we have a casual relationship between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Linear Regression?

A

a linear regression is a linear approximation of a causal relationship between two or more variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is the Dependent Variable labeled? (the predicted variable)

A

as Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are Independent Variables labeled? (the predictors)

A

x1, x2, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In Y hat - what does the hat denote?

A

An estimated or predicted value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the simple linear regression formula?

A

Y hat = b0 + b1 * x1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

You have an ice-cream shop. You noticed a relationship between the number of cones you order and the number of ice-creams you sell. Is this a suitable situation for regression analysis?

Yes

No

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

You are trying to predict the amount of beer consumed in the US, depending on the state. Is this regression material?

Yes

No

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does correlation measure?

A

The degree of relationship of two variables

it doesn’t capture causality but shows that two variables move together (no matter in which direction)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the purpose of regression analysis?

A

To see how one variable affects another, or what changes it causes the other

it shows no degree of connection but cause and effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which statement is false?

Correlation does not imply causation.

Correlation is symmetrical regarding both variables.

Correlation could be represented as a line.

Correlation does not capture the direction of the causal relationship.

A

Correlation could be represented as a line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does it mean if x and y have a positive correlation?

An increase in x translates to a decrease in y.

An increase in y translates to a decrease in x.

The variables x and y tend to move in the same direction.

None of the above

A

The variables x and y tend to move in the same direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Assume you have the following sample regression: y = 6 + x. If we draw the regression line, what would be its slope?

1

6

x

None of the above

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does a p-value of 0.503 suggest about the intercept coefficient?

It is significantly different from 0.

It is not significantly different from 0.

It is equal to 0.503.

None of the above.

A

It is not significantly different from 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does a p-value of 0.000 suggest about the coefficient (x)?

It is significantly different from 0.

It is not significantly different from 0.

It does not tell us anything.

None of the above.

A

It is significantly different from 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the predicted GPA of students with an SAT score of 1850? (Unlike in the lectures, this time assume that any coefficient with a p-value greater than 0.05 is not significantly different from 0)

3.42

3.06

3.23

3.145

A

3.145

Using the value of the coefficients in front of const and SAT, let’s write down the corresponding formula for linear regression, namely:
GPA = 0.2750 + 0.0017SAT
We can see that the variable const has a p-value of 0.503 which makes it statistically insignificant. The question asks to make a prediction excluding such insignificant variables and that reduces the equation above down to
GPA = 0.0017
SAT
Now, plugging in SAT = 1850, we obtain the desired result.

Hope this helps!

Kind regards,
365 Hristina

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the Sum of Squares Total?

A

denoted: SST, or TSS

squared difference between the independent variable and its mean

measures the total variability of the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the Sum of Squares Regression?

A

SSR or ESS

sum of the differences between predicted value and the mean of the dependent variable

a measure that describes of how well your line fits the data

if equal to SST then the model captures all the variability and is perfect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the Sum of Squares Error?

A

SSE or RSS

the difference between the observed value and the predicted value

the smaller the error the better the estimation power of the regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the connection between SST, SSR, and SSE?

A

SST = SSR + SSE

the total variability of the dataset = the explained variability by the regression line + the unexplained variability

a lower error will cause a more powerful regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which of the following is true?

SST = SSR + SSE

SSR = SST + SEE

SSE = SST + SSR

A

SST = SSR + SSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the OLS?

A

Ordinary Least Squares

The most common method to estimate the linear regression equation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What software do beginner statisticians prefer?

A

Excel, SPSS, SAS, STATA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What software do data scientist prefer?

A

Programming languages like, R and Python

the offer limitless capabilities and unmatched speed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What are other methods for determining the regression line?
- Generalized Least Squares - Maximum likelihood estimation - Bayesian Regression - Kernel Regression - Gaussian Process Regression
26
Since OLS (Ordinary Least Squares) is simple enough to understand, why do advanced statisticians prefer using programming languages to solve regressions? Limitless capabilities and unmatched speed. Other software cannot compute so many calculations. Huge datasets cannot be used in Excel None of the above.
Limitless capabilities and unmatched speed.
27
What is the R-squared?
R2 = SSR/SST it measures the goodness of fit or your model - the more factors you include in your regression, the higher the R-squared a relative measure and takes values ranging from 0-1 R2 = 0 means your regression line explains none of the variability of the data. R2 = 1 means your regression line explains all of the variability and is perfect Typical range: 0.2 - 0.9
28
SST = 1245, SSR = 945, SSE = 300. What is the R-squared of this regression? 0.24 0.52 0.76 0.87
0.76
29
The R-squared measures: How well your data fits the regression line How well your regression line fits your data How well your data fits your model How well your model fits your data
How well your model fits your data ie. it measures how much of the total variability is explained by our model
30
What is the best fitting model?
The least SSE the lower the SSE the higher the SSR The more powerful the model is
31
Why do we prefer using a multiple linear regression model to a simple linear regression model? Easier to compute. Having more independent variables makes the graphical representation clearer. More realistic - things often depend on 2, 3, 10 or even more factors. None of the above.
More realistic - things often depend on 2, 3, 10 or even more factors. also, multiple regressions are always better than simples ones, as with each additional variable your add, the explanatory power may only increase or stay the same.
32
What is the Adjusted R-squared?
Always smaller than the R-squared, as it penalizes excessive use of variables
33
The adjusted R-squared is a measure that: measures how well your model fits the data measures how well your model fits the data but penalizes the excessive use of variables measures how well your model fits the data but penalizes excessive use of p-values measures how well your data fits the model but penalizes the excessive use of variables
measures how well your model fits the data but penalizes the excessive use of variables
34
The adjusted R-squared is: usually bigger than the R-squared usually smaller than the R-squared usually the same as the R-squared incomparable to the R-squared
usually smaller than the R-squared
35
What can you tell about a new variable if adding it increases R-squared but decreases the adjusted R-squared? The variable improves our model The variable can be omitted since it holds no predictive power It has a quadratic relationship with the dependant variable None of the above
The variable can be omitted since it holds no predictive power
36
What is the F-statistic?
It is used for testing the overall significance of the model The lower the F-statistic the closer to a non-significant model It follows an F distribution. It is used for tests.
37
What are the 5 linear regression assumptions?
1. Linearity 2. No endogeneity 3. Normality and homoscedasticity 4. No autocorrelation 5. No multicollinearity
38
What is one of the biggest mistakes you can make in OLS?
is to perform a regression that violates one of the 5 assumptions.
39
If a regression assumption is violated: Some things change. You cannot perform a regression. Performing regression analysis will yield an incorrect result. It is no big deal.
Performing regression analysis will yield an incorrect result.
40
Why is a Linear Regression called linear?
Because the equation is linear
41
How can you verify if the relationship between two variables is linear?
Plot the independent variable x1 against the dependent variable y on a scatter plot and if the result looks like a line then a linear regression model is suitable if the relationship is non linear you should not use the data before transforming it appropriately
42
What are some fixes for when a relationship between x1 and y is not linear?
1. Run a non-linear regression 2. Exponential transformation 3. Log transformation *if the relationship is non linear you should not use the data before transforming it appropriately
43
What should you do if you want to employ a linear regression but the relationship in your data is not linear? Not use it. Ignore it and proceed with your analysis Transform it appropriately before using it. None of the above.
Transform it appropriately before using it.
44
What is No Endogeneity or regressors?
Endogeneity refers to situations in which a predictor in a linear regression model is correlated to the error term. The error becomes correlated with everything else
45
What is Omitted Variable Bias?
It happens when you forget to include a relevant variable everything you don't explain in your model goes into the error Leads to biased and counterintuitive estimates
46
What are the sources of Endogeneity?
There is a wide range of sources of Endogeneity. The common sources of Endogeneity can be classified as: omitted variables, simultaneity, and measurement error.
47
The easiest way to detect an omitted variable bias is through: the error term the independent variables the dependent variable sophisticated software
the error term
48
What should you do if the data exhibits heteroscedasticity? Try to identify and remove outliers Try a log transformation Try to reduce bias by accounting for omitted variables All of the above.
All of the above.
49
How does one detect autocorrelation?
Plot all the residuals on a graph and look for patters. If you can't find any, you are safe. or Burbin-Watson test: values are from 0-4 2 -> no autocorrelation <1 and >3 are cause for alarm
50
Autocorrelation is not likely to be observed in: time series data sample data panel data cross-sectional data
cross-sectional data
51
How do you fix autocorrelation when using a linear regression model? Try to identify and remove outliers. Use log transformation. Try to reduce bias by accounting for omitted variables. None of the above.
None of the above.
52
What is multicollinearity?
When two or more variables have a high correlation
53
How do we fix multicollinearity?
1. Drop one of the two variables; 2. Transform them into one; 3. Keep them both.
54
How do we determine multicollinearity?
Before creating the regression, between each two pairs of independent variables
55
No multicollinearity is: easy to spot and easy to fix easy to spot but hard to fix hard to spot but easy to fix hard to spot and hard to fix
easy to spot and easy to fix
56
What is a Dummy Variable?
A variable that is used to include categorical data into a regression model
57
What is a Dummy Variable?
A variable that is used to include categorical data into a regression model
58
How do you determine the variables that are unneeded in a model?
feature selection through p-values if a variable has a p-value > 0.05, we can disregard it r-squared will also show how well a model is fit
59
What does feature selection do?
It simplifies models, improves speed, and prevents a series of unwanted issues arising from having too many features
60
What is a common problem when working with numerical data in linear regressions? What is the fix?
Differences in magnitudes The fix is: standardization or feature scaling or normalization - all the same thing How: subtracting the mean and dividing by the standard deviation
61
What are Standardized Coefficients or Weights?
The bigger the weight, the bigger the impact. It carries weight on the result.
62
What is another name for Intercepts in ML?
Bias - if a model needs to have it..
63
How do you interperet Weights in result summaries?
The closer a weight is to 0, the smaller its impact; the bigger the weiht, the bigger its impact
64
What is overfitting and how do we deal with it?
Our training has focused on the particular training set so much, it has "missed the point" split the dataset into two - a training set and a test set Splits of 80/20 or 90/10 are common
65
What is underfitting?
The model has not captured the underlying logic of the data it provides an answer that is far from correct You'll realize that there are not relationships to be found or you need a different model
66
What is one of the best ways of checking for multicollinearity?
Through VIF - variance inflation factor VIF = 1 - no multicollinearity at all (min value of measure) VIF = 1- 5 - considered perfectly ok VIF = > 5 - could be unacceptable