Statsmodels Flashcards
What is the linear regression process in short?
Get sample data
Design a model that works for that sample
Make predictions for the whole population
Explain dependent and independent variables
There’s a dependent variable labeled Y (predicted)
And independent variables labeled x1, x2, …, xk (predictors)
Y = F(x1, x2, …, xk)
The dependent variable Y is a function of the independent variables x1 to xk
Explain the coefficients in y = ß0 + ß1x1 + ε
These are the ß values in the model formula
ß1 –> Quantifies the effect of x on y –> Differs per country for example (income example)
ß0 –> Constant –> Like minimum wage in the income vs education example
ε –> Represents the error of estimation –> Is 0 on average
Easiest regression model in formula?
Simple linear regression model
y = ß0 + ß1x1 + ε
y –> Variable we are trying to predict –> Dependent variable
x –> independent variable
Goal is to predict the value of y provided we have the value of x
What is the sample data equivalent of the simple linear regression equation?
ŷ = b0 + b1x1
ŷ –> Estimated / Predicted value
b1 –> Quantifier
x1 –> Sample data for independent variable
Correlation vs. Regression
Correlation
Measures relationship between two variables
Movement together
Formula interchangeable –> p(x,y) = p(y,x)
Graph –> Single point
Regression
One variable affects the other or what changes it causes to the other
Cause and effect
Formula –> One way
Graph –> Line
What can you do with the following modules:
numpy
pandas
scipy
statsmodels.api
matplotlib
seaborn
matplotlib
sklearn
numpy
Working with multi-dimensional arrays
pandas
Enhances numpy
Organize data in tabular form
Along with descriptive data
scipy
numpy, pandas and matplotlib are part of scipy
statsmodels.api
Build on top of numpy and scipy –> Regressions and statsmodels
matplotlib
2D plotting specially designed for visualization of numpy computations
seaborn
Python visualization library based on matplotlib
sklearn
scikit learn –> Machine learning libraries
Check which packages are installed
Anaconda Navigator
CMD.exe Prompt
Write “conda list”
Upload and access data
Put the csv file in the same folder as the notebook file
data = pd.read_csv(‘filename’)
write ‘data’ –> The data will show up in the output
Pull up statistical data of your dataset
data.describe()
What are the steps for plotting regression data?
Import relevant libraries
Load the data
Declare the dependent and the independent variables
Explore the data
Regression itself
Plot the regression line on the initial scatter
How to find ß0?
Read from the tables to find the numbers for plotting the scatter line
coef
const –> ß0
How to determine whether variable is significant?
Hypothesis testing based on H0: ß=0
In the results table this is the same as t + P>|t|
p-value < 0.05 means that the variable is significant
What to do if ß0 is not significanty different from 0
They are not calculated in the formula and thus left out of the prediction of the expected value
Explore determinants of a good regression
Sum of squares total (SST or TSS)
Sum of squares regression (SSR)
Sum of squares error (SSE)
Sum of squares total (SST) formula plus meaning?
∑(yi - ȳ)²
Can think this as the dispersion of the variables around the mean
Measures the total variability of the dataset
Sum of squares regression (SSR) formula plus meaning?
∑(ŷi - ȳ)²
Sum of predicted value minus mean of dependent variable, squared
Measures how well the line fits the data
If SSR = SST –> Then the regression line is perfect meaning all the spots are ON the line
Sum of squares error (SSE) formula plus meaning?
∑e(i)²
Difference observed value and the predicted value
The smaller the error, the better the estimation power of the regression
What is the connection between SST, SSR & SSE
SST = SSR + SSE
In words: The total variability of the dataset is equal to the variability explained by the regression line plus the unexplained variability (error)
What is OLS?
OLS –> Ordinary least squares
Most common method to estimate the linear regression equation
Least squares stands for minimizing the SSE (error) –> Lower error –> better explanatory power
OLS is the line with the smallest error –> Closest to all points simultaneously
There are other methods to calculate regression. OLS is simple and powerful enough for most problems
What is R-squared and how to interpret it?
R-squared –> How well your model fits your data
Intuitive tool when in the right hands
R² : SSR / SST
R² = 0 –> Regression explains NONE of the variability
R² = 1 –> Regression explains ALL of the variability
What you will typically observe is values from 0.2 til 0.9
What is a good R squared?
No rule of thumb!
Depends on the context and complexity of the topic whether the number is a strong indicator
With a mediocre score of R, one might need additional indicators to explain the correlation
The more factors you include in your regression –> The higher the R squared
Why multiple regressions?
Good models require multiple regressions in order to address the higher complexity of problems
Population (mulitple regression) model
More independent variables (more than one)
ȳ –> Inferred value
b0 –> Intercept
x1…xk –> Independent variable
b1…bk –> coefficient
ȳ = b0 + b1x1 + b2x2 + b3x3 + … + bkxk
What is the goal of building a model?
Goal is to limit the SSE as much as possible
With each additional variable we increase the explanatory power!
What is adjusted R-Squared?
Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases when the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected.
The R-squared measures how much of the total variability is explained by our model
Multiple regressions are always better than simple ones –> As explained more variables lead to a better explanatory power
Denoted as: r̄²
How is the adjusted R squared formula build up, with which variables?
1 - (1 - R-squared) * ((n - 1)/(n - p - 1))
So:
R-squared
n = Total sample size
p = number of predictors
How big is Adjusted R squared compared to R squared?
r̄² is always smaller than R²
It penalizes excessive use of variables
How to use p-value to determine if a variable should stay?
It’s only significant when p-value < 0.05
How to interpret Adjusted R squared?
Check before and after if Adjusted R-squared was lowered or highered
What are the consequences of adding useless data?
The formula changes –> Different number for the intercept number of ß0
Thus the bias of this useless variable is reflected into the coefficients of the others
What is the simplicity/explanatory power tradeoff
Simplicity is better rewarded than having a high explanatory power!
What is the F-statistic and how is it used?
F-statistic > Follows an F-distribution
It is used for testing the overall significance of the model
F-test
Null Hypothesis is that all betas are equal to 0 –> H0: ß1 = ß2 = ß3 = 0
H1: at least one ßi≠0
If all Beta’s are 0 than the model is useless
Compare F-statistic with or without variable –> Lower F-statistic means closer to a non-significant model
Prob(F-statistic) can still be significant but notice the change –> If it’s higher then drop the variable
Interpretation of F-statistic?
Prob(F-statistic) very low –> We say overall model is significant
The lower –> The closer to a non-significant model
Don’t forget to look for the 3 zeroes after the dot
How to verify linearity?
Plot the data –> If data points form something that can be explained as a straight line –> Then linear regression is suitable
Regression assumption:
Explain Endogeneity
σxε = 0 : ∀x, ε
The error (difference observed and predicted values) is correlated with the independent variable –> This is problem referred to with ‘omitted variable bias
Explain Omitted variable bias
In general
Omitted variable bias occurs when you forget to include a variable. This is reflected in the error term as the factor you forgot about is included in the error. In this way, the error is not random but includes a systematic part (the omitted variable).
You either include or omit the X variable leading to a difference in error –> Therefore the x and ε are somewhat correlated
Regression assumption:
Explain Normality and homoscedasticity
ε ~ N(0, σ²)
Comprises:
Normality –> We assume the error term is normally distributed
Zero mean
Homoscedasticity
When in doubt about including variable, what should you do?
Just include the variable –> Worst thing that can happen is that it leads to inefficient estimates
You can then immediately drop that variable
Leaving out a great variable does a lot more harm!
What if error term is not normally distributed in Normality and homoscedasticity?
CLT Applies
Remember:
In probability theory, the central limit theorem (CLT) establishes that, in many situations, for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.
What does diffusing mean?
Diffusing means the correlation is there for lower values but not for higher values –> We don’t like this pattern –> Heteroscedasticity
Example of heteroscedasticity?
Poor person will have the same dinner every day –> Low variability
Rich person will eat out and then dine in the next day –> High variability thus we expect heteroscedasticity
How to prevent heteroscedasticity?
Check for omitted variable bias:OVS
Look for outliers
Log Transform –> A statistician’s best friend
Apply logarithmic axes
By changing the scale of X, reduces the width of the graph
New model is called: Semi-log model
Denoted as:
ŷ = b0 + b1(log x1)
or:
log ŷ = b0 + b1x1
Meaning: As X increases by 1 unit, Y increases by b1 percent
Log-log model?
When using log on both axes:
log ŷ = b0 + b1(logx1)
Interpretation: As X increases by 1 percent, Y increases by b1 percent
Relation is known as elasticity
Regression assumption: Explain ‘No autocorrelation’
Autocorrelation is a mathematical representation of the degree of similarity between a given time series and a lagged version of itself over successive time intervals. It’s conceptually similar to the correlation between two different time series, but autocorrelation uses the same time series twice: once in its original form and once lagged one or more time periods.
No autocorrelation
σ(εiεj) = 0 : ∀i ≠ j
Errors are assumed to be uncorrelated
Highly unlikely to find it in cross sectional data
Very common in time-series data such as stock prices
Spot autocorrelation
Look at the graph –> If you can’t find any, you are safe
Durbin-Watson test
Generally its values fall between 0 and 4
2 –> No autocorrelation
<1 and >3 are a cause for alarm
Conclusion: When in the presence of autocorrelation avoid the linear regression model
Regression assumption: No Multicollinearity
ρ(xixj) ≈ 1 : ∀i,j; i ≠ j
Is observed when 2 or more variables have a high correlation among each other
Example: a = 2 + 5 * b
In this case there is no point in using both a and b because they are correlated
ρ(cd) = 0.9 –> Imperfect multicollinearity
How to deal with catagorical data?
Use a dummy instead and explain it later on –> Transform yes and no into 1 and 0
You can do calculations with 0 and 1
What will the categorical data graph look like?
You will get two different models:
ȳ = b0 + b1x1 + b20 = b0 + b1x1
ȳ = b0 + b1x1 + b20 = b0 + b1x1 + b2 = (b0+b2) + b1.x1
Results in two lines with equal slope but different intercept