Statsmodels Flashcards
What is the linear regression process in short?
Get sample data
Design a model that works for that sample
Make predictions for the whole population
Explain dependent and independent variables
There’s a dependent variable labeled Y (predicted)
And independent variables labeled x1, x2, …, xk (predictors)
Y = F(x1, x2, …, xk)
The dependent variable Y is a function of the independent variables x1 to xk
Explain the coefficients in y = ß0 + ß1x1 + ε
These are the ß values in the model formula
ß1 –> Quantifies the effect of x on y –> Differs per country for example (income example)
ß0 –> Constant –> Like minimum wage in the income vs education example
ε –> Represents the error of estimation –> Is 0 on average
Easiest regression model in formula?
Simple linear regression model
y = ß0 + ß1x1 + ε
y –> Variable we are trying to predict –> Dependent variable
x –> independent variable
Goal is to predict the value of y provided we have the value of x
What is the sample data equivalent of the simple linear regression equation?
ŷ = b0 + b1x1
ŷ –> Estimated / Predicted value
b1 –> Quantifier
x1 –> Sample data for independent variable
Correlation vs. Regression
Correlation
Measures relationship between two variables
Movement together
Formula interchangeable –> p(x,y) = p(y,x)
Graph –> Single point
Regression
One variable affects the other or what changes it causes to the other
Cause and effect
Formula –> One way
Graph –> Line
What can you do with the following modules:
numpy
pandas
scipy
statsmodels.api
matplotlib
seaborn
matplotlib
sklearn
numpy
Working with multi-dimensional arrays
pandas
Enhances numpy
Organize data in tabular form
Along with descriptive data
scipy
numpy, pandas and matplotlib are part of scipy
statsmodels.api
Build on top of numpy and scipy –> Regressions and statsmodels
matplotlib
2D plotting specially designed for visualization of numpy computations
seaborn
Python visualization library based on matplotlib
sklearn
scikit learn –> Machine learning libraries
Check which packages are installed
Anaconda Navigator
CMD.exe Prompt
Write “conda list”
Upload and access data
Put the csv file in the same folder as the notebook file
data = pd.read_csv(‘filename’)
write ‘data’ –> The data will show up in the output
Pull up statistical data of your dataset
data.describe()
What are the steps for plotting regression data?
Import relevant libraries
Load the data
Declare the dependent and the independent variables
Explore the data
Regression itself
Plot the regression line on the initial scatter
How to find ß0?
Read from the tables to find the numbers for plotting the scatter line
coef
const –> ß0
How to determine whether variable is significant?
Hypothesis testing based on H0: ß=0
In the results table this is the same as t + P>|t|
p-value < 0.05 means that the variable is significant
What to do if ß0 is not significanty different from 0
They are not calculated in the formula and thus left out of the prediction of the expected value
Explore determinants of a good regression
Sum of squares total (SST or TSS)
Sum of squares regression (SSR)
Sum of squares error (SSE)
Sum of squares total (SST) formula plus meaning?
∑(yi - ȳ)²
Can think this as the dispersion of the variables around the mean
Measures the total variability of the dataset
Sum of squares regression (SSR) formula plus meaning?
∑(ŷi - ȳ)²
Sum of predicted value minus mean of dependent variable, squared
Measures how well the line fits the data
If SSR = SST –> Then the regression line is perfect meaning all the spots are ON the line
Sum of squares error (SSE) formula plus meaning?
∑e(i)²
Difference observed value and the predicted value
The smaller the error, the better the estimation power of the regression
What is the connection between SST, SSR & SSE
SST = SSR + SSE
In words: The total variability of the dataset is equal to the variability explained by the regression line plus the unexplained variability (error)
What is OLS?
OLS –> Ordinary least squares
Most common method to estimate the linear regression equation
Least squares stands for minimizing the SSE (error) –> Lower error –> better explanatory power
OLS is the line with the smallest error –> Closest to all points simultaneously
There are other methods to calculate regression. OLS is simple and powerful enough for most problems
What is R-squared and how to interpret it?
R-squared –> How well your model fits your data
Intuitive tool when in the right hands
R² : SSR / SST
R² = 0 –> Regression explains NONE of the variability
R² = 1 –> Regression explains ALL of the variability
What you will typically observe is values from 0.2 til 0.9