SKlearn Flashcards
What is independent variable called in SKlearn
Feature
What is dependent variable called?
Output, Target
Find the R-squared in sklearn
reg.score(x_matrix,y)
Notice x has been reshaped to a 2d vector
Find the coefficients in sklearn
reg.coef_
Result is an array containing all coefficients
Find the intercept in sklearn
reg.intercept_
–> Leads to a float
Making predictions in sklearn
reg.predict(input)
Leads to array not a float, because predict method can take more than 1 value
What is the ML word for observation?
Sample
Each row in the dataset is a sample
How to calculate Adjusted R-squared in Python?
Set cell to markdown
Put the formula in Python –> Look up formula
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
Notice that x does not need to be reshaped because it already contains 2 variables. Then fill in these variables into formula.
Remember: Adjusted R-squared steps on the R-squared and adjusts for the nr of variables included in the model
What is the advantage of feature selection?
Simplifies models
Improves speed and prevents a series of unwanted issues arising from having too many features
What can you do with the F-statistic?
Test whether model has merit
Null Hypothesis is that all betas are equal to 0 –> H0: ß1 = ß2 = ß3 = 0
If all Beta’s are 0 than the model is useless
What is an F-Statistic?
Similar to a T statistic from a T-test
T-test will tell you if a single variable is statistically significant
F-test will tell you if a group of variables are jointly significant
Based on hypothesis that all betas are equal to 0 –> H0: ß1 = ß2 = ß3 = 0
How to interpret the P-value in the results table?
A low P-value (< 0.05) means that the coefficient is likely not to equal zero.
A high P-value (> 0.05) means that we cannot conclude that the explanatory variable affects the dependent variable (here: if Average_Pulse affects Calorie_Burnage).
A high P-value is also called an insignificant P-value.
How is the P-value denoted in the results table?
P>|t|
How to interpret the F-statistic? And the P-value change?
Compare F-statistic without or without variable –> Lower F-statistic means closer to a non-significant model
Prob(F-statistic) can still be significant but notice the change –> If it’s higher then drop the variable
What will this return?
from sklearn.feature_selection import f_regression
f_regression(x,y)
2 Arrays
1 with the F-statistics
1 with the according p-values –> Prob(F-statistic)
What does feature_selection.f_regression do?
It creates simple linear regressions of each feature and the dependent variable
from sklearn.feature_selection import f_regression
f_regression(x,y)
How to extract the p-values from regression results?
p_values is f_regression(x,y)[1]
Since first array are the F-statistics
Do p-values reflect the interconnection of the features in our multiple linear regression?
No
Why is Feature Scaling / Standardization needed?
Common problem when working with numerical data is difference in magnitude
It is transforming the data into a standard scale so all numbers are of the same magnitude
What is feature scaling also called?
Standardization / Normalization
What are coefficients and intercept called in Machine Learning
Weights and Bias
What is the reason SKlearn does not really support p-values?
Most ML practisioners perform some kind of feature scaling so what the (very small) weights off the variable become apparent.
What would you need to do after standardizing the dataset?
After standardizing the dataset, the input for predict must be standardized as well
Input must be standarized in the same way
x_simple_matrix = x_scaled[:,0].reshape(-1,1)
What happens here?
You are taking the first column out of x_scaled