Regression Flashcards
KNN Classification - training and predicting
Training: store all the data
Prediction:
1. calculate the distance from x to all points in your dataset
2. sort the points in your dataset by increasing distance from x
3. predict the majority label of the k closest point
Distance
- euclidean distance, manhattan distance, cosine distance = 1 - cosine similarity
KNN Regression
We take the average of the k nearest items
KNN Hyperparameters
K
- how many amount of neighbors we’re using. General rule: k=sqrt(n). Then do your grid search from here.
KNN - noise vs signal
lower k tends to overfit
higher k tends to underfit - capture less noise, and low signal
standardization
use standardization when your data has varying scales
(data point - mean) / standard deviation
put each scale to mean = 0
KNN (pros, cons)
pros:
- super simple
- training is trivial
- easy o add more data
- few hyperparameter
cons
- if you have a lot of features, you need a lot more data, but can be costly to gather more data
- high prediction cost
- bad with high dimensions. Anything more than 5 is bad.
- categorical features don’t work well
Mean Squared Error (MSE)
expected value of the square of the error
MSE = 1/n * E( predicted - actual)**2
- the average squared difference between the estimated values and the actual values
- mean of all the square errors in your model and data
Irreducible error
Error that we can’t do anything about Even if we had all possible data and could build a perfect model, we can’t predict values exactly.
Bias and variance
- errors that we can control
bias = failing to capture some of the signal (underfit) variance = error we get when from real world data. Where are the errors coming from and How consistently we're off.
When capturing more signal, you’re naturally capture more noise and variance.
k fold cross validation
train test split - reserve data for the ultimate testing set.
then do k-fold training: training and validation set
churn
decision rules
Linear Regression - scatterplot
good practice to plot a scatter plot, if it appears a linear relationship, between dependent and independent variables, it is good hint that we can use linear regression as a learning algorithm.
Feature Engineering
anytime you use your current features to create new features.
Linear Regression with single feature
y = mx + b
Linear Regression - how to pick the best line
Residual = the distance between our predicted value and the actual value
Find the line that minimizes the total sum of squared residuals.
Linear Regression with multiple features
the linear relationship is:
y = B01 + B1x1 + B2x2 + Bpxp
Linear Regression - R-squared
give you a sense of how well your model performs - how much variance of the data you’re capturing. It tells you how much better your model is doing better than the dumb model. Close to 0 = dumb model. 1 = overfitting
Something to watch out for: the more features you have, the higher r-squared is. Hence, people adjusted r-squared is a better measure
Linear Regression - Adjusted r-squared
normalized to the increase of features. It is a better measure of accuracy of your model because r-squared is susceptible to number of features
How do you know if a feature and dependent variable has a linear relationship?
You can plot the residuals (y_predicted - y_actual). We want to see our residuals normally distributed around 0 and shows no trend.
studentized residuals
divide residual by the estimate of standard deviation of the residuals.
It helps you to find outliers. You can remove a point from the data, plot the residual plot again to find outliers.
homoscedasticity vs. heteroscedasticity
homoscedasticity = When variance of your residual is constant.
how to test?
- Divide your data into 2 parts, if variance on left equals to right, it’s homo, else, it’s heteroscedasiticity
- sm.stats.diagnostic.het_goldfeldquandt then look for p-value. Null hypothesis = homoscedasticity
Normality
residuals are normally distributed
To see if residuals are normally distributed, we look at QQ plot - If they follow the same line, it is normal.
We can also use Jarque-Bera(JB). Look at p-value.
Null: it follows normality
Alt: doesn’t follow normality
Multicollinearity
When you have two features or more that depend on each other. Ie, height and weight.
- It means you can’t be confidence about your features so you will have to remove one of more features to be confident about your feature
- Variance inflation factor VIF. If VIF > 10, it’s a good indication that you have multiplelinearity
Linear Regression (prediction vs inferential)
- when there’s a linear relation between outcome and features
- Goal: find the coefficients of the features
- what coefficients should I choose to minimize my sum of squared residuals? - the difference between my observations and predictions
When I’m making prediction, and I don’t care about inferring the coefficients. I build a linear regression model and predict.
If you want to interpret the data - if you want to see how confidence you are with each coefficient, then you have to follow the following conditions.
- Linearity
- Homoscedascity
- Multicolinearity
- Normality
- All data is i.i.d (individually independent distributed)
Linear Regression - Top bottom, bottom top
bottom top: start by each feature, start with the one giving you the lowest RMSE, then add second, add third and see if you can reduce your RMSE.
top bottom: consider all features, remove one by one
Regularized regression - what does it do?
It introduces parameters into linear regression to allow us to adjust for biases and variances tradeoff.
It introduce an alpha to the linear regression equation. The bigger the alpha, the smaller beta gets.
When beta is small, the model is less sensitive to each individual feature, it is less likely to overfit.
regularized regression - criteria
All predictors need to be on the same scale.
standardized_feature = (raw_feature - mean(raw_feature))/st_dev ( raw_feature)
ridge regression
optimization: alpha(beta**2)
As alpha increases, it pushes all betas proportionately close to 0
overfitting and underfitting
When overfitting, we’re capturing a lot of signal but we’re also introducing a lot of noise. So when we fit our real world data into our model, it is going to give me inconsistent outcome.
lasso regression
optimization: alpha( |beta| )
As alpha increase, some betas drop to 0 sooner than other betas.
Benefit: allow for feature selection - I believe there are features that matter more than others, and I want the features that matter more to stay in the model longer as I’m increasing alpha, and features that matter less to drop out.
Logistic Regression - decision boundary
setting a threshold to determine class 1 and class 2. We set this boundary based on business needs
Logistic Regression - find the best model
minimize log loss
how do you define which model is better in logistic regression?
ROC curve
it works like this: false positive rate on x, true positive rate on y. Then start moving threshold to see how the fpr and tpr rate changes.
The bigger the area of this curve, the better we’re doing with our model.
logistic regression - performance
depends on what class you’re interested in, we need to adjust our performance measurement using a confusion matrix. For example, if positive is more important, we want to reduce false negative, etc.
Whenever accuracy is not good indicator, always use confusion matrix and other measurement like recall, precision, etc.
linear regression - performance
whichever model with the lowest RMSE is the better model
what does it mean that logistic regression is a soft classifier?
able to provide probabilities of an outcome belong to one class vs the other.
confusion matrix
it is a table used to describe the performance of a classification model.
Accuracy = how often is the classifier correct?
Recall = when it’s actually yes, how often does it predict yes?
Precision = when it predicts yes, how often is it correct?
how do you deal with imbalanced class?
Before you do any of these, make sure to do train test split. Real world data should reflect real data as accurately as possible.
- oversampling - one way to do it is to bootstrap
- undersampling - appropriate when you have more data than you can process. Or when data in the majority class is less important to the result. ie. 1 million rows, 100,000 in one class. We randomly make 100,000 data from the majority class.
- SMOTE - it generates some new data points looking at two samples of data but usually results is not good.
using a confusion matrix in a business case
- build a model and create a confusion matrix
- assign dollar value to a confusion matrix with a business use case
- multiply them together to calculate profit
- test it with a range of threshold to see which threshold gives us the highest revenue