Regression Flashcards

Question 1

Q

KNN Classification - training and predicting

Answer

A

Training: store all the data
Prediction:
1. calculate the distance from x to all points in your dataset
2. sort the points in your dataset by increasing distance from x
3. predict the majority label of the k closest point

Distance
- euclidean distance, manhattan distance, cosine distance = 1 - cosine similarity

Question 2

Q

KNN Regression

Answer

A

We take the average of the k nearest items

Question 3

Q

KNN Hyperparameters

Answer

A

K

- how many amount of neighbors we’re using. General rule: k=sqrt(n). Then do your grid search from here.

Question 4

Q

KNN - noise vs signal

Answer

A

lower k tends to overfit

higher k tends to underfit - capture less noise, and low signal

Question 5

Q

standardization

Answer

A

use standardization when your data has varying scales
(data point - mean) / standard deviation

put each scale to mean = 0

Question 6

Q

KNN (pros, cons)

Answer

A

pros:
- super simple
- training is trivial
- easy o add more data
- few hyperparameter

cons

if you have a lot of features, you need a lot more data, but can be costly to gather more data
high prediction cost
bad with high dimensions. Anything more than 5 is bad.
categorical features don’t work well

Question 7

Q

Mean Squared Error (MSE)

Answer

A

expected value of the square of the error
MSE = 1/n * E( predicted - actual)**2
- the average squared difference between the estimated values and the actual values
- mean of all the square errors in your model and data

Question 8

Q

Irreducible error

Answer

A

Error that we can’t do anything about Even if we had all possible data and could build a perfect model, we can’t predict values exactly.

Question 9

Q

Bias and variance

Answer

A

errors that we can control

bias = failing to capture some of the signal (underfit)
variance = error we get when from real world data. Where are the errors coming from and How consistently we're off.

When capturing more signal, you’re naturally capture more noise and variance.

Question 10

Q

k fold cross validation

Answer

A

train test split - reserve data for the ultimate testing set.
then do k-fold training: training and validation set

Question 11

Q

churn

Answer

A

decision rules

Question 12

Q

Linear Regression - scatterplot

Answer

A

good practice to plot a scatter plot, if it appears a linear relationship, between dependent and independent variables, it is good hint that we can use linear regression as a learning algorithm.

Question 13

Q

Feature Engineering

Answer

A

anytime you use your current features to create new features.

Question 14

Q

Linear Regression with single feature

Answer

A

y = mx + b

Question 15

Q

Linear Regression - how to pick the best line

Answer

A

Residual = the distance between our predicted value and the actual value

Find the line that minimizes the total sum of squared residuals.

Question 16

Q

Linear Regression with multiple features

Answer

A

the linear relationship is:

y = B01 + B1x1 + B2x2 + Bpxp

Question 17

Q

Linear Regression - R-squared

Answer

A

give you a sense of how well your model performs - how much variance of the data you’re capturing. It tells you how much better your model is doing better than the dumb model. Close to 0 = dumb model. 1 = overfitting

Something to watch out for: the more features you have, the higher r-squared is. Hence, people adjusted r-squared is a better measure

Question 18

Q

Linear Regression - Adjusted r-squared

Answer

A

normalized to the increase of features. It is a better measure of accuracy of your model because r-squared is susceptible to number of features

Question 19

Q

How do you know if a feature and dependent variable has a linear relationship?

Answer

A

You can plot the residuals (y_predicted - y_actual). We want to see our residuals normally distributed around 0 and shows no trend.

Question 20

Q

studentized residuals

Answer

A

divide residual by the estimate of standard deviation of the residuals.
It helps you to find outliers. You can remove a point from the data, plot the residual plot again to find outliers.

Question 21

Q

homoscedasticity vs. heteroscedasticity

Answer

A

homoscedasticity = When variance of your residual is constant.

how to test?

Divide your data into 2 parts, if variance on left equals to right, it’s homo, else, it’s heteroscedasiticity
sm.stats.diagnostic.het_goldfeldquandt then look for p-value. Null hypothesis = homoscedasticity

Question 22

Q

Normality

Answer

A

residuals are normally distributed

To see if residuals are normally distributed, we look at QQ plot - If they follow the same line, it is normal.

We can also use Jarque-Bera(JB). Look at p-value.
Null: it follows normality
Alt: doesn’t follow normality

Question 23

Q

Multicollinearity

Answer

A

When you have two features or more that depend on each other. Ie, height and weight.

It means you can’t be confidence about your features so you will have to remove one of more features to be confident about your feature
Variance inflation factor VIF. If VIF > 10, it’s a good indication that you have multiplelinearity

Question 24

Q

Linear Regression (prediction vs inferential)

Answer

A

when there’s a linear relation between outcome and features
Goal: find the coefficients of the features
what coefficients should I choose to minimize my sum of squared residuals? - the difference between my observations and predictions

When I’m making prediction, and I don’t care about inferring the coefficients. I build a linear regression model and predict.

If you want to interpret the data - if you want to see how confidence you are with each coefficient, then you have to follow the following conditions.

Linearity
Homoscedascity
Multicolinearity
Normality
All data is i.i.d (individually independent distributed)

Question 25

Q

Linear Regression - Top bottom, bottom top

Answer

A

bottom top: start by each feature, start with the one giving you the lowest RMSE, then add second, add third and see if you can reduce your RMSE.

top bottom: consider all features, remove one by one

Question 26

Q

Regularized regression - what does it do?

Answer

A

It introduces parameters into linear regression to allow us to adjust for biases and variances tradeoff.

It introduce an alpha to the linear regression equation. The bigger the alpha, the smaller beta gets.

When beta is small, the model is less sensitive to each individual feature, it is less likely to overfit.

Question 27

Q

regularized regression - criteria

Answer

A

All predictors need to be on the same scale.

standardized_feature = (raw_feature - mean(raw_feature))/st_dev ( raw_feature)

Question 28

Q

ridge regression

Answer

A

optimization: alpha(beta**2)

As alpha increases, it pushes all betas proportionately close to 0

Question 29

Q

overfitting and underfitting

Answer

A

When overfitting, we’re capturing a lot of signal but we’re also introducing a lot of noise. So when we fit our real world data into our model, it is going to give me inconsistent outcome.

Question 30

Q

lasso regression

Answer

A

optimization: alpha( |beta| )
As alpha increase, some betas drop to 0 sooner than other betas.

Benefit: allow for feature selection - I believe there are features that matter more than others, and I want the features that matter more to stay in the model longer as I’m increasing alpha, and features that matter less to drop out.

Question 31

Q

Logistic Regression - decision boundary

Answer

A

setting a threshold to determine class 1 and class 2. We set this boundary based on business needs

Question 32

Q

Logistic Regression - find the best model

Answer

A

minimize log loss

Question 33

Q

how do you define which model is better in logistic regression?

Answer

A

ROC curve

it works like this: false positive rate on x, true positive rate on y. Then start moving threshold to see how the fpr and tpr rate changes.

The bigger the area of this curve, the better we’re doing with our model.

Question 34

Q

logistic regression - performance

Answer

A

depends on what class you’re interested in, we need to adjust our performance measurement using a confusion matrix. For example, if positive is more important, we want to reduce false negative, etc.

Whenever accuracy is not good indicator, always use confusion matrix and other measurement like recall, precision, etc.

Question 35

Q

linear regression - performance

Answer

A

whichever model with the lowest RMSE is the better model

Question 36

Q

what does it mean that logistic regression is a soft classifier?

Answer

A

able to provide probabilities of an outcome belong to one class vs the other.

Question 37

Q

confusion matrix

Answer

A

it is a table used to describe the performance of a classification model.
Accuracy = how often is the classifier correct?
Recall = when it’s actually yes, how often does it predict yes?
Precision = when it predicts yes, how often is it correct?

Question 38

Q

how do you deal with imbalanced class?

Answer

A

Before you do any of these, make sure to do train test split. Real world data should reflect real data as accurately as possible.

oversampling - one way to do it is to bootstrap
undersampling - appropriate when you have more data than you can process. Or when data in the majority class is less important to the result. ie. 1 million rows, 100,000 in one class. We randomly make 100,000 data from the majority class.
SMOTE - it generates some new data points looking at two samples of data but usually results is not good.

Question 39

Q

using a confusion matrix in a business case

Answer

A

build a model and create a confusion matrix
assign dollar value to a confusion matrix with a business use case
multiply them together to calculate profit
test it with a range of threshold to see which threshold gives us the highest revenue