Regression Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

KNN Classification - training and predicting

A

Training: store all the data
Prediction:
1. calculate the distance from x to all points in your dataset
2. sort the points in your dataset by increasing distance from x
3. predict the majority label of the k closest point

Distance
- euclidean distance, manhattan distance, cosine distance = 1 - cosine similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

KNN Regression

A

We take the average of the k nearest items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

KNN Hyperparameters

A

K

- how many amount of neighbors we’re using. General rule: k=sqrt(n). Then do your grid search from here.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

KNN - noise vs signal

A

lower k tends to overfit

higher k tends to underfit - capture less noise, and low signal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

standardization

A

use standardization when your data has varying scales
(data point - mean) / standard deviation

put each scale to mean = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

KNN (pros, cons)

A

pros:
- super simple
- training is trivial
- easy o add more data
- few hyperparameter

cons

  • if you have a lot of features, you need a lot more data, but can be costly to gather more data
  • high prediction cost
  • bad with high dimensions. Anything more than 5 is bad.
  • categorical features don’t work well
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Mean Squared Error (MSE)

A

expected value of the square of the error
MSE = 1/n * E( predicted - actual)**2
- the average squared difference between the estimated values and the actual values
- mean of all the square errors in your model and data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Irreducible error

A

Error that we can’t do anything about Even if we had all possible data and could build a perfect model, we can’t predict values exactly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Bias and variance

A
  • errors that we can control
bias = failing to capture some of the signal (underfit)
variance = error we get when from real world data. Where are the errors coming from and How consistently we're off.

When capturing more signal, you’re naturally capture more noise and variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

k fold cross validation

A

train test split - reserve data for the ultimate testing set.
then do k-fold training: training and validation set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

churn

A

decision rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Linear Regression - scatterplot

A

good practice to plot a scatter plot, if it appears a linear relationship, between dependent and independent variables, it is good hint that we can use linear regression as a learning algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Feature Engineering

A

anytime you use your current features to create new features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Linear Regression with single feature

A

y = mx + b

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Linear Regression - how to pick the best line

A

Residual = the distance between our predicted value and the actual value

Find the line that minimizes the total sum of squared residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Linear Regression with multiple features

A

the linear relationship is:

y = B01 + B1x1 + B2x2 + Bpxp

17
Q

Linear Regression - R-squared

A

give you a sense of how well your model performs - how much variance of the data you’re capturing. It tells you how much better your model is doing better than the dumb model. Close to 0 = dumb model. 1 = overfitting

Something to watch out for: the more features you have, the higher r-squared is. Hence, people adjusted r-squared is a better measure

18
Q

Linear Regression - Adjusted r-squared

A

normalized to the increase of features. It is a better measure of accuracy of your model because r-squared is susceptible to number of features

19
Q

How do you know if a feature and dependent variable has a linear relationship?

A

You can plot the residuals (y_predicted - y_actual). We want to see our residuals normally distributed around 0 and shows no trend.

20
Q

studentized residuals

A

divide residual by the estimate of standard deviation of the residuals.
It helps you to find outliers. You can remove a point from the data, plot the residual plot again to find outliers.

21
Q

homoscedasticity vs. heteroscedasticity

A

homoscedasticity = When variance of your residual is constant.

how to test?

  1. Divide your data into 2 parts, if variance on left equals to right, it’s homo, else, it’s heteroscedasiticity
  2. sm.stats.diagnostic.het_goldfeldquandt then look for p-value. Null hypothesis = homoscedasticity
22
Q

Normality

A

residuals are normally distributed

To see if residuals are normally distributed, we look at QQ plot - If they follow the same line, it is normal.

We can also use Jarque-Bera(JB). Look at p-value.
Null: it follows normality
Alt: doesn’t follow normality

23
Q

Multicollinearity

A

When you have two features or more that depend on each other. Ie, height and weight.

  • It means you can’t be confidence about your features so you will have to remove one of more features to be confident about your feature
  • Variance inflation factor VIF. If VIF > 10, it’s a good indication that you have multiplelinearity
24
Q

Linear Regression (prediction vs inferential)

A
  1. when there’s a linear relation between outcome and features
  2. Goal: find the coefficients of the features
  3. what coefficients should I choose to minimize my sum of squared residuals? - the difference between my observations and predictions

When I’m making prediction, and I don’t care about inferring the coefficients. I build a linear regression model and predict.

If you want to interpret the data - if you want to see how confidence you are with each coefficient, then you have to follow the following conditions.

  1. Linearity
  2. Homoscedascity
  3. Multicolinearity
  4. Normality
  5. All data is i.i.d (individually independent distributed)
25
Q

Linear Regression - Top bottom, bottom top

A

bottom top: start by each feature, start with the one giving you the lowest RMSE, then add second, add third and see if you can reduce your RMSE.

top bottom: consider all features, remove one by one

26
Q

Regularized regression - what does it do?

A

It introduces parameters into linear regression to allow us to adjust for biases and variances tradeoff.

It introduce an alpha to the linear regression equation. The bigger the alpha, the smaller beta gets.

When beta is small, the model is less sensitive to each individual feature, it is less likely to overfit.

27
Q

regularized regression - criteria

A

All predictors need to be on the same scale.

standardized_feature = (raw_feature - mean(raw_feature))/st_dev ( raw_feature)

28
Q

ridge regression

A

optimization: alpha(beta**2)

As alpha increases, it pushes all betas proportionately close to 0

29
Q

overfitting and underfitting

A

When overfitting, we’re capturing a lot of signal but we’re also introducing a lot of noise. So when we fit our real world data into our model, it is going to give me inconsistent outcome.

30
Q

lasso regression

A

optimization: alpha( |beta| )
As alpha increase, some betas drop to 0 sooner than other betas.

Benefit: allow for feature selection - I believe there are features that matter more than others, and I want the features that matter more to stay in the model longer as I’m increasing alpha, and features that matter less to drop out.

31
Q

Logistic Regression - decision boundary

A

setting a threshold to determine class 1 and class 2. We set this boundary based on business needs

32
Q

Logistic Regression - find the best model

A

minimize log loss

33
Q

how do you define which model is better in logistic regression?

A

ROC curve

it works like this: false positive rate on x, true positive rate on y. Then start moving threshold to see how the fpr and tpr rate changes.

The bigger the area of this curve, the better we’re doing with our model.

34
Q

logistic regression - performance

A

depends on what class you’re interested in, we need to adjust our performance measurement using a confusion matrix. For example, if positive is more important, we want to reduce false negative, etc.

Whenever accuracy is not good indicator, always use confusion matrix and other measurement like recall, precision, etc.

35
Q

linear regression - performance

A

whichever model with the lowest RMSE is the better model

36
Q

what does it mean that logistic regression is a soft classifier?

A

able to provide probabilities of an outcome belong to one class vs the other.

37
Q

confusion matrix

A

it is a table used to describe the performance of a classification model.
Accuracy = how often is the classifier correct?
Recall = when it’s actually yes, how often does it predict yes?
Precision = when it predicts yes, how often is it correct?

38
Q

how do you deal with imbalanced class?

A

Before you do any of these, make sure to do train test split. Real world data should reflect real data as accurately as possible.

  1. oversampling - one way to do it is to bootstrap
  2. undersampling - appropriate when you have more data than you can process. Or when data in the majority class is less important to the result. ie. 1 million rows, 100,000 in one class. We randomly make 100,000 data from the majority class.
  3. SMOTE - it generates some new data points looking at two samples of data but usually results is not good.
39
Q

using a confusion matrix in a business case

A
  1. build a model and create a confusion matrix
  2. assign dollar value to a confusion matrix with a business use case
  3. multiply them together to calculate profit
  4. test it with a range of threshold to see which threshold gives us the highest revenue