Machine learning for Regression Flashcards

Question 1

Q

1) Data Preparation

Answer

A

You can view the data and do preparation steps e.g. clean up columns (select, renaming), cleanup values, update or create new values etc.

Question 2

Q

2) Exploratory Data Analysis

Answer

A

Looking at each column for unique values, number of unique values,
Visualize the column you want to predict (matplotlib and seaborn for visualization)
See the distribution of the target variable and manipulate to make it suitable for ML models e.g. normal distribution is ideal for the ml models.
Check for missing values

Question 3

Q

Long Tail Distribution

Answer

A

If majority of the data is concentrated at lower values and small number of them in higher values. This type of distribution is not good for machine learning algorithms because long tailed data confuses the algorithm.

Question 4

Q

How to get rid of long tail distribution problem for ML?

Answer

A

Apply logarithmic distribution to get more compact values. If we have very large values, the logarithmic values are not very large.
Log of zero doesn’t exist, we can solve it by adding 1 to the data.
In numpy, we have np.log1p which adds 1 to the values. Tail is gone and the shape resembles normal distribution with a clear centre and approximate symmetrical. If the target variable looks like that, the models do a lot better in prediction.

Question 5

Q

Types of Distribution

Answer

A

+ Symmetric Distributions
Mirrored distributions around mean
+ Left Skewed Distributions
Long tail on the left
+ Right Skewed Distributions
Long tail on the right
+ Bimodal distributions
Symmetric and peaks on left and right
+ Uniform Distribution
Approximately uniformed distribution i.e. roughly symmetric

Question 6

Q

Types of Probability Distributions

Answer

A

+Discrete Distributions
Finite number of outcomes
+Continuous Distributions
Infinite number of outcomes e.g. time and distance
X(variable) ~ N(Type) (mean, variance)(characteristics) but the characteristics could vary by the type of distributions
Distributions are usually created on the outcome variable.

Discrete Distributions:
Uniform distribution
+Outcomes are equally likely or equiprobable

Bernoulli Distribution
+ Events with only two types of outcomes e.g. true or false.
+ Regardless of whether one outcome is more likely than the other
+ Any event with two outcomes can be transformed into a Bernoulli event

Binomial distribution
+ If we are carry out multiple iterations of the two outcomes
+e.g. we can flip the coin 3 times and would like to know the likely hood of getting heads twice.

Poisson Distribution:
Test out how unusual an event frequency is for a given interval
If the frequency changes so is the expectation of the outcome.

Continuous Distribution:
The probability distribution would be a curve as compared to unconnected individual bars.

Normal distributions
+ Often observed in nature
+ Symmetrical distribution around mean
+ extreme left or right are the outliers

Question 7

Q

3) Setting up Validation Framework

Answer

A

Split data into 3 parts
1) Training dataset (60%)
2) Validation dataset (20%)
3) Test dataset (20%)

It may happen that total number of records is not equal to number of test data + number of val data
For this, we can take out val data and test data , the remaining records are for training data.

Make sure that before splitting the data, it should be shuffled and it should not be in sequence so that values are present in all three data sets. We can get index and shuffle it using numpy as in
idx = np.arrange(n), np.random.shuffle(idx)
df_shuffle = df.iloc[idx]
df_train = df_shuffle.iloc[:n_train].copy()

delete the target variable from feature Matrix i.e. df_train and test and validation data.

Apply log transformation to target variable y.

Question 8

Q

Linear Regression (formula)

Answer

A

Model outputs a number.
g(xi)=w0 + w1x1 + w2x2 + w3x3 ….
We combine the features / observations/characteristics in a way that it is close to the outcome /target variable.
We don’t take features as it is, we multiply it with some weight
w0 is bias which means prediction when we don’t know any thing about the features.
More compact formula
g(xi) = w0 + summation of wixi

Question 9

Q

Undo log(y+1) (Getting original prediction values)

Answer

A

Since we did log(y+1) in previous steps to make outcome distribution more normal, so the model is still outputting the same logarithmic output, we need to undo it to give us really price. The way to undo is through exponent. e.g. np exp(x)
np.expm1(x) subtracts 1.

Question 10

Q

Compact Linear Regression Formula

Answer

A

g(xi)=w0+xiT.w
More compact
w = [w0=1, w1, w2, w3…..wn]
xi = [x0=1, x1, x2, x3….xn]
wTXi = xiTw

We just prepend 1 in the beginning for w and x. So now the result is the same.

Question 11

Q

Linear Regression Formula

Answer

A

Matrix and Vector W so each row is multiplied with each element of the column.
It will be
X1Tw
x2Tw
….
xnTw

Question 12

Q

4) Training Linear Regression: Normal Equation

Answer

A

g(X)=Xw~y
We need to find a way to get w.
Xinv.Xw=Xinv.y if the inverse of X exist otherwise solution does not exist.
XT.X is called gram matrix and for this inverse exists because it’s always square.
XT.Xw=XT.y
(XT.X)inv. XT.Xw=(XT.X)inv.XT.y
Iw=(XT.X)inv.XT.y
Iw=w

w0 = bias term
w=rest are weights

We should add bias term i.e. it is to let the model know if there is no information about the car, what should be the prediction.
Negative coefficient means that outcomes goes down because of some factor e.g. age of the car

Question 13

Q

Numpy: np.column_stack()

Answer

A

Use np.column_stack([ones, X]) to add columns to an existing matrix.

Question 14

Q

5) Baseline Model

Answer

A

1) Using all the numerical columns which best describes an observations
2) Creating numpy matrix out of dataframe e.g. df.values or df.to_numpy()
3) Checking for missing values e.g. df.isnull().sum(), we can fill_na(0). If we fill missing values with 0, we make the model ignore the features. e.g.
g(xi) = w0 + xi1w1 + xi2w2 = w0 + xi2.w2 , If xi1 was horsepower and it doesn’t make sense that a car has 0 horsepower. We can replace null values with mean values as well.
4) We can use these weights for predictions. y=w_0 + X_train.dot(w)
5) We can plot predictions to check target variable and prediction.

Question 15

Q

RMSE

Answer

A

How to objectively assess the performance of the model?
RMSE stands for root mean square error.
g(xi) = prediction for xi
y=actual prediction

Summation((g(xi) - yi)^2)<1 to m>/m
We take the difference between the prediction and actual target variable, take squared value, sum it up for all observations and divide by the number of observations i.e. m to get the value. We take the square root of the mean squared value.

Lesser the RMSE, better the model performance

Question 16

Q

6) Model Validation

Answer

Study These Flashcards

A

We prepare the validation dataset in a similar way as training dataset. We generate the predicted target variable on validation dataset.
We take the RMSE for predicted target variable and the actual predictions.

Question 17

Q

7) Simple Feature Engineering discussion

Answer

Study These Flashcards

A

We can generate new features based on existing feature e.g. calculate age through year.

Question 18

Q

Why use df.copy?

Answer

Study These Flashcards

A

So that our original dataframe is not modified and we are making all the changes on the copied dataframe.

Question 19

Q

8) Feature Engineering: Categorical Variables

Answer

Study These Flashcards

A

Categorical Variables are columns/ features which are string e.g. car make or columns which are numerical but categorical e.g. number of doors like 2, 3, 4.
The way to encode these variables is creating a binary column i.e. 1 categorical value translates into a new column and then we add either 1 or 0 based on the value.

After adding the feature, we can check for RMSE and see the impact of feature.

e.g. for car make, we can add binary columns for the first 5 most popular ones

Question 20

Q

9) Regularization

Answer

Study These Flashcards

A

Sometimes the bias and weights can become large and increase RMSE significantly which means something went wrong. So if check our weight equation i.e. (XT.X)inv.XT.y, the problem is the inverse matrix and sometimes inverse does not exist. This can happen due to duplicate features and columns can have same values.
Numpy would complain that this is a singular matrix and it cannot compute inverse.
If there is a slight difference between two columns, they do not remain same columns and inverse is possible. This can happen due to noise in the data and it tries to calculate inverse, even though inverse should not exist.
This problem can be resolved if we add a small number on the diagonal of the matrix.
The larger the number we add to the diagonal the more controlled are the values in inv matrix.
In code we can do it as e.g.
XTX=XTX+0.01*np.eye(3)
This is Regularization which means controlling the numbers. The greater the Regularization parameter, the more controlled the values and inverse could be possible.

Question 21

Q

10) Tuning the model

Answer

Study These Flashcards

A

We need to find the best regularization parameter for our model.
We use the validation dataset to find the regularization parameter from a list of values for the parameter.
Apply to the model and calculate bias and RMSE.

Question 22

Q

11) Using the model

Answer

Study These Flashcards

A

Now we want to train model on both train and validation dataset and apply it on the test dataset for final predictions.

Now we take the model and input any new car from the test dataset and predict the price of the car.
In real scenario, we might have a dictionary with all the possible values which user has inputted, we sent it to the model and the model returns the price.
Turn this dictionary from user to prepare it as the model expects it.
Convert the prediction to the price so we take the exp of the value. np.expm1(y_pred).

Question 23

Q

The formula for linear Regression model is also called

Answer

Study These Flashcards

A

Normal equation

Question 24

Q

The binary columns created for categorical values is an encoding also called

Answer

Study These Flashcards

A

one hot encoding

Machine learning for Regression Flashcards

(24 cards)