Machine learning for Regression Flashcards
1) Data Preparation
You can view the data and do preparation steps e.g. clean up columns (select, renaming), cleanup values, update or create new values etc.
2) Exploratory Data Analysis
Looking at each column for unique values, number of unique values,
Visualize the column you want to predict (matplotlib and seaborn for visualization)
See the distribution of the target variable and manipulate to make it suitable for ML models e.g. normal distribution is ideal for the ml models.
Check for missing values
Long Tail Distribution
If majority of the data is concentrated at lower values and small number of them in higher values. This type of distribution is not good for machine learning algorithms because long tailed data confuses the algorithm.
How to get rid of long tail distribution problem for ML?
Apply logarithmic distribution to get more compact values. If we have very large values, the logarithmic values are not very large.
Log of zero doesn’t exist, we can solve it by adding 1 to the data.
In numpy, we have np.log1p which adds 1 to the values. Tail is gone and the shape resembles normal distribution with a clear centre and approximate symmetrical. If the target variable looks like that, the models do a lot better in prediction.
Types of Distribution
+ Symmetric Distributions
Mirrored distributions around mean
+ Left Skewed Distributions
Long tail on the left
+ Right Skewed Distributions
Long tail on the right
+ Bimodal distributions
Symmetric and peaks on left and right
+ Uniform Distribution
Approximately uniformed distribution i.e. roughly symmetric
Types of Probability Distributions
+Discrete Distributions
Finite number of outcomes
+Continuous Distributions
Infinite number of outcomes e.g. time and distance
X(variable) ~ N(Type) (mean, variance)(characteristics) but the characteristics could vary by the type of distributions
Distributions are usually created on the outcome variable.
Discrete Distributions:
Uniform distribution
+Outcomes are equally likely or equiprobable
Bernoulli Distribution
+ Events with only two types of outcomes e.g. true or false.
+ Regardless of whether one outcome is more likely than the other
+ Any event with two outcomes can be transformed into a Bernoulli event
Binomial distribution
+ If we are carry out multiple iterations of the two outcomes
+e.g. we can flip the coin 3 times and would like to know the likely hood of getting heads twice.
Poisson Distribution:
Test out how unusual an event frequency is for a given interval
If the frequency changes so is the expectation of the outcome.
Continuous Distribution:
The probability distribution would be a curve as compared to unconnected individual bars.
Normal distributions
+ Often observed in nature
+ Symmetrical distribution around mean
+ extreme left or right are the outliers
3) Setting up Validation Framework
Split data into 3 parts
1) Training dataset (60%)
2) Validation dataset (20%)
3) Test dataset (20%)
It may happen that total number of records is not equal to number of test data + number of val data
For this, we can take out val data and test data , the remaining records are for training data.
Make sure that before splitting the data, it should be shuffled and it should not be in sequence so that values are present in all three data sets. We can get index and shuffle it using numpy as in
idx = np.arrange(n), np.random.shuffle(idx)
df_shuffle = df.iloc[idx]
df_train = df_shuffle.iloc[:n_train].copy()
delete the target variable from feature Matrix i.e. df_train and test and validation data.
Apply log transformation to target variable y.
Linear Regression (formula)
Model outputs a number.
g(xi)=w0 + w1x1 + w2x2 + w3x3 ….
We combine the features / observations/characteristics in a way that it is close to the outcome /target variable.
We don’t take features as it is, we multiply it with some weight
w0 is bias which means prediction when we don’t know any thing about the features.
More compact formula
g(xi) = w0 + summation of wixi
Undo log(y+1) (Getting original prediction values)
Since we did log(y+1) in previous steps to make outcome distribution more normal, so the model is still outputting the same logarithmic output, we need to undo it to give us really price. The way to undo is through exponent. e.g. np exp(x)
np.expm1(x) subtracts 1.
Compact Linear Regression Formula
g(xi)=w0+xiT.w
More compact
w = [w0=1, w1, w2, w3…..wn]
xi = [x0=1, x1, x2, x3….xn]
wTXi = xiTw
We just prepend 1 in the beginning for w and x. So now the result is the same.
Linear Regression Formula
Matrix and Vector W so each row is multiplied with each element of the column.
It will be
X1Tw
x2Tw
….
xnTw
4) Training Linear Regression: Normal Equation
g(X)=Xw~y
We need to find a way to get w.
Xinv.Xw=Xinv.y if the inverse of X exist otherwise solution does not exist.
XT.X is called gram matrix and for this inverse exists because it’s always square.
XT.Xw=XT.y
(XT.X)inv. XT.Xw=(XT.X)inv.XT.y
Iw=(XT.X)inv.XT.y
Iw=w
w0 = bias term
w=rest are weights
We should add bias term i.e. it is to let the model know if there is no information about the car, what should be the prediction.
Negative coefficient means that outcomes goes down because of some factor e.g. age of the car
Numpy: np.column_stack()
Use np.column_stack([ones, X]) to add columns to an existing matrix.
5) Baseline Model
1) Using all the numerical columns which best describes an observations
2) Creating numpy matrix out of dataframe e.g. df.values or df.to_numpy()
3) Checking for missing values e.g. df.isnull().sum(), we can fill_na(0). If we fill missing values with 0, we make the model ignore the features. e.g.
g(xi) = w0 + xi1w1 + xi2w2 = w0 + xi2.w2 , If xi1 was horsepower and it doesn’t make sense that a car has 0 horsepower. We can replace null values with mean values as well.
4) We can use these weights for predictions. y=w_0 + X_train.dot(w)
5) We can plot predictions to check target variable and prediction.
RMSE
How to objectively assess the performance of the model?
RMSE stands for root mean square error.
g(xi) = prediction for xi
y=actual prediction
Summation((g(xi) - yi)^2)<1 to m>/m
We take the difference between the prediction and actual target variable, take squared value, sum it up for all observations and divide by the number of observations i.e. m to get the value. We take the square root of the mean squared value.
Lesser the RMSE, better the model performance