Baby Steps Flashcards
What is supervised machine learning?
Supervised learning is a type of machine learning in which our algorithms are trained using well-labeled training data, and machines predict the output based on that data. Labeled data indicates that the input data has already been tagged with the appropriate output. Basically, it is the task of learning a function that maps the input set and returns an output. Some of its examples are: Linear Regression, Logistic Regression, KNN, etc.
What is regression? Which models can you use to solve a regression problem?
Regression is a part of supervised ML. Regression models investigate the relationship between a dependent (target) and independent variable (s) (predictor). Here are some common regression models
Linear Regression establishes a linear relationship between target and predictor (s). It predicts a numeric value and has a shape of a straight line.
Polynomial Regression has a regression equation with the power of independent variable more than 1. It is a curve that fits into the data points.
Ridge Regression helps when predictors are highly correlated (multicollinearity problem). It penalizes the squares of regression coefficients but doesn’t allow the coefficients to reach zeros (uses L2 regularization).
Lasso Regression penalizes the absolute values of regression coefficients and allows some of the coefficients to reach absolute zero (thereby allowing feature selection).
What is linear regression? When do we use it?
Linear regression is a model that assumes a linear relationship between the input variables (X) and the single output variable (y).
With a simple equation:
y = B0 + B1*x1 + … + Bn * xN
B is regression coefficients, x values are the independent (explanatory) variables and y is dependent variable.
The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.
Simple linear regression:
y = B0 + B1*x1
Multiple linear regression:
y = B0 + B1*x1 + … + Bn * xN
What are the main assumptions of linear regression?
There are several assumptions of linear regression. If any of them is violated, model predictions and interpretation may be worthless or misleading.
Linear relationship between features and target variable.
Additivity means that the effect of changes in one of the features on the target variable does not depend on values of other features. For example, a model for predicting revenue of a company have of two features - the number of items a sold and the number of items b sold. When company sells more items a the revenue increases and this is independent of the number of items b sold. But, if customers who buy a stop buying b, the additivity assumption is violated.
Features are not correlated (no collinearity) since it can be difficult to separate out the individual effects of collinear features on the target variable.
Errors are independently and identically normally distributed (yi = B0 + B1*x1i + … + errori):
No correlation between errors (consecutive errors in the case of time series data).
Constant variance of errors - homoscedasticity. For example, in case of time series, seasonal patterns can increase errors in seasons with higher activity.
Errors are normaly distributed, otherwise some features will have more influence on the target variable than to others. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.
What’s the normal distribution? Why do we care about it?
The normal distribution is a continuous probability distribution whose probability density function takes the following formula:
P(x) = e** (-(x-μ)2) / 2σ**(2) / σ ^2pi
where μ is the mean and σ is the standard deviation of the distribution, and ^ is sqrt
The normal distribution derives its importance from the Central Limit Theorem, which states that if we draw a large enough number of samples, their mean will follow a normal distribution regardless of the initial distribution of the sample, i.e the distribution of the mean of the samples is normal. It is important that each sample is independent from the other.
This is powerful because it helps us study processes whose population distribution is unknown to us.
How do we check if a variable follows the normal distribution?
Plot a histogram out of the sampled data. If you can fit the bell-shaped “normal” curve to the histogram, then the hypothesis that the underlying random variable follows the normal distribution can not be rejected.
Check Skewness and Kurtosis of the sampled data. Skewness = 0 and kurtosis = 3 are typical for a normal distribution, so the farther away they are from these values, the more non-normal the distribution.
Use Kolmogorov-Smirnov or/and Shapiro-Wilk tests for normality. They take into account both Skewness and Kurtosis simultaneously.
Check for Quantile-Quantile plot. It is a scatterplot created by plotting two sets of quantiles against one another. Normal Q-Q plot place the data points in a roughly straight line.
What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices?
Data is not normal. Specially, real-world datasets or uncleaned datasets always have certain skewness. Same goes for the price prediction. Price of houses or any other thing under consideration depends on a number of factors. So, there’s a great chance of presence of some skewed values i.e outliers if we talk in data science terms.
Yes, you may need to do pre-processing. Most probably, you will need to remove the outliers to make your distribution near-to-normal.
What methods for solving linear regression do you know?
To solve linear regression, you need to find the coefficients which minimize the sum of squared errors.
Matrix Algebra method: Let’s say you have X, a matrix of features, and y, a vector with the values you want to predict. After going through the matrix algebra and minimization problem, you get this solution: .
But solving this requires you to find an inverse, which can be time-consuming, if not impossible. Luckily, there are methods like Singular Value Decomposition (SVD) or QR Decomposition that can reliably calculate this part (called the pseudo-inverse) without actually needing to find an inverse. The popular python ML library sklearn uses SVD to solve least squares.
Alternative method: Gradient Descent.
What is gradient descent? How does it work?
Gradient descent is an algorithm that uses calculus concept of gradient to try and reach local or global minima. It works by taking the negative of the gradient in a point of a given function, and updating that point repeatedly using the calculated negative gradient, until the algorithm reaches a local or global minimum, which will cause future iterations of the algorithm to return values that are equal or too close to the current point. It is widely used in machine learning applications.
What is the normal equation?
Normal equations are equations obtained by setting equal to zero the partial derivatives of the sum of squared errors (least squares); normal equations allow one to estimate the parameters of a multiple linear regression.
What is SGD — stochastic gradient descent? What’s the difference with the usual gradient descent?
In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.
While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.
Which metrics for evaluating regression models do you know?
Mean Squared Error(MSE)
Root Mean Squared Error(RMSE)
Mean Absolute Error(MAE)
R² or Coefficient of Determination
Adjusted R²
What are MSE and RMSE?
MSE stands for Mean Square Error while RMSE stands for Root Mean Square Error. They are metrics with which we can evaluate models.
What is the bias-variance trade-off?
Bias is the error introduced by approximating the true underlying function, which can be quite complex, by a simpler model. Variance is a model sensitivity to changes in the training dataset.
Bias-variance trade-off is a relationship between the expected test error and the variance and the bias - both contribute to the level of the test error and ideally should be as small as possible:
ExpectedTestError = Variance + Bias² + IrreducibleError
But as a model complexity increases, the bias decreases and the variance increases which leads to overfitting. And vice versa, model simplification helps to decrease the variance but it increases the bias which leads to underfitting.
What is overfitting?
When your model perform very well on your training set but can’t generalize the test set, because it adjusted a lot to the training set.
How to validate your models?
One of the most common approaches is splitting data into train, validation and test parts. Models are trained on train data, hyperparameters (for example early stopping) are selected based on the validation data, the final measurement is done on test dataset. Another approach is cross-validation: split dataset into K folds and each time train models on training folds and measure the performance on the validation folds. Also you could combine these approaches: make a test/holdout dataset and do cross-validation on the rest of the data. The final quality is measured on test dataset.
Why do we need to split our data into three parts: train, validation, and test?
The training set is used to fit the model, i.e. to train the model with the data. The validation set is then used to provide an unbiased evaluation of a model while fine-tuning hyperparameters. This improves the generalization of the model. Finally, a test data set which the model has never “seen” before should be used for the final evaluation of the model. This allows for an unbiased evaluation of the model. The evaluation should never be performed on the same data that is used for training. Otherwise the model performance would not be representative.
Can you explain how cross-validation works?
Cross-validation is the process to separate your total training set into two subsets: training and validation set, and evaluate your model to choose the hyperparameters. But you do this process iteratively, selecting differents training and validation set, in order to reduce the bias that you would have by selecting only one validation set.
What is K-fold cross-validation?
K fold cross validation is a method of cross validation where we select a hyperparameter k. The dataset is now divided into k parts. Now, we take the 1st part as validation set and remaining k-1 as training set. Then we take the 2nd part as validation set and remaining k-1 parts as training set. Like this, each part is used as validation set once and the remaining k-1 parts are taken together and used as training set. It should not be used in a time series data.
How do we choose K in K-fold cross-validation? What’s your favorite K?
There are two things to consider while deciding K: the number of models we get and the size of validation set. We do not want the number of models to be too less, like 2 or 3. At least 4 models give a less biased decision on the metrics. On the other hand, we would want the dataset to be at least 20-25% of the entire data. So that at least a ratio of 3:1 between training and validation set is maintained.
I tend to use 4 for small datasets and 5 for large ones as K.
What is classification? Which models would you use to solve a classification problem?
Classification problems are problems in which our prediction space is discrete, i.e. there is a finite number of values the output variable can be. Some models which can be used to solve classification problems are: logistic regression, decision tree, random forests, multi-layer perceptron, one-vs-all, amongst others.
What is logistic regression? When do we need to use it?
Logistic regression is a Machine Learning algorithm that is used for binary classification. You should use logistic regression when your Y variable takes only two values, e.g. True and False, “spam” and “not spam”, “churn” and “not churn” and so on. The variable is said to be a “binary” or “dichotomous”.
Is logistic regression a linear model? Why?
Yes, Logistic Regression is considered a generalized linear model because the outcome always depends on the sum of the inputs and parameters. Or in other words, the output cannot depend on the product (or quotient, etc.) of its parameters.
What is sigmoid? What does it do?
A sigmoid function is a type of activation function, and more specifically defined as a squashing function. Squashing functions limit the output to a range between 0 and 1, making these functions useful in the prediction of probabilities.
Sigmod(x) = 1/(1+e^{-x})