Interview PREP Flashcards
What is the difference between supervised machine learning and unsupervised? Give examples.
Unsupervised machine learning is when you have a bunch of input values but no one particular output value, so basically you don’t exactly know what you’re looking for. It mainly deals with unlabeled data.
Supervised machine learning is when there is an associated response variable Yi and we try to find the relationship between predictors and the response variable. So linear regression,XGboost,etc. When I think of supervised machine learning, I think of inference and prediction
What is inference and prediction?
Prediction models are models that care about the error and don’t really care as much about how we got there. When I think of the epitome of predictive models, I think of neural networks.
Inferential models dive deeper. In inferential modeling, you really want to see how the individual predictors affect your prediction. You’re more curious about the complex relationships in your model in this methodology.
An overall example would be looking at housing prices. When you’re trying to be predictive as possible, you really just care about your accuracy. In an inferential methodology , you care about things like “how does square feet affect my price”
What is regression? Which models can you solve with regression?
Regression is a part of supervised ML that investigates the relationships between dependent values and independent values. You have linear regression, polynomial regression, Ridge regression ,and Lasso Regression.
What is linear regression? When do we use it?
Linear regression models assume linear relationships between your dependent value and independent values.
Simple linear regression would be something like y=b0+b1*x
Multiple linear regression is when you have multiple independent values, so itd be something like y=b0+b1x+b2x2,etc
What are the main assumptions of linear regression?
Linear relationship
Multivariate normality-This assumption can best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this issue-assumes the residuals are normality distributed
No or little multicollinearity-When two predictors are independent
No auto-correlation of errors-residuals should be independent of eachother
Homoscedasticity-the size of the error term shouldn’t depend on the independent value
what’s the normal distribution and why should we care about it?
The normal distribution is a continuous probability distribution where the mean mode and median are the same. We should care about it because it is very important to the central limit thereom which basically says that if you grab a large sample size, it should mirror a normal distribution. So if you look one std above the mean, you can assume that 16% of the population has a mean above that and then 2.5% 2 standard deviations away
What is gradient descent?
Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable function. Gradient descent is simply used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible. Imagine a blindfolded man who wants to climb to the top of a hill with the fewest steps along the way as possible. He might start climbing the hill by taking really big steps in the steepest direction, which he can do as long as he is not close to the top. As he comes closer to the top, however, his steps will get smaller and smaller to avoid overshooting it.
What is batch gradient descent?
Batch gradient descent, also called vanilla gradient descent, calculates the error for each example within the training dataset, but only after all training examples have been evaluated does the model get updated. This whole process is like a cycle and it’s called a training epoch.
Some advantages of batch gradient descent are its computational efficient, it produces a stable error gradient and a stable convergence. Some disadvantages are the stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve. It also requires the entire training dataset be in memory and available to the algorithm.
What is stochastic gradient descent?
By contrast, stochastic gradient descent (SGD) does this for each training example within the dataset, meaning it updates the parameters for each training example one by one. Depending on the problem, this can make SGD faster than batch gradient descent. One advantage is the frequent updates allow us to have a pretty detailed rate of improvement.
The frequent updates, however, are more computationally expensive than the batch gradient descent approach. Additionally, the frequency of those updates can result in noisy gradients, which may cause the error rate to jump around instead of slowly decreasing.
What is mini batch gradient descent?
Mini-batch gradient descent is the go-to method since it’s a combination of the concepts of SGD and batch gradient descent. It simply splits the training dataset into small batches and performs an update for each of those batches. This creates a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.
Common mini-batch sizes range between 50 and 256, but like any other machine learning technique, there is no clear rule because it varies for different applications. This is the go-to algorithm when training a neural network and it is the most common type of gradient descent within deep learning.
which metrics do you know for evaluating linear regression?
Mean Squared Error(MSE) Root Mean Squared Error(RMSE) Mean Absolute Error(MAE) R² or Coefficient of Determination Adjusted R²
What is the bias-variance trade off?
Bias is the error introduced by approximating the true underlying function, which can be quite complex, by a simpler model(uder fitting)
. Variance is a model sensitivity to changes in the training dataset(overfitting)
Bias-variance trade-off is a relationship between the expected test error and the variance and the bias - both contribute to the level of the test error and ideally should be as small as possible:
What is over fitting? What is underfitting?
But as a model complexity increases, the bias decreases and the variance increases which leads to overfitting. And vice versa, model simplification helps to decrease the variance but it increases the bias which leads to underfitting.
How to validate your models?
One of the most common approaches is splitting data into train, validation and test parts. Models are trained on train data, hyperparameters (for example early stopping) are selected based on the validation data, the final measurement is done on test dataset. Another approach is cross-validation: split dataset into K folds and each time train models on training folds and measure the performance on the validation folds. Also, you could combine these approaches: make a test/holdout dataset and do cross-validation on the rest of the data. The final quality is measured on the test dataset.
Why do we need to split the data into train,validation,and test?
The training set is used to fit the model, i.e. to train the model with the data. The validation set is then used to provide an unbiased evaluation of a model while fine-tuning hyperparameters. This improves the generalization of the model. Finally, a test data set which the model has never “seen” before should be used for the final evaluation of the model. This allows for an unbiased evaluation of the model. The evaluation should never be performed on the same data that is used for training. Otherwise the model performance would not be representative.
How do you go about adding and removing variables in your model?
Forward Selection-you start with just the intercept and keep adding variables and checking the rss
Backward Selection-we start with all the variables in the model and keep removing variables starting with the largest p value until a stopping rule is reached(p value goes above a certain threshold)
Mixed Selection-Start with no models, then go with the forward and then we remove if the p value goes above a certain threshold until all the variables in the model have a low p value and the ones with a big p value aren’t there
What are some appraoches to validation?
Leave one out cross validation, k folds cross validation, and normal cross validation
What is logistic regression and when is it used?
Logistic regression is a Machine Learning algorithm that is used for binary classification. You should use logistic regression when your Y variable takes only two values, e.g. True and False, “spam” and “not spam”, “churn” and “not churn” and so on. The variable is said to be a “binary” or “dichotomous”
What is a sigmmoid function? what does it do?
A sigmoid function is a type of activation function, and more specifically defined as a squashing function. Squashing functions limit the output to a range between 0 and 1, making these functions useful in the prediction of probabilities.
Sigmod(x) = 1/(1+e^{-x})
Is accuracy always a good metric?
Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.
What is regularization and why do we need it?
Regularization is used to reduce overfitting in machine learning models. It helps the models to generalize well and make them robust to outliers and noise in the data.
Which regularization techniques do you know?
L1 Regularization (Lasso regularization) - Adds the sum of absolute values of the coefficients to the cost function- can shrink to 0 L2 Regularization (Ridge regularization) - Adds the sum of squares of coefficients to the cost function-can shrink to close to 0 but never to 0
Where lambda determines the amount of regularization
How does L2 regularization look like in a linear model?
L2 regularization adds a penalty term to our cost function which is equal to the sum of squares of models coefficients multiplied by a lambda hyperparameter. This technique makes sure that the coefficients are close to zero and is widely used in cases when we have a lot of features that might correlate with each other.
How do we select the right regularizaiton parameters?
Regularization parameters can be chosen using a grid search, for example https://scikit-learn.org/stable/modules/linear_model.html has one formula for the implementing for regularization, alpha in the formula mentioned can be found by doing a RandomSearch or a GridSearch on a set of values and selecting the alpha which gives the least cross validation or validation error.