Interview PREP Flashcards
What is the difference between supervised machine learning and unsupervised? Give examples.
Unsupervised machine learning is when you have a bunch of input values but no one particular output value, so basically you don’t exactly know what you’re looking for. It mainly deals with unlabeled data.
Supervised machine learning is when there is an associated response variable Yi and we try to find the relationship between predictors and the response variable. So linear regression,XGboost,etc. When I think of supervised machine learning, I think of inference and prediction
What is inference and prediction?
Prediction models are models that care about the error and don’t really care as much about how we got there. When I think of the epitome of predictive models, I think of neural networks.
Inferential models dive deeper. In inferential modeling, you really want to see how the individual predictors affect your prediction. You’re more curious about the complex relationships in your model in this methodology.
An overall example would be looking at housing prices. When you’re trying to be predictive as possible, you really just care about your accuracy. In an inferential methodology , you care about things like “how does square feet affect my price”
What is regression? Which models can you solve with regression?
Regression is a part of supervised ML that investigates the relationships between dependent values and independent values. You have linear regression, polynomial regression, Ridge regression ,and Lasso Regression.
What is linear regression? When do we use it?
Linear regression models assume linear relationships between your dependent value and independent values.
Simple linear regression would be something like y=b0+b1*x
Multiple linear regression is when you have multiple independent values, so itd be something like y=b0+b1x+b2x2,etc
What are the main assumptions of linear regression?
Linear relationship
Multivariate normality-This assumption can best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this issue-assumes the residuals are normality distributed
No or little multicollinearity-When two predictors are independent
No auto-correlation of errors-residuals should be independent of eachother
Homoscedasticity-the size of the error term shouldn’t depend on the independent value
what’s the normal distribution and why should we care about it?
The normal distribution is a continuous probability distribution where the mean mode and median are the same. We should care about it because it is very important to the central limit thereom which basically says that if you grab a large sample size, it should mirror a normal distribution. So if you look one std above the mean, you can assume that 16% of the population has a mean above that and then 2.5% 2 standard deviations away
What is gradient descent?
Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable function. Gradient descent is simply used to find the values of a function’s parameters (coefficients) that minimize a cost function as far as possible. Imagine a blindfolded man who wants to climb to the top of a hill with the fewest steps along the way as possible. He might start climbing the hill by taking really big steps in the steepest direction, which he can do as long as he is not close to the top. As he comes closer to the top, however, his steps will get smaller and smaller to avoid overshooting it.
What is batch gradient descent?
Batch gradient descent, also called vanilla gradient descent, calculates the error for each example within the training dataset, but only after all training examples have been evaluated does the model get updated. This whole process is like a cycle and it’s called a training epoch.
Some advantages of batch gradient descent are its computational efficient, it produces a stable error gradient and a stable convergence. Some disadvantages are the stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve. It also requires the entire training dataset be in memory and available to the algorithm.
What is stochastic gradient descent?
By contrast, stochastic gradient descent (SGD) does this for each training example within the dataset, meaning it updates the parameters for each training example one by one. Depending on the problem, this can make SGD faster than batch gradient descent. One advantage is the frequent updates allow us to have a pretty detailed rate of improvement.
The frequent updates, however, are more computationally expensive than the batch gradient descent approach. Additionally, the frequency of those updates can result in noisy gradients, which may cause the error rate to jump around instead of slowly decreasing.
What is mini batch gradient descent?
Mini-batch gradient descent is the go-to method since it’s a combination of the concepts of SGD and batch gradient descent. It simply splits the training dataset into small batches and performs an update for each of those batches. This creates a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.
Common mini-batch sizes range between 50 and 256, but like any other machine learning technique, there is no clear rule because it varies for different applications. This is the go-to algorithm when training a neural network and it is the most common type of gradient descent within deep learning.
which metrics do you know for evaluating linear regression?
Mean Squared Error(MSE) Root Mean Squared Error(RMSE) Mean Absolute Error(MAE) R² or Coefficient of Determination Adjusted R²
What is the bias-variance trade off?
Bias is the error introduced by approximating the true underlying function, which can be quite complex, by a simpler model(uder fitting)
. Variance is a model sensitivity to changes in the training dataset(overfitting)
Bias-variance trade-off is a relationship between the expected test error and the variance and the bias - both contribute to the level of the test error and ideally should be as small as possible:
What is over fitting? What is underfitting?
But as a model complexity increases, the bias decreases and the variance increases which leads to overfitting. And vice versa, model simplification helps to decrease the variance but it increases the bias which leads to underfitting.
How to validate your models?
One of the most common approaches is splitting data into train, validation and test parts. Models are trained on train data, hyperparameters (for example early stopping) are selected based on the validation data, the final measurement is done on test dataset. Another approach is cross-validation: split dataset into K folds and each time train models on training folds and measure the performance on the validation folds. Also, you could combine these approaches: make a test/holdout dataset and do cross-validation on the rest of the data. The final quality is measured on the test dataset.
Why do we need to split the data into train,validation,and test?
The training set is used to fit the model, i.e. to train the model with the data. The validation set is then used to provide an unbiased evaluation of a model while fine-tuning hyperparameters. This improves the generalization of the model. Finally, a test data set which the model has never “seen” before should be used for the final evaluation of the model. This allows for an unbiased evaluation of the model. The evaluation should never be performed on the same data that is used for training. Otherwise the model performance would not be representative.
How do you go about adding and removing variables in your model?
Forward Selection-you start with just the intercept and keep adding variables and checking the rss
Backward Selection-we start with all the variables in the model and keep removing variables starting with the largest p value until a stopping rule is reached(p value goes above a certain threshold)
Mixed Selection-Start with no models, then go with the forward and then we remove if the p value goes above a certain threshold until all the variables in the model have a low p value and the ones with a big p value aren’t there
What are some appraoches to validation?
Leave one out cross validation, k folds cross validation, and normal cross validation
What is logistic regression and when is it used?
Logistic regression is a Machine Learning algorithm that is used for binary classification. You should use logistic regression when your Y variable takes only two values, e.g. True and False, “spam” and “not spam”, “churn” and “not churn” and so on. The variable is said to be a “binary” or “dichotomous”
What is a sigmmoid function? what does it do?
A sigmoid function is a type of activation function, and more specifically defined as a squashing function. Squashing functions limit the output to a range between 0 and 1, making these functions useful in the prediction of probabilities.
Sigmod(x) = 1/(1+e^{-x})
Is accuracy always a good metric?
Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.
What is regularization and why do we need it?
Regularization is used to reduce overfitting in machine learning models. It helps the models to generalize well and make them robust to outliers and noise in the data.
Which regularization techniques do you know?
L1 Regularization (Lasso regularization) - Adds the sum of absolute values of the coefficients to the cost function- can shrink to 0 L2 Regularization (Ridge regularization) - Adds the sum of squares of coefficients to the cost function-can shrink to close to 0 but never to 0
Where lambda determines the amount of regularization
How does L2 regularization look like in a linear model?
L2 regularization adds a penalty term to our cost function which is equal to the sum of squares of models coefficients multiplied by a lambda hyperparameter. This technique makes sure that the coefficients are close to zero and is widely used in cases when we have a lot of features that might correlate with each other.
How do we select the right regularizaiton parameters?
Regularization parameters can be chosen using a grid search, for example https://scikit-learn.org/stable/modules/linear_model.html has one formula for the implementing for regularization, alpha in the formula mentioned can be found by doing a RandomSearch or a GridSearch on a set of values and selecting the alpha which gives the least cross validation or validation error.
Whats the different between l2 regularisation(ridge) and l1 (lasso)
Penalty terms: L1 regularization uses the sum of the absolute values of the weights, while L2 regularization uses the sum of the weights squared.
Feature selection: L1 performs feature selection by reducing the coefficients of some predictors to 0, while L2 does not.
Computational efficiency: L2 has an analytical solution, while L1 does not.
Multicollinearity: L2 addresses multicollinearity by constraining the coefficient norm.
What are decision trees?
This is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables.
In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible.
A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a value for the target variabl
to quote from the elements of statistical learning”trees have one aspect that prevents them from being the ideal tool for predictive learning, namely inaccuracy”
What are main parameters of a decision tree?
maximum tree depth
minimum samples per leaf node
impurity criterion
What is random forest?
Random Forest is a machine learning method for regression and classification which is composed of many decision trees. Random Forest belongs to a larger class of ML algorithms called ensemble methods (in other words, it involves the combination of several models to solve a single prediction problem
Explain how RF works
Step 1) create a bootstrapped dataset->to create a bootstrapped sample, we just randomly select samples from the original dataset->we can pick the same sample more than once
Step 2) create a decision tree using the bootstrapped dataset but only use a random subset of variables at each step->
step3) let’s say we select two variables->and we get good blood circulation as the best predictor-> we then make that a split and keep repeating
step4) go back to step 1 and repeat-> do this a lot of times
How do we use?
step5) run the data down the first tree->let’s say it predicts heart disease
Step 6) run the data down the second tree->lets say it also say yes-> keep going
Step 7) we see which option received more votes
bootstrapping the aggregate and using the aggregate is called bagging
Step 8)typically some of the data doesn’t end up in the dataset->we test our trees with that ->out of bag dataset
Step 9)accurate = proportion of out of bag that were correctly classified
Step 10) can do things like change the number of variables per step and choose the one which performs best
Gradient boosting tree?
Gradient boosting works with first guess-initial guess-then builds a tree-builds fixed tree sizes based off previous errors-similiar to adaboost-it scales the trees -however it scales all trees by the same amount-builds another tree built off the tree before-keeps building until it has made the number of trees uve asked for or fit doesn’t get better-we are basically predicting the residuals-low bias high variance if we over fit -uses a learning rate to fight it-taking small steps results in better predictions on testing dataset-start with initial prediction then add first tree prediction and then add the second set prediction -each time we add a tree to the prediction-
Explain XGboost tree
Designed to be used with large complicated datasets
Start out with prediction
Xgboost fits regression tree to the residuals like gradient descent-uses a unique regression tree-
Each tree starts out as a single leaf-all the residuals go to the leaf-calculate similarity score-sum of residuals squared divided by number of residuals + a lambda (regularization parameters)
Can we do a better job clustering the residuals?
Then calculate another similarity score for those residuals
When residuals are similar or just one- the similarity scores are large
We need to compare these new leafs with the older tree which we compare the gains - similarity score of leaf on left plus similarity score of leaf on right - leaf score
What hyper parameter tuning do you know?
rid Search is an exhaustive approach such that for each hyper-parameter, the user needs to manually give a list of values for the algorithm to try. After these values are selected, grid search then evaluates the algorithm using each and every combination of hyper-parameters and returns the combination that gives the optimal result (i.e. lowest MAE). Because grid search evaluates the given algorithm using all combinations, it’s easy to see that this can be quite computationally expensive and can lead to sub-optimal results specifically since the user needs to specify specific values for these hyper-parameters, which is prone for error and requires domain knowledge.
Random Search is similar to grid search but differs in the sense that rather than specifying which values to try for each hyper-parameter, an upper and lower bound of values for each hyper-parameter is given instead. With uniform probability, random values within these bounds are then chosen and similarly, the best combination is returned to the user. Although this seems less intuitive, no domain knowledge is necessary and theoretically much more of the parameter space can be explored.
Bayesian processes-not exactly ure how ti works though
What are the problems with sigmoid as an activation function?
The derivative of the sigmoid function for large positive or negative numbers is almost zero. From this comes the problem of vanishing gradient — during the backpropagation our net will not learn (or will learn drastically slow). One possible way to solve this problem is to use ReLU activation function.
What is RELU? how is it better than sigmoid?
ReLU is an abbreviation for Rectified Linear Unit. It is an activation function which has the value 0 for all negative values and the value f(x) = x for all positive values. The ReLU has a simple activation function which makes it fast to compute and while the sigmoid and tanh activation functions saturate at higher values, the ReLU has a potentially infinite activation, which addresses the problem of vanishing gradients.
reg techniques for NN?
L1 Regularization - Defined as the sum of absolute values of the individual parameters. The L1 penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded.
L2 Regularization - Defined as the sum of square of individual parameters. Often supported by regularization hyperparameter alpha. It results in weight decay.
Data Augmentation - This requires some fake data to be created as a part of training set.
Drop Out : This is most effective regularization technique for newral nets. Few randome nodes in each layer is deactivated in forward pass. This allows the algorithm to train on different set of nodes in each iterations.
What is the difference between Data Mining and Data Profiling?
Process of finding relevant information which has not been found before- way in which raw data is turned into valuable information-anything like web scraping/census data
Data profiling is usually done to assess a dataset for its uniqueness ,consistency and logic. Looking at it and saying “is it related to what im working with”
Define data wrangling in terms of data analytics
Data wrangling is the process of cleaning,structuring and enriching the raw data into a desirable usable format for better decision making
What are the various steps involved in any analytics project?
Understand the problem Data collection Data cleaning Data exploration and analysis Interpret the results
What are the best practices for data cleaning?
80% in most analysis is in the cleaning
Make a data cleaning plan by understanding where common errors take place and keep communication
Identify and remove duplicates before working with the data
Focus on the accuracy on the data, maintain the value of types of data
Standardize the data at point of entry
How do you subset or filter data in SQL
Where and HAVING clause