Machine Learning Questions Flashcards

1
Q

What is overfitting?

A

overfitting happens when you model is too tailored towards the training data (low bias high variance), in other words, it has learned all the detail and noise in the training data, and when the model is applied to the testing dataset, which may not have the same detail and noise, the model will perform poorly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is bias vs. variance?

A

bias measures how far off in general the model’s predictions are from the correct value or prediction, variance is how much the predictions vary between different realizations of the model. Linear models often have a high bias but low variance, nonlinear models often have low bias but high variance. For example, a student’s score in a test is affected by the his or her average test score, which is bias in this case, and the score is also affected by his or her performance in this given test, which is variance. High variance may indicate over-fitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you overcome overfitting? Please list 3-5 practical experience. / What is ‘Dimension Curse’? How to prevent?

A

bagging, cross-validation (tune hyperparameters using k sets), reducing number of features, pruning, regularization (adds penalty as model complexity increases). Dimension curse refers to when dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Please briefly describe the Random Forest classifier. How did it work? Any pros and cons in practical implementation?

A

Random Forest classifier uses random forest to make classification predictions (vs. regression predictions). The model works by constructing many decision trees each with only a random sample of training data and a subset of features. Pro: Less overfitting and parallel implementation, handle large datasets with higher dimensionality. Con: low interpretability, can’t give continuous prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Bagging? Why is it a popular method?

A

bagging (bootstrap aggregating) is an ensemble learning method that uses different subsets of the data for training (sampling with replacement). It reduces the variance of algorithm, can be applied to decision tree or other algorithms. It is popular due to its boost for performance by reducing variance and individual learners in bagging can be trained in parallel and scale well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Boosting? How is it different from bagging?

A

Boosting and bagging are both ensemble learning methods where several weak learners are combined to create a strong learner. Boosting uses all data to train each learner sequentially, instances that were misclassified by the previous learners are given more weight so that subsequent learners focus on these instances more in training. PRO: reduce bias and variance, fine-tuned boosting models have higher performance than bagging models CON: long time to run, can’t scale up

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is SVM (support vector machine)?

A

Unsupervised learning to perform linear, nonlinear, or outlier detection. Basic idea is to maximize the minimum margin between different categorical variable group. In linear regression, SVM finds the line that is the furthest from the closest two points of groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Briefly rephrase PCA in your own way. How does it work? Pros and Cons of PCA?

A

PCA is a way to reduce dimensionality, so if given a d-dimensional space, PCA can find a lower dimensional projection that captures most of the “variability” in the original data. PRO: remove multicollinearity, reduce overfitting, improve visualization, and ultimately improve algorithm performance. CON: features will be less interpretable, data standardization necessary, information loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why doesn’t logistic regression use R^2?

A

R2 is variation / total variation for linear regression models, but if you use r2 to evaluate performance of logistic regressions, r2 can be high for both good and bad models and may not be a good indicator of model performance. Alternatively, we can evaluate using accuracy such as the confusion matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When will you use L1 regularization compared to L2?

A

Both L1 and L2 prevent overfitting by shrinking the coefficients by imposing a penalty. L2 Ridge shrinks all coefficients by the same proportions but eliminates none, while L1 Lasso can shrink some coefficients to zero, performing variable selection. If all the features are correlated with the label, L2 ridge outperforms lasso, if only a subset of features are correlated with the label, L1 lasso outperforms ridge. L1 can also be used in feature selection, L2 can be used address multicollinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is R-Squared?

A

R-squared measures how close the data are to the fitted regression line, equals to explained variation / total variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is KNN (K nearest neighbor)? Pros and Cons of KNN?

A

Given a data point, compute the K nearest data points using certain distance metric. For classification, we take the majority label of neighbors; for regression, we take the mean of the label values. KNN is not trained but simply computed during inference time, so it can be. Having a high / low number of K may lead to underfitting / overfitting.
PRO: Very simple implementation. Robust with regard to the search space; for instance, classes don’t have to be linearly separable.
Classifier can be updated online at very little cost as new instances with known classes are presented. Few parameters to tune: distance metric and k.
CON: Sensitiveness to noisy or irrelevant attributes
Sensitiveness to very unbalanced datasets
Computationally expensive since each test data point needs to be compared with every training data point to compute distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is cross validation?

A

Cross validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a validation set to evaluate the model. A k-fold cross validation divides the data into k partitions and trains on each k-1 fold, and evaluate on the remaining 1 fold. This results to k models/evaluations, which can be average to get an overall model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Linear Regression?

A

A linear approach to modeling the relationship between independent variables and dependent variable, aka assume linear relationship between features and the label. Learn the parameter by minimizing the cost function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Difference between supervised learning and unsupervised learning?

A

n supervised learning, algo learns on a labeled dataset, in unsupervised, model learns on unlabeled data that the algo tries to make sense by extracting features and patterns on its own. Unsupervised learning examples include clustering and neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is accuracy, precision, and recall?

A

Based on the confusion matrix, which is composed by predicted negative, predicted positive, actual negative, and actual positive. Accuracy = (tp + tn) / total; Precision = tp / predicted positive percentage of tp out of all predicted positives; Recall = tp / actual positive, percentage of tp found from all actually positive points

17
Q

What is P-value? How do you use it in hypothesis testing?

A

p-value is defined as the probability of observing data as or more extreme than the observed data if the null hypothesis was true. In hypothesis testing, we reject null hypothesis if p-value is less than the significance

18
Q

What is logistic regression? What is the sigmoid function?

A

Logistic regression uses the sigmoid function to predict value within [0, 1]. It is a form of binary regression. It can be used to predict binary events such as win or lose. Sigmoid function is 1 over 1 + e to the negative x

19
Q

What is standardization Vs. Normalization? Why are they necessary? When to use which?

A

Standardization transforms data to have a mean of zero and a SD of 1 (using z-score). While normalization scales data to between 0 and 1. Standardization is used for us to compare data with different units, otherwise, variables measured at different scales might create a bias in the model. Normalization is used against variables with different ranges of values. Use standardization when data is normal distribution, else use normalization.

20
Q

Please describe the difference between GBM (Gradient Boosting Machine) tree model and Random Forest. When to use which?

A

Random forests are a large number of decision tress, trained at the same time, and the prediction is based on averages of majority rules.
Gradient boosting also trains a number of decision trees, but the tree is trained one at a time, and the new trees focus on the weaknesses of former trees, and prediction is made by summing up the prediction of all trees. A well-tuned boosting model can have better performance than random forests, but it is not a good choice when the data has a lot of noise since that will lead to overfitting.

21
Q

What is feature selection and why is it important?

A

Selecting attributes that improve prediction accuracy and eliminate attributes that are irrelevant or decrease accuracy.

22
Q

What is data correlation? Why is it useful?

A

Data correlation happens when one or multiple attributes depend on another attribute or is associated with other attributes. Correlation can help predict missing values, it can also indicate the presence of a causal relationship.

23
Q

What is multicollinearity? Which models are impacted? How to deal with it?

A

One predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. Multicollinearity can be found using correlation matrix. It can lead to misleading results. Decision trees and boosted trees are not impacted, but regression models are impacted. You can deal with multicollinearity by deleting one of the features or using PCA.