Machine Learning Questions Flashcards
What is overfitting?
overfitting happens when you model is too tailored towards the training data (low bias high variance), in other words, it has learned all the detail and noise in the training data, and when the model is applied to the testing dataset, which may not have the same detail and noise, the model will perform poorly
What is bias vs. variance?
bias measures how far off in general the model’s predictions are from the correct value or prediction, variance is how much the predictions vary between different realizations of the model. Linear models often have a high bias but low variance, nonlinear models often have low bias but high variance. For example, a student’s score in a test is affected by the his or her average test score, which is bias in this case, and the score is also affected by his or her performance in this given test, which is variance. High variance may indicate over-fitting
How do you overcome overfitting? Please list 3-5 practical experience. / What is ‘Dimension Curse’? How to prevent?
bagging, cross-validation (tune hyperparameters using k sets), reducing number of features, pruning, regularization (adds penalty as model complexity increases). Dimension curse refers to when dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance
Please briefly describe the Random Forest classifier. How did it work? Any pros and cons in practical implementation?
Random Forest classifier uses random forest to make classification predictions (vs. regression predictions). The model works by constructing many decision trees each with only a random sample of training data and a subset of features. Pro: Less overfitting and parallel implementation, handle large datasets with higher dimensionality. Con: low interpretability, can’t give continuous prediction
What is Bagging? Why is it a popular method?
bagging (bootstrap aggregating) is an ensemble learning method that uses different subsets of the data for training (sampling with replacement). It reduces the variance of algorithm, can be applied to decision tree or other algorithms. It is popular due to its boost for performance by reducing variance and individual learners in bagging can be trained in parallel and scale well.
What is Boosting? How is it different from bagging?
Boosting and bagging are both ensemble learning methods where several weak learners are combined to create a strong learner. Boosting uses all data to train each learner sequentially, instances that were misclassified by the previous learners are given more weight so that subsequent learners focus on these instances more in training. PRO: reduce bias and variance, fine-tuned boosting models have higher performance than bagging models CON: long time to run, can’t scale up
What is SVM (support vector machine)?
Unsupervised learning to perform linear, nonlinear, or outlier detection. Basic idea is to maximize the minimum margin between different categorical variable group. In linear regression, SVM finds the line that is the furthest from the closest two points of groups.
Briefly rephrase PCA in your own way. How does it work? Pros and Cons of PCA?
PCA is a way to reduce dimensionality, so if given a d-dimensional space, PCA can find a lower dimensional projection that captures most of the “variability” in the original data. PRO: remove multicollinearity, reduce overfitting, improve visualization, and ultimately improve algorithm performance. CON: features will be less interpretable, data standardization necessary, information loss
Why doesn’t logistic regression use R^2?
R2 is variation / total variation for linear regression models, but if you use r2 to evaluate performance of logistic regressions, r2 can be high for both good and bad models and may not be a good indicator of model performance. Alternatively, we can evaluate using accuracy such as the confusion matrix.
When will you use L1 regularization compared to L2?
Both L1 and L2 prevent overfitting by shrinking the coefficients by imposing a penalty. L2 Ridge shrinks all coefficients by the same proportions but eliminates none, while L1 Lasso can shrink some coefficients to zero, performing variable selection. If all the features are correlated with the label, L2 ridge outperforms lasso, if only a subset of features are correlated with the label, L1 lasso outperforms ridge. L1 can also be used in feature selection, L2 can be used address multicollinearity.
What is R-Squared?
R-squared measures how close the data are to the fitted regression line, equals to explained variation / total variation
What is KNN (K nearest neighbor)? Pros and Cons of KNN?
Given a data point, compute the K nearest data points using certain distance metric. For classification, we take the majority label of neighbors; for regression, we take the mean of the label values. KNN is not trained but simply computed during inference time, so it can be. Having a high / low number of K may lead to underfitting / overfitting.
PRO: Very simple implementation. Robust with regard to the search space; for instance, classes don’t have to be linearly separable.
Classifier can be updated online at very little cost as new instances with known classes are presented. Few parameters to tune: distance metric and k.
CON: Sensitiveness to noisy or irrelevant attributes
Sensitiveness to very unbalanced datasets
Computationally expensive since each test data point needs to be compared with every training data point to compute distance
What is cross validation?
Cross validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a validation set to evaluate the model. A k-fold cross validation divides the data into k partitions and trains on each k-1 fold, and evaluate on the remaining 1 fold. This results to k models/evaluations, which can be average to get an overall model performance.
What is Linear Regression?
A linear approach to modeling the relationship between independent variables and dependent variable, aka assume linear relationship between features and the label. Learn the parameter by minimizing the cost function.
Difference between supervised learning and unsupervised learning?
n supervised learning, algo learns on a labeled dataset, in unsupervised, model learns on unlabeled data that the algo tries to make sense by extracting features and patterns on its own. Unsupervised learning examples include clustering and neural networks.