Machine Learning Flashcards
Technical interview study
What is Machine Learning?
ML is the field of science that studies algorithms that approximate functions increasingly well as they are given more observations.
What are some common applications of Machine Learning?
ML algorithms are used to learn and automate human processes, optimize outcomes, predict outcomes, model complex relationships, and to learn patterns in data (among many other uses).
What is labeled data and what is it used for?
Labeled data is data that has the information about a target variable for each instance.
Labeled data allows us to train supervised ML algorithms.
What are the most common types of algorithms that use supervised learning?
Most common types supervised learning algorithms:
regression
classification
What are the most common types of algorithms that use unsupervised learning?
Most common unsupervised learning algos:
clustering, dimensionality reduction (PCA), and association-rule mining.
What is the difference between online and offline learning?
Online learning refers to the updating of models as they gain more information.
Offline learning refers to learning by batch processing data. If new data comes in, an entire new batch (including all the old and new data) must be fed into the algorithm to learn from the new data.
What is reinforcement learning?
Reinforcement learning describes a set of algorithms that learn from the outcome of each decision.
e.g., a robot could use reinforcement learning to learn that walking forward into a wall is bad, but turning away from a wall and walking is good.
What is the difference between a model parameter and a learning hyperparameter?
A model parameter describes the final model itself; e.g. slope of a linear regression fit.
A learning hyperparameter describes a way in which a model parameter is learned; e.g. learning rate, penalty terms, number of features to include in a weak predictor.
What is overfitting?
Overfitting is when a model makes much better predictions on known training data than on unseen (validation, test) data.
How can we combat overfitting?
Ways to combat overfitting:
a. simplify the flexibility of the model (by changing the hyperparameters)
b. select a different model
c. use more training data
d. gather better quality data
What is training data and what is it used for?
Training data is data which will be used to train the ML model.
For supervised learning, this training data must have a labeled target, i.e. what we are trying to predict must be defined.
For unsupervised learning, the training data will contain only features and will use no labeled targets; i.e. what we are trying to predict is not defined.
What is a validation set and why do we use one?
A validation set is a set of data that is used to evaluate a model’s performance during training/model selection. After models are trained, they are evaluated on the validation set to select the best possible model.
Information from the validation set must never be used to train the model.
It must also not be used as the test data set because we’ve biased our model selection toward working well with this data, even though the model was not directly trained on it.
What is a test set and why use one?
A test set is a data set not used during ML training or validation.
The model’s performance is evaluated on the test set to predict how well it will generalize to new data.
What is cross validation and why is it useful?
Cross validation is a technique for more accurately training and validating models. It rotates what data is held out from model training to be used as the validation data.
Several models are trained and evaluated. with every piece of data being held out from one model. The average performance of all models is then calculated.
It is a more reliable way to validate models but is more computationally expensive, e.g. 5-fold CV requires training and validating 5 models instead of 1.
What does a confusion matrix look like?
Predicted values
yhat=1 yhat=0
True y=1 TP FN recall: TP/(y=1)
Values y=0 FP TN specificity: TN/(y=0)
precision: TP/(yhat=1) accuracy: (TP+FN)/total
precision: measures accuracy of a PREDICTED_POSITIVE outcome
recall: (sensitivity) measures strength of model to predict real-positive-class outcomes, the proportion of true 1s identified
specificity: measures a model’s ability to predict a negative outcome, the proportion of true 0s identified