Machine Learning Flashcards
Explain K-means.
Notes
- Unsupervised clustering algorithm.
- Model features should be normalized values
- Does not always converge to a global minimum
- Convergence depends on initial cluster centroids
- Initialization Methods: Random, Forgy, Kmeans++
- If the number of clusters is not known use elbow method (increase cluster size until gain on Loss in minimal)
Steps
- Determine K (# clusters)
- Initiate K cluster centroids
- Assign points to each cluster
- Take the mean value of all points in cluster. Set that as cluster centroid
- Repeat
Pros
- fast to train, scalable, will convergence
Cons
- have to choose K, dependant on initial centroids, susceptible to outliers
Explain supervised, unsupervised, semi-supervised and reinforcement learning.
Supervised: Data with labels. Continuous/Discrete.
Ex: Linear Regression, Decision Tree, Forecasting Temperature
Unsupervised: Data without labels.
Ex: K-means, Hierarchal Clustering, Customer Segmentation
Semi-supervised: Data that is not labelled, but can be manipulated to have a label. Think how word2vec updates word embeddings using words that are within a sliding window
Ex: Word2Vec
Reinforcement: Each action/data point gets a response/feedback
Ex: dreamerv2
What is overfitting? What are some strategies to prevent it?
When a model does not generalize well to new data. It is training to noise in the test data.
Strategies:
Regularization (L1/L2)
Reduce model complexity
Use a validation dataset
Cross-validation
Early-stopping
Use more data
Remove features
Ensemble Learning
What is the training, validation and test data? What percentage of the data would you allocate to each?
Training: Used to tune the model parameters
Validation: Used during training to ensure that the model is not overfitting
Test: Gets evaluation of how real world model performance. Once the test data is used it can not be used as test data again.
80-10-10: Typical
60-20-20: Small dataset
90-5-5: Large dataset (if each dataset contains a good representation of true population)
How would you handle missing/corrupted data?
Mean - No outliers
Median - There are outliers
Forward/Backward Fill - If there is an order to the data
Impute value - maybe NAN values indicate something?
Remove row/column - Might not be worth keeping
How to choose which ML model to use for a classification problem?
Strategies
Cross Validation (if computationally viable)
Train-valid-test (if cross-validation not viable)
Model size limitations
Model inference speed
Little data (use model with lower variance)
Big data (use model with lower bias)
Do you need model to handle missing values
Explain the bias/variance trade off.
Bias: How well the model fits the training data. Lower bias the better
Variance: How much the model parameters and predictions change with a different training sample
Tradeoff: Low bias and low variance is the sweet spot. If you want to lower the bias more, the variance will increase and vice versa. Sometimes you may want to sacrifice bias for variance for more robust predictions.
What is a confusion matrix?
A confusion matrix plots the predicted values against the actual values for classification problems. It also shows the TP, TN, FP, FN’s.
What are TPs, TNs, FPs and FNs?
Think in this format. “Correct? Prediction?”.
True positives - Correct positive prediction
True negatives - Correct negative prediction
False positives - Positive prediction when the label is False
False negatives - Negative prediction when the label in True
Stages of ML Model
- Understanding problem
- past work, privacy, ethics, do we need ML? - Data Collection
- existing datasets, get creative here - Data preparation
- ELT/ETL, feature engineering - Model Development/Model Testing
- Cross validation, hyper-parameter tuning - Model Deployment
- Inference speed, REST API or on device, data drift
Explain Backpropagation.
Backpropagation:
Process to update neural network parameters
Forward Pass:
Pass data through and make predictions
Backward Pass:
Calculates the chained partial derivative of the loss function with respect a specific weight/bias. Do this for every parameter. The result of the chained derivative is the direction of steepest ascent, so we take the negative to get the steepest descent. Multiply this by the learning rate and add to parameter to update.
What are some examples of Supervised, Unsupervised, and Semi-supervised Learning?
Supervised Learning:
Forecasting temperature
Predicting type of disease on plants using image data
Predicting the cost of housing expenses
Forecasting energy demand
Unsupervised:
Customer segmentation
Anomaly detection
Identifying patterns in DNA
Semi-supervised
Training embeddings using text corpora
Labelling unlabelled data
What is K-Means and KNN. Compare and contrast.
KMeans:
Unsupervised clustering
Scalable, fast inference
Centroid Initialization: Random, Forgy, Kmeans++
KNN:
Supervised classification
Lazy Learner (No training)
Not scalable, long inference time
Prediction based on K closest points
How could you train a model to play Checkers?
Use a reinforcement learning model such as dreamerv2. Make an agent play the game and reinforce positive moves (ie. gain checkers), and penalize negative moves (ie. lose checkers).
How could you build a recommendation engine? What are its benefits?
Strategies:
Customer segmentation
Product segmentation
Cosine similarity (customers or products)
Benefits:
Customer retention, Customer lifetime value, Improved search results
Classification vs. Regression
Classification - Discrete Labels
Regression - Continuous Labels
Hyperparameters vs. Parameters
Hyperparameters
Tuned by person (LR, Optimizer, Weight Decay, Hidden Layers, etc)
Parameters
Model learns these from the training data (weights + biases)
Random Forest vs. Gradient boosted decision tree
Random Forest:
Takes the mean/mode/median of the predictions from a group of decision trees
Each tree is trained on a subset of the features
More generalizable
Ensemble Learning method
Can train in parallel
GBDT:
Each tree is built on top of each other
Will fit a decision tree on the residuals error from the previous tree
Predicts error from previous tree rather than target directly
Considerations when choosing an ML model?
Label presence
Model Size
Training Time, Inference Time
Prediction Accuracy
Implications of FP and FN
Model explainability
Size of training data
Precision vs Recall. Define these with TP, TN, FP and FNs.
Precision:
How many of your positive predictions are actually positive
TP / (TP + FP)
Recall:
How many of the true data points were identified
TP / (TP + FN)
Correlation vs. Covariance
Correlation:
Strength of relationship between variables
-1 -> 1
Covariance:
Direction of relationship between variables
Magnitude dependant on scale