Machine Learning Flashcards

1
Q

Explain K-means.

A

Notes
- Unsupervised clustering algorithm.
- Model features should be normalized values
- Does not always converge to a global minimum
- Convergence depends on initial cluster centroids
- Initialization Methods: Random, Forgy, Kmeans++
- If the number of clusters is not known use elbow method (increase cluster size until gain on Loss in minimal)

Steps
- Determine K (# clusters)
- Initiate K cluster centroids
- Assign points to each cluster
- Take the mean value of all points in cluster. Set that as cluster centroid
- Repeat

Pros
- fast to train, scalable, will convergence

Cons
- have to choose K, dependant on initial centroids, susceptible to outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain supervised, unsupervised, semi-supervised and reinforcement learning.

A

Supervised: Data with labels. Continuous/Discrete.
Ex: Linear Regression, Decision Tree, Forecasting Temperature

Unsupervised: Data without labels.
Ex: K-means, Hierarchal Clustering, Customer Segmentation

Semi-supervised: Data that is not labelled, but can be manipulated to have a label. Think how word2vec updates word embeddings using words that are within a sliding window
Ex: Word2Vec

Reinforcement: Each action/data point gets a response/feedback
Ex: dreamerv2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is overfitting? What are some strategies to prevent it?

A

When a model does not generalize well to new data. It is training to noise in the test data.

Strategies:
Regularization (L1/L2)
Reduce model complexity
Use a validation dataset
Cross-validation
Early-stopping
Use more data
Remove features
Ensemble Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the training, validation and test data? What percentage of the data would you allocate to each?

A

Training: Used to tune the model parameters

Validation: Used during training to ensure that the model is not overfitting

Test: Gets evaluation of how real world model performance. Once the test data is used it can not be used as test data again.

80-10-10: Typical
60-20-20: Small dataset
90-5-5: Large dataset (if each dataset contains a good representation of true population)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How would you handle missing/corrupted data?

A

Mean - No outliers
Median - There are outliers
Forward/Backward Fill - If there is an order to the data
Impute value - maybe NAN values indicate something?
Remove row/column - Might not be worth keeping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to choose which ML model to use for a classification problem?

A

Strategies
Cross Validation (if computationally viable)
Train-valid-test (if cross-validation not viable)
Model size limitations
Model inference speed
Little data (use model with lower variance)
Big data (use model with lower bias)
Do you need model to handle missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the bias/variance trade off.

A

Bias: How well the model fits the training data. Lower bias the better

Variance: How much the model parameters and predictions change with a different training sample

Tradeoff: Low bias and low variance is the sweet spot. If you want to lower the bias more, the variance will increase and vice versa. Sometimes you may want to sacrifice bias for variance for more robust predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a confusion matrix?

A

A confusion matrix plots the predicted values against the actual values for classification problems. It also shows the TP, TN, FP, FN’s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are TPs, TNs, FPs and FNs?

A

Think in this format. “Correct? Prediction?”.

True positives - Correct positive prediction
True negatives - Correct negative prediction
False positives - Positive prediction when the label is False
False negatives - Negative prediction when the label in True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Stages of ML Model

A
  1. Understanding problem
    - past work, privacy, ethics, do we need ML?
  2. Data Collection
    - existing datasets, get creative here
  3. Data preparation
    - ELT/ETL, feature engineering
  4. Model Development/Model Testing
    - Cross validation, hyper-parameter tuning
  5. Model Deployment
    - Inference speed, REST API or on device, data drift
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain Backpropagation.

A

Backpropagation:
Process to update neural network parameters

Forward Pass:
Pass data through and make predictions

Backward Pass:
Calculates the chained partial derivative of the loss function with respect a specific weight/bias. Do this for every parameter. The result of the chained derivative is the direction of steepest ascent, so we take the negative to get the steepest descent. Multiply this by the learning rate and add to parameter to update.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some examples of Supervised, Unsupervised, and Semi-supervised Learning?

A

Supervised Learning:
Forecasting temperature
Predicting type of disease on plants using image data
Predicting the cost of housing expenses
Forecasting energy demand

Unsupervised:
Customer segmentation
Anomaly detection
Identifying patterns in DNA

Semi-supervised
Training embeddings using text corpora
Labelling unlabelled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is K-Means and KNN. Compare and contrast.

A

KMeans:
Unsupervised clustering
Scalable, fast inference
Centroid Initialization: Random, Forgy, Kmeans++

KNN:
Supervised classification
Lazy Learner (No training)
Not scalable, long inference time
Prediction based on K closest points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How could you train a model to play Checkers?

A

Use a reinforcement learning model such as dreamerv2. Make an agent play the game and reinforce positive moves (ie. gain checkers), and penalize negative moves (ie. lose checkers).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How could you build a recommendation engine? What are its benefits?

A

Strategies:
Customer segmentation
Product segmentation
Cosine similarity (customers or products)

Benefits:
Customer retention, Customer lifetime value, Improved search results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Classification vs. Regression

A

Classification - Discrete Labels
Regression - Continuous Labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Hyperparameters vs. Parameters

A

Hyperparameters
Tuned by person (LR, Optimizer, Weight Decay, Hidden Layers, etc)

Parameters
Model learns these from the training data (weights + biases)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Random Forest vs. Gradient boosted decision tree

A

Random Forest:
Takes the mean/mode/median of the predictions from a group of decision trees
Each tree is trained on a subset of the features
More generalizable
Ensemble Learning method
Can train in parallel

GBDT:
Each tree is built on top of each other
Will fit a decision tree on the residuals error from the previous tree
Predicts error from previous tree rather than target directly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Considerations when choosing an ML model?

A

Label presence
Model Size
Training Time, Inference Time
Prediction Accuracy
Implications of FP and FN
Model explainability
Size of training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Precision vs Recall. Define these with TP, TN, FP and FNs.

A

Precision:
How many of your positive predictions are actually positive
TP / (TP + FP)

Recall:
How many of the true data points were identified
TP / (TP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Correlation vs. Covariance

A

Correlation:
Strength of relationship between variables
-1 -> 1

Covariance:
Direction of relationship between variables
Magnitude dependant on scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How are splits determined in a Decision Tree?

A

Gini Impurity / Information Gain:
Gini impurity faster to compute
0 means split is pure

23
Q

How to decision trees prune? LGBM vs XGBoost.

A

Pruning removes redundant splits (or very little IG)
Reduces model complexity and decreases variance

LGBM - Leaf wise pruning (Fast but greedy)
XGB - Builds out to max depth and then prunes (optimal but slow)

24
Q

What is Logistic Regression?

A

Linear regression with sigmoid/logistic activation for classification
Scales model output between 0-1

25
Q

Normalization vs. Standardization

A

Normalization:
Scales values [0,1] (bounded)
Affected by outliers
f(x) = (x - xmin) / (xmax - xmin)

Standardization:
Assigns Z-score to points (unbounded)
More robust to outliers
f(x) = (x - mu) / sigma

26
Q

What is SVM? Min or Max the margin? Hard vs. Soft margin?

A

SVM (support vector machine):
Supervised classification
support vectors (closest points to decision boundary)
maximize margin (distance from decision boundary to support vectors)
Hard-margin = must perfectly classify
Soft-margin = allows for slight errors (smoothed boundary)
kernel trick (if points are not linearly separable, you can map them to a higher dimension so that they can be)

27
Q

You have a very large dataset that can not fit on one machine. What do you do?

A

Ensure optimal data types are used
Load subsets of data on each batch
Remove non informative features
Use PCA (takes )

28
Q

What is PCA? What determines the PCs?

A

PCA (principal component analysis):
Dimensionality reduction technique
Reduces computation, can visualize high dimensional data
PCs are picked based on how much of the data variance they explain(higher first)

29
Q

Type I vs Type II Error

A

Type I:
Reject Null Hypothesis but its True in the population

Type II:
Do not reject null hypothesis but its False in the population

30
Q

Explain Ensemble Learning. (Pros and Cons)

A

Use multiple models in prediction
Can be the Mean/Median/Mode of a number of predictions
Can a be a linear regression model fit to all predictions to get a weighted linear combination of predictions
Best to use models with “different perspectives”

Pros: More generalizable, Increased accuracy
Cons: More computationally expensive (training and inference)

31
Q

Explain cross-validation.

A

K-fold cross validation:
Train K models. Each data point is in the validation set one time
Gives more accurate measure of model performance than train-valid-test split
Computationally expensive
Good for limited datasets
Note: There is still an unseen test set, the rest of the data is used for K-fold CV

32
Q

L1 vs L2

A

L2 (Ridge):
Penalizes the squared weight values in the loss function
Scales weights towards 0 not inclusive (non-sparse)
L2 regression solvable with least squares
Gaussian Prior

L1 (Lasso):
Penalizes the absolute value of the weights in the loss function
Scales weights towards 0 inclusive (sparse)
L1 regression only solvable with gradient descent
Laplacian Prior

33
Q

What does the ROC curve show?

A

It shows us how well the model ID’s true positives (sensitivity) versus how well it identifies true negatives (specificity).

A line from bottom-left to top right is just a model predicting randomly.

34
Q

How does Bayes Theorem apply to ML?

A

Naive Bayes Classifier:
Classifies based on probability of each class from training data

Bayesian Hyperparameter Optimization:
Uses information from past combinations to determine where to search next. Picks the next combination based on space likely to minimize loss function.
“smarter” search than random/grid search

35
Q

Why is Naive Bayes naive?

A

Assumes all features are independent (ie. does not consider the affect of multiple variables combined)

36
Q

What is the F1 score? When would you use it?

A

The F1 score is a classification metric that balances precision and recall. It can be useful when you have an imbalanced dataset.

Harmonic mean between precision and recall.
F1 = 2(pr)/(p+r)

37
Q

Which is more important. Model accuracy, or model performance?

A

Differs by use case:

Accuracy: Sales Forecast, Annual GDP calculation, Image Generation
Performance: Robotics, TSA XRAY screening, On-device ML, Sentence completion

38
Q

What are some strategies to handle a dataset imbalance.

A

Collect more data (best)
Choose a different metric (F1-score, Precision, Recall, avg per-class accuracy)
Oversample (Increase likelihood of overfit)
Undersample (Increase likelihood of underfit)
Give more weight to lower class in weight update

39
Q

MAE vs MSE

A

MAE:
Not sensitive to outliers
A gain on any point is equal to a gain on any other point

MSE:
Sensitive to outliers
Prioritizes improving outliers

40
Q

Assumptions of Linear Regression?

A

Constant variance across range (Homoscedasticity)
Normally distributed residuals
Independent observations
Linear relationship between variables and target

:( cant map non-linear relationships

41
Q

What is collinearity? Why is it bad in a linear model?

A

When there are multiple variables that are highly correlated
Gives misleading feature weights

Ex.
Think of if we added the same feature twice
The weights could be (5,5),(10,0),(-20,30) although the effect is always 10

42
Q

Explain Bagging vs. Boosting

A

Bagging:
Taking the mean/median/mode from a set of predictions to make final prediction

Boosting:
Sequential process of fitting the next model on the error of the previous model
This is seen in gradient boosted decision trees

43
Q

What is an Outlier? How could you screen these points?

A

Z-score:(x - mean)/(st-dev)
Anything outside of 3 standard deviations is probably an outlier

Clustering:
Fit a k-means model. If there is a cluster with very few points these are likely outliers.

Binary Classification:
If you have labelled set of points, build a classification model to identify

44
Q

How do you identify causation versus correlation?

A

Hypothesis Testing, A/B testing where all variables other than the independent variable are controlled.

45
Q

Vanishing vs Exploding Gradients. What can you do to stop these?

A

VG:
Sequence of partial derivatives that are below zero in weight update calculation
Results in tiny tiny update steps
More common with a sigmoid/TanH function
Model never converges

EG:
Sequence of partial derivatives that are below zero in weight update calculation
Results in overshooting the global minimum
Not as bad as VG (as we can just do gradient clipping)

Strategies to mitigate (both):
- Swish/ReLU Activations, Gradient Clipping, Residual/Skip connections
- Batch Normalization Layers
- He Initialization
- (initialize weights with a sample from gaussian distribution)
- (mu = 0 and sig = sqrt(2/(# of inputs to the node)))

Just EG:
L1/L2 regularization, Lower LR, Maybe change optimizer?

46
Q

Define the curse of dimensionality

A

Exponential increase in computational efforts for every dimension increase

47
Q

What are some metrics for classification and regression?

A

Classification:
Accuracy, Precision, Recall, F1-Score, Cross Entropy

Regression:
RMSE, MSE, MAE, MPercentageE, Information Criterion, R^2, L1, L2

48
Q

What is data drift? How do you detect it?

A

When the distribution of predictions or features changes.

Detect By:
Significance Testing: Comparing distribution of prediction with historical (KS-Test, T-test)
Model-Based Approach: Train classifier to predict historical and real-time. (If easy to differentiate, could be data drift)

49
Q

q. What is a long-tailed distribution? Give 3 examples in the real world.

A

When the data distribution is not centred around the mean.

Ex 1: Number of coin flips until you see heads
Ex 2: If you generate a random number “B” between 0-100, and then a random number between 0-“B”.
Ex 3: Time before a battery runs out of charge

50
Q

Batch vs Mini-batch vs Stochastic Gradient Descent.

A

Batch
- Update weights after evaluation of all data points

Mini-batch
- Update weights after evaluation of a subset of the datapoints

Stochastic
- Update weights after every single data point

51
Q

Why do we use activation functions? What are some activation functions?

A

Activation functions make it so that we can map non-linear patterns in the data. They are usually easily differentiable due to the number of derivatives calculated for every weight update.

Examples:
- ReLU, Sigmoid/Logistic, TanH, Swish, Mish, Leaky ReLU, APT-X

52
Q

q. What are some common optimizers for neural networks? Explain.

A

Momentum (Ball rolling down hill)

NAG (Nesterovs accelerated gradient)
- Uses momentum, and lookahead (Calculates update with respect to future paramters)
- Smart ball rolling down a hill

Adadelta
- Reparameterizes on the fly based on decaying average of past gradients for each w
- w’s with consistently small updates get small LR
- w’s with consistently large updates get large LR

ADAM
- Uses exp decaying avg of momentum and past gradients
- Also does lookahead like NAG

53
Q

2 Criteria for splitting decision trees?

A

Continuos
SSE:
Sum of the SSE in each node

Categorical:
Gini Impurity:
p**2 for each node
computationally less expensive than IG

Information Gain (1 - entropy):
     -1 * prob*log(prob) for each node
54
Q

SVM: What is a kernel function? Name 2 kernel functions and their formulas.

A

A kernel function for SVM transforms the data into a higher dimensional space so that the data can be linearly separable.

Polynomial Kernel:
(x * y + b)^2

Radial Basis Function
e^(-beta(a-b)^2))
- Weighted avg of surrounding points (kind of) / Polynomial kernel to infinite # of dimensions