Machine Learning Flashcards

Question

Normalization vs. Standardization

Answer 1

Normalization: Scales values [0,1] (bounded) Affected by outliers f(x) = (x - xmin) / (xmax - xmin) Standardization: Assigns Z-score to points (unbounded) More robust to outliers f(x) = (x - mu) / sigma

Answer 2

SVM (support vector machine): Supervised classification support vectors (closest points to decision boundary) maximize margin (distance from decision boundary to support vectors) Hard-margin = must perfectly classify Soft-margin = allows for slight errors (smoothed boundary) kernel trick (if points are not linearly separable, you can map them to a higher dimension so that they can be)

Answer 3

Ensure optimal data types are used Load subsets of data on each batch Remove non informative features Use PCA (takes )

Answer 4

PCA (principal component analysis): Dimensionality reduction technique Reduces computation, can visualize high dimensional data PCs are picked based on how much of the data variance they explain(higher first)

Answer 5

Type I: Reject Null Hypothesis but its True in the population Type II: Do not reject null hypothesis but its False in the population

Answer 6

Use multiple models in prediction Can be the Mean/Median/Mode of a number of predictions Can a be a linear regression model fit to all predictions to get a weighted linear combination of predictions Best to use models with "different perspectives" Pros: More generalizable, Increased accuracy Cons: More computationally expensive (training and inference)

Answer 7

K-fold cross validation: Train K models. Each data point is in the validation set one time Gives more accurate measure of model performance than train-valid-test split Computationally expensive Good for limited datasets Note: There is still an unseen test set, the rest of the data is used for K-fold CV

Answer 8

L2 (Ridge): Penalizes the squared weight values in the loss function Scales weights towards 0 not inclusive (non-sparse) L2 regression solvable with least squares Gaussian Prior L1 (Lasso): Penalizes the absolute value of the weights in the loss function Scales weights towards 0 inclusive (sparse) L1 regression only solvable with gradient descent Laplacian Prior

Answer 9

It shows us how well the model ID's true positives (sensitivity) versus how well it identifies true negatives (specificity). A line from bottom-left to top right is just a model predicting randomly.

Answer 10

Naive Bayes Classifier: Classifies based on probability of each class from training data Bayesian Hyperparameter Optimization: Uses information from past combinations to determine where to search next. Picks the next combination based on space likely to minimize loss function. "smarter" search than random/grid search

Answer 11

Assumes all features are independent (ie. does not consider the affect of multiple variables combined)

Answer 12

The F1 score is a classification metric that balances precision and recall. It can be useful when you have an imbalanced dataset. Harmonic mean between precision and recall. F1 = 2*(p*r)/(p+r)

Answer 13

Differs by use case: Accuracy: Sales Forecast, Annual GDP calculation, Image Generation Performance: Robotics, TSA XRAY screening, On-device ML, Sentence completion

Answer 14

Collect more data (best) Choose a different metric (F1-score, Precision, Recall, avg per-class accuracy) Oversample (Increase likelihood of overfit) Undersample (Increase likelihood of underfit) Give more weight to lower class in weight update

Answer 15

MAE: Not sensitive to outliers A gain on any point is equal to a gain on any other point MSE: Sensitive to outliers Prioritizes improving outliers

Answer 16

Constant variance across range (Homoscedasticity) Normally distributed residuals Independent observations Linear relationship between variables and target :( cant map non-linear relationships

Answer 17

When there are multiple variables that are highly correlated Gives misleading feature weights Ex. Think of if we added the same feature twice The weights could be (5,5),(10,0),(-20,30) although the effect is always 10

Answer 18

Bagging: Taking the mean/median/mode from a set of predictions to make final prediction Boosting: Sequential process of fitting the next model on the error of the previous model This is seen in gradient boosted decision trees

Answer 19

Z-score:(x - mean)/(st-dev) Anything outside of 3 standard deviations is probably an outlier Clustering: Fit a k-means model. If there is a cluster with very few points these are likely outliers. Binary Classification: If you have labelled set of points, build a classification model to identify

Answer 20

Hypothesis Testing, A/B testing where all variables other than the independent variable are controlled.

Answer 21

VG: Sequence of partial derivatives that are below zero in weight update calculation Results in tiny tiny update steps More common with a sigmoid/TanH function Model never converges EG: Sequence of partial derivatives that are below zero in weight update calculation Results in overshooting the global minimum Not as bad as VG (as we can just do gradient clipping) Strategies to mitigate (both): - Swish/ReLU Activations, Gradient Clipping, Residual/Skip connections - Batch Normalization Layers - He Initialization - (initialize weights with a sample from gaussian distribution) - (mu = 0 and sig = sqrt(2/(# of inputs to the node))) Just EG: L1/L2 regularization, Lower LR, Maybe change optimizer?

Answer 22

Exponential increase in computational efforts for every dimension increase

Answer 23

Classification: Accuracy, Precision, Recall, F1-Score, Cross Entropy Regression: RMSE, MSE, MAE, MPercentageE, Information Criterion, R^2, L1, L2

Answer 24

When the distribution of predictions or features changes. Detect By: Significance Testing: Comparing distribution of prediction with historical (KS-Test, T-test) Model-Based Approach: Train classifier to predict historical and real-time. (If easy to differentiate, could be data drift)

Answer 25

When the data distribution is not centred around the mean. Ex 1: Number of coin flips until you see heads Ex 2: If you generate a random number “B” between 0-100, and then a random number between 0-“B”. Ex 3: Time before a battery runs out of charge

Answer 26

Batch - Update weights after evaluation of all data points Mini-batch - Update weights after evaluation of a subset of the datapoints Stochastic - Update weights after every single data point

Answer 27

Activation functions make it so that we can map non-linear patterns in the data. They are usually easily differentiable due to the number of derivatives calculated for every weight update. Examples: - ReLU, Sigmoid/Logistic, TanH, Swish, Mish, Leaky ReLU, APT-X

Answer 28

Momentum (Ball rolling down hill) NAG (Nesterovs accelerated gradient) - Uses momentum, and lookahead (Calculates update with respect to future paramters) - Smart ball rolling down a hill Adadelta - Reparameterizes on the fly based on decaying average of past gradients for each w - w's with consistently small updates get small LR - w's with consistently large updates get large LR ADAM - Uses exp decaying avg of momentum and past gradients - Also does lookahead like NAG

Answer 29

Continuos SSE: Sum of the SSE in each node Categorical: Gini Impurity: p**2 for each node computationally less expensive than IG Information Gain (1 - entropy): -1 * prob*log(prob) for each node

Answer 30

A kernel function for SVM transforms the data into a higher dimensional space so that the data can be linearly separable. Polynomial Kernel: (x * y + b)^2 Radial Basis Function e^(-beta(a-b)^2)) - Weighted avg of surrounding points (kind of) / Polynomial kernel to infinite # of dimensions

Machine Learning Flashcards

(54 cards)