Machine Learning Fundamentals Flashcards

Question

What are some simple strategies for handling missing data?

Answer 1

- Ignore data with missing values - Drop missing values - Let the algorithm handle it (sometimes there is information in which values are missing). XGBoost has options for this.

Answer 2

Imputation and Interpolation. Can replace missing values with the mean / median / common value (mode)

Answer 3

Imputing missing values by identifying a set of other data points that are similar in other features and randomly selecting an imputed value from among the values in the similar set.

Answer 4

Regress missing variable against other variables, then use the regression prediction as the imputed value. Stochastic version adds in a random residual component.

Answer 5

KNN imputation can be used to fill missing values. Can be much better than other methods, but is computationally expensive on large datasets and can be sensitive to outliers.

Answer 6

SMOTE is a method to oversample minority classes by interpolating between observed data points and including those interpolated points in training. Selects a point in the minority class and finds its k nearest neighbors for some k, then interpolates between them to create new points.

Answer 7

Expected model error can be expressed in terms of bias and variance (as bias^2 + variance + irreducible error). Bias measures the expected value of the loss function (squared error) - high bias is underfitting - bad on train, bad on test. Variance measures the expected squared difference between the predicted value and mean predicted value (mean function); high variance and low bias means a model is overfit.

Answer 8

Model-based learning learns a model over the feature space to make predictions. Instance-based learning memorizes the training set and uses a similarity function to compare new data to memorized instances (KNN).

Answer 9

Random noise can be iteratively added to an image until the result is indistinguishable from the noise distribution. Diffusion models learn the inverse transform to create an image out of noise.

Answer 10

Adam is a method that uses per-parameter learning rates that are updated using four parameters: alpha, beta0, beta1, and epsilon, which is included to avoid a DIV/0. alpha is the maximum LR for any parameter. beta0 (usually 0.9) controls the decay of the first moment (mean) decay and beta1 controls the second moment (zero-centered variance, or acceleration) decay (0.999).

Answer 11

This states that the relative probability of two alternatives does not depend on the introduction or removal of new alternatives (the relative probability of taking a bus or car does not change when a bike option is added). It is an issue in multinomial logistic regression.

Answer 12

Line search (subsumes gradient descent methods) and trust region.

Answer 13

A second-order method to find where an equation is equal to 0. (Gradient Descent is a first-order method). Is a method to find zeros for a function. To minimize or maximize a function, it's used to find the zeros of the derivative. Makes updates as follows: x_{t+1} = x_t - f'(x_t) / f''(x_t)

Answer 14

- Allows for control of bias-variance tradeoff (by fitting as hard as desired). - More trees allows for harder fits - Depth of trees allows for harder fits - Learning rate scales down the weight of the trees

Answer 15

- AdaBoost uses decision stumps - decision trees that make a single cut. Gradient Boosting uses full trees. - Gradient boosting initializes a mean value (or distribution over classes) - Gradient boosting scales the trees identically (by the learning rate) - AdaBoost scales subsequent trees by their performance.

Answer 16

The values in the leaves of the decision trees are the mean residuals of the data points that are sorted into that leaf. New predictions are the mean value (initial guess) and the sum of the new residuals (scaled down by the learning rate).

Answer 17

No. There's no reason to, as each tree represents taking a small step determined by the learning rate. If the steps need to go "in the same direction", that's fine.

Answer 18

In each tree, the splits made on each feature are found and the quality of those splits is averaged (either by gini impurity or information gain).

Answer 19

It's the probability of classifying a sample of a given class incorrectly, given the class makeup of the dataset. = \Sum_{C} p_c * (1 - p_c), where C is the number of classes indexed by c.

Answer 20

a distance metric defined as the (cardinalities of) the intersection over the union of two sets

Answer 21

Cook's Distance is an estimate of the influence of a data point in linear regression. It takes into account both the leverage and residual of each observation. Cook's Distance is a summary of how much a regression model changes when the ith observation is removed.

Answer 22

AdaGrad divides each parameter's gradient by the l2 norm of all previous gradients. The intent is to promote learning in parameters that have not been updated yet, but in practice these large norms grind training to a halt. RMSProp fixes AdaGrad by using a convex combination of the new gradient and historical gradient to prevent the norm from growing out of control.

Answer 23

A trie is a tree data structure built for prefixes. It is commonly used to store the english language for prefix search. It has a placeholder route node with children representing each english character. Special nodes indicate the ends of words. For example, a path might be Root -> M -> A -> N -> Y.

Answer 24

The idea behind Global Average Pooling is to create a feature map for each output class and average it down to a single value that is used as the input for the softmax layer, avoiding putting a fully-connected

Answer 25

- More native to convolutional structure because it directly associates output categories with feature maps. - Allows interpretability, because the feature maps can be interpreted as confidence maps for a given output class - Prevents overfitting, as there are no learned parameters in the GAP layer. - Sums out spatial information, so is less receptive to translational changes in input.

Answer 26

Batch normalization is applied to the activation output (usually before the nonlinearity), centering and normalizing to unit variance the input to the BN layer across all samples in the batch. Has parameters including a running (moving average) mean and variance, as well as gamma and beta, such that the update for post-normalization H is made as yH + B instead of just H. This allows the mean and spread to be any value instead of 0 and 1. This reparameterizes the mean to be the learned B instead of a function of everything else in the network, which can supposedly improve training dynamics.

Answer 27

Using Layer Normalization normalizes the mean and variance of the hidden layer output neurons (activations). Importantly, this works on a per-training-example basis! Normalizes over C,H,W directions of N,C,H,W.

Answer 28

Instance Normalization normalizes per-training example and per-channel, along the H,W dimension of N,C,H,W

Answer 29

Depthwise convolutions use separate filters for each input channel, instead of the same filters over all input channels.

Answer 30

A means for object usage in computer science, inspired by the phrase "if it walks like a duck and quacks like a duck, it's a duck". An example is using a jax array in place of a numpy array.

Answer 31

Atomicity, Consistency (database is always in a correct state with respect to constraints, cascades, triggers, etc), Isolation (parallel transactions leave the database in state that would result had they all been executed concurrently) and Durability (once a transaction has been committed, it will be robust to failures like power outages, etc).

Answer 32

A function F is equivariant to a transform T when for all points x in the domain of F, T, F(T(x)) = T(F(x)).

Answer 33

Group and Individual Fairness. Informally, group fairness protects by normalizing model performance across groups as defined by a protected attribute (e.g. race, gender) and Individual Fairness states that "Similar individuals should be treated similarly".

Answer 34

If Y = 1 is considered a "Positive" outcome, then demographic parity states that P(Y=1 | A = a) should be the same for all values of a. That is, the probability of a positive outcome should be independent of the protected attribute.

Answer 35

P(Y_hat = 1 | A = 0, Y = y) = P(Y_hat = 1 | A = 1, Y = y). This has the effect of equalizing true positive rates when the true label is 1 and false positive rates when the true label is 0. This aligns with the goal of building good classifiers, but enforces that the accuracy is the same over demographics, so punishes some models with better majority performance, for example.

Answer 36

A function giving distance between two objects in a set, defined by three properties: d(x,y) = 0 <=> x = y d(x,y) = d(y,x) d(x,z) <= d(x,y) + d(y,z)

Answer 37

1) It doesn't actually do anything to ensure fairness for individuals, just equalizes probabilities of positive outcomes. 2) In the event that the true labels Y are correlated with the protected attribute, the ideal predictor is not admissible under DP, as it is unfair, so there will be significant loss of utility.

Answer 38

A relaxation of Equal Odds. Instead of requiring P(Y_hat = 1 | A = 0, Y = y) = P(Y_hat = 1 | A = 1, Y = y), equalizing true positive rates when the true label is 1 and false positive rates when the true label is 0, we only constrain the positive, or advantaged, class: P(Y_hat = 1 | A = 0, Y = 1) = P(Y_hat = 1 | A = 1, Y = 1).

Answer 39

That label uncertainty does not depend on the data; probability of mislabeling depends on only the true label value and the mislabeled value. For example, leopard and jaguar are likely to have label mistakes; leopard and bathtub are not.

Answer 40

The correction to using n - 1 instead of n when computing the sample variance.

Answer 41

Generative models model the full joint distribution of P(y, x), and then compute P(y | x) using Bayes' rule and pick the most likely probability. Discriminative classifiers model the posterior P(y | x) directly, or learn a map directly from x to the class labels.

Answer 42

Isolation forests exist for anomaly detection. As anomalies are "few and different", they are easy to isolate with trees that cut up the feature space. This method creates an ensemble of trees that isolate all the points in the dataset; the anomaly score can be given by s(x,n) = 2^[-E(h(x))/c(n)], where E(h(x)) is the average path length in the isolation trees and c(n) gives the average path length of all data points (function of the number of points).

Answer 43

For every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner. Proof leans on the idea that the training set size must be less than half of the size of the feature space (X), which is essentially always true in practice. Importantly, statement of the theorem also includes that the original learner reaches 0 empirical risk on its task.

Answer 44

It is a loosely-defined term that is used to refer to a model that goes from the input X all the way to the output y.

Answer 45

Probability calibration is the process of aligning a model's predicted probabilities with observed data frequencies. The reliability chat plots the predicted probabilities vs. the number of positives at a predictive threshold (i.e., if bin samples receiving around 40% as a probability of a positive prediction, do 40% of those have positive labels? Or not?) A calibration curve is one way of post-hoc adjusting predictions based on probabilities to improve calibration.

Answer 46

A sigmoid-shaped curve represents an under-confident classifier. An inverse sigmoid (logit) shaped curve represents an over-confident classifier.

Answer 47

Logit takes values in [0, 1] and maps them to (-inf, +inf). Sigmoid takes values in (-inf, +inf), and maps them to [0,1].

Answer 48

A class whose instances are classes.

Machine Learning Fundamentals Flashcards

(72 cards)