Machine Learning Fundamentals Flashcards

1
Q

What is the difference between classical momentum and Nesterov Momentum in an Optimizer?

A

Classical Momentum makes an update that is the sum of two terms: the decayed previous gradient average and the new gradient computation. Nesterov momentum recognizes that the gradient update should be taken after the momentum contribution moves the parameters: why would one make a decision of which way to go based on being at point A, then walk 100 feet East to point B and then act on the decision from A?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe a Bayesian Network.

A

Bayesian Networks are DAGs whose nodes represent variables and edges are drawn between nodes with conditional dependencies (edges with no connecting node are conditionally independent). Each node can be thought of as a probability function with its dependencies as inputs and its output being a probability or distribution. Importantly, they can be used to derive the probability of a cause given an effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain Boosting vs Bagging.

A

Both are methods that use ensembles of weak learners to create a strong learner. Bagging uses bootstrapped samples to create the individual learners, and averages their predictions (or uses voting or a similar procedure). Boosting trains a new learner on the mistakes of the previous one, and uses a weighted combination of the learner outputs to generate the final prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the assumptions that are made when using a linear model?

A

There are four:

1) The true relationship between the independent and dependent variables are linear.
2) Error is Gaussian.
3) Noise is homoskedastic.
4) Data samples are independent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe what is meant by an over- or under-determined system.

A

These terms refer to the shape of the (data points x features) data matrix. In modern ML, we are almost always working in high-dimensional systems that have either a huge number of data points relative to the number of features (maybe internet advertising transaction data) or a huge number of features relative to the number of data points (biology, DNA). The former is overdetermined (cannot fit a line to the data), the latter underdetermined (infinte solutions).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

On model fitting: to what extent does the choice of model depend on how much data you have?

A

Ideally, these are independent. The model choice should be based on the characteristics of the system, not a function of the number of data points that you have. If a linear model can perform well on 50 data points from a highly complex system, why should we expect it to generalize?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name a few reasons a dataset might have outliers.

A
  • Data entry errors
  • Measurement errors (Instrument Errors)
  • Intentional Errors (to test error detection systems)
  • Data processing errors (unintended mutations, maybe buggy preprocessing code)
  • Sampling errors (data from an unintended distribution made it into the dataset)
  • Natural (not an error, just a rare / novel data point)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some types of outliers?

A
  • Point outliers: data points far from the distribution
  • Contextual outliers: for example, an exclamation point in the middle of a word in NLP. Data that might belong in the domain, but not where it’s found.
  • Collective outliers: subsets of outliers that may themselves be sending some signal that the underlying system is poorly understood or that the model is poorly formulated.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some methods to detect outliers?

A
  • Z scores
  • DBSCAN (Density-based Spatial Clustering of Applications with Noise), that splits points into core points, border points, and outliers based on neighborhoods around the points (radius is a hyperparameter) and number of neighbors (minpts) that must in a point’s neighborhood to be considered a core point. A point with no other points in its neighborhood is an outlier. Process is O(N log N), good for medium-sized datasets.
  • Isolation Forest. Creates random splits between min_val and max_val on different features.
  • Robust Random Cut Forest (Amazon)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Can you explain Big O, Big Omega, Big Theta?

A
  • Big O: an algorithm is O(F(N)) if there exist constant c and integer n0 such that for all input problem sizes greater than n0, the runtime of the algorithm will be less than cF(N).
  • Big Omega: lower bound. The runtime of the algorithm will always be greater than cF(N).
  • Big Theta: tight bound.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can you explain the difference between Generative and Descriminative models?

A
  • A generative model models the full joint distribution of the data (P(X,Y)), while a Descriminative model models the conditional probability of the target given the data, P(X|Y).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Bayes risk?

A

The Bayes risk is the lowest achievable risk (expected loss function value) with respect to a data generating process by a measurable function, f. The Bayes Classifier is the model that achieves that risk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are advantages of Decision Trees?

A
  • Simple to understand and interpret
  • Can handle numerical and categorical data
  • Require little data preprocessing
  • White-box
  • Possible to validate using statistical tests.
  • Non-parametric: makes no assumptions about training data or residuals
  • Performs well on large datasets
  • Mirrors human decision-making
  • Robust against collinearity
  • Built-in feature selection / importance
  • Can approximate any boolean function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are disadvantages of Decision Trees?

A
  • Can be non-robust to changes in the training data. A small change in the training data can result in a large change in the tree
  • Optimal decision tree learning is NP-Complete.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are nuisance variables?

A

A nuisance variables is a random variable that is necessary for properly modeling a system, but is of no interest itself (or is an intermediate variable, or unobserved).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Give the time complexity (average and worst cases are the same in each case) of the followiing list operations in python:

Copy, Append, Pop last, Pop intermediate, insert, get item, set item, delete item, iterate list, get slice, delete slice, set slice, extend, sort, multiply, x in s, min, max, length.

A

Copy O(N); Append O(1); Pop last O(1); Pop intermediate O(N);, insert O(N); get item O(1); set item O(1); delete item O(N); iterate list O(N); get slice O(k); delete slice O(N); set slice O(k + n);, extend O(k); sort O(N log N); multiply, x in s O(N); min O(N); max O(N); length O(1);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Give the time complexity (average and worst cases are the same in each case) of the followiing collections.deque operations in python:

Copy, Append, AppendLeft, Pop, PopLeft, Extend, ExtendLeft, Rotate, Remove.

A

Copy O(N), Append O(1), AppendLeft O(1), Pop O(1), PopLeft O(1), Extend O(k), ExtendLeft O(k), Rotate O(k), Remove O(N).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Give the time complexity of the followiing set operations in python:

x in S, Union S|T, Intersection S & T, Mulitple Intersection S1&S2…SN, Difference S\T, Symmetric Difference

A

(average-case / worst-case)

x in S (O(1) / O(N)), Union S|T O(len(S) + len(T)), Intersection S & T (O(min(len(S), len(T))) / O(len(S) * len(T)), Mulitple Intersection S1&S2…SN (n-1)*O(max_len(Si)), Difference S\T (O(len(S)), Symmetric Difference (O(len(S)) / O(len(S) * len(T)))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Give the time complexity of the followiing dict operations in python:

k in d; Copy; Get Item; Set Item; Delete Item; Iteration

A

(Avg. Case / Amortized Worst Case)

k in d (O(1) / O(N)); Copy O(N); Get Item (O(1) / O(N)); Set Item (O(1) / O(N)); Delete Item (O(1) / O(N)); Iteration O(N)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Describe Generalized Linear Models.

A

A generalized linear model is a linear model with an additional component - a link function, g(mu) = XB , that specifies the relationship between the expected value of the target given the linear model XB + e. So y = g^-1 [XB + e].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is an exponential response (or log-linear) model? What is the link function involved in modeling them?

A

This is a generalized linear model where the logarithm of the output variable varies linearly with the input. The link function is the negative inverse: XB =

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the link function that specifies a logistic regression model?

A

the link function is log( p / 1-p) (or log-odds ratio). The inverse function (mean function) that specifies the model is then, 1 / (1 + exp(-XB)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Describe Rabin-Karp Substring Search.

A

RKSS uses a rolling hash function to find occurrences of a given substring in a larger string in linear time, by computing a single hash based on each length-k window rather than doing string comparisons index-by-index.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are some models of missing data?

A

MCAR: Missing completely at random. Values in the dataset are missing with no dependence on the data itself. MAR: Missing at random. Values in the data are missing at random within subgroups of data points (for example, maybe data collected from managers is more likely to be missing than data from other employees). MNAR: Missing not at random. Probability of data being missing depends on data itself (ex. people are embarrassed about their information and don’t want to report it).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are some simple strategies for handling missing data?

A
  • Ignore data with missing values
  • Drop missing values
  • Let the algorithm handle it (sometimes there is information in which values are missing). XGBoost has options for this.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are some more advanced options for handling missing data?

A

Imputation and Interpolation. Can replace missing values with the mean / median / common value (mode)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is hot deck imputation?

A

Imputing missing values by identifying a set of other data points that are similar in other features and randomly selecting an imputed value from among the values in the similar set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Regressed Imputation?

A

Regress missing variable against other variables, then use the regression prediction as the imputed value. Stochastic version adds in a random residual component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

KNN imputation?

A

KNN imputation can be used to fill missing values. Can be much better than other methods, but is computationally expensive on large datasets and can be sensitive to outliers.

30
Q

Explain SMOTE.

A

SMOTE is a method to oversample minority classes by interpolating between observed data points and including those interpolated points in training. Selects a point in the minority class and finds its k nearest neighbors for some k, then interpolates between them to create new points.

31
Q

Explain model bias and variance.

A

Expected model error can be expressed in terms of bias and variance (as bias^2 + variance + irreducible error). Bias measures the expected value of the loss function (squared error) - high bias is underfitting - bad on train, bad on test. Variance measures the expected squared difference between the predicted value and mean predicted value (mean function); high variance and low bias means a model is overfit.

32
Q

Explain the difference between model and instance-based learning

A

Model-based learning learns a model over the feature space to make predictions. Instance-based learning memorizes the training set and uses a similarity function to compare new data to memorized instances (KNN).

33
Q

What is a diffusion model?

A

Random noise can be iteratively added to an image until the result is indistinguishable from the noise distribution. Diffusion models learn the inverse transform to create an image out of noise.

34
Q

Describe the ADAM optimizer.

A

Adam is a method that uses per-parameter learning rates that are updated using four parameters: alpha, beta0, beta1, and epsilon, which is included to avoid a DIV/0. alpha is the maximum LR for any parameter. beta0 (usually 0.9) controls the decay of the first moment (mean) decay and beta1 controls the second moment (zero-centered variance, or acceleration) decay (0.999).

35
Q

What is independence of irrelevant alternatives? What models assume it?

A

This states that the relative probability of two alternatives does not depend on the introduction or removal of new alternatives (the relative probability of taking a bus or car does not change when a bike option is added). It is an issue in multinomial logistic regression.

36
Q

Optimization: What are the two major approaches?

A

Line search (subsumes gradient descent methods) and trust region.

37
Q

What is Newton’s method?

A

A second-order method to find where an equation is equal to 0. (Gradient Descent is a first-order method). Is a method to find zeros for a function. To minimize or maximize a function, it’s used to find the zeros of the derivative. Makes updates as follows: x_{t+1} = x_t - f’(x_t) / f’‘(x_t)

38
Q

Give major points about gradient boosting.

A
  • Allows for control of bias-variance tradeoff (by fitting as hard as desired).
  • More trees allows for harder fits
  • Depth of trees allows for harder fits
  • Learning rate scales down the weight of the trees
39
Q

How does gradient boosting differ from AdaBoost?

A
  • AdaBoost uses decision stumps - decision trees that make a single cut. Gradient Boosting uses full trees.
  • Gradient boosting initializes a mean value (or distribution over classes)
  • Gradient boosting scales the trees identically (by the learning rate)
  • AdaBoost scales subsequent trees by their performance.
40
Q

How does gradient boosting combine trees (regression)?

A

The values in the leaves of the decision trees are the mean residuals of the data points that are sorted into that leaf. New predictions are the mean value (initial guess) and the sum of the new residuals (scaled down by the learning rate).

41
Q

Does gradient boosting try to prevent the trees from being the same?

A

No. There’s no reason to, as each tree represents taking a small step determined by the learning rate. If the steps need to go “in the same direction”, that’s fine.

42
Q

How are feature importances computed for GBM?

A

In each tree, the splits made on each feature are found and the quality of those splits is averaged (either by gini impurity or information gain).

43
Q

What is Gini Impurity?

A

It’s the probability of classifying a sample of a given class incorrectly, given the class makeup of the dataset. = \Sum_{C} p_c * (1 - p_c), where C is the number of classes indexed by c.

44
Q

What is the Jaccard Index?

A

a distance metric defined as the (cardinalities of) the intersection over the union of two sets

45
Q

What is cook’s distance?

A

Cook’s Distance is an estimate of the influence of a data point in linear regression. It takes into account both the leverage and residual of each observation. Cook’s Distance is a summary of how much a regression model changes when the ith observation is removed.

46
Q

What is RMSProp? How does it improve on AdaGrad?

A

AdaGrad divides each parameter’s gradient by the l2 norm of all previous gradients. The intent is to promote learning in parameters that have not been updated yet, but in practice these large norms grind training to a halt. RMSProp fixes AdaGrad by using a convex combination of the new gradient and historical gradient to prevent the norm from growing out of control.

47
Q

What is a Trie?

A

A trie is a tree data structure built for prefixes. It is commonly used to store the english language for prefix search. It has a placeholder route node with children representing each english character. Special nodes indicate the ends of words. For example, a path might be Root -> M -> A -> N -> Y.

48
Q

Explain Global Average (or Max) Pooling.

A

The idea behind Global Average Pooling is to create a feature map for each output class and average it down to a single value that is used as the input for the softmax layer, avoiding putting a fully-connected

49
Q

What are the advantages of using Global Average Pooling?

A
  • More native to convolutional structure because it directly associates output categories with feature maps.
  • Allows interpretability, because the feature maps can be interpreted as confidence maps for a given output class
  • Prevents overfitting, as there are no learned parameters in the GAP layer.
  • Sums out spatial information, so is less receptive to translational changes in input.
50
Q

Describe Batch Normalization.

A

Batch normalization is applied to the activation output (usually before the nonlinearity), centering and normalizing to unit variance the input to the BN layer across all samples in the batch. Has parameters including a running (moving average) mean and variance, as well as gamma and beta, such that the update for post-normalization H is made as yH + B instead of just H. This allows the mean and spread to be any value instead of 0 and 1. This reparameterizes the mean to be the learned B instead of a function of everything else in the network, which can supposedly improve training dynamics.

51
Q

Describe Layer Normalization.

A

Using Layer Normalization normalizes the mean and variance of the hidden layer output neurons (activations). Importantly, this works on a per-training-example basis! Normalizes over C,H,W directions of N,C,H,W.

52
Q

Describe Instance Normalization.

A

Instance Normalization normalizes per-training example and per-channel, along the H,W dimension of N,C,H,W

53
Q

What are depthwise convolutions?

A

Depthwise convolutions use separate filters for each input channel, instead of the same filters over all input channels.

54
Q

What is duck typing?

A

A means for object usage in computer science, inspired by the phrase “if it walks like a duck and quacks like a duck, it’s a duck”. An example is using a jax array in place of a numpy array.

55
Q

What does ACID stand for for databases?

A

Atomicity, Consistency (database is always in a correct state with respect to constraints, cascades, triggers, etc), Isolation (parallel transactions leave the database in state that would result had they all been executed concurrently) and Durability (once a transaction has been committed, it will be robust to failures like power outages, etc).

56
Q

What is Equivariance?

A

A function F is equivariant to a transform T when for all points x in the domain of F, T, F(T(x)) = T(F(x)).

57
Q

What are two foundational formulations in Model Fairness?

A

Group and Individual Fairness. Informally, group fairness protects by normalizing model performance across groups as defined by a protected attribute (e.g. race, gender) and Individual Fairness states that “Similar individuals should be treated similarly”.

58
Q

What is Demographic Parity in Model Fairness?

A

If Y = 1 is considered a “Positive” outcome, then demographic parity states that P(Y=1 | A = a) should be the same for all values of a. That is, the probability of a positive outcome should be independent of the protected attribute.

59
Q

What is “Equal Odds Fairness” in Model Fairness?

A

P(Y_hat = 1 | A = 0, Y = y) = P(Y_hat = 1 | A = 1, Y = y). This has the effect of equalizing true positive rates when the true label is 1 and false positive rates when the true label is 0. This aligns with the goal of building good classifiers, but enforces that the accuracy is the same over demographics, so punishes some models with better majority performance, for example.

60
Q

What is a metric in math?

A

A function giving distance between two objects in a set, defined by three properties:

d(x,y) = 0 <=> x = y
d(x,y) = d(y,x)
d(x,z) <= d(x,y) + d(y,z)

61
Q

What are some issues with Demographic Parity?

A

1) It doesn’t actually do anything to ensure fairness for individuals, just equalizes probabilities of positive outcomes.
2) In the event that the true labels Y are correlated with the protected attribute, the ideal predictor is not admissible under DP, as it is unfair, so there will be significant loss of utility.

62
Q

What is Equality of Opportunity?

A

A relaxation of Equal Odds. Instead of requiring P(Y_hat = 1 | A = 0, Y = y) = P(Y_hat = 1 | A = 1, Y = y), equalizing true positive rates when the true label is 1 and false positive rates when the true label is 0, we only constrain the positive, or advantaged, class:
P(Y_hat = 1 | A = 0, Y = 1) = P(Y_hat = 1 | A = 1, Y = 1).

63
Q

What is the class-conditional model of label uncertainty?

A

That label uncertainty does not depend on the data; probability of mislabeling depends on only the true label value and the mislabeled value. For example, leopard and jaguar are likely to have label mistakes; leopard and bathtub are not.

64
Q

What is the Bessel Correction in Statistics?

A

The correction to using n - 1 instead of n when computing the sample variance.

65
Q

What is the difference between Generative and Discriminative models?

A

Generative models model the full joint distribution of P(y, x), and then compute P(y | x) using Bayes’ rule and pick the most likely probability. Discriminative classifiers model the posterior P(y | x) directly, or learn a map directly from x to the class labels.

66
Q

What is the Isolation Forest Method? What is it used for?

A

Isolation forests exist for anomaly detection. As anomalies are “few and different”, they are easy to isolate with trees that cut up the feature space. This method creates an ensemble of trees that isolate all the points in the dataset; the anomaly score can be given by s(x,n) = 2^[-E(h(x))/c(n)], where E(h(x)) is the average path length in the isolation trees and c(n) gives the average path length of all data points (function of the number of points).

67
Q

State the No Free Lunch theorem (in words).

A

For every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner. Proof leans on the idea that the training set size must be less than half of the size of the feature space (X), which is essentially always true in practice. Importantly, statement of the theorem also includes that the original learner reaches 0 empirical risk on its task.

68
Q

What does end-to-end learning mean in a deep learning context?

A

It is a loosely-defined term that is used to refer to a model that goes from the input X all the way to the output y.

69
Q

Explain probability calibration and the reliability plot

A

Probability calibration is the process of aligning a model’s predicted probabilities with observed data frequencies. The reliability chat plots the predicted probabilities vs. the number of positives at a predictive threshold (i.e., if bin samples receiving around 40% as a probability of a positive prediction, do 40% of those have positive labels? Or not?) A calibration curve is one way of post-hoc adjusting predictions based on probabilities to improve calibration.

70
Q

What does a sigmoid-shaped calibration reliability curve communicate? An inverse-sigmoid?

A

A sigmoid-shaped curve represents an under-confident classifier. An inverse sigmoid (logit) shaped curve represents an over-confident classifier.

71
Q

What is the purpose of the logit / sigmoid functions?

A

Logit takes values in [0, 1] and maps them to (-inf, +inf). Sigmoid takes values in (-inf, +inf), and maps them to [0,1].

72
Q

What is a Metaclass?

A

A class whose instances are classes.