Machine Learning Fundamentals Flashcards
What is the difference between classical momentum and Nesterov Momentum in an Optimizer?
Classical Momentum makes an update that is the sum of two terms: the decayed previous gradient average and the new gradient computation. Nesterov momentum recognizes that the gradient update should be taken after the momentum contribution moves the parameters: why would one make a decision of which way to go based on being at point A, then walk 100 feet East to point B and then act on the decision from A?
Describe a Bayesian Network.
Bayesian Networks are DAGs whose nodes represent variables and edges are drawn between nodes with conditional dependencies (edges with no connecting node are conditionally independent). Each node can be thought of as a probability function with its dependencies as inputs and its output being a probability or distribution. Importantly, they can be used to derive the probability of a cause given an effect.
Explain Boosting vs Bagging.
Both are methods that use ensembles of weak learners to create a strong learner. Bagging uses bootstrapped samples to create the individual learners, and averages their predictions (or uses voting or a similar procedure). Boosting trains a new learner on the mistakes of the previous one, and uses a weighted combination of the learner outputs to generate the final prediction.
What are the assumptions that are made when using a linear model?
There are four:
1) The true relationship between the independent and dependent variables are linear.
2) Error is Gaussian.
3) Noise is homoskedastic.
4) Data samples are independent.
Describe what is meant by an over- or under-determined system.
These terms refer to the shape of the (data points x features) data matrix. In modern ML, we are almost always working in high-dimensional systems that have either a huge number of data points relative to the number of features (maybe internet advertising transaction data) or a huge number of features relative to the number of data points (biology, DNA). The former is overdetermined (cannot fit a line to the data), the latter underdetermined (infinte solutions).
On model fitting: to what extent does the choice of model depend on how much data you have?
Ideally, these are independent. The model choice should be based on the characteristics of the system, not a function of the number of data points that you have. If a linear model can perform well on 50 data points from a highly complex system, why should we expect it to generalize?
Name a few reasons a dataset might have outliers.
- Data entry errors
- Measurement errors (Instrument Errors)
- Intentional Errors (to test error detection systems)
- Data processing errors (unintended mutations, maybe buggy preprocessing code)
- Sampling errors (data from an unintended distribution made it into the dataset)
- Natural (not an error, just a rare / novel data point)
What are some types of outliers?
- Point outliers: data points far from the distribution
- Contextual outliers: for example, an exclamation point in the middle of a word in NLP. Data that might belong in the domain, but not where it’s found.
- Collective outliers: subsets of outliers that may themselves be sending some signal that the underlying system is poorly understood or that the model is poorly formulated.
What are some methods to detect outliers?
- Z scores
- DBSCAN (Density-based Spatial Clustering of Applications with Noise), that splits points into core points, border points, and outliers based on neighborhoods around the points (radius is a hyperparameter) and number of neighbors (minpts) that must in a point’s neighborhood to be considered a core point. A point with no other points in its neighborhood is an outlier. Process is O(N log N), good for medium-sized datasets.
- Isolation Forest. Creates random splits between min_val and max_val on different features.
- Robust Random Cut Forest (Amazon)
Can you explain Big O, Big Omega, Big Theta?
- Big O: an algorithm is O(F(N)) if there exist constant c and integer n0 such that for all input problem sizes greater than n0, the runtime of the algorithm will be less than cF(N).
- Big Omega: lower bound. The runtime of the algorithm will always be greater than cF(N).
- Big Theta: tight bound.
Can you explain the difference between Generative and Descriminative models?
- A generative model models the full joint distribution of the data (P(X,Y)), while a Descriminative model models the conditional probability of the target given the data, P(X|Y).
What is Bayes risk?
The Bayes risk is the lowest achievable risk (expected loss function value) with respect to a data generating process by a measurable function, f. The Bayes Classifier is the model that achieves that risk.
What are advantages of Decision Trees?
- Simple to understand and interpret
- Can handle numerical and categorical data
- Require little data preprocessing
- White-box
- Possible to validate using statistical tests.
- Non-parametric: makes no assumptions about training data or residuals
- Performs well on large datasets
- Mirrors human decision-making
- Robust against collinearity
- Built-in feature selection / importance
- Can approximate any boolean function
What are disadvantages of Decision Trees?
- Can be non-robust to changes in the training data. A small change in the training data can result in a large change in the tree
- Optimal decision tree learning is NP-Complete.
What are nuisance variables?
A nuisance variables is a random variable that is necessary for properly modeling a system, but is of no interest itself (or is an intermediate variable, or unobserved).
Give the time complexity (average and worst cases are the same in each case) of the followiing list operations in python:
Copy, Append, Pop last, Pop intermediate, insert, get item, set item, delete item, iterate list, get slice, delete slice, set slice, extend, sort, multiply, x in s, min, max, length.
Copy O(N); Append O(1); Pop last O(1); Pop intermediate O(N);, insert O(N); get item O(1); set item O(1); delete item O(N); iterate list O(N); get slice O(k); delete slice O(N); set slice O(k + n);, extend O(k); sort O(N log N); multiply, x in s O(N); min O(N); max O(N); length O(1);
Give the time complexity (average and worst cases are the same in each case) of the followiing collections.deque operations in python:
Copy, Append, AppendLeft, Pop, PopLeft, Extend, ExtendLeft, Rotate, Remove.
Copy O(N), Append O(1), AppendLeft O(1), Pop O(1), PopLeft O(1), Extend O(k), ExtendLeft O(k), Rotate O(k), Remove O(N).
Give the time complexity of the followiing set operations in python:
x in S, Union S|T, Intersection S & T, Mulitple Intersection S1&S2…SN, Difference S\T, Symmetric Difference
(average-case / worst-case)
x in S (O(1) / O(N)), Union S|T O(len(S) + len(T)), Intersection S & T (O(min(len(S), len(T))) / O(len(S) * len(T)), Mulitple Intersection S1&S2…SN (n-1)*O(max_len(Si)), Difference S\T (O(len(S)), Symmetric Difference (O(len(S)) / O(len(S) * len(T)))
Give the time complexity of the followiing dict operations in python:
k in d; Copy; Get Item; Set Item; Delete Item; Iteration
(Avg. Case / Amortized Worst Case)
k in d (O(1) / O(N)); Copy O(N); Get Item (O(1) / O(N)); Set Item (O(1) / O(N)); Delete Item (O(1) / O(N)); Iteration O(N)
Describe Generalized Linear Models.
A generalized linear model is a linear model with an additional component - a link function, g(mu) = XB , that specifies the relationship between the expected value of the target given the linear model XB + e. So y = g^-1 [XB + e].
What is an exponential response (or log-linear) model? What is the link function involved in modeling them?
This is a generalized linear model where the logarithm of the output variable varies linearly with the input. The link function is the negative inverse: XB =
What is the link function that specifies a logistic regression model?
the link function is log( p / 1-p) (or log-odds ratio). The inverse function (mean function) that specifies the model is then, 1 / (1 + exp(-XB)).
Describe Rabin-Karp Substring Search.
RKSS uses a rolling hash function to find occurrences of a given substring in a larger string in linear time, by computing a single hash based on each length-k window rather than doing string comparisons index-by-index.
What are some models of missing data?
MCAR: Missing completely at random. Values in the dataset are missing with no dependence on the data itself. MAR: Missing at random. Values in the data are missing at random within subgroups of data points (for example, maybe data collected from managers is more likely to be missing than data from other employees). MNAR: Missing not at random. Probability of data being missing depends on data itself (ex. people are embarrassed about their information and don’t want to report it).
What are some simple strategies for handling missing data?
- Ignore data with missing values
- Drop missing values
- Let the algorithm handle it (sometimes there is information in which values are missing). XGBoost has options for this.
What are some more advanced options for handling missing data?
Imputation and Interpolation. Can replace missing values with the mean / median / common value (mode)
What is hot deck imputation?
Imputing missing values by identifying a set of other data points that are similar in other features and randomly selecting an imputed value from among the values in the similar set.
Regressed Imputation?
Regress missing variable against other variables, then use the regression prediction as the imputed value. Stochastic version adds in a random residual component.