General Flashcards
Define bias
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting)
Define variance
The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
Define bias-variance tradeoff
It is the compromise choose a model that both accurately capture the regularities in its training data, but also generalises well to unseen data. High-variance learning methods represent their training set well but overfit to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that don’t tend to overfit but may underfit their training data, failing to capture important regularities.
Models with low bias are usually more complex (e.g. higher-order regression polynomials), enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate - despite their added complexity. In contrast, models with higher bias tend to be relatively simple (low-order or even linear regression polynomials) but may produce lower variance predictions when applied beyond the training set.
How to overcome overfitting
- Reduce the model complexity (fewer features)
- Regularization (features contribute less)
What is a vector norm
A way of measuring the length of a vector
Give examples of vector norms
- L1
- L2
Define length of L2 norm ||B||_2
√B_0^2 + B_1^2
Define length of L1 norm ||B||_1
|B_0|+|B_1|
Sketch the ||B||_2 = 2 and ||B||_1 = 2
https://en.wikipedia.org/wiki/File:L1_and_L2_balls.jpg where crosses axes at 2
Describe Ordinary Least Squares
OLS chooses the parameters of a linear function of a set of explanatory variables by minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function. Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression line
What is iid?
a sequence or other collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.
The assumption is important in the classical form of the central limit theorem, which states that the probability distribution of the sum (or average) of IID variables with finite variance approaches a normal distribution.
What is the problem with highly correlated explanatory variables in OLS?
Very high variance between different samples, so feature weights get abnormally big
What is C in Ridge Regression (L2)?
C^2 is the radius of the CIRCLE in LP space,
where you define ||B||^2_2 <= C^2
What is the main difference in outcome between using L1 and L2 space for regularisation?
Given the L1 diamond shape as opposed to the L2 circle, you’re more likely to hit a corner which zeros coefficients.
Which regularisation gives a spares response
L1, as it zeros some coefficients
What is a generative model?
A generative model describes how data is generated, in terms of a probabilistic model.
In the scenario of supervised learning, a generative model estimates the joint probability distribution of data P(X, Y) between the observed data X and corresponding labels Y
Give examples of generative models
- Naive Bayes
- Hidden Markov Models
- Latent Dirichlet Allocation
- Boltzmann Machines
Why would you choose a discriminative model?
Because you didn’t have enough data to estimate the density f, so variance is massive.
Generative
p(x,y) = f(x|y)p(y)
Generative versus discriminative, discuss
Discriminative is probability of class given observation P(C|x), generative is probability of observation given class P(x|C). For generative, given data, you model whole distribution. For discriminative, given data, you model decision boundary. https://www.youtube.com/watch?v=OWJ8xVGRyFA
Pros and cons of discriminative model
Pros: easy & fewer observations
Cons: Can classify but not generate the data/obs back
Pros and cons of generative model
Pros: get the underlying idea of what the classifier is built on
Cons: Very expensive - lots of parameters
Need lots of data
Define SVM
A non-probabilistic binary linear classifier which separates the categories by a clear gap that is as wide as possible with a hyperplane or set of hyperplanes, defined so that the distance between the hyperplane and the nearest point x_i from either group is maximised
How do SVM perform non-linear classification?
the kernel trick
Describe the kernel trick
The idea is that data that isn’t linearly separable in n dimensional space may be linearly separable in a higher dimensional space. But, because of Lagrangian magic, we need not compute the exact transformation of our data, we just need the inner product of our data in that higher dimensional space.
https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78
Advantages and disadvantages of SVM
Advantages
- it has a regularisation parameter, which makes the user think about avoiding over-fitting.
- it uses the kernel trick, so you can build in expert knowledge about the problem via engineering the kernel and it’s linearly separable.
- an SVM is defined by a convex optimisation problem (no local minima) for which there are efficient methods (e.g. SMO).
- it is an approximation to a bound on the test error rate, and there is a substantial body of theory behind it which suggests it should be a good idea.
Disadvantages
- the determination of the parameters for a given value of the regularisation and kernel parameters and choice of kernel. In a way the SVM moves the problem of over-fitting from optimising the parameters to model selection
- not great with multiclass
Describe Gradient boosting
- Have data
- Fit simple single-layer decision tree regressor (simple step function - one transition point)
- plot out error residuals from first fit
- fit single-layer decision tree regressor two to error residuals
- combine models one and two for marginally more complex fit (two transition points) for model three
- plot out error residuals from fit three
- fit single-layer decision tree regressor four to error residuals from fit three
- combine
- etc
Relative pros and cons random forest versus gradient boosting
Random Forests train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data. There are typically two parameters in RF - number of trees and no. of features to be selected at each node.
GBTs build trees one at a time, where each new tree helps to correct errors made by previously trained tree. With each tree added, the model becomes even more expressive. There are typically three parameters - number of trees, depth of trees and learning rate, and the each tree built is generally shallow.
GBDT training generally takes longer because of the fact that trees are built sequentially. However benchmark results have shown GBDT are better learners than Random Forests, but GBDTs are prone to overfitting if not handled carefully.
Bayes’ theorem
P(A|B)= P(B|A)P(A) / P(B)
Conditional probability of P(B) 9in terms of P(A))
P(B) = P(B|A)P(A) + P(A|B)P(B)
What is a prior?
Probability before you run a test
What is posterior?
It is the probability of the outcome given the prior and the evidence from the test.
What’s the difference between a prob density function and a prob mass function?
Density is for continuous distributions, mass is for discrete
Is the Dirichlet distribution discrete or continuous?
Continuous
Is the Dirichlet distribution discrete or continuous?
Continuous
Describe the process of calculating the ROC AUC
The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds. The larger the area under the roc curve, the better the classifier is at separating the classes.
https://www.youtube.com/watch?v=OAl6eAyP-yo