General Flashcards

1
Q

Define bias

A

The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define variance

A

The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define bias-variance tradeoff

A

It is the compromise choose a model that both accurately capture the regularities in its training data, but also generalises well to unseen data. High-variance learning methods represent their training set well but overfit to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that don’t tend to overfit but may underfit their training data, failing to capture important regularities.

Models with low bias are usually more complex (e.g. higher-order regression polynomials), enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate - despite their added complexity. In contrast, models with higher bias tend to be relatively simple (low-order or even linear regression polynomials) but may produce lower variance predictions when applied beyond the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to overcome overfitting

A
  • Reduce the model complexity (fewer features)

- Regularization (features contribute less)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a vector norm

A

A way of measuring the length of a vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Give examples of vector norms

A
  • L1

- L2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define length of L2 norm ||B||_2

A

√B_0^2 + B_1^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define length of L1 norm ||B||_1

A

|B_0|+|B_1|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sketch the ||B||_2 = 2 and ||B||_1 = 2

A

https://en.wikipedia.org/wiki/File:L1_and_L2_balls.jpg where crosses axes at 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe Ordinary Least Squares

A

OLS chooses the parameters of a linear function of a set of explanatory variables by minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function. Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is iid?

A

a sequence or other collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.
The assumption is important in the classical form of the central limit theorem, which states that the probability distribution of the sum (or average) of IID variables with finite variance approaches a normal distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the problem with highly correlated explanatory variables in OLS?

A

Very high variance between different samples, so feature weights get abnormally big

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is C in Ridge Regression (L2)?

A

C^2 is the radius of the CIRCLE in LP space,

where you define ||B||^2_2 <= C^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the main difference in outcome between using L1 and L2 space for regularisation?

A

Given the L1 diamond shape as opposed to the L2 circle, you’re more likely to hit a corner which zeros coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which regularisation gives a spares response

A

L1, as it zeros some coefficients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a generative model?

A

A generative model describes how data is generated, in terms of a probabilistic model.

In the scenario of supervised learning, a generative model estimates the joint probability distribution of data P(X, Y) between the observed data X and corresponding labels Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Give examples of generative models

A
  • Naive Bayes
  • Hidden Markov Models
  • Latent Dirichlet Allocation
  • Boltzmann Machines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why would you choose a discriminative model?

A

Because you didn’t have enough data to estimate the density f, so variance is massive.
Generative
p(x,y) = f(x|y)p(y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Generative versus discriminative, discuss

A
Discriminative is probability of class given observation P(C|x), generative is probability of observation given class P(x|C). For generative, given data, you model whole distribution. For discriminative, given data, you model decision boundary.
https://www.youtube.com/watch?v=OWJ8xVGRyFA
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Pros and cons of discriminative model

A

Pros: easy & fewer observations
Cons: Can classify but not generate the data/obs back

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Pros and cons of generative model

A

Pros: get the underlying idea of what the classifier is built on
Cons: Very expensive - lots of parameters
Need lots of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Define SVM

A

A non-probabilistic binary linear classifier which separates the categories by a clear gap that is as wide as possible with a hyperplane or set of hyperplanes, defined so that the distance between the hyperplane and the nearest point x_i from either group is maximised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do SVM perform non-linear classification?

A

the kernel trick

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe the kernel trick

A

The idea is that data that isn’t linearly separable in n dimensional space may be linearly separable in a higher dimensional space. But, because of Lagrangian magic, we need not compute the exact transformation of our data, we just need the inner product of our data in that higher dimensional space.
https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Advantages and disadvantages of SVM

A

Advantages

  • it has a regularisation parameter, which makes the user think about avoiding over-fitting.
  • it uses the kernel trick, so you can build in expert knowledge about the problem via engineering the kernel and it’s linearly separable.
  • an SVM is defined by a convex optimisation problem (no local minima) for which there are efficient methods (e.g. SMO).
  • it is an approximation to a bound on the test error rate, and there is a substantial body of theory behind it which suggests it should be a good idea.

Disadvantages

  • the determination of the parameters for a given value of the regularisation and kernel parameters and choice of kernel. In a way the SVM moves the problem of over-fitting from optimising the parameters to model selection
  • not great with multiclass
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Describe Gradient boosting

A
  • Have data
  • Fit simple single-layer decision tree regressor (simple step function - one transition point)
  • plot out error residuals from first fit
  • fit single-layer decision tree regressor two to error residuals
  • combine models one and two for marginally more complex fit (two transition points) for model three
  • plot out error residuals from fit three
  • fit single-layer decision tree regressor four to error residuals from fit three
  • combine
  • etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Relative pros and cons random forest versus gradient boosting

A

Random Forests train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data. There are typically two parameters in RF - number of trees and no. of features to be selected at each node.
GBTs build trees one at a time, where each new tree helps to correct errors made by previously trained tree. With each tree added, the model becomes even more expressive. There are typically three parameters - number of trees, depth of trees and learning rate, and the each tree built is generally shallow.
GBDT training generally takes longer because of the fact that trees are built sequentially. However benchmark results have shown GBDT are better learners than Random Forests, but GBDTs are prone to overfitting if not handled carefully.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Bayes’ theorem

A

P(A|B)= P(B|A)P(A) / P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Conditional probability of P(B) 9in terms of P(A))

A

P(B) = P(B|A)P(A) + P(A|B)P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is a prior?

A

Probability before you run a test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is posterior?

A

It is the probability of the outcome given the prior and the evidence from the test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What’s the difference between a prob density function and a prob mass function?

A

Density is for continuous distributions, mass is for discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Is the Dirichlet distribution discrete or continuous?

A

Continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Is the Dirichlet distribution discrete or continuous?

A

Continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Describe the process of calculating the ROC AUC

A

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds. The larger the area under the roc curve, the better the classifier is at separating the classes.
https://www.youtube.com/watch?v=OAl6eAyP-yo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Why is ROC AUC useful for imbalanced classes?

A

So long as the ordering of observations by predicted probability remain the same, the ROC and AUC would be identical invariant of scale. Given all it cares about is how well you’ve separated your classes, it is only sensitive to rank ordering, so robust to unbalanced classes.

37
Q

What’s Gauss–Markov theorem?

A

Gauss–Markov theorem states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator, provided it exists.

38
Q

Cost function

A

Describes how well a response surface fits the available data

39
Q

What is the major downside to a least squares cost function?

A

It’s dominated by outliers

40
Q

Describe a cost function where some regularisation exists

A

cost function = some data reconstruction error + regularisation penalty

41
Q

Define autocorrelation

A

Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them.

42
Q

Explain Cohen’s Kappa

A

The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance)
k = p_o - p_e / (1 - p_e)

43
Q

Explain Mutual information

A

A measure of the mutual dependence between the two variables, it determines how similar the joint distribution p(X,Y) is to the products of factored marginal distribution p(X)p(Y).

44
Q

Explain entropy (information theory)

A

The average amount of information produced by a stochastic source of data. The measure of information entropy associated with each possible data value is the negative logarithm of the probability mass function for the value. Thus, when the data source has a lower-probability value (i.e., when a low-probability event occurs), the event carries more “information” (“surprisal”) than when the source data has a higher-probability value. The amount of information conveyed by each event defined in this way becomes a random variable whose expected value is the information entropy.

45
Q

Online learning setting

A

a method of machine learning in which data becomes available in a sequential order and is used to update our best predictor for future data at each step

46
Q

Why would you use stochastic gradient descent over batch gradient descent?

A

Batch uses every data point to calculate each step, so is computationally expensive

47
Q

Why would you use stochastic gradient descent over batch gradient descent?

A

Batch uses every data point to calculate each step, so is computationally expensive

48
Q

Downside of user-user collaborative filtering

A
  • systems performed poorly when they had many items but comparatively few ratings (sparsity)
  • computing similarities between all pairs of users was expensive
  • user profiles changed quickly and the entire system model had to be recomputed
49
Q

limitations of pca

A
  • Scaling features arbitrarily (for example converting height in feet to inches) skews the PCs considerably
  • Depending on task, the strongly predictive information may lie in directions of small variance, which gets removed by PCA
  • Assumes underlying Gaussian distribution of data
50
Q

Bhattacharyya distance

A

measures the similarity of two discrete or continuous probability distributions. It is closely related to the Bhattacharyya coefficient which is a measure of the amount of overlap between two statistical samples or populations.

51
Q

Mean Absolute Error

A

MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. (L1 loss)

52
Q

Root Mean Square Error

A

RMSE is the square root of the average of squared differences between prediction and actual observation.

53
Q

Difference (in outcome) between RMSE and MAE

A

Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable.
RMSE does not necessarily increase with the variance of the errors. RMSE increases with the variance of the frequency distribution of error magnitudes.
RMSE has a tendency to be increasingly larger than MAE as the test sample size increases.

54
Q

Why is least squares so commonly used for regression?

A

When solving the statistical linear regression problem, a very common modeling assumption is that for every possible value of “x”, the quantity “y” is normally distributed with a mean that is linear in “x”. Therefore, the likelihood function is essentially a product of PDFs of the normal distribution. As stated above, you estimate the unknown parameters (and therefore find the best fitting line) by maximizing the likelihood function. If you look at what the product of normal PDFs looks like, you will notice that maximizing this expression happens to be equivalent to minimizing the sum of squared errors.

55
Q

R-squared (regression metric)

A

the percentage of the response variable variation that is explained by a linear model.
R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.

56
Q

User-User collaborative filtering

A

S(u, i) = rhat_u + SUM((r_vi - rhat_v) . w_uv) / SUM(w_uv)

S -> pred score
u -> user
v -> different user
i -> item
r-> rating (want to normalise to user v's range, for user u)
w -> weighting describing similarity between users u &amp; v. Commonly Pearson's correlation or cosine similarity.
hat -> average
note here SUM is over (v e U)
57
Q

Common modifications to generic user-user

A

obvs v =/= u
limit size of neighbourhood (top n neighbours)
Limit minimum similarity between people, i.e. SUM is over (v e V)

58
Q

Benefits of item-item CF

A
  • works quite well
  • efficient implementation in the common case where you have many more users than items
  • un-dynamic relationships between items lends itself to precomputability
59
Q

Assumptions/lmiitations of item-item CF

A

item-item relationships need to be stable (most issues are temporal)
resulting recommendations have lower serendipity

60
Q

Item-Item collaborative filtering

A

S(u, i) = SUM(w_ij . r_uj) / SUM(|w_ij|)

S -> pred score
u -> user
i -> item
j -> other item
r-> rating (want to normalise to user v's range, for user u)
w -> weighting describing similarity between users u &amp; v. Commonly Pearson's correlation or cosine similarity.
hat -> average
note here SUM is over (j e N)
N -> Neighbourhood
61
Q

Gradient boosting parameters; learning_rate

A

a weight between 0 & 1 that controls the contribution of each tree
For example, if the current prediction for a particular example is 0.2 and the next tree predicts that it should actually be 0.8, the correction would be +0.6. At a learning rate of 1, the updated prediction would be the full 0.2+1(0.6)=0.8, while a learning rate of 0.1 would update the prediction to be 0.2+0.1(0.6)=0.26.

62
Q

Gradient boosting parameters; max_tree_depth

A

The maximum number of edges from the node to the tree’s root node

63
Q

What is an embedding?

A

embedding is a mapping from discrete objects, such as words, to vectors of real numbers.

64
Q

The fundamental theory of linear programming

A

If the max/min exists for the linear programming problem, it occurs at the vertex of the feasible region

65
Q

Steps to solving a Linear Programming Problem

A
  • Define the variables
  • Write the objective function and state whether the goal is to minimise or maximise the function
  • Write the constraints (which gives you the system of inequalities)
  • Graph the constraints to determine the feasible region (the solution to the system of inequalities)
  • Identify the vertices of the feasible region
  • Test the vertices in the objective function to determine the maximum or minimum function value.
66
Q

A downside of binning and one hot encoding continuous values

A

You lose the sense of order. I.e. if you bin columns 1-10, into 1-5 then 6-10, you lose the concept that the values 1-5 are smaller than 6-10.

67
Q

Grep for string

A

grep -rn “string” ~/location/

68
Q

Curl

A

Curl is “a command line tool for getting or sending files using URL syntax.”

69
Q

Wget

A

GNU Wget is a computer program that retrieves content from web servers. It is part of the GNU Project

70
Q

Adam optimization algorithm

A

Adam is an optimization algorithm that can used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data. Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training.
Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).

Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.

The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.

71
Q

Adaptive Gradient Algorithm (AdaGrad)

A

maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).

72
Q

Root Mean Square Propagation (RMSProp)

A

that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).

73
Q

Likelihood function

A

In frequentist inference, a likelihood function (often simply the likelihood) is a function of the parameters of a statistical model, given specific observed data. Likelihood functions play a key role in frequentist inference, especially methods of estimating a parameter from a set of statistics. Probability in this mathematical context describes the plausibility of a random outcome, given a model parameter value, without reference to any observed data. Likelihood describes the plausibility of a model parameter value, given specific observed data.

Let X be a discrete random variable with probability mass function p depending on a parameter θ. Then the function

L(θ∣x) = p_θ(x) = P_θ (X = x)
considered as a function of θ, is the likelihood function (of
θ), given the outcome x of the random variable X. Sometimes the probability of “the value x of X for the parameter value θ” is written as P(X = x | θ); it is often written as P(X = x; θ), to emphasise that it is not a conditional probability.

74
Q

Why choose a recurrent neural network?

A

A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a sequence. This allows it to exhibit temporal dynamic behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs.

75
Q

Boltzmann machine

A

A Boltzmann machine is a network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off. Boltzmann machines have a simple learning algorithm (Hinton & Sejnowski, 1983) that allows them to discover interesting features that represent complex regularities in the training data. The learning algorithm is very slow in networks with many layers of feature detectors, but it is fast in “restricted Boltzmann machines” that have a single layer of feature detectors. Many hidden layers can be learned efficiently by composing restricted Boltzmann machines, using the feature activations of one as the training data for the next.
Boltzmann machines are used to solve two quite different computational problems. For a search problem, the weights on the connections are fixed and are used to represent a cost function. The stochastic dynamics of a Boltzmann machine then allow it to sample binary state vectors that have low values of the cost function.
For a learning problem, the Boltzmann machine is shown a set of binary data vectors and it must learn to generate these vectors with high probability. To do this, it must find weights on the connections so that, relative to other possible binary vectors, the data vectors have low values of the cost function. To solve a learning problem, Boltzmann machines make many small updates to their weights, and each update requires them to solve many different search problems.

76
Q

CNN

A

In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks, most commonly applied to analyzing visual imagery.

CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.
Each convolutional neuron processes data only for its receptive field. Although fully connected feedforward neural networks can be used to learn features as well as classify data, it is not practical to apply this architecture to images. A very high number of neurons would be necessary, even in a shallow (opposite of deep) architecture, due to the very large input sizes associated with images, where each pixel is a relevant variable. For instance, a fully connected layer for a (small) image of size 100 x 100 has 10000 weights for each neuron in the second layer. The convolution operation brings a solution to this problem as it reduces the number of free parameters, allowing the network to be deeper with fewer parameters.

77
Q

Multi layer perceptron

A

a class of feedforward artificial neural network. An MLP consists of, at least, three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.

78
Q

Backpropagation

A

a method used in artificial neural networks to calculate a gradient that is needed in the calculation of the weights to be used in the network. It is commonly used to train deep neural networks, a term referring to neural networks with more than one hidden layer.

Backpropagation is a special case of a more general technique called automatic differentiation. In the context of learning, backpropagation is commonly used by the gradient descent optimization algorithm to adjust the weight of neurons by calculating the gradient of the loss function. This technique is also sometimes called backward propagation of errors, because the error is calculated at the output and distributed back through the network layers.

Backpropagation requires the derivative of the loss function with respect to the network output to be known, which typically (but not necessarily) means that a desired target value is known. For this reason it is considered to be a supervised learning method, although it is used in some unsupervised networks such as autoencoders. Backpropagation is also a generalization of the delta rule to multi-layered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer. It is closely related to the Gauss–Newton algorithm, and is part of continuing research in neural backpropagation. Backpropagation can be used with any gradient-based optimizer, such as L-BFGS or truncated Newton

79
Q

Automatic differentiation

A

n mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation,[1][2] is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program.

80
Q

Expectation value

A

In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the experiment it represents. For example, the expected value in rolling a six-sided die is 3.5, because the average of all the numbers that come up in an extremely large number of rolls is close to 3.5. Less roughly, the law of large numbers states that the arithmetic mean of the values almost surely converges to the expected value as the number of repetitions approaches infinity.

More practically, the expected value of a discrete random variable is the probability-weighted average of all possible values. In other words, each possible value the random variable can assume is multiplied by its probability of occurring, and the resulting products are summed to produce the expected value. The same principle applies to an absolutely continuous random variable, except that an integral of the variable with respect to its probability density replaces the sum

The expected value is a key aspect of how one characterizes a probability distribution; it is one type of location parameter. By contrast, the variance is a measure of dispersion of the possible values of the random variable around the expected value. The variance itself is defined in terms of two expectations: it is the expected value of the squared deviation of the variable’s value from the variable’s expected value.

81
Q

cross entropy

A

In information theory, the cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution q, rather than the “true” distribution p.

83
Q

Gaussian process

A

A distribution of random functions

84
Q

Joint probability distribution

A

Given random variables X, Y, …, that are defined on a probability space, the joint probability distribution for X, Y, … is a probability distribution that gives the probability that each of X, Y, … falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.

85
Q

Cumulative distribution function

A

I n probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.

In the case of a continuous distribution, it gives the area under the probability density function from minus infinity to x. Cumulative distribution functions are also used to specify the distribution of multivariate random variables.

86
Q

Surrogate loss function

A

In the context of learning, say you have a classification problem with data set {(X1,Y1),…,(Xn,Yn)}, where Xn are your features and Yn are your true labels.

Given a hypothesis function h(x); the loss function l:(h(Xn),Yn)→ℝ takes your hypothesis functions prediction, i.e. h(Xn) as well as the true label for that particular input and returns a penalty. Now, a general goal is to find a hypothesis such that it minimizes the empirical risk (the chances of being wrong!):

Rl(h) = E_empirical[l(h(X),Y)]
= 1/m ∑^m_i l(h(X_i,Y_i).

In the case of binary classification, a common loss function that is used is the 0−1 loss function:

l(h(X),Y) = {0 Y=h(X)
1 otherwise
In general the loss function that we care about cannot be optimized efficiently. For example, 0−1 loss function is discontinuous. So, we consider another loss function that will make our life easier, which we call it the surrogate loss function.

An example of a surrogate loss function could be ψ(h(x))=max(1−h(x),0) (hinge-loss in SVM), which is convex and easy to optimize using conventional methods. This function acts as a proxy, for the actual loss we wanted to minimize in the first place. Obviously, it has its disadvantages, but in some cases a surrogate loss function actually results in being able to learn more. By this I mean that, once your classifier achieves optimal risk (i.e. highest accuracy), you can still see the loss decreasing, which means that it is trying to push the different classes even further apart to improve its robustness.

87
Q

GLM vs linear models; distribution of labels

A

linear models Y_i ~N(mu_i, sigma^2)

GLMs Y_i ~ exponential

88
Q

GLM vs linear models; linear predictor, function of the covariates

A

linear models nu_i = alpha + beta x_i

GLMs e.g. nu_i = alpha + beta x_i + gamma x_i^2

89
Q

GLM vs linear models; link function, connection between linear predictor and mu_i = E(Y_i)

A

linear models nu_i = mu_i -> mu_i = alpha + beta x_i

GLM e.g. nu_i = ln(mu_i) -> mu_i = e ^ { alpha + beta x_i + gamma x_i^2}

90
Q

Link function

A

provides the relationship between the linear predictor and the mean of the distribution function