ML Flashcards

Question

Logistic sigmoid function

Answer 1

sigmoid(a) = 1/(1+e^-a)

Answer 2

p(C1|x) = sigmoid(w^T x) p(C2|x) = 1 - p(C1|x)

Answer 3

Using gradient descent

Answer 4

Parameters that transform input data within each neuron

Answer 5

Determines whether a neuron should be activated and how to transform the input signal into an output signal

Answer 6

Consists of neurons that receive input data and pass to next layer - # of neurons = # of features in input dataset

Answer 7

Layer of neurons between input and output which processes input data using weights and activation functions

Answer 8

Final layer that produces result/prediction

Answer 9

Measures difference between predicted and true output

Answer 10

1) Calculate activations of hidden layer, h 2) Pass result of step 1 through a nonlinear function e.g. sigmoid 3) Use step 2 to calculate activations of output layer, o 4) Compute predictions using sigmoid of step 3

Answer 11

Algorithm used to compute gradients of loss function with respect to each weight to update weights across multiple layers based on derivative of error wrt the weight formula is δj = h′(aj )(Sum to k of:wkj δk) δj = error signal for jth hidden unit wkj = weight connecting hidden unit j to output unit k h'(aj) = derivative of activation function δ k = error signal at kth output unit

Answer 12

When gradients used to update the weights during backpropagation become too small which slows down/stops learning process

Answer 13

When gradients grow too large during backpropagation, causing weights to become large and degrading model performance

Answer 14

Used to cap magnitude of gradient : g' = min(1, c/||g|| ) g c = constant value to limit size of gradient

Answer 15

Each layer has the form of a residual layer which is like a shortcut for gradients to flow directly from output to previous layer defined to be: F1' = F1(x)+x where F1(x) is the standard mapping (linear transformation and then actication)

Answer 16

Activation functions that don't compress inputs into a limited range, avoiding vanishing gradients

Answer 17

Necessary to choose good initial parameters to avoid issues with gradient descent

Answer 18

Process of creating a validation set and stopping training once the error on the validation step starts to increase

Answer 19

Adds a term: (lambda)w^T w to the loss function where user chooses value of lambda>0 to penalise large weights

Answer 20

When all outgoing connections from a unit with some given probability are turned off during training to ensure that each unit can perform well without depending on other units This can drastically reduce overfitting

Answer 21

Decision tree where each internal node represents decision based on a feature and each leaf represents a class label

Answer 22

Decision tree where each internal node represents a decision based on a feature and each leaf node represents a predicted value

Answer 23

Non-parametric

Answer 24

Learning the tree structure greedy algorithm: 1) Start from root node 2) Run exhaustive search over each possible variable and threshold for a new node: For each variable and threshold: a. Compute average of target variable for each leaf of the proposed node b. Compute error if we stop adding nodes here 3) Choose variable and threshold that minimise the error 4) Add a new node for the chosen variable and threshold 5) Repeat step 2 until there are only n data points associated with each leaf node 6) Prune back the tree to remove branches that do not reduce error by more than a small tolerance value, e Pruning: 1) Start with a tree T0 2) Consider pruning each node in T0 by combining branches to obtain tree T 3) Compute C(T) = (Sum from T=1 to |T|) et(T) + lambda |T| where - |T| is # of leaves - et(T) is error associated with t'th leaf of tree - lambda is a penalty added for the # of leaves in the tree 4) If C(T) <= C(T0) keep the pruned tree, otherwise reinstate the pruned node

Answer 25

Used to compute dot product between two data points e.g. For two vectors, x and x', possible kernel function K(x,x') = ϕ(x).ϕ(x') where ϕ is mapping to a higher-dimensional space

Answer 26

We just need the kernel function to make a prediction so evaluate the kernel function values e.g. k(x1, x) without first computing ϕ(x1) and ϕ(x) and then compute their scalar product

Answer 27

Indicate how much each training sample contributes to the decision boundary

Answer 28

Defined to be K = ϕϕ^T so Knm = ϕ(xn)^T ϕ(xm) which is the similarity between the nth and mth datapoint

Answer 29

Distance from decision boundary (hyperplane) to closest training datapoint

Answer 30

Data points that lie closest to decision boundary (margin)

Answer 31

Allows points to lie on wrong side of hyperplane in exchange for a penalty which is controlled by a regularisation parameter

Answer 32

Measures distance of misclassified points from decision boundary

Answer 33

One-versus-one = train k(k-1)/2 SVM classifiers to distinguish between each pair of classes and take a vote for the predicted class One-versus-the-rest = train k SVM classifiers to distinguish between each class and all other classes

Answer 34

Graphical model that represents a set of random variables and their conditional dependencies via a DAG

Answer 35

1) The DAG (structure of the BN) 2) Conditional probability distributions p(xk|pak) (parameters of the BN)

Answer 36

A node is a collider on a path if both arrows point to it on the path

Answer 37

A path is blocked by S if at least one of the following: 1) There is a collider that is not in S and none of its desendants are in S 2) There is a non-collider on the path that is in S

Answer 38

If the network has nodes X1,X2,...,Xn the joint probability distribution is: P(X1,X2,..,Xn)=∏ i=1to n (P(Xi)|Parents(Xi))

Answer 39

A plate is drawn around a subset of variables, and the number of repetitions is indicated on the plate.

Answer 40

To translate a machine learning model into a Bayesian network, you identify the random variables in the model and their dependencies.

Answer 41

If all paths from node x to node y are blocked given nodes S then x and y are d-separated by S. This means x and y are conditionally independent given S in the DAG.

Answer 42

Represents our beliefs about the parameters of a model before observing any data.

Answer 43

Combines the prior distribution and the likelihood to provide an updated belief about the parameters after observing the data.

Answer 44

Parameters (unknown quantities we try to estimate) and latent variables (unobserved variables that influence the model) are treated as random variables Data (observed variables) are also considered random but are known quantities

Answer 45

Generates samples from a joint probability distribution from parents to children e.g. We first sample values for A and B, suppose we get A=0, B=1. Then we sample from C from the conditional distribution P(C|A=0, B=1) and so on

Answer 46

If we wanted to sample from P(B,D|E=1) we would sample from the marginal distribution P(B,D,E) and throw away those samples where E!=1

Answer 47

Series of random variables z1,...,zm such that the following holds for mE{1,...,M-1}: p(zm+1|z1,...,zm) = p(zm+1|zm)

Answer 48

If p(zm+1|zm) is the same for all m

Answer 49

It captures how likely it is to transition from the current state to any possible next state.

Answer 50

Sample a sequence of distributions which eventually get close to desired distribution: 1) Draw sample from each distribution in the sequence 2) Only keep samples when we get close enough to desired distribution *These samples are not independent

Answer 51

Used to construct a Markov chain For Bayesian ML this will be P(theta|D=d), the posterior distribution of the model parameters given the observed data

Answer 52

Define single transition probability distribution for a homogeneous Markov chain. Let current state be z^(T). Generate a value z* by sampling from a proposal distribution q(z|z^(T)) Accept z* as the new state with certain acceptance probability in which case z^(T+1)=z*. If we don't accept z* then stay where you are so z^(T+1)=z^(T)

Answer 53

Special case of the Metropolis-Hastings algorithm where the proposal distribution is symmetric (i.e., the probability of proposing a move to a state is the same as proposing a move back)

Answer 54

Distribution from which samples are drawn to general potential states in Markov chain

Answer 55

MH: α(x,y) = min(1, π(y)q(y|x)/π(x)q(x|y) ) where π is target distribution and q is proposal distribution M: α(x,y) = min(1, π(y)/π(x))

Answer 56

Initial period of MCMC simulation during which samples are discarded because early samples might not represent the target distribution well due to influence of initial state

Answer 57

Process by which distribution of samples generated by Markov chain approaches the target distribution

Answer 58

Helps to diagnose convergence and explore parameter space

Answer 59

Can be used to approximate an expected value defined by that distribution

Answer 60

Value computed from an MCMC run used to check for convergence. If the run is successful ie. there's been a convergence, it will be close to 1

Answer 61

Unsupervised learning technique to group similar data points into groups

Answer 62

Assigns data points to clusters probabilistically by applying MLE to a Gaussian mixture model, rather than strictly assigning each point to single cluster

Answer 63

Probabilistic model that assumes data is generated from a mixture of several Gaussian distributions (type of soft clustering model)

Answer 64

Represents the proportion of the kth component in the overall distribution as a probability between 0 and 1

Answer 65

The probability that a specific data point was generated by a particular cluster in the model

Answer 66

- Randomly initialise K cluster centroids - Assign each data point to the closest cluster centroid based on Euclidean distance - Recompute the centroids by calculating the mean of all data points assigned to each cluster - Repeat the assignment and update steps until convergence ie. when the cluster assignments no longer change significantly

Answer 67

MLE for Gaussian mixtures: Intialise by choosing starting variables for u (mean), covariance, pi Then compute iterations of: E step: Compute values for the responsibilities given the current parameter values M step: Re-estimate the parameters using the current responsibilities

Answer 68

There is no guarantee that it will succeed in maximising the log-likelihood. It may converge to a local maximum which is not a global maximum of the log-likelihood function.

Answer 69

(log likelihood of observed data X given parameters theta) ln p(X∣θ)=L(q,θ)+KL(q∣∣p) (Lower bound on log-likelihood) L(q,θ)= (sum to Z) q(Z) ln(p(X,Z∣θ)/q(Z)) q(Z) = distribution over latent variables Z P(X, Z|θ) = joint probability of observed data X and latent variables Z given parameters (Kullback-Leibler divergence between probability distributions p1 and p2) KL(q||p)= -(sum to Z) q(Z) ln(p(Z∣X,θ)/q(Z))

Answer 70

KL(q||p) ≥ 0 for any choice of q, so L(q, θ) ≤ ln p(X|θ).

Answer 71

We increase L(q, θ) by updating q (and leaving θ fixed)

Answer 72

We increase L(q, θ) by updating θ (and leaving q fixed)

Answer 73

After the E-step we have L(q, θ) = ln p(X|θ) (and so KL(q||p) = 0), so that in the following M-step increasing L(q, θ) will also increase ln p(X|θ).

Answer 74

Straight line with slope 1 for positive values of 𝑥 and a horizontal line at 0 for negative values of 𝑥. It is not differentiable at 0 but we just impose a gradient of 0.

ML Flashcards

(99 cards)