Machine Learning Flashcards

Question

Geometric distribution

Answer 1

- The simplest of the waiting time distributions - A special case of the negative binomial, it's the distribution for the number of trials needed to get the first success - n = number of trials - p = probability of success

Answer 2

An n-dimensional vector of numerical features that represent an object (for example, occurrence of a word).

Answer 3

A matrix that visualises the performance of an algorithm. - One dimension is the correct label - The other is the label that was predicted - Values in the diagonal indicate a correct prediction. Values elsewhere indicate a wrong prediction.

Answer 4

- Find a hyperplane in an n-dimensional space that divides the two classes with as much space as possible - The hyperplane can be written as w∙x - b = 0 - Select two parallel hyperplanes that fit between the data with as large a distance as possible \> w∙x - b = 1 \> w∙x - b = -1

Answer 5

- A weighted sum of each of the Multivariate Gaussians - Can be used for supervised and unsupervised classification - Can be created if training data is partitioned by: \> Labelling the data \> Finding number of Gaussians, and the mean and covariance for each \> Assume number of Gaussians is equal to the number of classes

Answer 6

- Defined by two parameters: mean (µ) and variance (σ²) - In a set X₁,...,X_n: \> Mean is determined by sum of values / number of values \> Variance is determined by the formula: (Σ(x - µ)²)/n

Answer 7

- Events that can occur together but are unrelated - p(x|y) = p(x)

Answer 8

- p(x|y,z) = p(x|z) - This means if z is known, knowing y gives no new information about x - It is still possible that p(x|y) ≠ p(x), so x and y both depend on z

Answer 9

- When independent random variables are summed, their normaized sum tends toward a normal distribution, even if the original variables themselves are not normally distributed. - This means statistical and probability methods that work for normal distributions can be applicable to a range of problems that use other distributions

Answer 10

- Choosing a polynomial that would fit every data point with low error but not fit the points in between

Answer 11

- Introduction a term into the objective function - Prevents over-fitting

Answer 12

A function that is desired to maximise or minimise

Answer 13

- A stochastic search algorithm - Example: travelling salesman problem, find the route with the lowest mileage that connects all points - Preferable to gradient descent because: \> Finds the true global minimum rather than a local one Technique: 1) Starts at a random initial placement, with a high 'temperature' 2) Generate a random move, higher temperature means a bigger move 3) Calculate the change in the score based on the move made 4) Depending on the change in score, accept or reject the move. The probability of acceptance depends on the current temperature 5) Lower the temperature value and loop from step 2

Answer 14

- General concept: take steps proportional to the negative of the gradient of the function at the current point - Step size is determined by the Taylor expansion - Direction:

Answer 15

- Note: if the distance value is too large then it could miss the minimum. - Note: if the distance value is too small then it's very slow

Answer 16

- Looks at every point in the feature space - Often the space is continuous, it has to be quantised Advantages: - Conceptually simple - Doesn't get stuck in local extrema Disadvantages: - May miss a solution - Very expensive, especially in high dimensions

Answer 17

- Use functions with a fixed form and parameters that are found using the data

Answer 18

- A parametric density function - Does not have to be all Gaussians

Answer 19

- Do not assume an underlying functional form (in other words, have an unknown number of parameters)

Answer 20

- A non-parametric density function \> Each data point contributes towards the overall density according to some smoothing function, called a kernel \> To form a density estimate at a point, simply sum the contributions from kernels placed at every data point

Answer 21

- The width of the kernel gives a trade-off between the smoothness and accuracy of the result \> Use a 'bandwidth' parameter to tune the technique \> A good selection of bandwidth depends on the data \> The choice of bandwidth is more important than the choice of kernel

Answer 22

- A non-parametric algorithm which uses the kernel density technique to find local maxima - Most common application is clustering - Algorithm steps: For each point: 1) Estimate the density at the point and at several points in a region nearby (using KDE) 2) If the nearby density is all lower, this is a local maximum 3) Otherwise, move to the highest point and repeat

Answer 23

1) All probabilities are non-negative,for all x: p(x) ≥ 0 2) Mutually exclusive events add 3) The probability that at least one of all the possible outcomes of a process will occur is 1

Answer 24

- Probabilities for all possible events as a function 1) Probability mass function \> Discrete data \> Must sum to 1 \> Can never exceed 1 \> Probability is given directly by the function 2) Probability density function \> Continuous data \> Must integrate to 1 \> Can exceed 1 locally \> Probability is given by integrating over an interval

Answer 25

If the optimal values cannot be computed directly, then search for them while excluding all outliers

Answer 26

- A meta-algorithm used in unsupervised GMM classifying - Used for latent variables, when the model is in the exponential family (e.g. Gaussian) - Two steps \> Expectation: find the expected value of the log-likelihood \> Maximisation: find parameters that maximise this - Problems \> If data isn't Gaussian distributed or clusters overlap \> May not give global maximum \> Number of clusters must be specified - For the last point, can incorporate measure of complexity into algorithm to pick automatically

Answer 27

- The output is fully determined by the parameter values and initial conditions

Answer 28

The output will be the result of an inherent randomness of the inputs (in other words, the same set of parameters will lead to different results most of the time)

Answer 29

- Is deterministic and can be derived mathematically - However, there is a set number of steps to do in closed form solutions while with deterministic solutions that is not the case

Answer 30

- μ = (ΣX)/ N - μ = mean - ΣX = sum of all values in the group - N = number of items in the group

Answer 31

A measure for how well a given line (m,c) fits the data D

Answer 32

- Data required: \> Feature vectors \> Labels corresponding to each item in the feature vector - Processing data: \> Splitting into two sets: training data (the data that will be used to build the model) and test data (the data that will be used to test said model) - Model used: \> Only depends on the number of features and the number of classes \> Doesn't get larger as more training data is added -- p(c|x) = p(c)p(x|c) -- p(c) = probability of the class (the number of times the class occurs divided by the total number of training data items) -- p(x|c) = the probability that the word x↓i appears in the sentiment c↓j (this can be computed by counting occurrences) -- p(x) is dropped in order to maximise the values, as it's a MAP approach - Testing the system: \> Keep a portion of the labelled data separate \> Plot the results as a 2D confusion matrix \> Values in the diagonal indicate the prediction was correctly. Values elsewhere indicated a wrong prediction

Answer 33

- As the number of features grows, the amount of data needed to generalise accurately grows exponentially - This is a problem for any method that requires statistical significance

Answer 34

- Example of an algorithm that doesn't require a human supervisor to partition data into classes

Answer 35

- A dynamic programming algorithm of complexity O(n \* |S|²) that calculates the most probable path through the hidden states given the observations - p_k(i,t) = e_k(i) max_l(p_l(j,t-1)\*p_lk)

Answer 36

- Can find local minimum/ maximum - max_vp(v|data) Steps: 1) Choose a random HMM 2) Use the Viterbi algorithm to calculate α's and β's 3) Use Bayes theorem to update parameters Problems: \> There are lots of parameters, which can cause overfitting -- This may result in the algorithm having to be run a few times, which would either need longer sequences or many short ones \> Can't learn on long sequences -- Multiplying many small numbers together results in 0s. Need to use log probabilities and not probabilities themselves or generate many short sequences

Answer 37

A square matrix of second order partial derivatives

Answer 38

- Requires a feature vector (objects to be classified) - Requires a training set (A set of samples with known classification) - For a given test object, compute the distance d between the test and each element of the training set - Look at the k elements with smallest d, take the classification with the biggest vote

Answer 39

- A matrix whose element in the i, j position is the covariance between the i-th and j-th elements of a random vector - Determines the correlation between features - It's ideal to use features that are not correlated and use as few features as possible, in order to avoid the curse of dimensionality

Machine Learning Flashcards

(66 cards)