Machine Learning Flashcards

1
Q

Hidden Markov Model

A

1) A statistical model that assumes that the system it’s modelling on is a Markov process with unobservable states.
2) A 5-tuple of S (a set of possible latent states), Σ (set of all observations), T (a transition matrix between hidden states), E (transition probability of selecting an element in element given a state s ∈ S), π (initial distribution over states)
3) Example of use: speech recognition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Markov Process

A
  • A continuous-time process
  • A sequence that obeys the Markov property.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Markov Chain

A
  • A discrete-time process
  • A set of parameters that allow the representation of a Markov process
  • A triple of Σ (set of possible states), E (transition matrix), π (initial distribution over the states)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

First-order Markov Chain

A

p(xn+1 || xp, xn-1,…,x1, x0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Markov Property

A

A sequence in which the distribution for xn+1 depends only on a few events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Latent variable

A

A variable that can’t be seen or measured in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Probability

A

Number between 0 and 1 which measures the chance of an event occurring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Conditional probability

A
  • Chance of observing an event y given that an event x has occurred
  • p(y|x) = p(x,y)/p(x)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Joint probability

A
  • Chance that a collection of events occur together
  • p(x = X, y = Y, z = Z) = p(x,y,z)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Marginal probability

A
  • Chance of observing a random variable in a particular state when at least two events are observed simultaneously
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Random variable

A

A variable whose value is subject to uncertainty or chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Discrete random variable

A

A random variable that can only take a countable number of values. For example, a coin toss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Continuous random variable

A

A random variable where the data can take infinitely many values. For example, measuring computation time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Bayes Theorem

A
  • P(A|B) = P(B|A) P(A)/P(B)
  • P(A|B) = likelihood of A given B
  • P(B|A) = likelihood of B given A
  • P(A) = prior
  • P(B) = evidence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Supervised classification

A
  • Determining a class of unlabelled data using a model that is learned by examining labelled data
  • For example: spam detection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Unsupervised classification

A
  • Determining a class of unlabelled data using a model that is learned by examining unlabelled data
  • For example: neural networks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Closed form solution

A
  • A formula that can be derived directly, that gives the optimum results for the parameters directly based on the data
  • In the context of lines, a solution where when fitting a straight line, it is possible to compute the optimal values for m and c directly rather than search for them
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Regression

A
  • Fitting a model to a set of data, in order to explore the relationship between dependent and independent variables.
  • For example, tuning parameters of a simulation to match real measurements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Uniform distribution

A
  • P(X = x|N) = 1/N, x = 1,2,3,…,N
  • Where N is an integer
  • Equal chance of each outcome
  • For example, rolling a die
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Binomial coefficient

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Binomial distribution

A
  • The distribution of the number of successes in a fixed number of independent Bernoulli trials

> X = 1 with probability p
> X = 0 with probability 1 - p
> 0 <= p <= 1
> EX = 1p + 0(1 - p) = p
> VarX = p(1 - p)

  • n Bernoulli trials
    > Ai = {X = 1 on the ith trial}, i = 1,2,..,n
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Negative binomial distribution

A

The distribution of the number of trials needed to get a fixed number of successes

  • x = number of trials
  • p = chance of success
  • r = successes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Poisson distribution

A
  • Often used to model waiting for an event
  • Single parameter λ is referred to as the intensity
  • P(X ≥ x) = 1 - P(X ≤ x - 1)
  • P(X ≥ 1) = 1 - P(X = 0)
  • Example:
    > Website accessed on average 5 times every 3 minutes
    > Probability of no accesses in the next minute?
    > Random variable X = number of accesses in a minute
    > Poisson distribution, λ = 5/3
    > P (no accesses in the next minute) = P(X=0)
    > (e-5/3)(5/3)0/0! = e-5/3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Hypergeometric distribution

A
  • No replacement, unlike binomial distribution
  • N = Total population
  • M = Number of successes
  • K = Sample size
  • x = Number of successes in sample
  • P(X > x) = 1 - P(X = x)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Geometric distribution

A
  • The simplest of the waiting time distributions
  • A special case of the negative binomial, it’s the distribution for the number of trials needed to get the first success
  • n = number of trials
  • p = probability of success
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Feature vector

A

An n-dimensional vector of numerical features that represent an object (for example, occurrence of a word).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Confusion Matrix

A

A matrix that visualises the performance of an algorithm.

  • One dimension is the correct label
  • The other is the label that was predicted
  • Values in the diagonal indicate a correct prediction. Values elsewhere indicate a wrong prediction.
28
Q

Support Vector Machine

A
  • Find a hyperplane in an n-dimensional space that divides the two classes with as much space as possible
  • The hyperplane can be written as w∙x - b = 0
  • Select two parallel hyperplanes that fit between the data with as large a distance as possible
    > w∙x - b = 1
    > w∙x - b = -1
29
Q

Multivariate Normal Distribution

A
30
Q

Gaussian Mixture Model

A
  • A weighted sum of each of the Multivariate Gaussians
  • Can be used for supervised and unsupervised classification
  • Can be created if training data is partitioned by:
    > Labelling the data
    > Finding number of Gaussians, and the mean and covariance for each
    > Assume number of Gaussians is equal to the number of classes
31
Q

Gaussian Distribution

A
  • Defined by two parameters: mean (µ) and variance (σ2)
  • In a set X1,…,Xn:
    > Mean is determined by sum of values / number of values
    > Variance is determined by the formula: (Σ(x - µ)2)/n
32
Q

Maximum likelihood approach

A
33
Q

Statistical independence

A
  • Events that can occur together but are unrelated
  • p(x|y) = p(x)
34
Q

Conditional independence

A
  • p(x|y,z) = p(x|z)
  • This means if z is known, knowing y gives no new information about x
  • It is still possible that p(x|y) ≠ p(x), so x and y both depend on z
35
Q

Central Limit Theorem

A
  • When independent random variables are summed, their normaized sum tends toward a normal distribution, even if the original variables themselves are not normally distributed.
  • This means statistical and probability methods that work for normal distributions can be applicable to a range of problems that use other distributions
36
Q

Overfitting

A
  • Choosing a polynomial that would fit every data point with low error but not fit the points in between
37
Q

Regularisation

A
  • Introduction a term into the objective function
  • Prevents over-fitting
38
Q

Objective function

A

A function that is desired to maximise or minimise

39
Q

Simulated annealing

A
  • A stochastic search algorithm
  • Example: travelling salesman problem, find the route with the lowest mileage that connects all points
  • Preferable to gradient descent because:
    > Finds the true global minimum rather than a local one

Technique:
1) Starts at a random initial placement, with a high ‘temperature’

2) Generate a random move, higher temperature means a bigger move
3) Calculate the change in the score based on the move made

4) Depending on the change in score, accept or reject the move. The probability of acceptance depends on the
current temperature

5) Lower the temperature value and loop from step 2

40
Q

Gradient descent

A
  • General concept: take steps proportional to the negative of the gradient of the function at the current point
  • Step size is determined by the Taylor expansion
  • Direction:
41
Q

Taylor Expansion

A
  • Note: if the distance value is too large then it could miss the minimum.
  • Note: if the distance value is too small then it’s very slow
42
Q

Brute Force

A
  • Looks at every point in the feature space
  • Often the space is continuous, it has to be quantised

Advantages:

  • Conceptually simple
  • Doesn’t get stuck in local extrema

Disadvantages:

  • May miss a solution
  • Very expensive, especially in high dimensions
43
Q

Parametric density function

A
  • Use functions with a fixed form and parameters that are found using the data
44
Q

Mixture Model

A
  • A parametric density function
  • Does not have to be all Gaussians
45
Q

Non-parametric density function

A
  • Do not assume an underlying functional form (in other words, have an unknown number of parameters)
46
Q

Kernel density estimation

A
  • A non-parametric density function
    > Each data point contributes towards the overall density according to some smoothing function, called a kernel
    > To form a density estimate at a point, simply sum the contributions from kernels placed at every data point
47
Q

Bandwidth

A
  • The width of the kernel gives a trade-off between the smoothness and accuracy of the result
    > Use a ‘bandwidth’ parameter to tune the technique
    > A good selection of bandwidth depends on the data
    > The choice of bandwidth is more important than the choice of kernel
48
Q

Mean Shift

A
  • A non-parametric algorithm which uses the kernel density technique to find local maxima
  • Most common application is clustering
  • Algorithm steps:
    For each point:
    1) Estimate the density at the point and at several points in a region nearby (using KDE)
    2) If the nearby density is all lower, this is a local maximum
    3) Otherwise, move to the highest point and repeat
49
Q

Axioms of probability theory

A

1) All probabilities are non-negative,for all x: p(x) ≥ 0
2) Mutually exclusive events add
3) The probability that at least one of all the possible outcomes of a process will occur is 1

50
Q

Probability functions

A
  • Probabilities for all possible events as a function

1) Probability mass function
> Discrete data
> Must sum to 1
> Can never exceed 1
> Probability is given directly by the function

2) Probability density function
> Continuous data
> Must integrate to 1
> Can exceed 1 locally
> Probability is given by integrating over an interval

51
Q

Optimisation

A

If the optimal values cannot be computed directly, then search for them while excluding all outliers

52
Q

Expectation maximisation

A
  • A meta-algorithm used in unsupervised GMM classifying
  • Used for latent variables, when the model is in the exponential family (e.g. Gaussian)
  • Two steps
    > Expectation: find the expected value of the log-likelihood
    > Maximisation: find parameters that maximise this
  • Problems
    > If data isn’t Gaussian distributed or clusters overlap
    > May not give global maximum
    > Number of clusters must be specified
  • For the last point, can incorporate measure of complexity into algorithm to pick automatically
53
Q

Deterministic optimisation

A
  • The output is fully determined by the parameter values and initial conditions
54
Q

Stochastic optimization

A

The output will be the result of an inherent randomness of the inputs (in other words, the same set of parameters will lead to different results most of the time)

55
Q

Closed form optimisation

A
  • Is deterministic and can be derived mathematically
  • However, there is a set number of steps to do in closed form solutions while with deterministic solutions that is not the case
56
Q

Population mean

A
  • μ = (ΣX)/ N
  • μ = mean
  • ΣX = sum of all values in the group
  • N = number of items in the group
57
Q

Sample mean

A
58
Q

Least squares objective function

A

A measure for how well a given line (m,c) fits the data D

59
Q

Classification problem

A
  • Data required:
    > Feature vectors
    > Labels corresponding to each item in the feature vector
  • Processing data:
    > Splitting into two sets: training data (the data that will be used to build the model) and test data (the data that will be used to test said model)
  • Model used:
    > Only depends on the number of features and the number of classes
    > Doesn’t get larger as more training data is added
    – p(c|x) = p(c)p(x|c)
    – p(c) = probability of the class (the number of times the class occurs divided by the total number of training data items)
    – p(x|c) = the probability that the word x↓i appears in the sentiment c↓j (this can be computed by counting occurrences)
    – p(x) is dropped in order to maximise the values, as it’s a MAP approach
  • Testing the system:
    > Keep a portion of the labelled data separate
    > Plot the results as a 2D confusion matrix
    > Values in the diagonal indicate the prediction was correctly. Values elsewhere indicated a wrong prediction
60
Q

Curse of dimensionality

A
  • As the number of features grows, the amount of data needed to generalise accurately grows exponentially
  • This is a problem for any method that requires statistical significance
61
Q

K-Means

A
  • Example of an algorithm that doesn’t require a human supervisor to partition data into classes
62
Q

Viterbi Algorithm

A
  • A dynamic programming algorithm of complexity O(n * |S|2) that calculates the most probable path through the hidden states given the observations
  • pk(i,t) = ek(i) maxl(pl(j,t-1)*plk)
63
Q

Barum-Welch Algorithm

A
  • Can find local minimum/ maximum
  • maxvp(v|data)

Steps:

1) Choose a random HMM
2) Use the Viterbi algorithm to calculate α’s and β’s
3) Use Bayes theorem to update parameters

Problems:
> There are lots of parameters, which can cause overfitting
– This may result in the algorithm having to be run a few times, which would either need longer sequences or many short ones

> Can’t learn on long sequences
– Multiplying many small numbers together results in 0s. Need to use log probabilities and not probabilities themselves or generate many short sequences

64
Q

Hessian

A

A square matrix of second order partial derivatives

65
Q

k Nearest Neighbours

A
  • Requires a feature vector (objects to be classified)
  • Requires a training set (A set of samples with known classification)
  • For a given test object, compute the distance d between the test and each element of the training set
  • Look at the k elements with smallest d, take the classification with the biggest vote
66
Q

Covariance Matrix

A
  • A matrix whose element in the i, j position is the covariance between the i-th and j-th elements of a random vector
  • Determines the correlation between features
  • It’s ideal to use features that are not correlated and use as few features as possible, in order to avoid the curse of dimensionality