Machine Learning Flashcards
Hidden Markov Model
1) A statistical model that assumes that the system it’s modelling on is a Markov process with unobservable states.
2) A 5-tuple of S (a set of possible latent states), Σ (set of all observations), T (a transition matrix between hidden states), E (transition probability of selecting an element in element given a state s ∈ S), π (initial distribution over states)
3) Example of use: speech recognition
Markov Process
- A continuous-time process
- A sequence that obeys the Markov property.
Markov Chain
- A discrete-time process
- A set of parameters that allow the representation of a Markov process
- A triple of Σ (set of possible states), E (transition matrix), π (initial distribution over the states)
First-order Markov Chain
p(xn+1 || xp, xn-1,…,x1, x0)
Markov Property
A sequence in which the distribution for xn+1 depends only on a few events
Latent variable
A variable that can’t be seen or measured in the data
Probability
Number between 0 and 1 which measures the chance of an event occurring
Conditional probability
- Chance of observing an event y given that an event x has occurred
- p(y|x) = p(x,y)/p(x)
Joint probability
- Chance that a collection of events occur together
- p(x = X, y = Y, z = Z) = p(x,y,z)
Marginal probability
- Chance of observing a random variable in a particular state when at least two events are observed simultaneously
Random variable
A variable whose value is subject to uncertainty or chance
Discrete random variable
A random variable that can only take a countable number of values. For example, a coin toss
Continuous random variable
A random variable where the data can take infinitely many values. For example, measuring computation time
Bayes Theorem
- P(A|B) = P(B|A) P(A)/P(B)
- P(A|B) = likelihood of A given B
- P(B|A) = likelihood of B given A
- P(A) = prior
- P(B) = evidence
Supervised classification
- Determining a class of unlabelled data using a model that is learned by examining labelled data
- For example: spam detection
Unsupervised classification
- Determining a class of unlabelled data using a model that is learned by examining unlabelled data
- For example: neural networks
Closed form solution
- A formula that can be derived directly, that gives the optimum results for the parameters directly based on the data
- In the context of lines, a solution where when fitting a straight line, it is possible to compute the optimal values for m and c directly rather than search for them
Regression
- Fitting a model to a set of data, in order to explore the relationship between dependent and independent variables.
- For example, tuning parameters of a simulation to match real measurements
Uniform distribution
- P(X = x|N) = 1/N, x = 1,2,3,…,N
- Where N is an integer
- Equal chance of each outcome
- For example, rolling a die
Binomial coefficient
Binomial distribution
- The distribution of the number of successes in a fixed number of independent Bernoulli trials
> X = 1 with probability p
> X = 0 with probability 1 - p
> 0 <= p <= 1
> EX = 1p + 0(1 - p) = p
> VarX = p(1 - p)
- n Bernoulli trials
> Ai = {X = 1 on the ith trial}, i = 1,2,..,n
Negative binomial distribution
The distribution of the number of trials needed to get a fixed number of successes
- x = number of trials
- p = chance of success
- r = successes
Poisson distribution
- Often used to model waiting for an event
- Single parameter λ is referred to as the intensity
- P(X ≥ x) = 1 - P(X ≤ x - 1)
- P(X ≥ 1) = 1 - P(X = 0)
- Example:
> Website accessed on average 5 times every 3 minutes
> Probability of no accesses in the next minute?
> Random variable X = number of accesses in a minute
> Poisson distribution, λ = 5/3
> P (no accesses in the next minute) = P(X=0)
> (e-5/3)(5/3)0/0! = e-5/3
Hypergeometric distribution
- No replacement, unlike binomial distribution
- N = Total population
- M = Number of successes
- K = Sample size
- x = Number of successes in sample
- P(X > x) = 1 - P(X = x)
Geometric distribution
- The simplest of the waiting time distributions
- A special case of the negative binomial, it’s the distribution for the number of trials needed to get the first success
- n = number of trials
- p = probability of success
Feature vector
An n-dimensional vector of numerical features that represent an object (for example, occurrence of a word).