Machine Learning Flashcards
Supervised ML
deal with samples, provide label, learn from set of labelled samples
classification
set of possible labels is finite, discrete and categorical output
binary classification
has two possible labels (+1/-1)
multi-class classification
has more than two (finitely) many labels
regression
set of possible labels is infinite and output is continuous
batch learning
given training set of labelled samples, work out labels for the samples from a test set
online learning
see sample, work out predicted label, check true label carry on
training/exploration stage
analyse training set
exploitation stage
apply hypothesis to test data
deduction induction transduction triangle
- data -> hypothesis: induction
- hypothesis -> unknown: deduction
- data -> unknown: transduction
IID assumption
independent identically distributed
labelled samples (xi, yi) are assumed to be generated independently from the same probability measure
feature
attribute / component of the dataset which represents a sample
label
categorises the sample into a certain class, the thing we’re trying to predict
conformal prediction
given a training set and test sample, try in turn each potential label for the test sample
for each label we look at how plausible the augmented training set is under IID assumption
use p-value for this
conformity measure
evaluate how conforming/well the new observed test data fits with the existing observed training data, give an equivariant conformity score
p-value
evaluate implausibility of augmented training set
non conformity measure
similar to conformity measure but measures how strange(non-conformal) the test data is compared to training data
average false p-value
the average of all p-values for all postulated labels in the test set except for the true labels
training accuracy
accuracy on training set
generalisation accuracy
how well the model is able to accurately predict on the test set after training on the training set
overfitting
when the model learns all of the details and noise of the training set that it is not able to generalise on the test set, so it negatively impacts performance on the test set.
high training accuracy low generalisation accuracy
underfitting
when the model does not capture enough of the underlying patterns of the training data due to it being too simple so cannot make accurate prediction on test data
low training accuracy low generalisation accuracy
learning curve
plot of the accuracy vs size of training set n
use: understanding CV
RSS
residual sum of squares
aim to minimise residuals
residuals is difference between true and predicted label