Machine Learning Flashcards
Supervised ML
deal with samples, provide label, learn from set of labelled samples
classification
set of possible labels is finite, discrete and categorical output
binary classification
has two possible labels (+1/-1)
multi-class classification
has more than two (finitely) many labels
regression
set of possible labels is infinite and output is continuous
batch learning
given training set of labelled samples, work out labels for the samples from a test set
online learning
see sample, work out predicted label, check true label carry on
training/exploration stage
analyse training set
exploitation stage
apply hypothesis to test data
deduction induction transduction triangle
- data -> hypothesis: induction
- hypothesis -> unknown: deduction
- data -> unknown: transduction
IID assumption
independent identically distributed
labelled samples (xi, yi) are assumed to be generated independently from the same probability measure
feature
attribute / component of the dataset which represents a sample
label
categorises the sample into a certain class, the thing we’re trying to predict
conformal prediction
given a training set and test sample, try in turn each potential label for the test sample
for each label we look at how plausible the augmented training set is under IID assumption
use p-value for this
conformity measure
evaluate how conforming/well the new observed test data fits with the existing observed training data, give an equivariant conformity score
p-value
evaluate implausibility of augmented training set
non conformity measure
similar to conformity measure but measures how strange(non-conformal) the test data is compared to training data
average false p-value
the average of all p-values for all postulated labels in the test set except for the true labels
training accuracy
accuracy on training set
generalisation accuracy
how well the model is able to accurately predict on the test set after training on the training set
overfitting
when the model learns all of the details and noise of the training set that it is not able to generalise on the test set, so it negatively impacts performance on the test set.
high training accuracy low generalisation accuracy
underfitting
when the model does not capture enough of the underlying patterns of the training data due to it being too simple so cannot make accurate prediction on test data
low training accuracy low generalisation accuracy
learning curve
plot of the accuracy vs size of training set n
use: understanding CV
RSS
residual sum of squares
aim to minimise residuals
residuals is difference between true and predicted label
feature engineering
including derived features to the training set at will for a more accurate prediction of the model
TSS
total sum of squares
sum of squared difference of true label minus mean of true labels
R^2
1 - RSS / TSS
% of variability in the label explained by data
low value - not compatible with poor performance on test data due to possible overfitting
regularisation
each feature should have as little effect on the outcome as possible to avoid overfitting
α regularisation parameter
α = 0: ridge regression coincides with least squares, no penalty applied, size of coefficients doesn’t change, keeps model complex -> overfitting
α -> ∞: coefficient estimates forced to shrink towards 0, shrinkage penalty applied, RSS less important, model too simple may lead to underfitting
Lasso
Least absolute shrinkage and selection operator
-L1 penalty instead of L2/euclidean norm
-minimises RSS
sets many w[j] coefficients to 0
-LASSO performs model selection = sparse model involve only some of the features
- if a few of my features important then use this
-useful for importance of the interpretability of the model
method chaining
concatenation of method calls
data normalisation
measuring the dataset on the same scale to ensure compatibility of the model
normalisation - least squares
not essential
if first feature x[0] measured in metres, w^[0] will be the corresponding Least Squares estimate
if instead x[0] is now measures in km, all xi[0] will decrease 1000 fold
if you run Least Squares on the new dataset, then w^[0] will increase 1000 fold so no change to predictions
normalisation - ridge/lasso
essential
due to the presence of penalty terms that are the same for all variables, so predictions will change
normalising features prevents larger features from being unfairly penalised by the penalty terms
StandardScaler
for each feature mean 0 standard deviation 1
1) shift each feature down by its mean
2) divide each feature by its SD
RobustScaler
for each feature median 0 IQR 1
1) shift each feature down by its median
2) divide each feature by its IQR
MinMaxScaler
shift each feature so it is a value ranging between 0 and 1
data snooping
when test set is used for developing the model.
test set leaks into model
inaccurate normalisation
leads to data snooping
affect transformation of data
lead to overfitting/underfitting
Normalizer
each sample divided by its Euclidean norm
parameter selection
split training set (used for model checking) further into validation set where we select the best parameters to evaluate on test set
inductive conformity measure
A : Z* x Z -> R
A(C~, z) says how well z conforms to C~
no analogue of the equivariance requirement
kernel
a function that turns linear problems into non linear one
take a feature mapping F . X -> H of X = sample space into H = feature space, equipped with dot product, allows this feature mapping to be turned to K(x, x’) = F(x) . F(x’)
kernel trick
- write the algorithms so that all xs can only appear in dot products
- replace the dot products with kernels
kernel features
-symmetric K(x, x’) = K(x, x’)
-positive definite ∑i=1∑j=1(aiaj)K(xi, xj) ≥ 0
decay factor lambda
used to weigh gaps between the substrings
dimension of feature space, value of such coordinate depends on how frequently and compactly the sustring is embedded int the text
c
length of subsequences taken into account
activation function
np.tanh
function nicely mapping the real line R to (-1,1)
separating hyperplane
in the p-dimnsional space R^p, a flat affine subspace of dimension p-1
separates two classes
linear scoring function
a function that models the hyperplane for some samples in a p-dimensional space. it separates 2 classes
if less than 0 = negative
if more than 1 = positive
margin
the shortest perpendicular distance from each of the training samples to the separating hyperplane
maximum margin hyperplane / optimal separating hyperplane
the farthest perpendicular distance to the hyperplane from the training samples.
maximum margin classifier
classifying a test sample based on which side of the maximum margin hyperplane it lies in
support vectors
vectors in the p-dimensional space R^p that lie closest to the maximum margin hyperplane
if they moved slightly the mmh moves as well
equidistant
distance between support vectors and hyperplane = slab, larger slab = greater confidence
lie on the directly on the soft margin or the wrong side
soft margin classifier
the hyperplane that almost separates the classes using a soft margin
shorter margin but greater robustness and classification
soft bc it violates some of the training observations
solution to optimisation problem slide 8:46
slack variables
allow individual training samples to be on the wrong side of the margin or hyperplane
tuning parameter C
determines number and severity of violations to the margin and hyperplane that we tolerate
C = ∞ ->no violations tolerated, slack variables must be 0, old
C = 0 -> prioritise maximising margin, tolerate all violations
Pipeline
glues multiple processing steps into a single scikit-learn estimator
fit = train model using training data, through transforming data then fitting svm
score = evaluare on test data
cross-conformal predictor
≈ as full conformal predictor but
p(y) = (all ranks + 1)/n+1
calculate conformity score
rank scores
rank - 1
repeat for all folds
add all ranks
add 1
divide by n + 1 for p-value
repeat for all postulated labels
point prediction = highest p-value label
confidence = 1 - lower p-value
credibility = highest p-value
set predictor
predictor that outputs prediction sets rather than point predictions and takes a significant level as parameter
calibration curve
error rate vs significance level