Machine Learning Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Supervised ML

A

deal with samples, provide label, learn from set of labelled samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

classification

A

set of possible labels is finite, discrete and categorical output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

binary classification

A

has two possible labels (+1/-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

multi-class classification

A

has more than two (finitely) many labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

regression

A

set of possible labels is infinite and output is continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

batch learning

A

given training set of labelled samples, work out labels for the samples from a test set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

online learning

A

see sample, work out predicted label, check true label carry on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

training/exploration stage

A

analyse training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

exploitation stage

A

apply hypothesis to test data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

deduction induction transduction triangle

A
  • data -> hypothesis: induction
  • hypothesis -> unknown: deduction
  • data -> unknown: transduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

IID assumption

A

independent identically distributed

labelled samples (xi, yi) are assumed to be generated independently from the same probability measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

feature

A

attribute / component of the dataset which represents a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

label

A

categorises the sample into a certain class, the thing we’re trying to predict

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

conformal prediction

A

given a training set and test sample, try in turn each potential label for the test sample

for each label we look at how plausible the augmented training set is under IID assumption

use p-value for this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

conformity measure

A

evaluate how conforming/well the new observed test data fits with the existing observed training data, give an equivariant conformity score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

p-value

A

evaluate implausibility of augmented training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

non conformity measure

A

similar to conformity measure but measures how strange(non-conformal) the test data is compared to training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

average false p-value

A

the average of all p-values for all postulated labels in the test set except for the true labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

training accuracy

A

accuracy on training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

generalisation accuracy

A

how well the model is able to accurately predict on the test set after training on the training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

overfitting

A

when the model learns all of the details and noise of the training set that it is not able to generalise on the test set, so it negatively impacts performance on the test set.

high training accuracy low generalisation accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

underfitting

A

when the model does not capture enough of the underlying patterns of the training data due to it being too simple so cannot make accurate prediction on test data

low training accuracy low generalisation accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

learning curve

A

plot of the accuracy vs size of training set n
use: understanding CV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

RSS

A

residual sum of squares
aim to minimise residuals
residuals is difference between true and predicted label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

feature engineering

A

including derived features to the training set at will for a more accurate prediction of the model

26
Q

TSS

A

total sum of squares
sum of squared difference of true label minus mean of true labels

27
Q

R^2

A

1 - RSS / TSS

% of variability in the label explained by data

low value - not compatible with poor performance on test data due to possible overfitting

28
Q

regularisation

A

each feature should have as little effect on the outcome as possible to avoid overfitting

29
Q

α regularisation parameter

A

α = 0: ridge regression coincides with least squares, no penalty applied, size of coefficients doesn’t change, keeps model complex -> overfitting

α -> ∞: coefficient estimates forced to shrink towards 0, shrinkage penalty applied, RSS less important, model too simple may lead to underfitting

30
Q

Lasso

A

Least absolute shrinkage and selection operator
-L1 penalty instead of L2/euclidean norm
-minimises RSS
sets many w[j] coefficients to 0
-LASSO performs model selection = sparse model involve only some of the features
- if a few of my features important then use this
-useful for importance of the interpretability of the model

31
Q

method chaining

A

concatenation of method calls

32
Q

data normalisation

A

measuring the dataset on the same scale to ensure compatibility of the model

33
Q

normalisation - least squares

A

not essential
if first feature x[0] measured in metres, w^[0] will be the corresponding Least Squares estimate
if instead x[0] is now measures in km, all xi[0] will decrease 1000 fold
if you run Least Squares on the new dataset, then w^[0] will increase 1000 fold so no change to predictions

34
Q

normalisation - ridge/lasso

A

essential
due to the presence of penalty terms that are the same for all variables, so predictions will change

normalising features prevents larger features from being unfairly penalised by the penalty terms

35
Q

StandardScaler

A

for each feature mean 0 standard deviation 1
1) shift each feature down by its mean
2) divide each feature by its SD

36
Q

RobustScaler

A

for each feature median 0 IQR 1
1) shift each feature down by its median
2) divide each feature by its IQR

37
Q

MinMaxScaler

A

shift each feature so it is a value ranging between 0 and 1

38
Q

data snooping

A

when test set is used for developing the model.

test set leaks into model
inaccurate normalisation
leads to data snooping
affect transformation of data
lead to overfitting/underfitting

39
Q

Normalizer

A

each sample divided by its Euclidean norm

40
Q

parameter selection

A

split training set (used for model checking) further into validation set where we select the best parameters to evaluate on test set

41
Q

inductive conformity measure

A

A : Z* x Z -> R
A(C~, z) says how well z conforms to C~
no analogue of the equivariance requirement

42
Q

kernel

A

a function that turns linear problems into non linear one
take a feature mapping F . X -> H of X = sample space into H = feature space, equipped with dot product, allows this feature mapping to be turned to K(x, x’) = F(x) . F(x’)

43
Q

kernel trick

A
  • write the algorithms so that all xs can only appear in dot products
  • replace the dot products with kernels
43
Q

kernel features

A

-symmetric K(x, x’) = K(x, x’)
-positive definite ∑i=1∑j=1(aiaj)K(xi, xj) ≥ 0

44
Q

decay factor lambda

A

used to weigh gaps between the substrings

dimension of feature space, value of such coordinate depends on how frequently and compactly the sustring is embedded int the text

44
Q

c

A

length of subsequences taken into account

45
Q

activation function

A

np.tanh

function nicely mapping the real line R to (-1,1)

46
Q

separating hyperplane

A

in the p-dimnsional space R^p, a flat affine subspace of dimension p-1

separates two classes

47
Q

linear scoring function

A

a function that models the hyperplane for some samples in a p-dimensional space. it separates 2 classes
if less than 0 = negative
if more than 1 = positive

48
Q

margin

A

the shortest perpendicular distance from each of the training samples to the separating hyperplane

49
Q

maximum margin hyperplane / optimal separating hyperplane

A

the farthest perpendicular distance to the hyperplane from the training samples.

50
Q

maximum margin classifier

A

classifying a test sample based on which side of the maximum margin hyperplane it lies in

51
Q

support vectors

A

vectors in the p-dimensional space R^p that lie closest to the maximum margin hyperplane

if they moved slightly the mmh moves as well

equidistant

distance between support vectors and hyperplane = slab, larger slab = greater confidence

lie on the directly on the soft margin or the wrong side

52
Q

soft margin classifier

A

the hyperplane that almost separates the classes using a soft margin

shorter margin but greater robustness and classification

soft bc it violates some of the training observations

solution to optimisation problem slide 8:46

53
Q

slack variables

A

allow individual training samples to be on the wrong side of the margin or hyperplane

54
Q

tuning parameter C

A

determines number and severity of violations to the margin and hyperplane that we tolerate

C = ∞ ->no violations tolerated, slack variables must be 0, old

C = 0 -> prioritise maximising margin, tolerate all violations

55
Q

Pipeline

A

glues multiple processing steps into a single scikit-learn estimator

fit = train model using training data, through transforming data then fitting svm
score = evaluare on test data

56
Q

cross-conformal predictor

A

≈ as full conformal predictor but
p(y) = (all ranks + 1)/n+1

calculate conformity score
rank scores
rank - 1
repeat for all folds
add all ranks
add 1
divide by n + 1 for p-value
repeat for all postulated labels

point prediction = highest p-value label
confidence = 1 - lower p-value
credibility = highest p-value

57
Q

set predictor

A

predictor that outputs prediction sets rather than point predictions and takes a significant level as parameter

58
Q

calibration curve

A

error rate vs significance level