Machine Learning Flashcards

Question

feature engineering

Answer 1

including derived features to the training set at will for a more accurate prediction of the model

Answer 2

total sum of squares sum of squared difference of true label minus mean of true labels

Answer 3

1 - RSS / TSS % of variability in the label explained by data low value - not compatible with poor performance on test data due to possible overfitting

Answer 4

each feature should have as little effect on the outcome as possible to avoid overfitting

Answer 5

α = 0: ridge regression coincides with least squares, no penalty applied, size of coefficients doesn't change, keeps model complex -> overfitting α -> ∞: coefficient estimates forced to shrink towards 0, shrinkage penalty applied, RSS less important, model too simple may lead to underfitting

Answer 6

Least absolute shrinkage and selection operator -L1 penalty instead of L2/euclidean norm -minimises RSS sets many w[j] coefficients to 0 -LASSO performs model selection = sparse model involve only some of the features - if a few of my features important then use this -useful for importance of the interpretability of the model

Answer 7

concatenation of method calls

Answer 8

measuring the dataset on the same scale to ensure compatibility of the model

Answer 9

not essential if first feature x[0] measured in metres, w^[0] will be the corresponding Least Squares estimate if instead x[0] is now measures in km, all xi[0] will decrease 1000 fold if you run Least Squares on the new dataset, then w^[0] will increase 1000 fold so no change to predictions

Answer 10

essential due to the presence of penalty terms that are the same for all variables, so predictions will change normalising features prevents larger features from being unfairly penalised by the penalty terms

Answer 11

for each feature mean 0 standard deviation 1 1) shift each feature down by its mean 2) divide each feature by its SD

Answer 12

for each feature median 0 IQR 1 1) shift each feature down by its median 2) divide each feature by its IQR

Answer 13

shift each feature so it is a value ranging between 0 and 1

Answer 14

when test set is used for developing the model. test set leaks into model inaccurate normalisation leads to data snooping affect transformation of data lead to overfitting/underfitting

Answer 15

each sample divided by its Euclidean norm

Answer 16

split training set (used for model checking) further into validation set where we select the best parameters to evaluate on test set

Answer 17

A : Z* x Z -> R A(C~, z) says how well z conforms to C~ no analogue of the equivariance requirement

Answer 18

a function that turns linear problems into non linear one take a feature mapping F . X -> H of X = sample space into H = feature space, equipped with dot product, allows this feature mapping to be turned to K(x, x') = F(x) . F(x')

Answer 19

- write the algorithms so that all xs can only appear in dot products - replace the dot products with kernels

Answer 20

-symmetric K(x, x') = K(x, x') -positive definite ∑i=1∑j=1(aiaj)K(xi, xj) ≥ 0

Answer 21

used to weigh gaps between the substrings dimension of feature space, value of such coordinate depends on how frequently and compactly the sustring is embedded int the text

Answer 22

length of subsequences taken into account

Answer 23

np.tanh function nicely mapping the real line R to (-1,1)

Answer 24

in the p-dimnsional space R^p, a flat affine subspace of dimension p-1 separates two classes

Answer 25

a function that models the hyperplane for some samples in a p-dimensional space. it separates 2 classes if less than 0 = negative if more than 1 = positive

Answer 26

the shortest perpendicular distance from each of the training samples to the separating hyperplane

Answer 27

the farthest perpendicular distance to the hyperplane from the training samples.

Answer 28

classifying a test sample based on which side of the maximum margin hyperplane it lies in

Answer 29

vectors in the p-dimensional space R^p that lie closest to the maximum margin hyperplane if they moved slightly the mmh moves as well equidistant distance between support vectors and hyperplane = slab, larger slab = greater confidence lie on the directly on the soft margin or the wrong side

Answer 30

the hyperplane that almost separates the classes using a soft margin shorter margin but greater robustness and classification soft bc it violates some of the training observations solution to optimisation problem slide 8:46

Answer 31

allow individual training samples to be on the wrong side of the margin or hyperplane

Answer 32

determines number and severity of violations to the margin and hyperplane that we tolerate C = ∞ ->no violations tolerated, slack variables must be 0, old C = 0 -> prioritise maximising margin, tolerate all violations

Answer 33

glues multiple processing steps into a single scikit-learn estimator fit = train model using training data, through transforming data then fitting svm score = evaluare on test data

Answer 34

≈ as full conformal predictor but p(y) = (all ranks + 1)/n+1 calculate conformity score rank scores rank - 1 repeat for all folds add all ranks add 1 divide by n + 1 for p-value repeat for all postulated labels point prediction = highest p-value label confidence = 1 - lower p-value credibility = highest p-value

Answer 35

predictor that outputs prediction sets rather than point predictions and takes a significant level as parameter

Answer 36

error rate vs significance level

Machine Learning Flashcards

(60 cards)