Machine Learning Technologies Flashcards

Question

In what situations are precision & recall more important?

Answer 1

Precision more important in recommender systems Recall more important in information retreival systems

Answer 2

Precision: beta < 1 Recall: beta > 1

Answer 3

1) Precision & recall calculated for each round of training & testing --> n binary confusion 2) Take averages for macro-P, macro-R, macro-F1 (using mP & mR) 3) Calculate element-wise averages (TP etc) and use them to obtain micro-P, micro-R, micro-F1

Answer 4

Unsupervised

Answer 5

Starts with initial prototype clusters Iteratively updates & optimises the prototypes

Answer 6

Choose the smallest number of clusters that adequately explains the data

Answer 7

1) Initialise K random centroids (from existing data points) 2) Expectation Maximisation (E-Step): determine which cluster each data point is closest to (Euclidean distance) and assign it 3) Expectation Maximisation (M-Step): recompute centroids based on assigned points 4) Repeat 2 & 3 until convergence

Answer 8

Simple & efficient Interpretable clusters

Answer 9

Sensitive to initial centroids Assumes clusters are equally sized Depends on the no. clusters Outliers skew centroids Not suitable for non-linear data

Answer 10

Intra-cluster: items within a cluster should be similar Inter-cluster: clusters themselves should be dissimilar

Answer 11

External index: compares clustering results against a reference model Internal index: evaluates clustering results without reference model

Answer 12

(Take values in range [0,1]) * Jaccard Coefficient (JC) * Fowlkes & Mallows Index (FMI) * Rand Index (RI)

Answer 13

* Davies-Bouldin Index (DBI) * Dunn Index (DI)

Answer 14

Non-negativity: dist(a,b)>=0 Identity of indiscernibles: if dist(a,b)=0, a=b Symmetry: dist(a,b)=dist(b,a) Subadditivity: dist(a,b)<=dist(a,c)+dist(c,b)

Answer 15

Non-metric distances

Answer 16

Categorical attributes that have a natural/inherent order e.g. {low, medium, high} can be represented as {1, 2, 3}

Answer 17

Categorical attributes that DON'T have a natural/inherent order e.g. {aircraft, train, ship}

Answer 18

Satisfies all axioms Only applicable to ordinal attributes When p=1, becomes Manhattan distance When p=2, becomes Euclidean distance

Answer 19

Can be applied to non-ordinal attributes m_u,a (no. samples in dataset where colour is red) denotes the number of samples taking value a (red) on the attribute u (colour), and m_u,a,i (no. samples in cluster i where colour is red) denote the number of samples within the ith cluster taking value a on the attribute u; k is the no. clusters

Answer 20

1) Arrange ordinal attributes in front of non-ordinal attributes 2) n_c denotes no. ordinal attributes and n-n_c dentoes no. non-ordinal attributes 3) Do MinkovDM()

Answer 21

The number of bits which need to be changed to turn one string into the other

Answer 22

The size of the intersection / the size of the union of the sample sets Doesn't work well for nominal data

Answer 23

The angle between 2 vectors of n dimensions Doesn't work well for nominal data

Answer 24

Bootstrap aggregating Combines predictions from multiple models (base learners)

Answer 25

* All samples in current node belong to same class (e.g. all "yes") * No samples in current node (e.g. spliting on Age > 60 leaves no samples in one brach) * No features left to split on (split on all features) or all samples have same feature values (same no. legs & same size)

Answer 26

A classifier that measures the impurity of a node in a decision tree (0=pure, 1=impure) G = 1 - sum(p^2)

Answer 27

1) Split tree by feature x (age), it results in 2 nodes (younger than 30 & older than 30) 2) p_i,k represents the proportion of instances of class k (younger than 30) in node i (age) 3) Calculate the Gini index 4) Select the feature (age, gender, etc.) that produces the lowest weighted sum of the Gini scores for the child nodes 5) Repeat until leaf node reached or Gini score becomes very small (indicating minimal impurity)

Answer 28

A measure of uncertainty/randomness in the data

Answer 29

Information gain criterion is biased toward features with more possible values, so we reduce bias with gain ratio Gain_ratio(D,a) = Gain(D,a)/IV(a) Where IV is the intrinsic value of feature a - it's large when a has many possible values

Answer 30

It is biased toward features with fewer possible values

Answer 31

Can acheive 0% error rate if each training example assigned to unique leaf node Easy to prepare data Highly interpretable - white-box model (can understanding prediction reasoning)

Answer 32

High training time High variance leads to overfitting Sensitive to variation in dataset e.g. rotation, change in data etc.

Answer 33

* Max tree depth * Min samples a node must have before splitting * Min samples a leaf node must have Necessary to avoid overfitting

Answer 34

Reduces overfitting Handles missing values Can perform parallel processing

Answer 35

Doesn't address bias Low interpretability

Answer 36

An extension of bagging Instead of creating a big decision tree, create multiple smaller decision trees Instead of selecting optimal split features, select from a subset of features randomly generated from the feature set of node Train hundreds/thousands of trees on bootstrapped datasets and aggregate predictions

Answer 37

Classification: each tree votes on class of new data point; majority vote wins Regression: each tree predicts a value; average of predictions taken as final prediction

Answer 38

A family of algorithms that converts weak learners to strong learners Each model is trained to correct predecessor errors by giving more weight to misclassified examples

Answer 39

1) Train a base learner 2) Adjust distribution of training samples according to results of base learner so incorrectly classified samples receive more attention 3) Train the next base learner with adjusted training samples; result is used to adjust training sample distribution again 4) Repeat 2 & 3 until no. base learners reaches a defined value

Answer 40

Should be accurate and diverse

Answer 41

1) Initialise weights for all training samples 2) Train weak learners on weighted dataset 3) Increase weights of misclassified examples so next weak learner focuses on them 4) Assign weight to each weak learner based on accuracy and combine predictions using classification/regression 5) Repeat process iteratively until defined no. iterations or model acheives desired accuracy

Answer 42

Reduces bias High accuracy Adaptive

Answer 43

Sensitive to outliers Less parallelisable Overfitting if boosting rounds are high

Answer 44

For k possible values (watermelon, pumpkin, cucumber), k-dimensional vectors: (1,0,0), (0,1,0), (0,0,1)

Answer 45

x = (x1, x2, ... , xn) w = (w1, w2, ... , wn) b is bias (y-intercept) f(x) = wTx+b

Answer 46

The line/surface that separates data into different groups You want to maximise the margin

Answer 47

The line equidistant from the margin boundaries Where the support vectors lay

Answer 48

The distance (both ways) between the decision boundary and closest data points from any class (the distance both ways so essentially the distance between classes) We want to maximise this

Answer 49

The data points (of different classes) that lie on the margin boundaries

Answer 50

Trains model to learn to generate new data instances by learning the underlying data distribution

Answer 51

Can generate new data Handles missing data (by learning underlying distribution) Good at low-resource learning Provides deep insights into underlying structure

Answer 52

Computationally expensive Hard to train - especially high-dimensional data Low classification performance

Answer 53

Focuses on learning the decision boundary between input features and target classes, rather than modeling the data distribution

Answer 54

Easy to train - learns boundaries between classes High classification accuracy Fast inference Efficient with large datasets

Answer 55

Limited understanding of data structure Struggles with missing data Requires large labelled datasets

Answer 56

Cost-complexity pruning: set a threshold for cost of a subtree & remove subtrees that exceed it Error-based pruning: evaluate performance of tree on validation set & remove nodes that don't improve accuracy

Answer 57

Evaluates the generalisation ability of each split and cancels the split if improvment is small (or worse) - includes root node

Answer 58

A decision tree with only one splitting

Answer 59

Allows the tree to grow into a complete tree Re-examines non-leaf nodes and replace with leaf nodes if replacement improves generalisation ability

Answer 60

Pro: less prone to underfitting - better generalisation ability Con: longer training time as examines every non-leaf node

Answer 61

Performance evaluation methods such as hold-out method

Answer 62

Bi-partitioning: 1) Sort data by value 2) Evaluate split points by information gain (n-1 potential split points as n-1 midpoints of adjacent values) 3) Select optimal split as t and split at t

Answer 63

Imputation: replace with estimated values (mean/median/mode) Ignoring: remove instances Treat as a unique category

Answer 64

Axis parallel so many segments needed for good approximations -> slow

Answer 65

Binary classifier that predicts probability of outcome by applying logistic function to linear model For when data can't be classified by a linear equation

Answer 66

A non-linear data transformation (increase dimensionality) to make the data linearly separable

Answer 67

A type of multilayer perceptron Has strictly 1 hidden layer with more RBF neurons than the no. inputs (to increase dimensionality)

Answer 68

* Each RBF neuron **stores prototype vector** chosen from training data * The neuron **finds the distance** / similarity score (low,0-1,high) between its input and its prototype * **Response decreases** exponentially **as distance** between input & prototype **increases** (weight of neurons in deciding output decreases) * Output value called the '**activation**' of the neuron

Answer 69

Gaussian Multi-quadric

Answer 70

Linear Thin plate splines

Answer 71

* Performance determined by centres, radiuses & RBF chosen * RBFs must cover the input space well to work effectively * Faster training, but slower classifications than MLPs

Answer 72

Set weights to small random numbers ```For T iterations / convergence ... For each input vector... Compute activation of each neuron Update weights: weight = oldweight - lr(actualouput - targetoutput) * [value input feature i for sample j]```

Answer 73

To introduce non-linearity to each neuron's output

Answer 74

Sigmoid: 1 / (1+e^-x) ReLU: max(0,x) Softmax: forces outputs to sum to 1 for probability distribution Tanh: (1-e^-2x) / (1+e^-2x)

Answer 75

An optimisation algorithm 1) Backpropogation **calculates gradients of loss function** with respect to parameters 2) Gradient descent uses those gradients to **adjust parameters** (weights & bias) in direction that **minimises loss**

Answer 76

MSE (mean squared error) KL divergence Hellinger distance

Answer 77

A function the measures the discrepancy between predicted output and actual output

Answer 78

Instead of using position vectors, use distance vectors from some point

Answer 79

A technique where multiple logistic regression models are applied sequentially on subsets of the problem to refine predictions Example (classifying images): first model classifies if image is animal or vehicle, second model distinguishes animal between cat or dog, third model distinguishes vehicle as car or bike

Answer 80

x=input, f(x)=function applied to input, y=output y = f(x) = σ(W^n ... σ(W^2 * σ(W^1 * x + b^1) + b^2) ... + b^n)

Answer 81

Classifies data with more than 2 possible outcomes Probabilities determine predicted class - calculated with softmax

Answer 82

Plot loss over each iteration for varying learning rates

Answer 83

Uses parameters more efficiently Has various levels of abstraction (better generalisation & nested heirarchy of concepts)

Answer 84

The process of breaking down a task into smaller, manageable units called modules

Answer 85

States that a feedforward neural network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function, given appropriate activation functions

Answer 86

In practice, deeper networks are more efficient & require fewer neurons

Answer 87

Instead of each pixel connecting to all other pixels, connect to pixels in a nearby region

Answer 88

* Patterns may be smaller than whole image * Patterns may appear in different regions - leads to multiple detectors for different regions doing the same thing * Decreasing the resolution shouldn't affect detection

Answer 89

1) Convolution 2) Max Pooling 3) Repeat 1 & 2 as many times as you want 4) Flatten

Answer 90

Using filters on an image to detect small patterns

Answer 91

Cut new image (from convolution) into pieces (pooling) and take the max value from each peice

Answer 92

Converting the multi-dimensional feature map output from convolutional layers into a one-dimensional vector

Answer 93

Exploding: when gradients are initialised above 1, after a large no. iterations, weight gets very big Vanishing: when gradients are initialised below 1, after a large no. iterations, weight becomes 0

Answer 94

Use clipping: gradient is capped, preventing excessively large updates to weights

Answer 95

A variant of LTSM Instead of using a forget gate to determine what fraction of information to retain, how much past information to retain/discard is based on input gate vector

Answer 96

1) What data is **available**? Can i get more? How? 2) What **format** is the data in? Transform? Cost? 3) Assess data **quality** 4) **Identify test/training** datasets 5) Assess **bias, legal, privacy, ethics** considerations

Answer 97

**Completeness** - enough labelled data? **Accuracy** **Believability** - can method be trusted? **TImeliness** - from right timeframe?

Answer 98

Features - e.g. introducing gender as a feature Training set

Answer 99

A post-processing step e.g. if a journey with 5 min splits goes car, car, bike, car, car, that bike should be smoothed out

Answer 100

A framework to help debug the model The further down the fault tree you go, the closer to the root cause you get

Answer 101

Undersegmentation: distinct regions incorrectly grouped as single segment Oversegmentation: single region divided into too many segments

Answer 102

A more complicated method of clustering data than k-means Represents a dataset as a mixture of several Gaussian distributions

Answer 103

The probabilities (pi) sum to 1

Answer 104

Parameters: means, variances, mixing coefficients **Expectation step**: use parameter estimates to calculate expected values of variables to determine how likelihood of data point belongs to each cluster **Maximisation step**: Update parameters by maximising likelihood of data

Answer 105

* **F**lexibility * **I**nterpretability * **S**peed * **H**andling of missing data * **R**obustness to outliers

Answer 106

* **S**ensitive to initialisation * **C**hoosing number of components * **A**ssumes normal distribution * **L**imited expressive power * **E**xpensive when high D

Answer 107

An unsupervised learning technique Network learns to compress (encode) and reconstruct (decode) data by creating a bottleneck (hidden) layer forcing network to find meaningful representations of the data Network is trained to minimise reconstruction error

Answer 108

A lower-dimensional representation of data that captures its essential features & underlying structure Helps represent hidden relationships within the data

Answer 109

Factorises a matrix into three matrices: rotation -> rescaling -> rotation Used as a data reduction tool by determining key features

Answer 110

By finding the direction of data with most variance (most information)

Answer 111

Powerful ML models capable of performing **linear & non-linear classification & regression** tasks

Answer 112

A margin that strictly imposes that all instances be "off the street" (not in the margin)

Answer 113

A margin that balances keeping the margin as large as possible whilst limiting "margin violations" C hyperparameter controls amount of violation allowed (small C allows more violations) Violation scales with distance

Machine Learning Technologies Flashcards

(138 cards)