Machine Learning Technologies Flashcards

1
Q

What are the 4 types of ML techniques?

A

Supervised
Semi-Supervised
Unsupervised
Reinforcement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is error rate?

A

The proportion of incorrectly classified samples to total no. samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is empirical error?

A

Error calculated on training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is generalised error?

A

Error calculated on unseen samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 4 reasons for underfitting happening?

A

Model too simple
Insufficient training
Uninformative dataset
Over-regularised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 4 reasons for overfitting happening?

A

Too complex
Excessive training
Small dataset
Lacking regularisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to fix overfitting?

A

Change model and/or change data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to fix underfitting?

A

Update model and/or add more data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is overfitting unavoidable?

A

Because P≠NP - there are some problems for which we can verify a solution quickly but finding that solution efficiently is computationally infeasible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What’s the hold-out method?

A

Where dataset is split into two disjoint subsets (training set & testing set)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why do we use stratified sampling?

A

To prevent biased error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the 2 difficulties in choosing the data split?

A

More data in training set -> better model approximation but less reliable evaluation
More data is testing set -> better evaluation but weaker approximation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is LOO (Leave-One-Out)?

A

A case of k-fold cross-validation where k = n-1. So the test set is 1 and the training set is the rest
Close to ideal evaluation of training but computation cost is prohibitive for large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 5 steps of bootstrapping?

A

For dataset D containing n samples
1) Randomly pick a sample from D
2) Copy to D’
3) Put it back in D
4) Repeat n times
5) Use D’ as training set and D\D’ as testing set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What proportion of the data ends up in the testing set in bootstrapping?

A

Chance of not being picked in m rounds: (1-1/m)^m
As m -> infinity, chance -> 1/e = 0.368
So 36.8% of original samples don’t appear in D’ (this remaining data is called OOB (out-of-bag) data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is out-of-bag estimate?

A

The evaluation result obtained by bootstrapping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Parameters vs hyperparameters

A

Parameters are internal variables, learned automatically (>10 billion)
Hyperparameters are external variables defined by the user (<10)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is accuracy?

A

Correctly predicted instances / all instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is error?

A

Incorrectly predicted instances / all instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is precision?

A

Correctly predicted positives / predicted positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is recall?

A

Correctly predicted positives / actual positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is specificity?

A

Correctly predicted negatives / actual negatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a P-R curve?

A

Precision-recall curve
A tool for evaluating effectiveness of a classification model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What 3 solutions are there to intersecting lines in a P-R curve?

A
  • Compare areas under curves - not easy to compute
  • Break-even point - measure the point on the curves where precision & recall are equal
  • F1-Measure - harmonic mean of P & R:
    2 x (P * R) / (P + R)
    = 2 x TP / (N + TP - TN)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
In what situations are precision & recall more important?
Precision more important in recommender systems Recall more important in information retreival systems
26
In F_beta, for what values of beta are precision & recall more important?
Precision: beta < 1 Recall: beta > 1
27
Discuss the use of multiple confusion matrices
1) Precision & recall calculated for each round of training & testing --> n binary confusion 2) Take averages for macro-P, macro-R, macro-F1 (using mP & mR) 3) Calculate element-wise averages (TP etc) and use them to obtain micro-P, micro-R, micro-F1
28
What type of learning technique is clustering?
Unsupervised
29
What is prototype clustering?
Starts with initial prototype clusters Iteratively updates & optimises the prototypes
30
Define Occam's Razor
Choose the smallest number of clusters that adequately explains the data
31
What are the 4 steps in updating centroids in K-Means clustering?
1) Initialise K random centroids (from existing data points) 2) Expectation Maximisation (E-Step): determine which cluster each data point is closest to (Euclidean distance) and assign it 3) Expectation Maximisation (M-Step): recompute centroids based on assigned points 4) Repeat 2 & 3 until convergence
32
What are the 2 advantages of K-means clustering?
Simple & efficient Interpretable clusters
33
What are the 5 disadvantages of K-means clustering?
Sensitive to initial centroids Assumes clusters are equally sized Depends on the no. clusters Outliers skew centroids Not suitable for non-linear data
34
Intra-cluster vs inter-cluster similarity
Intra-cluster: items within a cluster should be similar Inter-cluster: clusters themselves should be dissimilar
35
What are the 2 types of validity indices?
External index: compares clustering results against a reference model Internal index: evaluates clustering results without reference model
36
Name 3 commonly used external validity indices
(Take values in range [0,1]) * Jaccard Coefficient (JC) * Fowlkes & Mallows Index (FMI) * Rand Index (RI)
37
Name 2 commonly used internal validity indices
* Davies-Bouldin Index (DBI) * Dunn Index (DI)
38
What are the 4 distance axioms?
Non-negativity: dist(a,b)>=0 Identity of indiscernibles: if dist(a,b)=0, a=b Symmetry: dist(a,b)=dist(b,a) Subadditivity: dist(a,b)<=dist(a,c)+dist(c,b)
39
What distances don't satisfy the subadditivity condition?
Non-metric distances
40
What are ordinal attributes?
Categorical attributes that have a natural/inherent order e.g. {low, medium, high} can be represented as {1, 2, 3}
41
What are non-ordinal attributes?
Categorical attributes that DON'T have a natural/inherent order e.g. {aircraft, train, ship}
42
Describe the Minkowski Distance (MD)
Satisfies all axioms Only applicable to ordinal attributes When p=1, becomes Manhattan distance When p=2, becomes Euclidean distance
43
Describe the Value Difference Metric (VDM)
Can be applied to non-ordinal attributes m_u,a (no. samples in dataset where colour is red) denotes the number of samples taking value a (red) on the attribute u (colour), and m_u,a,i (no. samples in cluster i where colour is red) denote the number of samples within the ith cluster taking value a on the attribute u; k is the no. clusters
44
How can MD & VDM be combined?
1) Arrange ordinal attributes in front of non-ordinal attributes 2) n_c denotes no. ordinal attributes and n-n_c dentoes no. non-ordinal attributes 3) Do MinkovDM()
45
What is Hamming distance?
The number of bits which need to be changed to turn one string into the other
46
What is Jaccard index?
The size of the intersection / the size of the union of the sample sets Doesn't work well for nominal data
47
What is cosine index?
The angle between 2 vectors of n dimensions Doesn't work well for nominal data
48
What is bagging?
Bootstrap aggregating Combines predictions from multiple models (base learners)
49
Decision trees recursively iterate until one of what 3 conditions is met?
* All samples in current node belong to same class (e.g. all "yes") * No samples in current node (e.g. spliting on Age > 60 leaves no samples in one brach) * No features left to split on (split on all features) or all samples have same feature values (same no. legs & same size)
50
What is the Gini Impurity Index?
A classifier that measures the impurity of a node in a decision tree (0=pure, 1=impure) G = 1 - sum(p^2)
51
What are the 5 steps in sorting data into sets of least impurity?
1) Split tree by feature x (age), it results in 2 nodes (younger than 30 & older than 30) 2) p_i,k represents the proportion of instances of class k (younger than 30) in node i (age) 3) Calculate the Gini index 4) Select the feature (age, gender, etc.) that produces the lowest weighted sum of the Gini scores for the child nodes 5) Repeat until leaf node reached or Gini score becomes very small (indicating minimal impurity)
52
What is entropy?
A measure of uncertainty/randomness in the data
53
What is gain ratio?
Information gain criterion is biased toward features with more possible values, so we reduce bias with gain ratio Gain_ratio(D,a) = Gain(D,a)/IV(a) Where IV is the intrinsic value of feature a - it's large when a has many possible values
54
What is the drawback of gain ratio?
It is biased toward features with fewer possible values
55
What are the 3 advantages of decision trees?
Can acheive 0% error rate if each training example assigned to unique leaf node Easy to prepare data Highly interpretable - white-box model (can understanding prediction reasoning)
56
What are the 3 disadvantages of decision trees?
High training time High variance leads to overfitting Sensitive to variation in dataset e.g. rotation, change in data etc.
57
What are 3 regularisation hyperparameters for decision trees and why are they necessary?
* Max tree depth * Min samples a node must have before splitting * Min samples a leaf node must have Necessary to avoid overfitting
58
What are the 3 advantages of bagging?
Reduces overfitting Handles missing values Can perform parallel processing
59
What are the 2 disadvantages of bagging?
Doesn't address bias Low interpretability
60
What is random forest?
An extension of bagging Instead of creating a big decision tree, create multiple smaller decision trees Instead of selecting optimal split features, select from a subset of features randomly generated from the feature set of node Train hundreds/thousands of trees on bootstrapped datasets and aggregate predictions
61
What are the 2 ways random forests aggregate predictions?
Classification: each tree votes on class of new data point; majority vote wins Regression: each tree predicts a value; average of predictions taken as final prediction
62
What is boosting?
A family of algorithms that converts weak learners to strong learners Each model is trained to correct predecessor errors by giving more weight to misclassified examples
63
What are the 4 steps of boosting?
1) Train a base learner 2) Adjust distribution of training samples according to results of base learner so incorrectly classified samples receive more attention 3) Train the next base learner with adjusted training samples; result is used to adjust training sample distribution again 4) Repeat 2 & 3 until no. base learners reaches a defined value
64
What 2 properties should individual learners have in boosting?
Should be accurate and diverse
65
What are the 5 steps of AdaBoost (adaptive boosting)?
1) Initialise weights for all training samples 2) Train weak learners on weighted dataset 3) Increase weights of misclassified examples so next weak learner focuses on them 4) Assign weight to each weak learner based on accuracy and combine predictions using classification/regression 5) Repeat process iteratively until defined no. iterations or model acheives desired accuracy
66
What are 3 advantages of boosting?
Reduces bias High accuracy Adaptive
67
What are 3 disadvantages of boosting?
Sensitive to outliers Less parallelisable Overfitting if boosting rounds are high
68
What vectors are used for non-ordinal regression?
For k possible values (watermelon, pumpkin, cucumber), k-dimensional vectors: (1,0,0), (0,1,0), (0,0,1)
69
What is the linear regression model?
x = (x1, x2, ... , xn) w = (w1, w2, ... , wn) b is bias (y-intercept) f(x) = wTx+b
70
What is the hyperplane in supervised ML?
The line/surface that separates data into different groups You want to maximise the margin
71
What is the decision boundary / hyperplane in supervised ML?
The line equidistant from the margin boundaries Where the support vectors lay
72
What is the margin in supervised ML?
The distance (both ways) between the decision boundary and closest data points from any class (the distance both ways so essentially the distance between classes) We want to maximise this
73
What are support vectors in supervised ML?
The data points (of different classes) that lie on the margin boundaries
74
What does generative learning do?
Trains model to learn to generate new data instances by learning the underlying data distribution
75
What are 4 advantages of generative learning?
Can generate new data Handles missing data (by learning underlying distribution) Good at low-resource learning Provides deep insights into underlying structure
76
What are 3 disadvantages of generative learning?
Computationally expensive Hard to train - especially high-dimensional data Low classification performance
77
What does discriminative learning do?
Focuses on learning the decision boundary between input features and target classes, rather than modeling the data distribution
78
What are 4 advantages of discriminative learning?
Easy to train - learns boundaries between classes High classification accuracy Fast inference Efficient with large datasets
79
What are 3 disadvantages of discriminative learning?
Limited understanding of data structure Struggles with missing data Requires large labelled datasets
80
What are the 2 approaches to pruning *(not when)*?
Cost-complexity pruning: set a threshold for cost of a subtree & remove subtrees that exceed it Error-based pruning: evaluate performance of tree on validation set & remove nodes that don't improve accuracy
81
What is pre-pruning?
Evaluates the generalisation ability of each split and cancels the split if improvment is small (or worse) - includes root node
82
What is a decision stump?
A decision tree with only one splitting
83
What is post-pruning?
Allows the tree to grow into a complete tree Re-examines non-leaf nodes and replace with leaf nodes if replacement improves generalisation ability
84
What is the pro and con of post-pruning?
Pro: less prone to underfitting - better generalisation ability Con: longer training time as examines every non-leaf node
85
How is generalisation ability measured for pruning?
Performance evaluation methods such as hold-out method
86
How can you split continuous values in a decision tree?
Bi-partitioning: 1) Sort data by value 2) Evaluate split points by information gain (n-1 potential split points as n-1 midpoints of adjacent values) 3) Select optimal split as t and split at t
87
What are the 3 ways of handling missing values?
Imputation: replace with estimated values (mean/median/mode) Ignoring: remove instances Treat as a unique category
88
What is the problem with using singular feature check decision trees for defining decision boundaries?
Axis parallel so many segments needed for good approximations -> slow
89
What is logistic regression and when is it necessary?
Binary classifier that predicts probability of outcome by applying logistic function to linear model For when data can't be classified by a linear equation
90
What is RBF (Radial Basis Function)?
A non-linear data transformation (increase dimensionality) to make the data linearly separable
91
What is RBFN (RBF Network)?
A type of multilayer perceptron Has strictly 1 hidden layer with more RBF neurons than the no. inputs (to increase dimensionality)
92
How does RBFN work (4 points)?
* Each RBF neuron **stores prototype vector** chosen from training data * The neuron **finds the distance** / similarity score (low,0-1,high) between its input and its prototype * **Response decreases** exponentially **as distance** between input & prototype **increases** (weight of neurons in deciding output decreases) * Output value called the '**activation**' of the neuron
93
Name 2 popular RBFs
Gaussian Multi-quadric
94
Name 2 non-radial basis functions
Linear Thin plate splines
95
What are 3 limitations of RBFNs?
* Performance determined by centres, radiuses & RBF chosen * RBFs must cover the input space well to work effectively * Faster training, but slower classifications than MLPs
96
How are perceptrons trained?
Set weights to small random numbers ```For T iterations / convergence ... For each input vector... Compute activation of each neuron Update weights: weight = oldweight - lr(actualouput - targetoutput) * [value input feature i for sample j]```
97
What's the purpose of activation functions?
To introduce non-linearity to each neuron's output
98
Name 4 common activation functions & their formulae
Sigmoid: 1 / (1+e^-x) ReLU: max(0,x) Softmax: forces outputs to sum to 1 for probability distribution Tanh: (1-e^-2x) / (1+e^-2x)
99
What is gradient descent in backpropogation?
An optimisation algorithm 1) Backpropogation **calculates gradients of loss function** with respect to parameters 2) Gradient descent uses those gradients to **adjust parameters** (weights & bias) in direction that **minimises loss**
100
Name 3 cost functions
MSE (mean squared error) KL divergence Hellinger distance
101
What is a cost/loss function?
A function the measures the discrepancy between predicted output and actual output
102
Give an example of a transformation of data for logistic regression
Instead of using position vectors, use distance vectors from some point
103
What are cascading logistic regression models? Give an example
A technique where multiple logistic regression models are applied sequentially on subsets of the problem to refine predictions Example (classifying images): first model classifies if image is animal or vehicle, second model distinguishes animal between cat or dog, third model distinguishes vehicle as car or bike
104
Describe a fully connected feedforward network
x=input, f(x)=function applied to input, y=output y = f(x) = σ(W^n ... σ(W^2 * σ(W^1 * x + b^1) + b^2) ... + b^n)
105
What is multi-class classification?
Classifies data with more than 2 possible outcomes Probabilities determine predicted class - calculated with softmax
106
How do you find a good learning rate?
Plot loss over each iteration for varying learning rates
107
In what 2 ways are deeper networks better than shallower networks?
Uses parameters more efficiently Has various levels of abstraction (better generalisation & nested heirarchy of concepts)
108
What is modularisation?
The process of breaking down a task into smaller, manageable units called modules
109
What is universal approximation theorem?
States that a feedforward neural network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function, given appropriate activation functions
110
What is the reality of Universal Approximation Theorem?
In practice, deeper networks are more efficient & require fewer neurons
111
What is a convolution layer?
Instead of each pixel connecting to all other pixels, connect to pixels in a nearby region
112
Give 3 reasons for using CNN (convolutional neural network) for image recognition
* Patterns may be smaller than whole image * Patterns may appear in different regions - leads to multiple detectors for different regions doing the same thing * Decreasing the resolution shouldn't affect detection
113
What 4 steps are there in CNNs (convolutional neural networks)?
1) Convolution 2) Max Pooling 3) Repeat 1 & 2 as many times as you want 4) Flatten
114
What is convolution?
Using filters on an image to detect small patterns
115
What is max pooling?
Cut new image (from convolution) into pieces (pooling) and take the max value from each peice
116
What is flattening in CNNs?
Converting the multi-dimensional feature map output from convolutional layers into a one-dimensional vector
117
What are exploding & vanishing gradients?
Exploding: when gradients are initialised above 1, after a large no. iterations, weight gets very big Vanishing: when gradients are initialised below 1, after a large no. iterations, weight becomes 0
118
How can exploding gradients be addressed?
Use clipping: gradient is capped, preventing excessively large updates to weights
119
What is a gated recurrent unit?
A variant of LTSM Instead of using a forget gate to determine what fraction of information to retain, how much past information to retain/discard is based on input gate vector
120
What are the 5 aspects of data characterisation?
1) What data is **available**? Can i get more? How? 2) What **format** is the data in? Transform? Cost? 3) Assess data **quality** 4) **Identify test/training** datasets 5) Assess **bias, legal, privacy, ethics** considerations
121
In what 4 ways is data quality assessed?
**Completeness** - enough labelled data? **Accuracy** **Believability** - can method be trusted? **TImeliness** - from right timeframe?
122
Where is bias introduced (2)?
Features - e.g. introducing gender as a feature Training set
123
What is smoothing?
A post-processing step e.g. if a journey with 5 min splits goes car, car, bike, car, car, that bike should be smoothed out
124
What is fault analysis?
A framework to help debug the model The further down the fault tree you go, the closer to the root cause you get
125
What is undersegmentation & oversegmentation?
Undersegmentation: distinct regions incorrectly grouped as single segment Oversegmentation: single region divided into too many segments
126
What are GMMs (gaussian mixture models)?
A more complicated method of clustering data than k-means Represents a dataset as a mixture of several Gaussian distributions
127
What property should the data points have in mixture gaussian density?
The probabilities (pi) sum to 1
128
What are the 2 steps of the EM algorithm?
Parameters: means, variances, mixing coefficients **Expectation step**: use parameter estimates to calculate expected values of variables to determine how likelihood of data point belongs to each cluster **Maximisation step**: Update parameters by maximising likelihood of data
129
What are 5 advantages of GMMs?
* **F**lexibility * **I**nterpretability * **S**peed * **H**andling of missing data * **R**obustness to outliers
130
What are 5 disadvantages of GMMs?
* **S**ensitive to initialisation * **C**hoosing number of components * **A**ssumes normal distribution * **L**imited expressive power * **E**xpensive when high D
131
What are autoencoders?
An unsupervised learning technique Network learns to compress (encode) and reconstruct (decode) data by creating a bottleneck (hidden) layer forcing network to find meaningful representations of the data Network is trained to minimise reconstruction error
132
What is latent space?
A lower-dimensional representation of data that captures its essential features & underlying structure Helps represent hidden relationships within the data
133
What is SVD (singular value decomposition)?
Factorises a matrix into three matrices: rotation -> rescaling -> rotation Used as a data reduction tool by determining key features
134
How does PCA preserve information?
By finding the direction of data with most variance (most information)
135
What are support vector machines?
Powerful ML models capable of performing **linear & non-linear classification & regression** tasks
136
Are margins sensitive to feature scaling?
Yes
137
What is a hard margin?
A margin that strictly imposes that all instances be "off the street" (not in the margin)
138
What is a soft margin?
A margin that balances keeping the margin as large as possible whilst limiting "margin violations" C hyperparameter controls amount of violation allowed (small C allows more violations) Violation scales with distance