Machine Learning Technologies Flashcards

1
Q

What are the 4 types of ML techniques?

A

Supervised
Semi-Supervised
Unsupervised
Reinforcement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is error rate?

A

The proportion of incorrectly classified samples to total no. samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is empirical error?

A

Error calculated on training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is generalised error?

A

Error calculated on unseen samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 4 reasons for underfitting happening?

A

Model too simple
Insufficient training
Uninformative dataset
Over-regularised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 4 reasons for overfitting happening?

A

Too complex
Excessive training
Small dataset
Lacking regularisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to fix overfitting?

A

Change model and/or change data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to fix underfitting?

A

Update model and/or add more data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is overfitting unavoidable?

A

Because P≠NP - there are some problems for which we can verify a solution quickly but finding that solution efficiently is computationally infeasible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What’s the hold-out method?

A

Where dataset is split into two disjoint subsets (training set & testing set)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why do we use stratified sampling?

A

To prevent biased error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the 2 difficulties in choosing the data split?

A

More data in training set -> better model approximation but less reliable evaluation
More data is testing set -> better evaluation but weaker approximation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is LOO (Leave-One-Out)?

A

A case of k-fold cross-validation where k = n-1. So the test set is 1 and the training set is the rest
Close to ideal evaluation of training but computation cost is prohibitive for large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 5 steps of bootstrapping?

A

For dataset D containing n samples
1) Randomly pick a sample from D
2) Copy to D’
3) Put it back in D
4) Repeat n times
5) Use D’ as training set and D\D’ as testing set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What proportion of the data ends up in the testing set in bootstrapping?

A

Chance of not being picked in m rounds: (1-1/m)^m
As m -> infinity, chance -> 1/e = 0.368
So 36.8% of original samples don’t appear in D’ (this remaining data is called OOB (out-of-bag) data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is out-of-bag estimate?

A

The evaluation result obtained by bootstrapping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Parameters vs hyperparameters

A

Parameters are internal variables, learned automatically (>10 billion)
Hyperparameters are external variables defined by the user (<10)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is accuracy?

A

Correctly predicted instances / all instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is error?

A

Incorrectly predicted instances / all instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is precision?

A

Correctly predicted positives / predicted positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is recall?

A

Correctly predicted positives / actual positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is specificity?

A

Correctly predicted negatives / actual negatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a P-R curve?

A

Precision-recall curve
A tool for evaluating effectiveness of a classification model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What 3 solutions are there to intersecting lines in a P-R curve?

A
  • Compare areas under curves - not easy to compute
  • Break-even point - measure the point on the curves where precision & recall are equal
  • F1-Measure - harmonic mean of P & R:
    2 x (P * R) / (P + R)
    = 2 x TP / (N + TP - TN)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

In what situations are precision & recall more important?

A

Precision more important in recommender systems
Recall more important in information retreival systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

In F_beta, for what values of beta are precision & recall more important?

A

Precision: beta < 1
Recall: beta > 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Discuss the use of multiple confusion matrices

A

1) Precision & recall calculated for each round of training & testing –> n binary confusion
2) Take averages for macro-P, macro-R, macro-F1 (using mP& mR)
3) Calculate element-wise averages (TP etc) and use them to obtain micro-P, micro-R, micro-F1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What type of learning technique is clustering?

A

Unsupervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is prototype clustering?

A

Starts with initial prototype clusters
Iteratively updates & optimises the prototypes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Define Occam’s Razor

A

Choose the smallest number of clusters that adequately explains the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are the 4 steps in updating centroids in K-Means clustering?

A

1) Initialise K random centroids (from existing data points)
2) Expectation Maximisation (E-Step): determine which cluster each data point is closest to (Euclidean distance) and assign it
3) Expectation Maximisation (M-Step): recompute centroids based on assigned points
4) Repeat 2 & 3 until convergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are the 3 advantages of K-means clustering?

A

Simple & efficient
Interpretable clusters
Can help ML models learn & make predictions easier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are the 5 disadvantages of K-means clustering?

A

Sensitive to initial centroids
Assumes clusters are equally sized
Depends on the no. clusters
Outliers skew centroids
Not suitabble for non-linear data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Intra-cluster vs inter-cluster similarity

A

Intra-cluster: items within a cluster should be similar
Inter-cluster: clusters themselves should be dissimilar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are the 2 types of validity indices?

A

External index: compares clustering results against a reference model
Internal index: evaluates clustering results without reference model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Name 3 commonly used external validity indices

A

(Take values in range [0,1])

  • Jaccard Coefficient (JC)
  • Fowlkes & Mallows Index (FMI)
  • Rand Index (RI)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Name 2 commonly used internal validity indices

A
  • Davies-Bouldin Index (DBI)
  • Dunn Index (DI)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are the 4 distance axioms?

A

Non-negativity: dist(a,b)>=0
Identity of indiscernibles: if dist(a,b)=0, a=b
Symmetry: dist(a,b)=dist(b,a)
Subadditivity: dist(a,b)<=dist(a,c)+dist(c,b)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What distances don’t satisfy the subadditivity condition?

A

Non-metric distances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are ordinal attributes?

A

Categorical attributes that have a natural/inherent order e.g. {low, medium, high} can be represented as {1, 2, 3}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What are non-ordinal attributes?

A

Categorical attributes that DON’T have a natural/inherent order e.g. {aircraft, train, ship}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Describe the Minkowski Distance (MD)

A

Satisfies all axioms
Only applicable to ordinal attributes
When p=1, becomes Manhattan distance
When p=2, becomes Euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Describe the Value Difference Metric (VDM)

A

Can be applied to non-ordinal attributes
m_u,a (no. samples in dataset where colour is red) denotes the number of samples taking value a (red) on the attribute u (colour), and m_u,a,i (no. samples in cluster i where colour is red) denote the number of samples within the ith cluster taking value a on the attribute u; k is the no. clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

How can MD & VDM be combined?

A

1) Arrange ordinal attributes in front of non-ordinal attributes
2) n_c denotes no. ordinal attributes and n-n_c dentoes no. non-ordinal attributes
3) Do MinkovDM()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is Hamming distance?

A

The number of bits which need to be changed to turn one string into the other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is Jaccard index?

A

The size of the intersection / the size of the union of the sample sets
Doesn’t work well for nominal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is cosine index?

A

The angle between 2 vectors of n dimensions
Doesn’t work well for nominal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is bagging?

A

Bootstrap aggregating
Combines predictions from multiple models (base learners)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Decision trees recursively iterate until one of what 3 conditions is met?

A
  • All samples in current node belong to same class
  • No samples in current node
  • No features left to split on or all samples have same feature values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is the Gini Impurity Index?

A

A classifier that measures the impurity of a node in a decision tree (0=pure, 1=impure)
G = 1 - sum(p^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What are the 5 steps in sorting data into sets of least impurity?

A

1) Split tree by feature x (age), it results in 2 nodes (younger than 30 & older than 30)
2) p_i,k represents the proportion of instances of class k (younger than 30) in node i (age)
3) Calculate the Gini index
4) Select the feature (age, gender, etc.) that produces the lowest weighted sum of the Gini scores for the child nodes
5) Repeat until leaf node reached or Gini score becomes very small (indicating minimal impurity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is entropy?

A

The no. questions you need to ask on average to get to the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What is gain ratio?

A

Information gain criterion is biased toward features with more possible values, so we reduce bias with gain ratio
Gain_ratio(D,a) = Gain(D,a)/IV(a)
IV is the intrinsic value of feature a - it’s large when a has many possible values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is the drawback of gain ratio?

A

It is biased toward features with fewer possible values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What are the 3 advantages of decision trees?

A

Can acheive 0% error rate if each training example assigned to unique leaf node
Easy to prepare data
Highly interpretable - white-box model (can understanding prediction reasoning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What are the 3 disadvantages of decision trees?

A

High training time
High variance leads to overfitting
Sensitive to variation in dataset e.g. rotation, change in data etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What are 3 regularisation hyperparameters for decision trees and why are they necessary?

A

Max tree depth
Min samples a node must have before splitting
Min samples a leaf node must have

Necessary to avoid overfitting

58
Q

What are the 3 advantages of bagging?

A

Reduces overfitting
Handles missing values
Can perform parallel processing

59
Q

What are the 2 disadvantages of bagging?

A

Doesn’t address bias
Low interpretability

60
Q

What is random forest?

A

An extension of bagging

Instead of creating a big decision tree, create multiple smaller decision trees
Instead of selecting optimal split features, select from a subset of features randomly generated from the feature set of node
Train hundreds/thousands of trees on bootstrapped datasets and aggregate predictions

61
Q

What are the 2 ways random forests aggregate predictions?

A

Classification: each tree votes on class of new data point; majority vote wins
Regression: each tree predicts a value; average of predictions taken as final prediction

62
Q

What is boosting?

A

A family of algorithms that converts weak learners to strong learners
Each model is trained to correct predecessor errors by giving more weight to misclassified examples

63
Q

What are the 4 steps of boosting?

A

1) Train a base learner
2) Adjust distribution of training samples according to results of base learner so incorrectly classified samples receive more attention
3) Train the next base learner with adjusted training samples; result is used to adjust training sample distribution again
4) Repeat 2 & 3 until no. base learners reaches a defined value

64
Q

What 2 properties should individual learners have in boosting?

A

Should be accurate and diverse

65
Q

What are the 5 steps of AdaBoost (adaptive boosting)?

A

1) Initialise weights for all training samples
2) Train weak learners on weighted dataset
3) Increase weights of misclassified examples so next weak learner focuses on them
4) Assign weight to each weak learner based on accuracy and combine predictions using classification/regression
5) Repeat process iteratively until defined no. iterations or model acheives desired accuracy

66
Q

What are 3 advantages of boosting?

A

Reduces bias
High accuracy
Adaptive

67
Q

What are 3 disadvantages of boosting?

A

Sensitive to outliers
Less parallelisable
Overfitting is boosting rounds are high

68
Q

What vectors are used for non-ordinal regression?

A

For k possible values (watermelon, pumpkin, cucumber), k-dimensional vectors: (1,0,0), (0,1,0), (0,0,1)

69
Q

What is the linear regression model?

A

x = (x1, x2, … , xn)
w = (w1, w2, … , wn)
b is bias (y-intercept)

f(x) = wTx+b

70
Q

What is the hyperplane in supervised ML?

A

The line/surface that separates data into different groups
Maximises the margin (where distance boundary distances are equal)

71
Q

What are decision boundaries in supervised ML?

A

The lines parallel and equidistant from the hyperplane
Where the support vectors lay

72
Q

What is the margin in supervised ML?

A

The distance between the decision boundary and closest data points from any class (the distance both ways so essentially the distance between classes)

73
Q

What are support vectors in supervised ML?

A

The data points (of different classes) that determine the decision boundaries

74
Q

What does generative learning do?

A

Trains model to learn to generate new data instances by learning the underlying data distribution

75
Q

What are 4 advantages of generative learning?

A

Can generate new data
Handles missing data (by learning underlying distribution)
Provides deep insights into underlying structure
Good at low-resource learning

76
Q

What are 3 disadvantages of generative learning?

A

Computationally expensive
Hard to train - especially high-dimensional data
Low classification performance

77
Q

What does discriminative learning do?

A

Trains model to predict a target variable given input features by discovering patterns in data

78
Q

What are 4 advantages of discriminative learning?

A

High accuracy for classification
Easy to train - learns boundaries between classes
Fast inference
Works well with large data

79
Q

What are 3 disadvantages of discriminative learning?

A

Limited understanding of data structure
Struggles with missing data
Requires large labelled datasets

80
Q

What are the 2 approaches to pruning?

A

Cost-complexity pruning: set a threshold for cost of a subtree & remove subtrees that exceed it
Error-based pruning: evaluate performance of tree on validation set & remove nodes that don’t improve accuracy

81
Q

What is pre-pruning?

A

Evaluates the generalisation ability of each split and cancels the split if improvment is small (or worse) - includes root node

82
Q

What is a decision stump?

A

A decision tree with only one splitting

83
Q

What is post-pruning?

A

Allows the tree to grow into a complete tree
Re-examines non-leaf nodes and replace with leaf nodes if replacement improves generalisation ability

84
Q

What is the pro and con of post-pruning?

A

Pro: less prone to underfitting - better generalisation ability
Con: longer training time as examines every non-leaf node

85
Q

How is generalisation ability measured for pruning?

A

Performance evaluation methods such as hold-out method

86
Q

How can you split continuous values in a decision tree?

A

Bi-partitioning:
1) Sort data by value
2) Evaluate split points by information gain (n-1 potential split points as n-1 midpoints of adjacent values)
3) Select optimal split as t and split at t

87
Q

What are the 3 ways of handling missing values?

A

Imputation: replace with estimated values (mean/median/mode)
Ignoring: remove instances
Treat as a unique category

88
Q

What is the problem with using singular feature check decision trees for defining decision boundaries?

A

Axis parallel so many segments needed for good approximations -> slow

89
Q

What is logistic regression and when is it necessary?

A

Binary classifier that predicts probability of outcome by applying logistic function to linear model
For when data can’t be classified by a linear equation

90
Q

What is RBF (Radial Basis Function)?

A

A non-linear data transformation (increase dimensionality) to make the data linearly separable

91
Q

What is RBFN (RBF Network)?

A

A type of multilayer perceptron
Has strictly 1 hidden layer with more RBF neurons than the no. inputs (to increase dimensionality)

92
Q

How does RBFN work (4 points)?

A

Each RBF neuron stores prototype vector chosen from training data
Neuron finds the similarity score (low=0-1=high) between input and prototype
Response decreases exponentially as distance between input & prototype increases (weight of neurons in deciding output decreases)
Output value called the ‘activation’

93
Q

Name 2 popular RBFs

A

Gaussian
Multi-quadric

94
Q

Name 2 non-radial basis functions

A

Linear
Thin plate splines

95
Q

What are 4 limitations of RBFNs?

A

Performance determined by centres, radiuses & RBF chosen
RBFs must cover the input space well to work effectively
Centres chosen based on input data distribution, not prediction task
Faster training, but slower classifications than MLPs

96
Q

How are perceptrons trained?

A

Set wights to small random numbers
For T iterations / convergence …
For each input vector…
Compute activation of each neuron
Update weights:
weight = oldweight - lr(actualouput - targetoutput) * [value input feature i for sample j]

97
Q

What’s the purpose of activation functions?

A

To introudce non-linearity to each neuron’s output

98
Q

Name 4 common activation functions & their formulae

A

Sigmoid: 1 / (1+e^-x)
ReLU: max(0,x)
Softmax: forces outputs to sum to 1 for probability distribution
Tanh: (1-e^-2x) / (1+e^-2x)

99
Q

What is gradient descent in backpropogation?

A

An optimisation algorithm used to update weights & biases
After backpropogation calculates gradients of loss function with respect to parameters, gradient descent uses those gradients to adjust parameters in direction that minimises loss

100
Q

Name 3 cost functions

A

MSE
KL divergence
Hellinger distance

101
Q

What is a cost/loss function?

A

A function the measures the discrepancy between predicted output and actual output

102
Q

Give an example of a transformation of data for logistic regression

A

Instead of using position vectors, use distance vectors from some point

103
Q

What are cascading logistic regression models? Give an example

A

A technique where multiple logistic regression models are applied suquentially on subsets of the problem to refine predictions

Example (classifying images): first model classifies if image is animal or vehicle, second model distinguishes animal between cat or dog, third model distinguishes vehicle as car or bike

104
Q

Describe a fully connected feedforward network

A

x=input, f(x)=function applied to input, y=output
y = f(x) = σ(W^n … σ(W^2 * σ(W^1 * x + b^1) + b^2) … + b^n)

105
Q

What is multi-class classification?

A

Classifies data with more than 2 possible outcomes
Probabilities determine predicted class - calculated with softmax

106
Q

How do you find a good learning rate?

A

Plot loss over each iteration for varying learning rates

107
Q

How do you calculate loss?

A
  • Σ yi * log(pi)
    where p is the predicted probabilities in the multi-class classification (e.g. [0.2 0.7 0.1]) and y is actual labels (e.g. [0 1 0]
108
Q

In what 2 ways are deeper networks better than shallower networks?

A

Uses parameters more efficiently
Has various levels if abstraction (better generalisation & nested heirarchy of concepts)

109
Q

What is modularisation?

A

The process of breaking down a task into smaller, manageable units called modules

110
Q

What is universal approximation theorem?

A

States that neural networks with one hidden layer and a non-linear activation function can approximate any continuous function on a closed initerval, given enough neurons

111
Q

What 3 implications of universal approximation theorem are there?

A

NNs can model complex & arbitrary functions
Highlights network capacity, but does’t guarantee efficient training or generalisation
Practical applicaiton requires fine-tuning, ample data & computational resources

112
Q

What is a convolution layer?

A

Instead of each pixel connecting to all other pixels, connect to pixels in a nearby region

113
Q

Give 3 reasons for using CNN (convolutional neural network) for image recognition

A
  • Patterns may be smaller than whole image
  • Patterns may appear in different regions - leads to multiple detectors for different regions doing the same thing
  • Decreasing the resolution shouldn’t affect detection
114
Q

What 4 steps are there in CNNs (convolutional neural networks)?

A

1) Convolution
2) Max Pooling
3) Repeat 1 & 2 as many times as you want
4) Flatten

115
Q

What is convolution?

A

Using filters on an image to detect small patterns

116
Q

What is max pooling?

A

Cut new image into pieces (pooling) and take the max value from each peice

117
Q

What is flattening in CNNs?

A

Converying the multi-dimensional feature map output from convolutional layers into a one-dimensional vector

118
Q

What are exploding & vanishing gradients?

A

Exploding: when weights are set above 1, after a large no. iterations, weight gets very big
Vanishing: when weights are set below 1, after a large no. iterations, weight becomes 0

119
Q

How can exploding gradients be addressed?

A

Use clipping: gradient is capped, preventing excessively large updates to weights

120
Q

What is a gated recurrent unit?

A

A variant of LTSM
Instead of using a forget gate to determine what fraction of information to retain, how much past information to retain/discard is based on input gate vector

121
Q

What are the 6 steps in the problem characterisation process?

A

1) Frame the problem
2) What is the input/output of the ML model(s)
3) What business process is are the model(s) supporting
4) Is my model standalone or part of a pipeline
5) What are my model’s performance measures
6) Subject to constraints?

122
Q

What are the 5 aspects of data characterisation?

A

1) What data is available? Can i get more? How?
2) What format is the data in? Transform? Cost?
3) Assess data quality
4) Identify test/training datasets
5) Assess bias, legal, privacy, ethics considerations

123
Q

In what 4 ways is data quality assessed?

A

Completeness - enough labelled data?
Accuracy
Believability - can method be trusted?
TImeliness - from right timeframe?

124
Q

Where is bias introduced (2)?

A

Features - e.g. introducing gender as a feature
Training set

125
Q

What is smoothing?

A

A post-processing step
e.g. if a journey with 5 min splits goes car, car, bike, car, car, that bike should be smoothed out

126
Q

What is fault analysis?

A

A framework to help debug the model
The further down the fault tree you go, the closer to the root cause you get

127
Q

What is undersegmentation & oversegmentation?

A

Undersegmentation: distinct regions incorrectly grouped as single segment
Oversegmentation: single region divided into too many segments

128
Q

What are GMMs (gaussian mixture models)?

A

A more complicated method of clustering data that k-means

129
Q

What are the 3 parameters of multivariate gaussian distribution?

A

x ∈ R^D (lets say D(imensions)=2)
μ = mean vector (so μ=[μ1,μ2] is the mean of μ1 and μ2)
Σ = DxD covariance matrix (contains variances between 2 variables)

for Σ, a is the variance of x1, d is the variance of x2. b&c are the correlation between x1 & x2

130
Q

What property should the data points have in mixture gaussian density?

A

The probabilities (pi) sum to 1

131
Q

What are the 2 steps of the EM algorithm?

A

Parameters: means, varainces, mixing coefficients

Expectation step: use parameter estimates to calculate expected values of variables to determine how likely each data point belongs to each cluster
Maximisation step: Update parameters by maximising likelihood of data

132
Q

What are 5 advantages of GMMs?

A
  • Flexibile
  • Robust to outliers
  • Speed
  • Handling missing data
  • Interpretability
133
Q

What are 5 disadvantages of GMMs?

A
  • Sensitive to initialisation
  • Assumes normal distribution
  • Choosing number of components
  • Expensive when high D
  • Limited expressive power
134
Q

What are autoencoders?

A

An unsupervised learning technique
Network learns to compress (encode) and reconstruct (decode) data by creating a bottleneck (hidden) layer forcing network to find meaningful representations of the data
Network is trained to minimise reconstruction error

135
Q

What is latent space?

A

A lower-dimensional representation of data that captures its essential features & underlying structure
Helps represent hidden relationships within the data

136
Q

What is SVD (singular value decomposition)?

A

Factorises a matrix into three matrices: rotation -> rescaling -> rotation
Used as a data reduction tool by determining key features

137
Q

That is PCA (principle component analysis)?

A

A dimensionality reduction technique
Transforms large sets of variables into smaller set while preserving as much info as possible

138
Q

How does PCA preserve information?

A

By finding the direction of data with most variance (most information)

139
Q

What are support vector machines?

A

Powerful ML models capable of performing linear & nonlinear classification & regression tasks

140
Q

Are margins sensitive to feature scaling?

A

Yes

141
Q

What is a hard margin?

A

A margin that strictly imposes that all instances be “off the street” (not in the margin)

142
Q

What is a soft margin?

A

A margin that balances keeping the margin as large as possible whilst limiting “margin violations”
C hyperparameter controls amount of violation allowed (small C allows more violations)
Violation scales with distance