Machine Learning Technologies Flashcards
What are the 4 types of ML techniques?
Supervised
Semi-Supervised
Unsupervised
Reinforcement
What is error rate?
The proportion of incorrectly classified samples to total no. samples
What is empirical error?
Error calculated on training set
What is generalised error?
Error calculated on unseen samples
What are the 4 reasons for underfitting happening?
Model too simple
Insufficient training
Uninformative dataset
Over-regularised
What are the 4 reasons for overfitting happening?
Too complex
Excessive training
Small dataset
Lacking regularisation
How to fix overfitting?
Change model and/or change data
How to fix underfitting?
Update model and/or add more data
Why is overfitting unavoidable?
Because P≠NP - there are some problems for which we can verify a solution quickly but finding that solution efficiently is computationally infeasible
What’s the hold-out method?
Where dataset is split into two disjoint subsets (training set & testing set)
Why do we use stratified sampling?
To prevent biased error
What are the 2 difficulties in choosing the data split?
More data in training set -> better model approximation but less reliable evaluation
More data is testing set -> better evaluation but weaker approximation
What is LOO (Leave-One-Out)?
A case of k-fold cross-validation where k = n-1. So the test set is 1 and the training set is the rest
Close to ideal evaluation of training but computation cost is prohibitive for large datasets
What are the 5 steps of bootstrapping?
For dataset D containing n samples
1) Randomly pick a sample from D
2) Copy to D’
3) Put it back in D
4) Repeat n times
5) Use D’ as training set and D\D’ as testing set
What proportion of the data ends up in the testing set in bootstrapping?
Chance of not being picked in m rounds: (1-1/m)^m
As m -> infinity, chance -> 1/e = 0.368
So 36.8% of original samples don’t appear in D’ (this remaining data is called OOB (out-of-bag) data
What is out-of-bag estimate?
The evaluation result obtained by bootstrapping
Parameters vs hyperparameters
Parameters are internal variables, learned automatically (>10 billion)
Hyperparameters are external variables defined by the user (<10)
What is accuracy?
Correctly predicted instances / all instances
What is error?
Incorrectly predicted instances / all instances
What is precision?
Correctly predicted positives / predicted positives
What is recall?
Correctly predicted positives / actual positives
What is specificity?
Correctly predicted negatives / actual negatives
What is a P-R curve?
Precision-recall curve
A tool for evaluating effectiveness of a classification model
What 3 solutions are there to intersecting lines in a P-R curve?
- Compare areas under curves - not easy to compute
- Break-even point - measure the point on the curves where precision & recall are equal
- F1-Measure - harmonic mean of P & R:
2 x (P * R) / (P + R)
= 2 x TP / (N + TP - TN)
In what situations are precision & recall more important?
Precision more important in recommender systems
Recall more important in information retreival systems
In F_beta, for what values of beta are precision & recall more important?
Precision: beta < 1
Recall: beta > 1
Discuss the use of multiple confusion matrices
1) Precision & recall calculated for each round of training & testing –> n binary confusion
2) Take averages for macro-P, macro-R, macro-F1 (using mP & mR)
3) Calculate element-wise averages (TP etc) and use them to obtain micro-P, micro-R, micro-F1
What type of learning technique is clustering?
Unsupervised
What is prototype clustering?
Starts with initial prototype clusters
Iteratively updates & optimises the prototypes
Define Occam’s Razor
Choose the smallest number of clusters that adequately explains the data
What are the 4 steps in updating centroids in K-Means clustering?
1) Initialise K random centroids (from existing data points)
2) Expectation Maximisation (E-Step): determine which cluster each data point is closest to (Euclidean distance) and assign it
3) Expectation Maximisation (M-Step): recompute centroids based on assigned points
4) Repeat 2 & 3 until convergence
What are the 2 advantages of K-means clustering?
Simple & efficient
Interpretable clusters
What are the 5 disadvantages of K-means clustering?
Sensitive to initial centroids
Assumes clusters are equally sized
Depends on the no. clusters
Outliers skew centroids
Not suitable for non-linear data
Intra-cluster vs inter-cluster similarity
Intra-cluster: items within a cluster should be similar
Inter-cluster: clusters themselves should be dissimilar
What are the 2 types of validity indices?
External index: compares clustering results against a reference model
Internal index: evaluates clustering results without reference model
Name 3 commonly used external validity indices
(Take values in range [0,1])
- Jaccard Coefficient (JC)
- Fowlkes & Mallows Index (FMI)
- Rand Index (RI)
Name 2 commonly used internal validity indices
- Davies-Bouldin Index (DBI)
- Dunn Index (DI)
What are the 4 distance axioms?
Non-negativity: dist(a,b)>=0
Identity of indiscernibles: if dist(a,b)=0, a=b
Symmetry: dist(a,b)=dist(b,a)
Subadditivity: dist(a,b)<=dist(a,c)+dist(c,b)
What distances don’t satisfy the subadditivity condition?
Non-metric distances
What are ordinal attributes?
Categorical attributes that have a natural/inherent order e.g. {low, medium, high} can be represented as {1, 2, 3}
What are non-ordinal attributes?
Categorical attributes that DON’T have a natural/inherent order e.g. {aircraft, train, ship}
Describe the Minkowski Distance (MD)
Satisfies all axioms
Only applicable to ordinal attributes
When p=1, becomes Manhattan distance
When p=2, becomes Euclidean distance
Describe the Value Difference Metric (VDM)
Can be applied to non-ordinal attributes
m_u,a (no. samples in dataset where colour is red) denotes the number of samples taking value a (red) on the attribute u (colour), and m_u,a,i (no. samples in cluster i where colour is red) denote the number of samples within the ith cluster taking value a on the attribute u; k is the no. clusters
How can MD & VDM be combined?
1) Arrange ordinal attributes in front of non-ordinal attributes
2) n_c denotes no. ordinal attributes and n-n_c dentoes no. non-ordinal attributes
3) Do MinkovDM()
What is Hamming distance?
The number of bits which need to be changed to turn one string into the other
What is Jaccard index?
The size of the intersection / the size of the union of the sample sets
Doesn’t work well for nominal data
What is cosine index?
The angle between 2 vectors of n dimensions
Doesn’t work well for nominal data
What is bagging?
Bootstrap aggregating
Combines predictions from multiple models (base learners)
Decision trees recursively iterate until one of what 3 conditions is met?
- All samples in current node belong to same class (e.g. all “yes”)
- No samples in current node (e.g. spliting on Age > 60 leaves no samples in one brach)
- No features left to split on (split on all features) or all samples have same feature values (same no. legs & same size)
What is the Gini Impurity Index?
A classifier that measures the impurity of a node in a decision tree (0=pure, 1=impure)
G = 1 - sum(p^2)
What are the 5 steps in sorting data into sets of least impurity?
1) Split tree by feature x (age), it results in 2 nodes (younger than 30 & older than 30)
2) p_i,k represents the proportion of instances of class k (younger than 30) in node i (age)
3) Calculate the Gini index
4) Select the feature (age, gender, etc.) that produces the lowest weighted sum of the Gini scores for the child nodes
5) Repeat until leaf node reached or Gini score becomes very small (indicating minimal impurity)
What is entropy?
A measure of uncertainty/randomness in the data
What is gain ratio?
Information gain criterion is biased toward features with more possible values, so we reduce bias with gain ratio
Gain_ratio(D,a) = Gain(D,a)/IV(a)
Where IV is the intrinsic value of feature a - it’s large when a has many possible values
What is the drawback of gain ratio?
It is biased toward features with fewer possible values
What are the 3 advantages of decision trees?
Can acheive 0% error rate if each training example assigned to unique leaf node
Easy to prepare data
Highly interpretable - white-box model (can understanding prediction reasoning)
What are the 3 disadvantages of decision trees?
High training time
High variance leads to overfitting
Sensitive to variation in dataset e.g. rotation, change in data etc.
What are 3 regularisation hyperparameters for decision trees and why are they necessary?
- Max tree depth
- Min samples a node must have before splitting
- Min samples a leaf node must have
Necessary to avoid overfitting
What are the 3 advantages of bagging?
Reduces overfitting
Handles missing values
Can perform parallel processing
What are the 2 disadvantages of bagging?
Doesn’t address bias
Low interpretability
What is random forest?
An extension of bagging
Instead of creating a big decision tree, create multiple smaller decision trees
Instead of selecting optimal split features, select from a subset of features randomly generated from the feature set of node
Train hundreds/thousands of trees on bootstrapped datasets and aggregate predictions
What are the 2 ways random forests aggregate predictions?
Classification: each tree votes on class of new data point; majority vote wins
Regression: each tree predicts a value; average of predictions taken as final prediction
What is boosting?
A family of algorithms that converts weak learners to strong learners
Each model is trained to correct predecessor errors by giving more weight to misclassified examples
What are the 4 steps of boosting?
1) Train a base learner
2) Adjust distribution of training samples according to results of base learner so incorrectly classified samples receive more attention
3) Train the next base learner with adjusted training samples; result is used to adjust training sample distribution again
4) Repeat 2 & 3 until no. base learners reaches a defined value
What 2 properties should individual learners have in boosting?
Should be accurate and diverse
What are the 5 steps of AdaBoost (adaptive boosting)?
1) Initialise weights for all training samples
2) Train weak learners on weighted dataset
3) Increase weights of misclassified examples so next weak learner focuses on them
4) Assign weight to each weak learner based on accuracy and combine predictions using classification/regression
5) Repeat process iteratively until defined no. iterations or model acheives desired accuracy
What are 3 advantages of boosting?
Reduces bias
High accuracy
Adaptive
What are 3 disadvantages of boosting?
Sensitive to outliers
Less parallelisable
Overfitting if boosting rounds are high
What vectors are used for non-ordinal regression?
For k possible values (watermelon, pumpkin, cucumber), k-dimensional vectors: (1,0,0), (0,1,0), (0,0,1)
What is the linear regression model?
x = (x1, x2, … , xn)
w = (w1, w2, … , wn)
b is bias (y-intercept)
f(x) = wTx+b
What is the hyperplane in supervised ML?
The line/surface that separates data into different groups
You want to maximise the margin
What is the decision boundary / hyperplane in supervised ML?
The line equidistant from the margin boundaries
Where the support vectors lay
What is the margin in supervised ML?
The distance (both ways) between the decision boundary and closest data points from any class (the distance both ways so essentially the distance between classes)
We want to maximise this
What are support vectors in supervised ML?
The data points (of different classes) that lie on the margin boundaries
What does generative learning do?
Trains model to learn to generate new data instances by learning the underlying data distribution
What are 4 advantages of generative learning?
Can generate new data
Handles missing data (by learning underlying distribution)
Good at low-resource learning
Provides deep insights into underlying structure
What are 3 disadvantages of generative learning?
Computationally expensive
Hard to train - especially high-dimensional data
Low classification performance
What does discriminative learning do?
Focuses on learning the decision boundary between input features and target classes, rather than modeling the data distribution
What are 4 advantages of discriminative learning?
Easy to train - learns boundaries between classes
High classification accuracy
Fast inference
Efficient with large datasets
What are 3 disadvantages of discriminative learning?
Limited understanding of data structure
Struggles with missing data
Requires large labelled datasets
What are the 2 approaches to pruning (not when)?
Cost-complexity pruning: set a threshold for cost of a subtree & remove subtrees that exceed it
Error-based pruning: evaluate performance of tree on validation set & remove nodes that don’t improve accuracy
What is pre-pruning?
Evaluates the generalisation ability of each split and cancels the split if improvment is small (or worse) - includes root node
What is a decision stump?
A decision tree with only one splitting
What is post-pruning?
Allows the tree to grow into a complete tree
Re-examines non-leaf nodes and replace with leaf nodes if replacement improves generalisation ability
What is the pro and con of post-pruning?
Pro: less prone to underfitting - better generalisation ability
Con: longer training time as examines every non-leaf node
How is generalisation ability measured for pruning?
Performance evaluation methods such as hold-out method
How can you split continuous values in a decision tree?
Bi-partitioning:
1) Sort data by value
2) Evaluate split points by information gain (n-1 potential split points as n-1 midpoints of adjacent values)
3) Select optimal split as t and split at t
What are the 3 ways of handling missing values?
Imputation: replace with estimated values (mean/median/mode)
Ignoring: remove instances
Treat as a unique category
What is the problem with using singular feature check decision trees for defining decision boundaries?
Axis parallel so many segments needed for good approximations -> slow
What is logistic regression and when is it necessary?
Binary classifier that predicts probability of outcome by applying logistic function to linear model
For when data can’t be classified by a linear equation
What is RBF (Radial Basis Function)?
A non-linear data transformation (increase dimensionality) to make the data linearly separable
What is RBFN (RBF Network)?
A type of multilayer perceptron
Has strictly 1 hidden layer with more RBF neurons than the no. inputs (to increase dimensionality)
How does RBFN work (4 points)?
- Each RBF neuron stores prototype vector chosen from training data
- The neuron finds the distance / similarity score (low,0-1,high) between its input and its prototype
- Response decreases exponentially as distance between input & prototype increases (weight of neurons in deciding output decreases)
- Output value called the ‘activation’ of the neuron
Name 2 popular RBFs
Gaussian
Multi-quadric
Name 2 non-radial basis functions
Linear
Thin plate splines
What are 3 limitations of RBFNs?
- Performance determined by centres, radiuses & RBF chosen
- RBFs must cover the input space well to work effectively
- Faster training, but slower classifications than MLPs
How are perceptrons trained?
Set weights to small random numbers
For T iterations / convergence ... For each input vector... Compute activation of each neuron Update weights: weight = oldweight - lr(actualouput - targetoutput) * [value input feature i for sample j]
What’s the purpose of activation functions?
To introduce non-linearity to each neuron’s output
Name 4 common activation functions & their formulae
Sigmoid: 1 / (1+e^-x)
ReLU: max(0,x)
Softmax: forces outputs to sum to 1 for probability distribution
Tanh: (1-e^-2x) / (1+e^-2x)
What is gradient descent in backpropogation?
An optimisation algorithm
1) Backpropogation calculates gradients of loss function with respect to parameters
2) Gradient descent uses those gradients to adjust parameters (weights & bias) in direction that minimises loss
Name 3 cost functions
MSE (mean squared error)
KL divergence
Hellinger distance
What is a cost/loss function?
A function the measures the discrepancy between predicted output and actual output
Give an example of a transformation of data for logistic regression
Instead of using position vectors, use distance vectors from some point
What are cascading logistic regression models? Give an example
A technique where multiple logistic regression models are applied sequentially on subsets of the problem to refine predictions
Example (classifying images): first model classifies if image is animal or vehicle, second model distinguishes animal between cat or dog, third model distinguishes vehicle as car or bike
Describe a fully connected feedforward network
x=input, f(x)=function applied to input, y=output
y = f(x) = σ(W^n … σ(W^2 * σ(W^1 * x + b^1) + b^2) … + b^n)
What is multi-class classification?
Classifies data with more than 2 possible outcomes
Probabilities determine predicted class - calculated with softmax
How do you find a good learning rate?
Plot loss over each iteration for varying learning rates
In what 2 ways are deeper networks better than shallower networks?
Uses parameters more efficiently
Has various levels of abstraction (better generalisation & nested heirarchy of concepts)
What is modularisation?
The process of breaking down a task into smaller, manageable units called modules
What is universal approximation theorem?
States that a feedforward neural network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function, given appropriate activation functions
What is the reality of Universal Approximation Theorem?
In practice, deeper networks are more efficient & require fewer neurons
What is a convolution layer?
Instead of each pixel connecting to all other pixels, connect to pixels in a nearby region
Give 3 reasons for using CNN (convolutional neural network) for image recognition
- Patterns may be smaller than whole image
- Patterns may appear in different regions - leads to multiple detectors for different regions doing the same thing
- Decreasing the resolution shouldn’t affect detection
What 4 steps are there in CNNs (convolutional neural networks)?
1) Convolution
2) Max Pooling
3) Repeat 1 & 2 as many times as you want
4) Flatten
What is convolution?
Using filters on an image to detect small patterns
What is max pooling?
Cut new image (from convolution) into pieces (pooling) and take the max value from each peice
What is flattening in CNNs?
Converting the multi-dimensional feature map output from convolutional layers into a one-dimensional vector
What are exploding & vanishing gradients?
Exploding: when gradients are initialised above 1, after a large no. iterations, weight gets very big
Vanishing: when gradients are initialised below 1, after a large no. iterations, weight becomes 0
How can exploding gradients be addressed?
Use clipping: gradient is capped, preventing excessively large updates to weights
What is a gated recurrent unit?
A variant of LTSM
Instead of using a forget gate to determine what fraction of information to retain, how much past information to retain/discard is based on input gate vector
What are the 5 aspects of data characterisation?
1) What data is available? Can i get more? How?
2) What format is the data in? Transform? Cost?
3) Assess data quality
4) Identify test/training datasets
5) Assess bias, legal, privacy, ethics considerations
In what 4 ways is data quality assessed?
Completeness - enough labelled data?
Accuracy
Believability - can method be trusted?
TImeliness - from right timeframe?
Where is bias introduced (2)?
Features - e.g. introducing gender as a feature
Training set
What is smoothing?
A post-processing step
e.g. if a journey with 5 min splits goes car, car, bike, car, car, that bike should be smoothed out
What is fault analysis?
A framework to help debug the model
The further down the fault tree you go, the closer to the root cause you get
What is undersegmentation & oversegmentation?
Undersegmentation: distinct regions incorrectly grouped as single segment
Oversegmentation: single region divided into too many segments
What are GMMs (gaussian mixture models)?
A more complicated method of clustering data than k-means
Represents a dataset as a mixture of several Gaussian distributions
What property should the data points have in mixture gaussian density?
The probabilities (pi) sum to 1
What are the 2 steps of the EM algorithm?
Parameters: means, variances, mixing coefficients
Expectation step: use parameter estimates to calculate expected values of variables to determine how likelihood of data point belongs to each cluster
Maximisation step: Update parameters by maximising likelihood of data
What are 5 advantages of GMMs?
- Flexibility
- Interpretability
- Speed
- Handling of missing data
- Robustness to outliers
What are 5 disadvantages of GMMs?
- Sensitive to initialisation
- Choosing number of components
- Assumes normal distribution
- Limited expressive power
- Expensive when high D
What are autoencoders?
An unsupervised learning technique
Network learns to compress (encode) and reconstruct (decode) data by creating a bottleneck (hidden) layer forcing network to find meaningful representations of the data
Network is trained to minimise reconstruction error
What is latent space?
A lower-dimensional representation of data that captures its essential features & underlying structure
Helps represent hidden relationships within the data
What is SVD (singular value decomposition)?
Factorises a matrix into three matrices: rotation -> rescaling -> rotation
Used as a data reduction tool by determining key features
How does PCA preserve information?
By finding the direction of data with most variance (most information)
What are support vector machines?
Powerful ML models capable of performing linear & non-linear classification & regression tasks
Are margins sensitive to feature scaling?
Yes
What is a hard margin?
A margin that strictly imposes that all instances be “off the street” (not in the margin)
What is a soft margin?
A margin that balances keeping the margin as large as possible whilst limiting “margin violations”
C hyperparameter controls amount of violation allowed (small C allows more violations)
Violation scales with distance