Machine Learning Technologies Flashcards
What are the 4 types of ML techniques?
Supervised
Semi-Supervised
Unsupervised
Reinforcement
What is error rate?
The proportion of incorrectly classified samples to total no. samples
What is empirical error?
Error calculated on training set
What is generalised error?
Error calculated on unseen samples
What are the 4 reasons for underfitting happening?
Model too simple
Insufficient training
Uninformative dataset
Over-regularised
What are the 4 reasons for overfitting happening?
Too complex
Excessive training
Small dataset
Lacking regularisation
How to fix overfitting?
Change model and/or change data
How to fix underfitting?
Update model and/or add more data
Why is overfitting unavoidable?
Because P≠NP - there are some problems for which we can verify a solution quickly but finding that solution efficiently is computationally infeasible
What’s the hold-out method?
Where dataset is split into two disjoint subsets (training set & testing set)
Why do we use stratified sampling?
To prevent biased error
What are the 2 difficulties in choosing the data split?
More data in training set -> better model approximation but less reliable evaluation
More data is testing set -> better evaluation but weaker approximation
What is LOO (Leave-One-Out)?
A case of k-fold cross-validation where k = n-1. So the test set is 1 and the training set is the rest
Close to ideal evaluation of training but computation cost is prohibitive for large datasets
What are the 5 steps of bootstrapping?
For dataset D containing n samples
1) Randomly pick a sample from D
2) Copy to D’
3) Put it back in D
4) Repeat n times
5) Use D’ as training set and D\D’ as testing set
What proportion of the data ends up in the testing set in bootstrapping?
Chance of not being picked in m rounds: (1-1/m)^m
As m -> infinity, chance -> 1/e = 0.368
So 36.8% of original samples don’t appear in D’ (this remaining data is called OOB (out-of-bag) data
What is out-of-bag estimate?
The evaluation result obtained by bootstrapping
Parameters vs hyperparameters
Parameters are internal variables, learned automatically (>10 billion)
Hyperparameters are external variables defined by the user (<10)
What is accuracy?
Correctly predicted instances / all instances
What is error?
Incorrectly predicted instances / all instances
What is precision?
Correctly predicted positives / predicted positives
What is recall?
Correctly predicted positives / actual positives
What is specificity?
Correctly predicted negatives / actual negatives
What is a P-R curve?
Precision-recall curve
A tool for evaluating effectiveness of a classification model
What 3 solutions are there to intersecting lines in a P-R curve?
- Compare areas under curves - not easy to compute
- Break-even point - measure the point on the curves where precision & recall are equal
- F1-Measure - harmonic mean of P & R:
2 x (P * R) / (P + R)
= 2 x TP / (N + TP - TN)
In what situations are precision & recall more important?
Precision more important in recommender systems
Recall more important in information retreival systems
In F_beta, for what values of beta are precision & recall more important?
Precision: beta < 1
Recall: beta > 1
Discuss the use of multiple confusion matrices
1) Precision & recall calculated for each round of training & testing –> n binary confusion
2) Take averages for macro-P, macro-R, macro-F1 (using mP& mR)
3) Calculate element-wise averages (TP etc) and use them to obtain micro-P, micro-R, micro-F1
What type of learning technique is clustering?
Unsupervised
What is prototype clustering?
Starts with initial prototype clusters
Iteratively updates & optimises the prototypes
Define Occam’s Razor
Choose the smallest number of clusters that adequately explains the data
What are the 4 steps in updating centroids in K-Means clustering?
1) Initialise K random centroids (from existing data points)
2) Expectation Maximisation (E-Step): determine which cluster each data point is closest to (Euclidean distance) and assign it
3) Expectation Maximisation (M-Step): recompute centroids based on assigned points
4) Repeat 2 & 3 until convergence
What are the 3 advantages of K-means clustering?
Simple & efficient
Interpretable clusters
Can help ML models learn & make predictions easier
What are the 5 disadvantages of K-means clustering?
Sensitive to initial centroids
Assumes clusters are equally sized
Depends on the no. clusters
Outliers skew centroids
Not suitabble for non-linear data
Intra-cluster vs inter-cluster similarity
Intra-cluster: items within a cluster should be similar
Inter-cluster: clusters themselves should be dissimilar
What are the 2 types of validity indices?
External index: compares clustering results against a reference model
Internal index: evaluates clustering results without reference model
Name 3 commonly used external validity indices
(Take values in range [0,1])
- Jaccard Coefficient (JC)
- Fowlkes & Mallows Index (FMI)
- Rand Index (RI)
Name 2 commonly used internal validity indices
- Davies-Bouldin Index (DBI)
- Dunn Index (DI)
What are the 4 distance axioms?
Non-negativity: dist(a,b)>=0
Identity of indiscernibles: if dist(a,b)=0, a=b
Symmetry: dist(a,b)=dist(b,a)
Subadditivity: dist(a,b)<=dist(a,c)+dist(c,b)
What distances don’t satisfy the subadditivity condition?
Non-metric distances
What are ordinal attributes?
Categorical attributes that have a natural/inherent order e.g. {low, medium, high} can be represented as {1, 2, 3}
What are non-ordinal attributes?
Categorical attributes that DON’T have a natural/inherent order e.g. {aircraft, train, ship}
Describe the Minkowski Distance (MD)
Satisfies all axioms
Only applicable to ordinal attributes
When p=1, becomes Manhattan distance
When p=2, becomes Euclidean distance
Describe the Value Difference Metric (VDM)
Can be applied to non-ordinal attributes
m_u,a (no. samples in dataset where colour is red) denotes the number of samples taking value a (red) on the attribute u (colour), and m_u,a,i (no. samples in cluster i where colour is red) denote the number of samples within the ith cluster taking value a on the attribute u; k is the no. clusters
How can MD & VDM be combined?
1) Arrange ordinal attributes in front of non-ordinal attributes
2) n_c denotes no. ordinal attributes and n-n_c dentoes no. non-ordinal attributes
3) Do MinkovDM()
What is Hamming distance?
The number of bits which need to be changed to turn one string into the other
What is Jaccard index?
The size of the intersection / the size of the union of the sample sets
Doesn’t work well for nominal data
What is cosine index?
The angle between 2 vectors of n dimensions
Doesn’t work well for nominal data
What is bagging?
Bootstrap aggregating
Combines predictions from multiple models (base learners)
Decision trees recursively iterate until one of what 3 conditions is met?
- All samples in current node belong to same class
- No samples in current node
- No features left to split on or all samples have same feature values
What is the Gini Impurity Index?
A classifier that measures the impurity of a node in a decision tree (0=pure, 1=impure)
G = 1 - sum(p^2)
What are the 5 steps in sorting data into sets of least impurity?
1) Split tree by feature x (age), it results in 2 nodes (younger than 30 & older than 30)
2) p_i,k represents the proportion of instances of class k (younger than 30) in node i (age)
3) Calculate the Gini index
4) Select the feature (age, gender, etc.) that produces the lowest weighted sum of the Gini scores for the child nodes
5) Repeat until leaf node reached or Gini score becomes very small (indicating minimal impurity)
What is entropy?
The no. questions you need to ask on average to get to the data
What is gain ratio?
Information gain criterion is biased toward features with more possible values, so we reduce bias with gain ratio
Gain_ratio(D,a) = Gain(D,a)/IV(a)
IV is the intrinsic value of feature a - it’s large when a has many possible values
What is the drawback of gain ratio?
It is biased toward features with fewer possible values
What are the 3 advantages of decision trees?
Can acheive 0% error rate if each training example assigned to unique leaf node
Easy to prepare data
Highly interpretable - white-box model (can understanding prediction reasoning)
What are the 3 disadvantages of decision trees?
High training time
High variance leads to overfitting
Sensitive to variation in dataset e.g. rotation, change in data etc.