Classification Flashcards
What is a feature space
a coordinate space used to represent the input examples for a given problem, with one coordinate for each descriptive feature
Eager Learning Classification Strategy
- classifier build a full model during an initial training phase, to use later when new query examples arrive
- more offline setup work, less work at run-time
- generalise before seeing the query example
Lazy Learning Classification Strategy
- Classifier keeps all the training examples for later use.
- Little work is done offline, wait for new query examples.
- Focus on the local space around the examples
What learning strategy does KNN Classifier use and how?
Lazy. K-NN identifies the k most similar previous examples from the training set for which a label has already been assigned, using some distance function
How does Weighted kNN differ from the regular model
Weighted voting, closer neighbours get higher votes.
Is there a “best” distance measure
No, the choice of distance measure is highly problem-dependent
What is the difference between a local distance function (LDF) and a global distance function (GDF)
LDFs measure the distance between two examples based on a single feature, where as GDFs are based on the combination of the local distances across all features
Define the overlap function (measuring distance)
Returns 0 if the two values for a feature are equal and 1 otherwise
Define Hamming Distance (measuring distance)
GDF which is the sum of the overlap differences across all features
Define Absolute Difference (measuring distance)
Absolute value of the difference between values for a feature or several features
Define Absolute Difference for ordinal features (measuring distance)
calculate the absolute value of the difference between the two positions in the ordered list of possible values
Define Euclidean Distance,and give the formula
- “Straight line” distance between two points in a feature space
- calculated as the square root of the sum of squared differences for each feature f, representing a pair of examples.
ED(p, q) = SQR(SUM_f (q_f - p_f)^2)
What are Heterogeneous (Diverse) Distance Functions
GDF created from different local distance functions, using an appropriate function for each feature
Min-max normalization formula
z_i = x_i - min(x) / max(x) - min(x)
Standard Normalisation Formula
z_i = x_i - μ / σ
Advantages of kNN
- little training time
- interpretability and explainability
- transparency
Disadvantages of kNN
- need to carefully customise the distance function
- query time depends on the complexity of the function
Decision Tree Algorithm
- All training examples in root node
- Examples are split by one feature into child nodes
- The process repeats for each child node
- Continue until all leaf nodes contain the same class
Entropy formula
H(X) = - SUM_x p(x) log p(x)
What does Information Gain measure and give the formula
measures the reduction in entropy when a feature is used to split a set into subsets
IG for feature A that splits a set of examples S into {S1,…Sm}:
IG(S, A) = (original entropy) - (entropy after split)
IG(S, A) = H(S) - SUM(i=1, m) |S_i|/ |S| H(S_i)
Steps of computing IG for each feature in a dataset
- calculate overall dataset entropy
- calculate entropy for each feature
- calculate IG for each feature
Why do ensembles work?
When the average probability of an individual being correct is > 50%, the chance of the ensemble of them reaching the correct decision increases as more members are added.
This holds true only if the diversity in ensemble continues to grow as well.
What is the key idea of Bagging
Train classifiers on different subsets of the training data
What is Bootstrap aggregation, what classifiers does it work better for
Bagging technique which randomly samples with replacement
Works better for “unstable” classifiers, e.g. DT, NN
What is the key idea of Random Subspacing, and what does it encourage?
Train n base classifiers, each on a different subset of features.
Encourages diversity in the ensemble
When would you choose weighted voting for your ensembles
When the individual classifiers do not give equal performance, we should give more influence to the better classifiers
Discuss the Accuracy vs Diversity Trade-off in ensemble classification
An ideal ensemble is one that consists of highly accurate members which at the same time disagree, therefore we face a trade-off between diversity and accuracy when constructing an ensemble of classifiers
What is the key idea in Boosting
Train a sequence of classifiers sequentially, so that later classifiers are trained to better predict class labels that earlier ones performed poorly on
Give the basic approach to Boosting
- Assign an equal weight to all training examples
- Get a random sample from the training examples based on the weights.
- Train a classifier on the sample
- Increase weights for misclassified examples, decrease weight for correctly classified examples
- Output final model based on all classifiers (e.g. majority voting model)
Explain the bias-variance trade off
Bias is how close the classifier’s predictions are from the correct values and Variance is the error from sensitivity to small changes in the training set. There is often a tradeoff between minimising the two
Discuss how ensemble generation methods affect the bias-variance tradeoff
- Bagging can often reduce the variance part of error
- Boosting can often reduce variance and bias, because it focuses on misclassified examples
- Boosting can sometimes increase error, as it is susceptible to noise, which can lead to overfitting
Which classifiers generally suffer from overfitting and which classifiers generally suffer underfitting?
- low bias but high variance classifiers tend to overfit
- high bias but low variance classifiers tend to underfit