Oldie Flashcards
Shortcomings of Information Gain?
- IG tends to prefer highly branching attributes
- Subsets are more likely to be pure if there is a large number of variables
- May result in overfitting
Solve with Gain Ratio
What aspect of Information Gain is “Gain Ratio” intended to remedy? Explain with equation how to achieve this.
- Fixes IG’s shortcoming of preferring highly branching attributes (i.e. ID attributes low entropy and lots of them so high IG)
- GR reduces the bias for IG toward highly branching attributes by normalising relative to the Split Information (SI).
GR(Ra | R) = IG(Ra | R) / SI(Ra | R)
= H(R) - SumOf P(xi)H(xi) / - (sumof P(xi)logP(xi))
What’s the difference between classification and regression?
Classification = Predicting which class a data point is part of. - The dependent variables are categorical
Regression = Predicting continuous values
- The dependent variables are numerical
Outline the nature of “hill climbing” & provide example of hill climbing algorithm.
- Finds the global maximum iteratively - sometimes gets stuck in local maximum - unless convex function?
e. g. EM algorithm - guaranteed positive hill climb
What basic approach does hill climbing contrast with?
???
Advantages of “bagging”
- Decrease variance
- Highly effective over noisy data
- Performance is generally better & never substantially worse
- Possibility to parallelise computation of individual base classifiers
What is sum of squared errors? Name one thing it’s applied to.
Method to evaluate the quality of clusters.
Applied to K-Means
How does stratified cross validation contrast with non-stratified cross validation?
Stratification is generally better than regular.
- Both in terms of bias and variance
Achieves this by rearranging the data to ensure each fold is a good representation of the whole data set.
Outline the difference between supervised and unsupervised ML methods
Unsupervised:
- No knowledge of labelled data
- Needs to find clusters, patterns, etc by itself
Supervised:
- Has knowledge of labelled data
- Learning involves comparing its current predicted output with the correct outputs
- Able to know what the error is
Define “discretisation” with an example
Process of converting continuous attributes to nominal or ordinal attributes.
Some learners, such as Decision Trees, generally work better with nominal attributes.
Briefly describe what “overfitting” is
Overfitting is when the model picks up errors and/or noise OR has a lack of training data
- High variance in classifier
- Causes the training accuracy and test accuracy to not be similar - big gap between test and training curves
- Occurs when the model is excessively complex
- Struggles to generalise
- Overreacts to minor fluctuations in the training data
Briefly outline the “inductive learning hypothesis”
Any hypothesis found to approx. the target function well, over a sufficiently large set of training examples, will also approximate the target function well, over any other unobserved examples
Categorise each as either “parametric” or “non-parametric”
i) SVM
ii) Max Entropy Model
iii) C4.5 -> Style DT
SVM - Depends on kernel
Max Entropy - Parametric?
DT - Non-parametric
What is a “consistent” classifier?
A classifier which is able to flawlessly predict the class of all TRAINING instances
Outline the difference between “Lazy” and “eager” classifiers
Lazy (instance-based) -> KNN
- Stores data and waits until a query is made to do anything
Eager (decision tree, SVM)
- Constructs a classification model (generalise the training data) before receiving queries (new data to classify)
What is Zero-R?
Classifies all instances according to most occurring class.
Briefly outline the nature of conditional independence assumption in the context of NB
???
What is “hidden” in a HMM?
???
What is the role of “log likelihood” in EM algorithm?
???
“Smoothing” is considered important when dealing with probability
i) why?
ii) name one smoothing method - explain how it works
???
What are 2 parameters required to generate a Gaussian probability density function?
???
How are missing values in test instances dealt with in NB?
If training -> Don’t include in classifier
If test -> Ignore from end result
Just drop missing shit
Briefly explain how to generate a training & test learning curve
???
Explain with equations what PRECISION and RECALL are
Use Precision and Recall - if we want to know what we have positively identified NOT what we correctly identified
Precision = positive predictive values - proportion of positive predictions that are correct
= TP / TP+FP
Recall = sensitivity - accuracy with respect to positive cases
= TP / TP+FN
What sort of task is “root relative squared error” used to evaluate
Don’t think this is part of course
Describe with equation how the “overlap” metric is used to calculate distance between 2 continuous feature values in instance based learning
Don’t think this is part of course
Advantage of “overlap” metric over Euclidean Distance?
??? don’t think relevant
Describe how instance based learning combines distances for discrete and continuous features
??? don’t think relevant
Describe with use of an equation how “inverse distance” is used in weighting voting for instance based learning
??? don’t think relevant
Outline each step you would go through in calculating classification accuracy based on “10-fold stratified cross-validation” for a given dataset and supervised algo.
???
Why is stratified cross-validation superior to “holdout” method
???
Outline steps involved in the “bagging” algorithm
???
Give an example of an “unstable” algorithm
???
Classify as either supervised or unsupervised
i) C4.5 DT
ii) K-means
iii) boosting
iv) zero-R
i) Supervised
ii) Unsupervised
iii) ?
iv) Supervised
What is an “incremental” algo? When is this a desirable property?
???
What is the relationship between similarity and distance?
???
What is a nearest neighbour?
Closest point: Maximum similarity or Minimum distance
d(x,y) = min(d(z,y) | z belonging to C)
Big O of brute force nearest neighbour?
N samples, D dimensions
- O(DN^2)
Good for small datasets, but quickly becomes infeasible as N grows
PROS and CONS of Nearest Neighbours
PROS
- Simple
- Can handle arbitrarily many classes
- Incremental (can add extra data to the classifier on the fly)
CONS
- Prone to bias
- Arbitrary k-value
- Need a useful distance function
- Need an averaging function
- Expensive
- Lazy learner
What is the specificity equation
Accuracy with respect to negative cases
Specificity = TN / TN+FP
What is the (training) bias of a classifier?
The difference between correct and predicted value, averaged over training sets
- Bias is large if the learning method produces classifiers that are consistently wrong (underfitting)
- Bias is small if consistently correct or different training sets cause errors on different test sets
What is the (test) variance of a classifier?
The validation in the prediction of learned classifiers across training sets - Measures inconsistency not accuracy
- Variance is large if different training sets give rise to very different classifiers (overfitting)
- Variance is small if different training sets have a minor effect on the classification decisions made, don’t care if correct or incorrect
Pros and Cons of Holdout Strategy?
Pros
- Simple to work with
- High reproducibility
Cons
- Tradeoff between more training and more test data (variance vs bias)
- Representativeness of the training and test data
Pros and Cons of Random Sampling?
PROS
- Reduction in variance and bias over “holdout” strategy
Cons
- Lack of reproducibility
Pros and Cons of leave-one-out?
Pros
- No sampling bias in evaluating the system and results will be unique and repeatable
- Generally gives high accuracy
Cons
- Very expensive if working with a large dataset
Pros and Cons of M-Fold Cross Validation
Pros
- Need to only train the model M times (rather than N times like in leave-one-out) M is partitions, N is all instances
- Can measure stability of the system across different training/test combinations
Cons
- How data is distributed among the M partitions due to sampling can lead to bias (training data differs from test)
- The results will not be unique unless we always partition the data identically
- Slightly lower accuracy because (M-1) / M of data is used for training
What is M-Fold CV similar to?
Similar to leave-one-out, instead of doing over all N instances, you partition into M and only do over M
What is a baseline and what is a benchmark?
Baseline
- Naïve method which we would expect any reasonable well-developed classifier to perform better (the bare minimum)
Benchmark
- Established rival technique which we are pitching our method against (goal)
Discuss the importance of baselines, give examples
- Establishing whether any proposed method is doing better than “dumb and simple”
- Valuable in getting a sense for the intrinsic difficulty of a given task
e. g. Random Baseline, Zero-R, One-R
What is One-R? How does it work?
Baseline method
- Creates one rule for each attribute in the training data
- Then selects the rule with the smallest error rate as its “one rule”
How it works
- Create a decision stump for each attribute with branches for each value (attribute)
- Populate leaf with the majority class at that leaf (i.e. make all instances the majority class in this leaf - if majority is YES, make all instances YES)
- Select decision stump which leads to lowest error rate over training
Weather (Outlook) 9 Yes 5 No
- Sunny 2 Yes 3 NO (Replace all with NO - 2 errors)
- O’cast 4 YES 0 No (replace all with YES)
- Rainy 3 YES 2 No (replace all with YES - 2 errors)
Total errors = 4 / 14
What to be aware of / sensitive of when formulating a baseline?
Sensitive to the importance of positive and negatives in the classification task.
Pros & cons of One-R?
Pros
- Simple to understand and implement
- Simple to comprehend results
- Good results
Cons
- Unable to capture attribute interactions
- Bias towards high-arity attributes
What is the “convergence” criterion in k-means?
No change for past 2 iterations or if difference between iterations falls below a threshold specified.
=> Stable state
What is the name of the process of converting nominal -> continuous values? explain how this works with e.g.
Similarities / Hamming Distance