Week 1 - 4 Flashcards
Briefly explain how Nearest Prototype classifier works
- Calculate the centroid of each class (by averaging the numeric values along each axis -> then do Euclidean distance)
- Classify each test instance according to the class of the centroid it is nearest to
What are the theoretical properties of Nearest Prototype?
- Multiclass Classification model
- Parametric: Prototypes capture everything that is needed for prediction
- Incremental: Easy to add extra data to the classification on the fly
- Handles both nominal & continuous data
- Decision boundary between two classes is linear
What are some examples of Parametric models?
- Nearest Prototype
- Naïve Bayes
- Linear or Logistic Regression
What are some examples of Non-Parametric models?
- K-NN
- Decision Trees
Is SVM parametric or non-parametric?
Depends on the kernel used
Quirks of parametric models?
- Simpler, since they have clear functional form
- Once you have learnt the model coefficients, data is no longer needed
- TEND TOWARDS UNDERFITTING
Quirks of non-parametric models?
- Require more data, since there no assumptions are made about the form of the function
- TEND TOWARDS OVERFITTING
Briefly explain Bayesian Methods
- Learning & Classification methods based on probability theory
- Build a generative model that approximates how data is produced
- Categorisation produces a posterior P(A|B) probability distribution over the possible categories
Formula for Bayes Rule?
P(C | X) = [ P(X | C) * P(C) ] / P(X)
Formula for Naïve Bayes?
argmax ci belong to C P(Cj) Multiplcation Of P(Xi | Cj)
What to do if prob for Naïve Bayes is 0?
Make it a small ass number (e = less than 1/n) so whole thing doesn’t equal to 0
What to do if prob for Naïve Bayes is 0?
Make it a small ass number (e = less than 1/n) so whole thing doesn’t equal to 0
How to do Naïve Bayes?
P(C) * P(A | C) * P(B | C) … * P(X | C)
How to handle missing values in Naïve Bayes?
If a value is missing in a TEST instance, it is possible to simply ignore that feature for the purpose of classification.
If a value is missing in a TRAINING instance, it is possible to simply have it not contribute to the attribute-value counts / probability estimates for that feature.
How to do Naïve Bayes for continuous features?
1) Discretise the feature into nominal features
2) Use probability density estimation to estimate
P(Xi | Cj)
Theoretical properties of Naïve Bayes models?
- Multiclass classification method
- Parametric: Only have to store attribute-value pairs, counts / probability for each class not actual instances
- Incremental: Implications for weakly supervised learning
- Handles both nominal & continuous features
- Simple -> Fast
WTF is Noise?
- Noise refers to corruptions in the values of attributes
- Erroneous values
- Missing values
- Incomplete values
- Simply uninformative or unpredictive values
To fix, do some combination of feature selection and feature weighting
WTF is Feature Weighting?
- Weighting the distance calculation for each feature
- Feature selection: By thresholding based on weight, we can select only the features we want to keep
What are some common methods for feature weighting?
- Pointwise Mutual Information & Mutual Information
- Chi Squared
- Information Gain
What are some common methods for feature weighting?
- Pointwise Mutual Information & Mutual Information
- Chi Squared
- Information Gain
WTF is Feature Filtering?
- Intuition is that it is possible to evaluate “goodness” of each feature, separate from other features
- Consider each feature separately: Linear time in number of attributes
- Possible (but difficult) to control inter-dependence of features
- Typically most popular strategy
What makes a feature set good?
Better models
What makes a single feature good?
Well correlated with class
- Knowing a
lets us predict c
with more confidence
How to calc Pointwise Mutual Info?
PMI(A,C) = log2 ( P(A,C) / (P(A) * P(C)) )
Attributes with the greatest PMI = best attributes
Refer to Toy Example in Lecture 3B Page ~39