Classification - Part 3 Flashcards
What is the Naive Bayes classification? Name the method and goal
- Probabilistic classification technique that considers each attribute and class label as random variables
Goal: Find the class C that maximizes the conditional probability
P(C|A) -> Probability of class C given Attribute A
When is the application of the Bayes Theorem useful?
Bayes Theorem: P(C|A) = (P(C|A)*P(C)) / P(A)
Useful situations:
- P(C|A) is unknown
- P(A|C), P(A) and P(C) are known or easy to estimate
Whats the difference between prior and posterior probability (Bayes Theorem)
- Prior probability describes the probability of an event before evidence is seen
- Posterior probability describes the probability of an event after evidence is seen
How do you apply Bayes Theorem to the classification task?
- Compute probability P(C|A) for all values of C using Bayer Theorem
- a. Normalize the likelihood of the classes
- Choose value of C that maximizes P(C|A)
- P(A) is same for all classes (so you can neglect it when comparing the probability of the classes)
- Only need to estimate P(C) and P(A|C)
How to estimate the prior probability P(C)
- Count the records in training set that are labeled with class C
- Divide the count by overall number of records in training data
Explain the independence assumption and its implications for estimating P(A|C) for the Naive Bayes
- Naive Bayes assumes that all attributes are statistically independent
- This assumption is almost never correct
- > This assumption allows the joint probability P(A|C) to be reformulated as the product of the individual probabilities P(Ai|Cj)
- > The probabilities for P(A|C) can be estimated directly from the training data
How to estimate the probabilities P(Ai|Cj)?
- Count how often an attribute value appears together with class Cj
- Divide the count by the overall number of records belonging to class Cj
What are the names of the parts of the Bayes Theorem?
- P(A|C) Class conditional probability of evidence
- P(C) Prior probability of class
- P(A) Prior probability of evidence
- P(C|A) Posterior probability of class C
How to normalize the likelihood of the two classes (PC|A) ?
- Divide the classes probability by the sum of the probability of all classes
How should you handle numerical attributes when applying the Naive Bayes?
Option 1) Discretize the numerical attributes (apply categories to the numerical values)
Option2) Assume that numerical attributes have a normal distribution given the class
- estimate distribution parameters by training data
- with the probability distribution you can estimate the conditional probability P(A|C)
Which distribution parameters can be estimated from the training data?
- Sample mean
- Standard deviation
How to handle missing values in the training data?
- Dont include the records into the frequency count for attribute value-class combination (just pretend that this record does not exist)
How to handle missing values in the test data?
Attribute will be omitted from calculation
Explain the zero-frequency problem
- If an attribute value does not occur with every class value the posterior probability will also be zero!
Solution: Laplace Estimator Add 1 to the count of every attribute class combination
LaPlace: P(Ai|C) = (Nic + 1) / (Nc + |Vi|)
Vi = number of values for the attribute in the training set
What are the characteristics of Naive bayes?
- Works well because classification only that the maximum probability is assigned to correct class (even if the wrong independence assumption can lead to accurate probability estimates)
- Robust to isolated noise points (averaged out)
- Robust to irrelevant attributes (P(Ai|C) distributed uniformly for Ai)
- Redundant attributes can cause problems -> use subset of attributes
What is the technical advantage of Naive Bayes?
- Learning is computationally cheap because the probabilities can be calculated by one pass over the training data
- Storing the probabilities does not require a lot of memory
For which problems can you use Support Vector Machines ?
- Two class problems
- Examples described by continuous attributes
When doe SVMs achieve good results?
- For high dimensional data
How does SVMs work?
- They find a linear hyperplane (decision boundary that separates the data
How does a SVM find the best Hyperplane?
- To avoid overfitting and to generalize better for unseen data the hyperplane that maximizes the margin to the closest points (support vectors) is chosen
How to deal with noise points in SVMs?
- Use slack variables for margin computation
- Slack variables indicate if a record is used or ignored; result in a penalty for each data point that violates the decision boundary
Goal: Have a large margin without ignoring too many data points
How to handle decision boundaries that are not linear with SVMs?
- Transform data into higher dimensional space where there is a linear separation
- Different kernel function can be used for this transformation
What are the characteristics of SVMs ?
- Most successful classification technique for high dimensional data before DNNs appeared
- Hyperparameter selection often has a high impact on the performance of SVMs
What are the application areas for SVMs?
- Text classification
- Computer vision
- Handwritten digit recognition
- SPAM detection
- Bioinformatics