Classification - Part 3 Flashcards
What is the Naive Bayes classification? Name the method and goal
- Probabilistic classification technique that considers each attribute and class label as random variables
Goal: Find the class C that maximizes the conditional probability
P(C|A) -> Probability of class C given Attribute A
When is the application of the Bayes Theorem useful?
Bayes Theorem: P(C|A) = (P(C|A)*P(C)) / P(A)
Useful situations:
- P(C|A) is unknown
- P(A|C), P(A) and P(C) are known or easy to estimate
Whats the difference between prior and posterior probability (Bayes Theorem)
- Prior probability describes the probability of an event before evidence is seen
- Posterior probability describes the probability of an event after evidence is seen
How do you apply Bayes Theorem to the classification task?
- Compute probability P(C|A) for all values of C using Bayer Theorem
- a. Normalize the likelihood of the classes
- Choose value of C that maximizes P(C|A)
- P(A) is same for all classes (so you can neglect it when comparing the probability of the classes)
- Only need to estimate P(C) and P(A|C)
How to estimate the prior probability P(C)
- Count the records in training set that are labeled with class C
- Divide the count by overall number of records in training data
Explain the independence assumption and its implications for estimating P(A|C) for the Naive Bayes
- Naive Bayes assumes that all attributes are statistically independent
- This assumption is almost never correct
- > This assumption allows the joint probability P(A|C) to be reformulated as the product of the individual probabilities P(Ai|Cj)
- > The probabilities for P(A|C) can be estimated directly from the training data
How to estimate the probabilities P(Ai|Cj)?
- Count how often an attribute value appears together with class Cj
- Divide the count by the overall number of records belonging to class Cj
What are the names of the parts of the Bayes Theorem?
- P(A|C) Class conditional probability of evidence
- P(C) Prior probability of class
- P(A) Prior probability of evidence
- P(C|A) Posterior probability of class C
How to normalize the likelihood of the two classes (PC|A) ?
- Divide the classes probability by the sum of the probability of all classes
How should you handle numerical attributes when applying the Naive Bayes?
Option 1) Discretize the numerical attributes (apply categories to the numerical values)
Option2) Assume that numerical attributes have a normal distribution given the class
- estimate distribution parameters by training data
- with the probability distribution you can estimate the conditional probability P(A|C)
Which distribution parameters can be estimated from the training data?
- Sample mean
- Standard deviation
How to handle missing values in the training data?
- Dont include the records into the frequency count for attribute value-class combination (just pretend that this record does not exist)
How to handle missing values in the test data?
Attribute will be omitted from calculation
Explain the zero-frequency problem
- If an attribute value does not occur with every class value the posterior probability will also be zero!
Solution: Laplace Estimator Add 1 to the count of every attribute class combination
LaPlace: P(Ai|C) = (Nic + 1) / (Nc + |Vi|)
Vi = number of values for the attribute in the training set
What are the characteristics of Naive bayes?
- Works well because classification only that the maximum probability is assigned to correct class (even if the wrong independence assumption can lead to accurate probability estimates)
- Robust to isolated noise points (averaged out)
- Robust to irrelevant attributes (P(Ai|C) distributed uniformly for Ai)
- Redundant attributes can cause problems -> use subset of attributes
What is the technical advantage of Naive Bayes?
- Learning is computationally cheap because the probabilities can be calculated by one pass over the training data
- Storing the probabilities does not require a lot of memory
For which problems can you use Support Vector Machines ?
- Two class problems
- Examples described by continuous attributes
When doe SVMs achieve good results?
- For high dimensional data