L3 - Naïve Bayes Flashcards
What are the Naïve bayes learning outcomes?
Describe the basic principles of Bayes Theorem
Apply specialised methods and data structures needed to analyse text data in R
What is Bayes theorem?
A statistical principle for combining prior knowledge of the classes with new evidence gathered from data
Define the formula for Bayes theorem?
P(A│B)=(P(A”∩” B))/(P(B)) = (P(B│A)∗P(A))/(P(B))
How to test conditional probability for independence?
P(A│B) = P (A)
Formula for P(A|B) and P(B|A)? Rewrite this as the intersection set?
P(A∩B)= P(A|B) * P(B)
P(B∩A)= P(B|A) * P(A)
Give the proof for (1)
(1) P(A│B)=(P(A"∩" B))/(P(B)) Similarly: (2) P(B│A)=(P(B"∩" A))/(P(A)) So we have: (3)P(A∩B)= P(A|B) * P(B) (4)P(B∩A)= P(B|A) * P(A) and we know: (5) P(A∩B) = P(B∩A) So by re-writing (1) we have: (6) =
Explain the components of bayes theorem?
Posterior probability - P(A|B)
Likelihood - P(B|A)
A priori probability - P(A)
Marginal likelihood - P(B)
Why is it called Naïve Bayes?
IT makes strong assumption that it has independence among features
The features do not affect one another
Explain the steps involved for: When we observe a msg contains ‘Viagra’ and ‘Unsubscribe’ but not either ‘Money’ or ‘Groceries’, what is the probability that this msg is spam?
Assumption - naïve bayes assumes independence amongst features (e.g. P(A|B) = P(A))
Just focus on numerator to start with
(1) For independent events we have : P(A ∩B) = P(A)P(B),
(2) Rewrite P(B|A) in numerator = P(W1|Spam)P(W2not|Spam)…*P(Spam)
Search inside the spam class for each feature (e.g. spam or not)
(3) Calculate ham using the same formula but change the class
(4) Calculate Spam/Ham
(5) Denominator (normalisation/scaling)- Probability of features being spam /Probability of said features being ham and spam
What is normalisation/scaling for naïve classifiers?
The probability of features being the variable of interest/the probability of features being the variable of interest
P(A|features) / P(B|features)
What is the naïve bayes classification algorithm? What are the components?
PCL(|F1…Fn) = 1/Z CL = class label F1..Fn = n features 1/z = scaling factor
How can be we find the probability of P(B|features) when we have calculated P(A|features)?
Use the same equation for P(A|features) But replace with the information in the likelihood table for the alterative class label
How does the classification work for naïve bayes algorithm?
Training - calculate likelihood tables
Testing - given new unseen data
(1) - finds its probability of it belonging to each class using the likelihood tables
(2) Picks the probable class (e.g. which class is more likely after normalisation)
How does Naïve Bayes classify?
Picks the most probable class given the features observed (e.g. which is more likely after normalisation)
(1) Calculate the posterior probability for each class.
(2) The class with the highest posterior probability is the outcome of prediction.
* ** you calculate both classes and pick the one with probability that is higher
Difference between classifier and normalisation?
Normalisation - takes the probability of the classes and converts them into percentages
Why is KNN lazy?
It just compares similar features when classifying
Naïve bayes learns from the inputted likelihood table
Name one problem with naïve bayes?
If one of the features is a 0 (e.g. 0/20) then this will deem the whole equation as 0
0 features overrides all other features
How to overcome the naïve bayes problem?
Use Laplace estimator/smoothing
Non-zero probability - adds a small number to each of the features
Explain how to overcome the problem in more detail?
Numerator - add 1
Denominator - balance by changing the denominator by the same amount (doesn’t have to be 1)
Do not change prior knowledge (e.g. a priori probabilities A or B)
How to use naïve bayes with continuous features?
Discretisation (binning)
Categorises the numeric data into different bins
How to set bins?
Prior knowledge (e.g. spam more likely in the day time so set parameters for day and night) Simply use quantiles
Name some strengths for Naïve bayes? Explain.
(1) Robust to irrelevant features (some algorithms affected by unusual features)
For example you might include eye colour as a feature to classify gender, eye colour is actually completely irrelevant to gender but when you use probability as a classifier this bears out in the evidence for example male and females will be have nearly identical blue eyes
(2) Robust to missing data - some classifiers discard whole observation with missing data naïve will just drop the one missing feature (e.g. will reduce the sample size by 1 if only missing 1 missing attribute)
Name and explain some weaknesses of naïve bayes?
(1) strong assumption that all features are independent - faulty assumptions which does not exist in the real world
Not important to obtain precise probabilities so long as predictions are accurate
(2) strong prediction but weak probability estimates - some other algorithms are better at providing probability estimates because of the independence assumption
What are Bayesian methods? What is Naïve?
Bayes classifiers are classification methods based on Bayes’ theorem
Naïve Bayes classifier is the simplest one among them, which assumes features are independent