Chapter 5: Naive Bayes Flashcards
1
Q
Formal Definition of Classification
A
- Naive Byes = form of classification
- Classification: Given a database π· = {π₯1, π₯2, β¦ , π₯π} of tuples (items, records) and a set of classes πΆ = {πΆ1, πΆ2, β¦ , πΆπ}, the classification problem is to define a mapping π: π· πΆ where each xi is assigned to one class. A class, πΆπ, contains precisely those tuples mapped to it; that is,
πΆπ = {π₯π|π(π₯π) = πΆπ, 1 π π, and π₯π element π·}.- The logistic regression can be used for classification.
- Prediction is similar, but usually implies a mapping to numeric values instead of a class πΆπ
2
Q
Example Applications
A
- Determine if a bank customer for a loan is a low, medium, or high risk customer.
- Churn prediction (typically a classification task). Leaves or not ) Determine if a sequence of credit card purchases indicates questionable behavior.
- Identify customers that may be willing to purchase particular insurance policies.
- Identify patterns of gene expression that indicate the patient has cancer.
- Identify spam mail etc.
3
Q
Algorithms for Classification
A
- (Logistic) Regression
- Rudimentary Rules (e.g., 1R)
- Statistical Modeling (e.g., NaiΜve Bayes)
- Decision Trees: Divide and Conquer
- Classification Rules (e.g. PRISM)
- Instance-Based Learning (e.g. kNN)
- Support Vector Machines
4
Q
1-Rule (1R)
A
- Generate a one-level decision tree
- One attribute β easy to explain
- Basic idea:
- Rules βtestingβ a single attribute
- Classify according to frequency in training data
- Evaluate error rate for each attribute
- Choose the best attribute
- Thatβs all!
- Performs quite well, even compared to much more sophisticated algorithms!
- Often the underlying structure of data is quite simple
βVery simple classification rules perform well on most commonly used datasetsβ Holte, 1993
5
Q
Apply 1R on weather data
A
6
Q
Other Features of naive bayes
A
- Missing Values
- Include βdummyβ attribute value missing
- Numeric Values
- Discretization :
- Sort training data by attribute value
- Split range of numeric temperature values into categories
- Threshold values 64.5, 66.5, 70.5, 72, 77.5, 80.5, 84 between eight categories
- 72 can be removed and replaced by 73.5 to get a bigger class
- Problem of too many temperature classes -> define a min class size
- Discretization :
- Merge adjacent partitions having the same majority class
7
Q
NaiΜve Bayes Classifier
A
- 1R uses only a single attribute for classification
- Naive Bayes classifier allows all attributes to contribute equally
- Assumes
- All attributes equally important
- All attributes independent
- This means that knowledge about the value of a particular attribute doesnβt tell us anything about the value of another attribute
- Although based on assumptions that are almost never correct, this scheme works well in practice!
8
Q
Bayes Theorem: Some Notation
A
- Let π(π) represent the prior or unconditional probability that proposition π is true.
- Example: Let π represent that a customer is a high credit risk. π(π) = 0.1 means that there is a 10% chance a given customer is a high credit risk.
- Probabilities of events change when we know something about the world
- The notation π(π|h) is used to represent the conditional or posterior probability of π
- Read βthe probability of e given that all we know is h.β
β- π(π = hππh πππ π| h = π’πππππππ¦ππ) = 0.60
- The notation π(πΈ) is used to represent the probability distribution of all possible values of a random variable
- π(π ππ π) = < 0.7,0.2,0.1 >
9
Q
Conditional Probability and Bayes Rule
A
- The product rule for conditional probablities
- π (π | h) = π(π β© h)/π(h)
- π(π β© h) = π(π|h)π(h) = π(h|π)π(π) (product rule)
- π(π β© h) = π(π)π(h) (for independent random variables)
- Bayesβ rule relates conditional probabilities
- π(π β© h) = π(π|h)π(h)
- π(π β© h) = π(h|π)π(π)
- P(h |e) = P(e |h)P(h) / P(e)
10
Q
Bayes theorem
A
11
Q
Bayes cancer example
A
12
Q
Frequency Tables
A
13
Q
Naive Bayes β Probabilities
A
14
Q
Predicting a New Day
A
15
Q
Naive Bayes - Summary
A
- Want to classify a new instance (π1, π2, β¦ , ππ) into finite number of categories from the set h.
- Choose the most likely classification using
- Bayes theorem MAP (maximum a posteriori classification)
- Are assumptions stronger or less stronger? All attributes are treated equally. So be careful which attribute to include. Maybe it is without any relation and corrupts the data
- Assign the most probable category hππ΄π given (π1, π2, β¦ , ππ), i.e. the and maximum likelihood
- βNaive Bayesβ since the attributes are treated as independent only then you can multiply the probabilities