Chapter 5: Naive Bayes Flashcards
Formal Definition of Classification
- Naive Byes = form of classification
- Classification: Given a database π· = {π₯1, π₯2, β¦ , π₯π} of tuples (items, records) and a set of classes πΆ = {πΆ1, πΆ2, β¦ , πΆπ}, the classification problem is to define a mapping π: π· πΆ where each xi is assigned to one class. A class, πΆπ, contains precisely those tuples mapped to it; that is,
πΆπ = {π₯π|π(π₯π) = πΆπ, 1 π π, and π₯π element π·}.- The logistic regression can be used for classification.
- Prediction is similar, but usually implies a mapping to numeric values instead of a class πΆπ
Example Applications
- Determine if a bank customer for a loan is a low, medium, or high risk customer.
- Churn prediction (typically a classification task). Leaves or not ) Determine if a sequence of credit card purchases indicates questionable behavior.
- Identify customers that may be willing to purchase particular insurance policies.
- Identify patterns of gene expression that indicate the patient has cancer.
- Identify spam mail etc.
Algorithms for Classification
- (Logistic) Regression
- Rudimentary Rules (e.g., 1R)
- Statistical Modeling (e.g., NaiΜve Bayes)
- Decision Trees: Divide and Conquer
- Classification Rules (e.g. PRISM)
- Instance-Based Learning (e.g. kNN)
- Support Vector Machines
1-Rule (1R)
- Generate a one-level decision tree
- One attribute β easy to explain
- Basic idea:
- Rules βtestingβ a single attribute
- Classify according to frequency in training data
- Evaluate error rate for each attribute
- Choose the best attribute
- Thatβs all!
- Performs quite well, even compared to much more sophisticated algorithms!
- Often the underlying structure of data is quite simple
βVery simple classification rules perform well on most commonly used datasetsβ Holte, 1993
Apply 1R on weather data

Other Features of naive bayes
- Missing Values
- Include βdummyβ attribute value missing
- Numeric Values
- Discretization :
- Sort training data by attribute value
- Split range of numeric temperature values into categories
- Threshold values 64.5, 66.5, 70.5, 72, 77.5, 80.5, 84 between eight categories
- 72 can be removed and replaced by 73.5 to get a bigger class
- Problem of too many temperature classes -> define a min class size
- Discretization :
- Merge adjacent partitions having the same majority class

NaiΜve Bayes Classifier
- 1R uses only a single attribute for classification
- Naive Bayes classifier allows all attributes to contribute equally
- Assumes
- All attributes equally important
- All attributes independent
- This means that knowledge about the value of a particular attribute doesnβt tell us anything about the value of another attribute
- Although based on assumptions that are almost never correct, this scheme works well in practice!
Bayes Theorem: Some Notation
- Let π(π) represent the prior or unconditional probability that proposition π is true.
- Example: Let π represent that a customer is a high credit risk. π(π) = 0.1 means that there is a 10% chance a given customer is a high credit risk.
- Probabilities of events change when we know something about the world
- The notation π(π|h) is used to represent the conditional or posterior probability of π
- Read βthe probability of e given that all we know is h.β
β- π(π = hππh πππ π| h = π’πππππππ¦ππ) = 0.60
- The notation π(πΈ) is used to represent the probability distribution of all possible values of a random variable
- π(π ππ π) = < 0.7,0.2,0.1 >
Conditional Probability and Bayes Rule
- The product rule for conditional probablities
- π (π | h) = π(π β© h)/π(h)
- π(π β© h) = π(π|h)π(h) = π(h|π)π(π) (product rule)
- π(π β© h) = π(π)π(h) (for independent random variables)
- Bayesβ rule relates conditional probabilities
- π(π β© h) = π(π|h)π(h)
- π(π β© h) = π(h|π)π(π)
- P(h |e) = P(e |h)P(h) / P(e)

Bayes theorem

Bayes cancer example

Frequency Tables

Naive Bayes β Probabilities

Predicting a New Day

Naive Bayes - Summary
- Want to classify a new instance (π1, π2, β¦ , ππ) into finite number of categories from the set h.
- Choose the most likely classification using
- Bayes theorem MAP (maximum a posteriori classification)
- Are assumptions stronger or less stronger? All attributes are treated equally. So be careful which attribute to include. Maybe it is without any relation and corrupts the data
- Assign the most probable category hππ΄π given (π1, π2, β¦ , ππ), i.e. the and maximum likelihood
- βNaive Bayesβ since the attributes are treated as independent only then you can multiply the probabilities
Bayesian Classifier β The Zero Frequency Problem

Bayesian Classifier β The Zero Frequency Problem - Solution
- What if an attribute value doesnβt occur with every class value P(Outlook = overcast | no)=0
- Remedy: add 1 to the numerator for every attribute value-class combination, and the probability can never be zero
*

Missing Values
- Training: instance is not included in frequency count for attribute value- class combination
- Classification: attribute will be omitted from calculation Example:

Dealing with Numeric Attributes
- Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)
- The probability density function for the normal distribution is defined by two parameters:
- The sample mean:
- The standard deviation :
- The density function f(x):

Numeric Data: Unknown Distribution
- What if the data distribution#does not follow a known distribution.
- In this case we need a mechanism to estimate the density distribution.
- A simple and intuitive approach is based on kernel density estimation. ( Model irrgolor pnpasility density function
- Consider a random variable π whose distribution π(π) is unknown but a sample with a non-uniform distribution
Kernel Density Estimation
- We want to derive a function π(π₯) such that
- (1) π(π₯) is a probability density function, i.e.
f (x)dx= 1 - (2) π(π₯) is a smooth approximation of the data points in π
- (3) π(π₯) can be used to estimate values π₯* which are not in

Discussion of NaiΜve Bayes
- NaiΜve Bayes works surprisingly well (even if independence assumption is clearly violated)
- Domingos, Pazzani: On the Optimality of the Simple Bayesian Classifier under Zero-One-Loss, Machine Learning (1997) 29.
- However: adding too many redundant attributes will cause problems (e.g. identical attributes)
- Note also: many numeric attributes are not normally distributed Time complexity
- Calculating conditional probabilities: Time π(π) where n is the number of instances
- Calculating the class: Time π(ππ) where c is the number of classes, a the attributes
Bayesian (Belief) Networks: Multiple Variables with Dependency
- NaiΜve Bayes assumption of conditional independence is often too restrictive
- Bayesian Belief network (Bayesian net) describe conditional independence among subsets of attributes: combining prior knowledge about dependencies among variables with observed training data.
- Graphical representation: directed acyclic graph (DAG), one node for each attribute
- Overall probability distribution factorized into component distributions
- Graphβs nodes hold component distributions (conditional distributions)
Probability Laws
