Probabilistic Learning – Classification Using Naive Bayes Flashcards
Bayesian methods
The technique descended from the work of the 18th century
mathematician Thomas Bayes, who developed foundational principles to describe
the probability of events, and how probabilities should be revised in the light of
additional information. These principles formed the foundation for what are now
known as Bayesian methods.
What is probability?
probability is a number between 0 and 1 (that is, between 0 percent and
100 percent), which captures the chance that an event will occur in the light of the
available evidence.
The lower the probability, the less likely the event is to occur. A
probability of 0 indicates that the event will definitely not occur, while a probability
of 1 indicates that the event will occur with 100 percent certainty.
Classifiers based on Bayes methods
Classifiers based on Bayesian methods utilize training data to calculate an observed
probability of each outcome based on the evidence provided by feature values. When
the classifier is later applied to unlabeled data, it uses the observed probabilities to
predict the most likely class for the new features. It’s a simple idea, but it results in
a method that often has results on par with more sophisticated algorithms. In fact,
Bayesian classifiers have been used for:
• Text classification, such as junk e-mail (spam) filtering
• Intrusion or anomaly detection in computer networks
• Diagnosing medical conditions given a set of observed symptoms
how are they applied? simultaneously?
Typically, Bayesian classifiers are best applied to problems in which the information
from numerous attributes should be considered simultaneously in order to estimate
the overall probability of an outcome. While many machine learning algorithms
ignore features that have weak effects, Bayesian methods utilize all the available
evidence to subtly change the predictions. If large number of features have
Bayesian probability theory
Bayesian probability theory is rooted in the idea that the estimated
likelihood of an event, or a potential outcome, should be based on the evidence at
hand across multiple trials, or opportunities for the event to occur.
example of real world outcome
Event Trial
Heads result Coin flip
Rainy weather A single day
Message is spam Incoming e-mail message
Candidate becomes president Presidential election
Win the lottery Lottery ticket
mutually exclusive and exhaustive events
For example, given the value P(spam) = 0.20, we can calculate P(ham)
= 1 – 0.20 = 0.80. This concludes that spam and ham are mutually exclusive and
exhaustive events, which implies that they cannot occur at the same time and are the
only possible outcomes.
Because an event cannot simultaneously happen and not happen, an event is always
mutually exclusive and exhaustive with its complement, or the event comprising
of the outcomes in which the event of interest does not happen.
Venn Diagram
Venn diagram. First used in the late 19th century
by John Venn, the diagram uses circles to illustrate the overlap between sets of items.
In most Venn diagrams, the size of the circles and the degree of the overlap is not
meaningful. Instead, it is used as a reminder to allocate probability to all possible
combinations of events:
probability of both together
In other words, we hope to
estimate the probability that both P(spam) and P(Viagra) occur, which can be written
as P(spam ∩ Viagra). The upside down ‘U’ symbol signifies the intersection of the two
events; the notation A ∩ B refers to the event in which both A and B occur.
joint probability vs independent events
Calculating P(spam ∩ Viagra) depends on the joint probability of the two events or
how the probability of one event is related to the probability of the other. If the two
events are totally unrelated, they are called independent events. This is not to say
that independent events cannot occur at the same time; event independence simply
implies that knowing the outcome of one event does not provide any information
about the outcome of the other. For instance, the outcome of a heads result on a coin
flip is independent from whether the weather is rainy or sunny on any given day.
dependent events
If all events were independent, it would be impossible to predict one event by
observing another. In other words, dependent events are the basis of predictive
modeling. Just as the presence of clouds is predictive of a rainy day, the appearance
of the word Viagra is predictive of a spam e-mail.
calculate probability of independent events.
Calculating the probability of dependent events is a bit more complex than for
independent events. If P(spam) and P(Viagra) were independent, we could easily
calculate P(spam ∩ Viagra), the probability of both events happening at the same
time. Because 20 percent of all the messages are spam, and 5 percent of all the
e-mails contain the word Viagra, we could assume that 1 percent of all messages
are spam with the term Viagra. This is because 0.05 * 0.20 = 0.01. More generally, for
independent events A and B, the probability of both happening can be expressed as
P(A ∩ B) = P(A) * P(B).
This said, we know that P(spam) and P(Viagra) are likely to be highly dependent,
which means that this calculation is incorrect. To obtain a reasonable estimate, we
need to use a more careful formulation of the relationship between these two events,
which is based on advanced Bayesian methods.
bayes thereom of probability
The notation P(A|B) is read as the probability of event A, given that event B
occurred. This is known as conditional probability, since the probability of A is
dependent (that is, conditional) on what happened with event B. Bayes’ theorem
tells us that our estimate of P(A|B) should be based on P(A ∩ B), a measure of how
often A and B are observed to occur together, and P(B), a measure of how often B is
observed to occur in general
prior probability
Without knowledge of an incoming message’s content, the best estimate of its
spam status would be P(spam), the probability that any prior message was spam,
which we calculated previously to be 20 percent. This estimate is known as the
prior probability.
likelihood and marginal likelyhood
Suppose that you obtained additional evidence by looking more carefully at
the set of previously received messages to examine the frequency that the term
Viagra appeared. The probability that the word Viagra was used in previous spam
messages, or P(Viagra|spam), is called the likelihood. The probability that Viagra
appeared in any message at all, or P(Viagra), is known as the marginal likelihood