Probabilistic Learning – Classification Using Naive Bayes Flashcards

1
Q

Bayesian methods

A

The technique descended from the work of the 18th century
mathematician Thomas Bayes, who developed foundational principles to describe
the probability of events, and how probabilities should be revised in the light of
additional information. These principles formed the foundation for what are now
known as Bayesian methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is probability?

A

probability is a number between 0 and 1 (that is, between 0 percent and
100 percent), which captures the chance that an event will occur in the light of the
available evidence.
The lower the probability, the less likely the event is to occur. A
probability of 0 indicates that the event will definitely not occur, while a probability
of 1 indicates that the event will occur with 100 percent certainty.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Classifiers based on Bayes methods

A

Classifiers based on Bayesian methods utilize training data to calculate an observed
probability of each outcome based on the evidence provided by feature values. When
the classifier is later applied to unlabeled data, it uses the observed probabilities to
predict the most likely class for the new features. It’s a simple idea, but it results in
a method that often has results on par with more sophisticated algorithms. In fact,
Bayesian classifiers have been used for:
• Text classification, such as junk e-mail (spam) filtering
• Intrusion or anomaly detection in computer networks
• Diagnosing medical conditions given a set of observed symptoms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

how are they applied? simultaneously?

A

Typically, Bayesian classifiers are best applied to problems in which the information
from numerous attributes should be considered simultaneously in order to estimate
the overall probability of an outcome. While many machine learning algorithms
ignore features that have weak effects, Bayesian methods utilize all the available
evidence to subtly change the predictions. If large number of features have

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Bayesian probability theory

A

Bayesian probability theory is rooted in the idea that the estimated
likelihood of an event, or a potential outcome, should be based on the evidence at
hand across multiple trials, or opportunities for the event to occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

example of real world outcome

A

Event Trial
Heads result Coin flip
Rainy weather A single day
Message is spam Incoming e-mail message
Candidate becomes president Presidential election
Win the lottery Lottery ticket

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

mutually exclusive and exhaustive events

A

For example, given the value P(spam) = 0.20, we can calculate P(ham)
= 1 – 0.20 = 0.80. This concludes that spam and ham are mutually exclusive and
exhaustive events, which implies that they cannot occur at the same time and are the
only possible outcomes.
Because an event cannot simultaneously happen and not happen, an event is always
mutually exclusive and exhaustive with its complement, or the event comprising
of the outcomes in which the event of interest does not happen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Venn Diagram

A

Venn diagram. First used in the late 19th century
by John Venn, the diagram uses circles to illustrate the overlap between sets of items.
In most Venn diagrams, the size of the circles and the degree of the overlap is not
meaningful. Instead, it is used as a reminder to allocate probability to all possible
combinations of events:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

probability of both together

A

In other words, we hope to
estimate the probability that both P(spam) and P(Viagra) occur, which can be written
as P(spam ∩ Viagra). The upside down ‘U’ symbol signifies the intersection of the two
events; the notation A ∩ B refers to the event in which both A and B occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

joint probability vs independent events

A

Calculating P(spam ∩ Viagra) depends on the joint probability of the two events or
how the probability of one event is related to the probability of the other. If the two
events are totally unrelated, they are called independent events. This is not to say
that independent events cannot occur at the same time; event independence simply
implies that knowing the outcome of one event does not provide any information
about the outcome of the other. For instance, the outcome of a heads result on a coin
flip is independent from whether the weather is rainy or sunny on any given day.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

dependent events

A

If all events were independent, it would be impossible to predict one event by
observing another. In other words, dependent events are the basis of predictive
modeling. Just as the presence of clouds is predictive of a rainy day, the appearance
of the word Viagra is predictive of a spam e-mail.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

calculate probability of independent events.

A

Calculating the probability of dependent events is a bit more complex than for
independent events. If P(spam) and P(Viagra) were independent, we could easily
calculate P(spam ∩ Viagra), the probability of both events happening at the same
time. Because 20 percent of all the messages are spam, and 5 percent of all the
e-mails contain the word Viagra, we could assume that 1 percent of all messages
are spam with the term Viagra. This is because 0.05 * 0.20 = 0.01. More generally, for
independent events A and B, the probability of both happening can be expressed as
P(A ∩ B) = P(A) * P(B).
This said, we know that P(spam) and P(Viagra) are likely to be highly dependent,
which means that this calculation is incorrect. To obtain a reasonable estimate, we
need to use a more careful formulation of the relationship between these two events,
which is based on advanced Bayesian methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

bayes thereom of probability

A

The notation P(A|B) is read as the probability of event A, given that event B
occurred. This is known as conditional probability, since the probability of A is
dependent (that is, conditional) on what happened with event B. Bayes’ theorem
tells us that our estimate of P(A|B) should be based on P(A ∩ B), a measure of how
often A and B are observed to occur together, and P(B), a measure of how often B is
observed to occur in general

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

prior probability

A

Without knowledge of an incoming message’s content, the best estimate of its
spam status would be P(spam), the probability that any prior message was spam,
which we calculated previously to be 20 percent. This estimate is known as the
prior probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

likelihood and marginal likelyhood

A

Suppose that you obtained additional evidence by looking more carefully at
the set of previously received messages to examine the frequency that the term
Viagra appeared. The probability that the word Viagra was used in previous spam
messages, or P(Viagra|spam), is called the likelihood. The probability that Viagra
appeared in any message at all, or P(Viagra), is known as the marginal likelihood

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

posterior probability

A

Posterior probability is the probability an event will happen after all evidence or background information has been taken into account. It is closely related to prior probability, which is the probability an event will happen before you taken any new evidence into account. You can think of posterior probability as an adjustment on prior probability:
Posterior probability = prior probability + new evidence (called likelihood).
For example, historical data suggests that around 60% of students who start college will graduate within 6 years. This is the prior probability. However, you think that figure is actually much lower, so set out to collect new data. The evidence you collect suggests that the true figure is actually closer to 50%; This is the posterior probability

17
Q

plus and minus of Niave Bayes Algorithm

A
Strengths
• Simple, fast, and very effective
• Does well with noisy and missing
data
• Requires relatively few examples for
training, but also works well with
very large numbers of examples
• Easy to obtain the estimated
probability for a prediction
Weaknesses
• Relies on an often-faulty assumption
of equally important and
independent features
• Not ideal for datasets with many
numeric features
• Estimated probabilities are less
reliable than the predicted classes
18
Q

Why Niave?

A

The Naive Bayes algorithm is named as such because it makes some “naive”
assumptions about the data. In particular, Naive Bayes assumes that all of the
features in the dataset are equally important and independent. These assumptions
are rarely true in most real-world applications.

19
Q

Niave uses category values

A

One easy and effective solution is to discretize numeric features, which simply means
that the numbers are put into categories known as bins. For this reason, discretization
is also sometimes called binning. This method is ideal when there are large amounts
of training data, a common condition while working with Naive Bayes.
There are several different ways to discretize a numeric feature. Perhaps the most
common is to explore the data for natural categories or cut points in the distribution of
data. For example, suppose that you added a feature to the spam dataset that recorded
the time of night or day the e-mail was sent, from 0 to 24 hours past midnight.

20
Q

why Niave categorical?

A

One easy and effective solution is to discretize numeric features, which simply means
that the numbers are put into categories known as bins. For this reason, discretization
is also sometimes called binning. This method is ideal when there are large amounts
of training data, a common condition while working with Naive Bayes.
There are several different ways to discretize a numeric feature. Perhaps the most
common is to explore the data for natural categories or cut points in the distribution of
data. For example, suppose that you added a feature to the spam dataset that recorded
the time of night or day the e-mail was sent, from 0 to 24 hours past midnight.

21
Q

summarize niave beyes

A

Naive Bayes. This algorithm
constructs tables of probabilities that are used to estimate the likelihood that new
examples belong to various classes. The probabilities are calculated using a formula
known as Bayes’ theorem, which specifies how dependent events are related.
Although Bayes’ theorem can be computationally expensive, a simplified version that
makes so-called “naive” assumptions about the independence of features is capable
of handling extremely large datasets.