Naive Bayes and Sentiment Classification Flashcards
Text categorization
The taxt of assigning a label or category to an entire text or document.
Sentiment analysis
The extraction of sentiment - the positive or negative orientation that a writer expresses toward some object.
Spam detection
The binary classification taxt of assigning an email to one of the two classes spam
or not-spam
.
Naive Bayes
A probabilistic classifier.
For a given document d
, out of all classes c ∈ C
the classifier returns the class c^
which has the maximum posterior probability given the document.
c^
= argmax over c ∈ C
of P(c | d)
= argmax over c ∈ C
of P(d | c) P(c)
Why is Naive Bayes considered a generative model?
Because we can read its equation as stating some kind of implicit assumption about how a model is generated:
First a class is sampled from P(c)
and then words are generated by sampling from P(d | c)
.
2 Simplifying Assumptions of Naive Bayes classifiers
- Bag of words assumption - position of a word doesn’t matter.
- naive Bayes assumption:
Naive Bayes Assumption
c^
= argmax over c ∈ C
of P(d | c) P(c)
where a document d
is a set of features. So the likelihood P(d | c)
= P(f₁, f₂, ..., fₙ | c)
Estimating the probability of every possible combination of features (e.g. every possible set of words and positions) would require a huge number of parameters.
The naive bayes assumption is a conditional independence assumption that the probabilities P(fᵢ | c)
are independent given the class c
, and thus can be ‘naively’ multiplied:
P(f₁, f₂, ..., fₙ | c)
= P(f₁| c) P(f₂ | c) ... P( fₙ | c)
Naive Bayes Applied
positions ← all word positions in test document
cₙ = argmax over c ∈ C
of { P(c)
∏ ᵢ P(wᵢ | c)
}
where i ∈ positions
To avoid underflow and increase spead, calculations are done in log space:
cₙ = argmax over c ∈ C
of { log P(c)
Σ ᵢ log P(wᵢ | c)
}
where i ∈ positions
By considering features in log space, the predicted class is a linear function of input features. a linear classifier
Naive Bayes:
How to learn the probability P(c)
?
For the class prior P(c)
, we ask what percentage of documents in our training set are in each class c
.
Let Nc
be the number of documents in our dtraining data with class c
and Ndoc
be the total number of documents.
Then ^P(c)
= Nc / Ndoc
Naive Bayes:
How to learn the probability P(fᵢ | c)
?
To learn the probability P(fᵢ | c)
, we’ll assume a feature is just the existence of a word in the document’s bag of words.
We’ll want P(wᵢ | c)
, which we compute as the fraction of times the word wᵢ
appears among all words in all documents of topic c
.
We first concatenate all documents with category c
into one big “category c” text. Then we use the frequency of wᵢ
in this concatenated document to give a maximum likelihood estimate of the probability:
^P(wᵢ | c) = count(wᵢ , c) / Σ count(w, c) over all words w
Laplace Smoothing in Naive Bayes
To prevent zero probabilities, 1 is added to the counts:
^P(wᵢ | c)
= [ count(wᵢ , c) + 1 ]
/ [ Σ count(w, c) over all words w + |V|
Naive Bayes
What to do with unknown words?
Ignore them.
Remove them from the test document and do not include any probability for them at all.
Naive Bayes
Stop words
Very frequent words like ‘the’ and ‘a’ are often ignored.
Removed from both the training and test documents.
Binary Multinomial Naive Bayes
Used for sentiment analysis.
For sentiment classification, whether a word occurs or not seems to matter more than its frequency.
So word counts are clipped at 1 for each document.
How to deal with negation for Naive Bayes.
During text normalisation, prepend the prefix NOT_
to every word after a token of logical negation until the next punctuation mark.