Naive Bayes and Sentiment Classification Flashcards
Text categorization
The taxt of assigning a label or category to an entire text or document.
Sentiment analysis
The extraction of sentiment - the positive or negative orientation that a writer expresses toward some object.
Spam detection
The binary classification taxt of assigning an email to one of the two classes spam
or not-spam
.
Naive Bayes
A probabilistic classifier.
For a given document d
, out of all classes c ∈ C
the classifier returns the class c^
which has the maximum posterior probability given the document.
c^
= argmax over c ∈ C
of P(c | d)
= argmax over c ∈ C
of P(d | c) P(c)
Why is Naive Bayes considered a generative model?
Because we can read its equation as stating some kind of implicit assumption about how a model is generated:
First a class is sampled from P(c)
and then words are generated by sampling from P(d | c)
.
2 Simplifying Assumptions of Naive Bayes classifiers
- Bag of words assumption - position of a word doesn’t matter.
- naive Bayes assumption:
Naive Bayes Assumption
c^
= argmax over c ∈ C
of P(d | c) P(c)
where a document d
is a set of features. So the likelihood P(d | c)
= P(f₁, f₂, ..., fₙ | c)
Estimating the probability of every possible combination of features (e.g. every possible set of words and positions) would require a huge number of parameters.
The naive bayes assumption is a conditional independence assumption that the probabilities P(fᵢ | c)
are independent given the class c
, and thus can be ‘naively’ multiplied:
P(f₁, f₂, ..., fₙ | c)
= P(f₁| c) P(f₂ | c) ... P( fₙ | c)
Naive Bayes Applied
positions ← all word positions in test document
cₙ = argmax over c ∈ C
of { P(c)
∏ ᵢ P(wᵢ | c)
}
where i ∈ positions
To avoid underflow and increase spead, calculations are done in log space:
cₙ = argmax over c ∈ C
of { log P(c)
Σ ᵢ log P(wᵢ | c)
}
where i ∈ positions
By considering features in log space, the predicted class is a linear function of input features. a linear classifier
Naive Bayes:
How to learn the probability P(c)
?
For the class prior P(c)
, we ask what percentage of documents in our training set are in each class c
.
Let Nc
be the number of documents in our dtraining data with class c
and Ndoc
be the total number of documents.
Then ^P(c)
= Nc / Ndoc
Naive Bayes:
How to learn the probability P(fᵢ | c)
?
To learn the probability P(fᵢ | c)
, we’ll assume a feature is just the existence of a word in the document’s bag of words.
We’ll want P(wᵢ | c)
, which we compute as the fraction of times the word wᵢ
appears among all words in all documents of topic c
.
We first concatenate all documents with category c
into one big “category c” text. Then we use the frequency of wᵢ
in this concatenated document to give a maximum likelihood estimate of the probability:
^P(wᵢ | c) = count(wᵢ , c) / Σ count(w, c) over all words w
Laplace Smoothing in Naive Bayes
To prevent zero probabilities, 1 is added to the counts:
^P(wᵢ | c)
= [ count(wᵢ , c) + 1 ]
/ [ Σ count(w, c) over all words w + |V|
Naive Bayes
What to do with unknown words?
Ignore them.
Remove them from the test document and do not include any probability for them at all.
Naive Bayes
Stop words
Very frequent words like ‘the’ and ‘a’ are often ignored.
Removed from both the training and test documents.
Binary Multinomial Naive Bayes
Used for sentiment analysis.
For sentiment classification, whether a word occurs or not seems to matter more than its frequency.
So word counts are clipped at 1 for each document.
How to deal with negation for Naive Bayes.
During text normalisation, prepend the prefix NOT_
to every word after a token of logical negation until the next punctuation mark.
Sentiment lexicons
Lists of words that are pre-annotated with positive or negative sentiment.
Language id
The task of determining what language a given piece of text is written in.
Gold labels
Human-defined labels for each document.
Confusion matrix
A table for visualising how an algorithm performs with respect to the human gold labels, using two dimensions, and each cell labelling a set of possible outcomes.
Accuracy
The percentage of all observations our system labeled correctly.
Precision
Measures the percentage of the items that the system detected that are in fact positive.
Precision = TP / (TP + FP)
Recall
Measures the percentage of items actually present in the input that were correctly identified by the system.
Recall = TP / (TP + FN)
F-measure
A weighted harmonic mean of the precision and recall.
Fᵦ
= (β² + 1
) ·P
· R
/ [ β²
· P
+ R
]
β > 1 favors recall, while β < 1 favors precision.
Where P
is precision and R
is recall
F1 Score
An F-measure where β=1 and the weights of precision and recall are equally balanced.
F₁ = 2PR / (P + R)
Harmonic mean
The harmonic mean of a set of numbers is the reciprocal of the arithmetic mean of reciprocals.
HM(a₁, a₂, … aₙ)
= n / ( 1 / a₁
+ 1 / a₂
+ 1 / aₙ
)
Why is a Harmonic mean considered a conservative metric?
The harmonic mean of two values is closer to the minimum of the two values than the arithmetic mean is.
I.e. it weighs the lower of the two numbers more heavily.
Classification evaluation
Macroaveraging
We compute the performance for each class and then average over classes.
More appropriate when performance on all classes are equally important.
Classification evaluation
Microaveraging
We collect the decisions for each class into a single confusion matrix, and then compute precision and recall from that table.
More appropriate when performance on larger classes are more important.
Cross-validation
In cross-validation, we choose a number k
and partition our data into k
disjoint subsets called folds.
Now we choose one of those k
folds as a test set, train our classifier on the remaining k-1
folds, and then compute the error rate on the test set.
Then we repeat with another fold as the test set, again training on the other k-1
folds.
We do this sampling process k
times and average the test set error rate from these k
runs to get an average error rate.