Naive Bayes and Sentiment Classification Flashcards by Rayno Mostert

Text categorization

The taxt of assigning a label or category to an entire text or document.

How well did you know this?

Not at all

Perfectly

Sentiment analysis

The extraction of sentiment - the positive or negative orientation that a writer expresses toward some object.

How well did you know this?

Not at all

Perfectly

Spam detection

The binary classification taxt of assigning an email to one of the two classes spam or not-spam.

How well did you know this?

Not at all

Perfectly

Naive Bayes

A probabilistic classifier.

For a given document d, out of all classes c ∈ C the classifier returns the class c^ which has the maximum posterior probability given the document.

c^
= argmax over c ∈ C of P(c | d)
= argmax over c ∈ C of P(d | c) P(c)

How well did you know this?

Not at all

Perfectly

Why is Naive Bayes considered a generative model?

Because we can read its equation as stating some kind of implicit assumption about how a model is generated:

First a class is sampled from P(c) and then words are generated by sampling from P(d | c).

How well did you know this?

Not at all

Perfectly

2 Simplifying Assumptions of Naive Bayes classifiers

Bag of words assumption - position of a word doesn’t matter.
naive Bayes assumption:

How well did you know this?

Not at all

Perfectly

Naive Bayes Assumption

c^
= argmax over c ∈ C of P(d | c) P(c)

where a document d is a set of features. So the likelihood P(d | c) = P(f₁, f₂, ..., fₙ | c)

Estimating the probability of every possible combination of features (e.g. every possible set of words and positions) would require a huge number of parameters.

The naive bayes assumption is a conditional independence assumption that the probabilities P(fᵢ | c) are independent given the class c, and thus can be ‘naively’ multiplied:

P(f₁, f₂, ..., fₙ | c) = P(f₁| c) P(f₂ | c) ... P( fₙ | c)

How well did you know this?

Not at all

Perfectly

Naive Bayes Applied

positions ← all word positions in test document

cₙ = argmax over c ∈ C of { P(c) ∏ ᵢ P(wᵢ | c) }
where i ∈ positions

To avoid underflow and increase spead, calculations are done in log space:

cₙ = argmax over c ∈ C of { log P(c) Σ ᵢ log P(wᵢ | c) }
where i ∈ positions

By considering features in log space, the predicted class is a linear function of input features. a linear classifier

How well did you know this?

Not at all

Perfectly

Naive Bayes:

How to learn the probability P(c)?

For the class prior P(c), we ask what percentage of documents in our training set are in each class c.

Let Nc be the number of documents in our dtraining data with class c and Ndoc be the total number of documents.

Then ^P(c) = Nc / Ndoc

How well did you know this?

Not at all

Perfectly

Naive Bayes:

How to learn the probability P(fᵢ | c)?

To learn the probability P(fᵢ | c), we’ll assume a feature is just the existence of a word in the document’s bag of words.

We’ll want P(wᵢ | c), which we compute as the fraction of times the word wᵢ appears among all words in all documents of topic c.

We first concatenate all documents with category c into one big “category c” text. Then we use the frequency of wᵢ in this concatenated document to give a maximum likelihood estimate of the probability:

^P(wᵢ | c) = count(wᵢ , c) / Σ count(w, c) over all words w

How well did you know this?

Not at all

Perfectly

Laplace Smoothing in Naive Bayes

To prevent zero probabilities, 1 is added to the counts:

^P(wᵢ | c)
= [ count(wᵢ , c) + 1 ]
/ [ Σ count(w, c) over all words w + |V|

How well did you know this?

Not at all

Perfectly

Naive Bayes

What to do with unknown words?

Ignore them.

Remove them from the test document and do not include any probability for them at all.

How well did you know this?

Not at all

Perfectly

Naive Bayes

Stop words

Very frequent words like ‘the’ and ‘a’ are often ignored.

Removed from both the training and test documents.

How well did you know this?

Not at all

Perfectly

Binary Multinomial Naive Bayes

Used for sentiment analysis.

For sentiment classification, whether a word occurs or not seems to matter more than its frequency.

So word counts are clipped at 1 for each document.

How well did you know this?

Not at all

Perfectly

How to deal with negation for Naive Bayes.

During text normalisation, prepend the prefix NOT_ to every word after a token of logical negation until the next punctuation mark.

How well did you know this?

Not at all

Perfectly

Sentiment lexicons

Study These Flashcards

Lists of words that are pre-annotated with positive or negative sentiment.

Language id

Study These Flashcards

The task of determining what language a given piece of text is written in.

Gold labels

Study These Flashcards

Human-defined labels for each document.

Confusion matrix

Study These Flashcards

A table for visualising how an algorithm performs with respect to the human gold labels, using two dimensions, and each cell labelling a set of possible outcomes.

Accuracy

Study These Flashcards

The percentage of all observations our system labeled correctly.

Precision

Study These Flashcards

Measures the percentage of the items that the system detected that are in fact positive.

Precision = TP / (TP + FP)

Recall

Study These Flashcards

Measures the percentage of items actually present in the input that were correctly identified by the system.

Recall = TP / (TP + FN)

F-measure

Study These Flashcards

A weighted harmonic mean of the precision and recall.

Fᵦ = (β² + 1) ·P · R / [ β² · P + R ]

β > 1 favors recall, while β < 1 favors precision.

Where P is precision and R is recall

F1 Score

Study These Flashcards

An F-measure where β=1 and the weights of precision and recall are equally balanced.

F₁ = 2PR / (P + R)

Harmonic mean

The harmonic mean of a set of numbers is the reciprocal of the arithmetic mean of reciprocals. HM(a₁, a₂, ... aₙ) = n / ( `1 / a₁` + `1 / a₂` + `1 / aₙ` )

Why is a Harmonic mean considered a conservative metric?

The harmonic mean of two values is closer to the minimum of the two values than the arithmetic mean is. I.e. it weighs the lower of the two numbers more heavily.

# Classification evaluation Macroaveraging

We compute the performance for each class and then average over classes. | More appropriate when performance on all classes are equally important.

# Classification evaluation Microaveraging

We collect the decisions for each class into a single confusion matrix, and then compute precision and recall from that table. | More appropriate when performance on larger classes are more important.

Cross-validation

In cross-validation, we choose a number `k` and partition our data into `k` disjoint subsets called **folds**. Now we choose one of those `k` folds as a test set, train our classifier on the remaining `k-1` folds, and then compute the error rate on the test set. Then we repeat with another fold as the test set, again training on the other `k-1` folds. We do this sampling process `k` times and average the test set error rate from these `k` runs to get an average error rate.

Naive Bayes and Sentiment Classification Flashcards

(29 cards)