Lecture 3: Text Classification & Naive Bayes Flashcards

1
Q

Text Classification takes two things as an input. What are those?

A

A document d

And fixed set of classes C = {c_1,c_2,c_3,…,c_j}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Text Classification produces one output. What is that?

A

A predicted class c in C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Name some classification methods

A
  1. Hand-Coded Rules

2. Supervised machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In this lecture, we primarily work with one supervised machine learning algorithm. What is that?

A

Naive Bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Naive Bayes based on? And into what representation does it transform a document into?

A

Simple (“naïve”) classification method based on Bayes rule

Relies on very simple representation of document -> Bag of Words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain the concept of a “Bag of Words”

A

See it as a dictionary of words where each key is the word, and each value is the number of occurrences of that word in the given text document.

Example:

(1) John likes to watch movies. Mary likes movies too.
(2) Mary also likes to watch football games.

BoW1 {“John”:1,”likes”:2,”to”:1,”watch”:1,”movies”:2,”Mary”:1,”too”:1}

BoW2{“Mary”:1,”also”:1,”likes”:1,”to”:1,”watch”:1,”football”:1,”games”:1}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the formula of Naive Bayes algorithm?

A

P(c|d) = P(d|c)P(c)/P(d)

You then find the class that gives the largest probability given a document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some assumptions of Multinomial Naive Bayes?

A
Bag of words assumption: Assumes position does not matter
Conditional Independence: Assumes the feature probabilities P(xi|cj) are independent given class c
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Laplace Smoothing?

A

Laplace or “Add-1 smoothing”, is a smoothing technique that helps tackle the problem of zero probability in the Naïve Bayes machine learning algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the benefits of the Naive Bayes Classifier?

A
  • Very fast, low storage requirements
  • Robust to irrelevant features
    • Irrelevant features cancel each other without affecting results
  • Very good in domains with many equally important features
    • Decision trees suffer from fragmentation in this area, especially if little data
  • Optimal if the independence assumptions hold
  • A good dependable baseline for text classification
    • But we will see other classifiers that give better accuracy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you measure the performance of text classifiers?

A
  • Recall
  • Precision
  • Accuracy
  • F-Score
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

If we have more than one class, how can we combine multiple performance measures into one quantity?

A

Macroaveraging: Compute performance for each class, then average

Microaveraging: Collect decisions for all classes, compute contingency table, evaluate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is cross-validation?

A

By doing cross-validation you avoid overfitting due to handling errors from different splits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

If you have very little data, you should use… (Which classifier?)

A

Naive Bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

If you have a reasonable amount of data you should use… (Which classifier?)

A
It's perfect for all the clever classifiers:
- SVM
- Logistic Regression
- Decision Trees
etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

If you have a huge amount of data you should use… (Which classifier?)

A

Logistic Regression can work
Naive Bayes can work (cause its fast)

At a cost:
SVMs(Train time) or kNN(test time) can be too slow

17
Q

Classifier may not matter if…

A

You have enough data (Basically all almost all of the classifiers ranked close in accuracy when the amount of data increased)

18
Q

What is meant by “Underflow”?

A

Underflow, is a phenomenon that can occur when multiplying a lot of probabilities.

Essentially, what it means is that by multiplying a lot of probabilities, you may get a number so incredibly small, that the computer cannot actually represent in memory on its central processing unit (CPU).

19
Q

How can you deal with Underflow?

A

sum log of probabilities rather than multiplying probabilities.