Text Classification Flashcards

1
Q

What are the possible uses of text classification? (3)

A

Spam Detection

Topic Classification

Sentiment Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Formalise the text classification objective.

A

Given a document (d) and a set of classes C. We want to assign the document d to the most appropriate class c.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Naive Bayes assumption?

A

features are conditionally independent given class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the features of a document for topic classification?

A

Usually the count of the most frequent words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the features of a document for sentiment analysis?

A

Usually the count of words from a sentiment lexicon. Some words are attributed to certain classes. For example “good”, “adorable”,brave” would be associated with a positive class while “bad”, “ugly”, “cowardly” would be associated with a negative class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is the prior estimated for naive bayes?

A

Normally done using maximum-likelihood estimation. The number of documents that belong to a class C over the total number of documents is the prior for class C - P(C)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are the conditional probabilities of the features in text classifcation for naive bayes estimated?

A

Normally done using smoothing applied to maximum-likelihood estimation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Alternative features for text classification

A

Use binary features (did this word x occur in the document, yes or no). Use only a subset of the vocabulary. Use more complex features (morphological features, bigrams, synctatic features).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Advantages of Naive Bayes

A

1) Fast and easy to train/test 2) Simple model so is easy to implement 3) Doesn’t require as much training date 4) Usually works well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Disadvantage of Naive Bayes

A

The naive independence assumption is very weak, words tend to be correlated. So features are not really independent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is intrinsic evaluation?

A

An evaluation measure inherent to the task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Given an example of an intrinsic evaluation measure for language modelling

A

Perplexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Given an example of an intrinsic evaluation measure for POS tagging

A

accuracy (% of tags correct)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Given an example of an intrinsic evaluation measure for categorization

A

F-score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is extrinsic evaluation? Give an example for language modelling

A

measure effects on a downstream task Language modelling: Does it improve my ASR/MT task?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to deal with unbalanced classes?

A

1) Collect more data 2) Augment some of the data you do have 3) Create copies of training samples.

17
Q

What is the precision?

A

items the system detected that were right/items the system detected

In other words (true positives)/(false positives + true positives)

18
Q

What is the recall?

A

Items the system detected that were right/items the system should have detected

(true positives)/(true positives + false negatives)

19
Q

What is the F-measure (equation)

A

Fβ = ((β2+1)PR)/(β2P + R)

20
Q

What is the harmonic mean of the precision and recall? (F-1)

A

The F-measure with β set to 1: F1 = 2PR/(P + R)

21
Q

How is the F-measure decreased/increased

A

For precision and recall values that are close to eachother, the F-measure is smaller. For those that are further apart, the F-Score is much bigger.