Text Classification Flashcards
What are the possible uses of text classification? (3)
Spam Detection
Topic Classification
Sentiment Analysis
Formalise the text classification objective.
Given a document (d) and a set of classes C. We want to assign the document d to the most appropriate class c.
What is the Naive Bayes assumption?
features are conditionally independent given class.
What are the features of a document for topic classification?
Usually the count of the most frequent words
What are the features of a document for sentiment analysis?
Usually the count of words from a sentiment lexicon. Some words are attributed to certain classes. For example “good”, “adorable”,brave” would be associated with a positive class while “bad”, “ugly”, “cowardly” would be associated with a negative class.
How is the prior estimated for naive bayes?
Normally done using maximum-likelihood estimation. The number of documents that belong to a class C over the total number of documents is the prior for class C - P(C)
How are the conditional probabilities of the features in text classifcation for naive bayes estimated?
Normally done using smoothing applied to maximum-likelihood estimation.
Alternative features for text classification
Use binary features (did this word x occur in the document, yes or no). Use only a subset of the vocabulary. Use more complex features (morphological features, bigrams, synctatic features).
Advantages of Naive Bayes
1) Fast and easy to train/test 2) Simple model so is easy to implement 3) Doesn’t require as much training date 4) Usually works well
Disadvantage of Naive Bayes
The naive independence assumption is very weak, words tend to be correlated. So features are not really independent.
What is intrinsic evaluation?
An evaluation measure inherent to the task
Given an example of an intrinsic evaluation measure for language modelling
Perplexity
Given an example of an intrinsic evaluation measure for POS tagging
accuracy (% of tags correct)
Given an example of an intrinsic evaluation measure for categorization
F-score
What is extrinsic evaluation? Give an example for language modelling
measure effects on a downstream task Language modelling: Does it improve my ASR/MT task?
How to deal with unbalanced classes?
1) Collect more data 2) Augment some of the data you do have 3) Create copies of training samples.
What is the precision?
items the system detected that were right/items the system detected
In other words (true positives)/(false positives + true positives)
What is the recall?
Items the system detected that were right/items the system should have detected
(true positives)/(true positives + false negatives)
What is the F-measure (equation)
Fβ = ((β2+1)PR)/(β2P + R)
What is the harmonic mean of the precision and recall? (F-1)
The F-measure with β set to 1: F1 = 2PR/(P + R)
How is the F-measure decreased/increased
For precision and recall values that are close to eachother, the F-measure is smaller. For those that are further apart, the F-Score is much bigger.