Text Classification Flashcards
What are the possible uses of text classification? (3)
Spam Detection
Topic Classification
Sentiment Analysis
Formalise the text classification objective.
Given a document (d) and a set of classes C. We want to assign the document d to the most appropriate class c.
What is the Naive Bayes assumption?
features are conditionally independent given class.
What are the features of a document for topic classification?
Usually the count of the most frequent words
What are the features of a document for sentiment analysis?
Usually the count of words from a sentiment lexicon. Some words are attributed to certain classes. For example “good”, “adorable”,brave” would be associated with a positive class while “bad”, “ugly”, “cowardly” would be associated with a negative class.
How is the prior estimated for naive bayes?
Normally done using maximum-likelihood estimation. The number of documents that belong to a class C over the total number of documents is the prior for class C - P(C)
How are the conditional probabilities of the features in text classifcation for naive bayes estimated?
Normally done using smoothing applied to maximum-likelihood estimation.
Alternative features for text classification
Use binary features (did this word x occur in the document, yes or no). Use only a subset of the vocabulary. Use more complex features (morphological features, bigrams, synctatic features).
Advantages of Naive Bayes
1) Fast and easy to train/test 2) Simple model so is easy to implement 3) Doesn’t require as much training date 4) Usually works well
Disadvantage of Naive Bayes
The naive independence assumption is very weak, words tend to be correlated. So features are not really independent.
What is intrinsic evaluation?
An evaluation measure inherent to the task
Given an example of an intrinsic evaluation measure for language modelling
Perplexity
Given an example of an intrinsic evaluation measure for POS tagging
accuracy (% of tags correct)
Given an example of an intrinsic evaluation measure for categorization
F-score
What is extrinsic evaluation? Give an example for language modelling
measure effects on a downstream task Language modelling: Does it improve my ASR/MT task?