Supervised Classification 7 Flashcards
Give examples of Text Classification. Three or more.
Think about how you would classify a piece of text or a book
- Assigning subject categories, topics or genres
- Spam detection
- Authorship identification
- Age/gender identification
- Language identification
- Sentiment analysis
What is Rule-Based Classification?
Rules based on combinations of words or other features
• Spam: whether mentions my name, whether mentions money, feature
phrases like ‘you are selected’, ‘this is not a spam’
• POS tagging: prefixes (inconvenient, irregular), suffixes (friendly, quickly),
upper case letters, 35-year
• Sentiment analysis: ‘rock’, ‘appreciation’, ‘marvel’, ‘masterpiece’
• Gender identification from names: number of vowels, the ending letter
• Accuracy can be high
• If the features are well designed and selected by experts
• However, building and maintaining these rules is expensive!
What is Text Classification?
Input • A document d • A fixed set of classes C = {C1, C2, … , Cm} • Output • The predicted class c ∈ C for d
What is Supervised Classification?
Input
• A document d
• A fixed set of classes C = {C1, C2, … , Cm}
• A training set of N hand-labeled documents {(d1 , c1), … , (dN , cN)}
Output
• A classifier such that for any document, it predicts its class
When is a classifier supervised?
• A classifier is called supervised if it is built based on training corpora containing the correct label for each input
What is a dev-test set for?
analyze errors, select
features, optimize
hyper-parameters
What is a test set for?
test on held-out data
(model should not be
optimized to this data!)
What is the training model for?
Train the model
What is a multinomial logistic regression?
It’s a classification algorithm
• Predicts the probability of the input falling into a category
How do you calculate the similarity between the prediction at the truth labels in multinomial logistic regression?
Loss function e.g. cross entropy loss CE(P1,P2) = − ∑P1(xi) log P2(xi) Cross entropy loss is not symmetric! CE(P1,P2) = -∑I P1(xi) log P2(xi)
See slide 7 pg. 31
What does more features in ML algorithms do?
- More features usually allows for more flexibility, hence better performance at training time
- But using more features also usually increase the chance of overfitting
What are the issues with vocabulary size in TF and TF-IDF?
Why it harms • Over-parameterization, overfitting • Increases both computational and representational expenses • Introduces many ‘noisy features’, which may harm the performance (especially when raw tf/idf values are used)
What are some methods to reduce the vocabulary size in TF(-IDF)?
• Remove extremely common words, e.g. stop words and punctuations
• Remove extremely uncommon words, i.e. words that only appear in very few documents
Among the rest, you may:
• Select the top TF words, because they are more representative
• Select the top IDF words, because they are more informative
• Select the top TF-IDF words, to strike a balance
Why do you decide your vocabulary at training time, and keep it fixed at test time!?
Because your model does not understand what each feature means; it relies on the position of each feature to learn the importance/weight of each feature
How do you calculate accuracy in a model performance evaluation?
Accuracy
• Of all predictions, how many are correct
• Acc=(TP+TN)/(TP+FP+FN+TN)