Supervised Classification 7 Flashcards
Give examples of Text Classification. Three or more.
Think about how you would classify a piece of text or a book
- Assigning subject categories, topics or genres
- Spam detection
- Authorship identification
- Age/gender identification
- Language identification
- Sentiment analysis
What is Rule-Based Classification?
Rules based on combinations of words or other features
• Spam: whether mentions my name, whether mentions money, feature
phrases like ‘you are selected’, ‘this is not a spam’
• POS tagging: prefixes (inconvenient, irregular), suffixes (friendly, quickly),
upper case letters, 35-year
• Sentiment analysis: ‘rock’, ‘appreciation’, ‘marvel’, ‘masterpiece’
• Gender identification from names: number of vowels, the ending letter
• Accuracy can be high
• If the features are well designed and selected by experts
• However, building and maintaining these rules is expensive!
What is Text Classification?
Input • A document d • A fixed set of classes C = {C1, C2, … , Cm} • Output • The predicted class c ∈ C for d
What is Supervised Classification?
Input
• A document d
• A fixed set of classes C = {C1, C2, … , Cm}
• A training set of N hand-labeled documents {(d1 , c1), … , (dN , cN)}
Output
• A classifier such that for any document, it predicts its class
When is a classifier supervised?
• A classifier is called supervised if it is built based on training corpora containing the correct label for each input
What is a dev-test set for?
analyze errors, select
features, optimize
hyper-parameters
What is a test set for?
test on held-out data
(model should not be
optimized to this data!)
What is the training model for?
Train the model
What is a multinomial logistic regression?
It’s a classification algorithm
• Predicts the probability of the input falling into a category
How do you calculate the similarity between the prediction at the truth labels in multinomial logistic regression?
Loss function e.g. cross entropy loss CE(P1,P2) = − ∑P1(xi) log P2(xi) Cross entropy loss is not symmetric! CE(P1,P2) = -∑I P1(xi) log P2(xi)
See slide 7 pg. 31
What does more features in ML algorithms do?
- More features usually allows for more flexibility, hence better performance at training time
- But using more features also usually increase the chance of overfitting
What are the issues with vocabulary size in TF and TF-IDF?
Why it harms • Over-parameterization, overfitting • Increases both computational and representational expenses • Introduces many ‘noisy features’, which may harm the performance (especially when raw tf/idf values are used)
What are some methods to reduce the vocabulary size in TF(-IDF)?
• Remove extremely common words, e.g. stop words and punctuations
• Remove extremely uncommon words, i.e. words that only appear in very few documents
Among the rest, you may:
• Select the top TF words, because they are more representative
• Select the top IDF words, because they are more informative
• Select the top TF-IDF words, to strike a balance
Why do you decide your vocabulary at training time, and keep it fixed at test time!?
Because your model does not understand what each feature means; it relies on the position of each feature to learn the importance/weight of each feature
How do you calculate accuracy in a model performance evaluation?
Accuracy
• Of all predictions, how many are correct
• Acc=(TP+TN)/(TP+FP+FN+TN)
When the label distribution is highly unbalanced, you can be easily fooled by the accuracy. How do you avoid being fooled?
• Report the accuracy of the ‘simple
majority baseline
• Check the label distribution of the
training data
How do you calculate precision in a model performance evaluation?
Precision:
• % of labelled items that are correct
• TP/(TP+FP)
If precision is 0 it is N/A not 0
How do you calculate recall in a model performance evaluation?
Recall:
• % of correct items that have
been labelled
• TP/(TP+FN)
What are the qualities of an aggressive classifier?
• Tend to label more items • High-recall low-precision • When you don’t want to miss any spam; suitable for first round filtering (shortlist)
What are the qualities of a conservative classifier?
• Tend to label fewer items; only label the very certain ones • High-precision low-recall • When you don’t want any false alarms; suitable for second round selection
What is F1 Score/F Measure and how is it calculated?
It is the combined weighted precision and weighted recall.
1/2 (equally weight P and R): F1=2PR/(P+R)
What are ways to deal with imbalanced data?
• class weights: assign higher weights to the minority class, i.e., use higher loss function if a minority item is misclassified as majority
Down-sampling: sample the majority class to make their frequencies closer to the rarest class. Use the sampled subsets of data to train a model • Pros: easy to implement; allows many different sampling methods • Cons: smaller training data size; sometimes poor performance on real data (with real class distributions)
• Up-sampling: the minority class is resampled to increase the corresponding frequencies • In NLP, it means you need to create some new text of the minority class. This is also known as data augmentation.
To select the most likely class for a given document what do you do?
Given an input document, a classifier assigns a probability to each
class, P(c|d), and selects the most likely one: c = arg max(ci) P(d | ci)P(ci)
Go over chapter 7 slide 19 to 24
Go over chapter 7 slide 19 to 24
watch logistic regression video chapter 4 vid 3
watch logistic regression video chapter 4 vid 3
What is F1 score?
The F-score, also called the F1-score, is a measure of a model’s accuracy on a dataset.
The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall.
What tool in python is used for f1, accuracy, precision and recall?
sklearn