Supervised Classification 7 Flashcards

1
Q

Give examples of Text Classification. Three or more.

Think about how you would classify a piece of text or a book

A
  • Assigning subject categories, topics or genres
  • Spam detection
  • Authorship identification
  • Age/gender identification
  • Language identification
  • Sentiment analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Rule-Based Classification?

A

Rules based on combinations of words or other features
• Spam: whether mentions my name, whether mentions money, feature
phrases like ‘you are selected’, ‘this is not a spam’
• POS tagging: prefixes (inconvenient, irregular), suffixes (friendly, quickly),
upper case letters, 35-year
• Sentiment analysis: ‘rock’, ‘appreciation’, ‘marvel’, ‘masterpiece’
• Gender identification from names: number of vowels, the ending letter
• Accuracy can be high
• If the features are well designed and selected by experts
• However, building and maintaining these rules is expensive!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Text Classification?

A
Input
• A document d
• A fixed set of classes C = {C1, C2, … , Cm}
• Output
• The predicted class c ∈ C for d
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Supervised Classification?

A

Input
• A document d
• A fixed set of classes C = {C1, C2, … , Cm}
• A training set of N hand-labeled documents {(d1 , c1), … , (dN , cN)}
Output
• A classifier such that for any document, it predicts its class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When is a classifier supervised?

A

• A classifier is called supervised if it is built based on training corpora containing the correct label for each input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a dev-test set for?

A

analyze errors, select
features, optimize
hyper-parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a test set for?

A

test on held-out data
(model should not be
optimized to this data!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the training model for?

A

Train the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a multinomial logistic regression?

A

It’s a classification algorithm

• Predicts the probability of the input falling into a category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you calculate the similarity between the prediction at the truth labels in multinomial logistic regression?

A
Loss function
e.g. cross entropy loss
CE(P1,P2) = − ∑P1(xi) log P2(xi)
Cross entropy loss is not symmetric! 
CE(P1,P2) = -∑I P1(xi) log P2(xi)

See slide 7 pg. 31

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does more features in ML algorithms do?

A
  • More features usually allows for more flexibility, hence better performance at training time
  • But using more features also usually increase the chance of overfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the issues with vocabulary size in TF and TF-IDF?

A
Why it harms 
• Over-parameterization, overfitting 
• Increases both computational and
representational expenses
• Introduces many ‘noisy features’, which
may harm the performance (especially
when raw tf/idf values are used)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some methods to reduce the vocabulary size in TF(-IDF)?

A

• Remove extremely common words, e.g. stop words and punctuations
• Remove extremely uncommon words, i.e. words that only appear in very few documents
Among the rest, you may:
• Select the top TF words, because they are more representative
• Select the top IDF words, because they are more informative
• Select the top TF-IDF words, to strike a balance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do you decide your vocabulary at training time, and keep it fixed at test time!?

A

Because your model does not understand what each feature means; it relies on the position of each feature to learn the importance/weight of each feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you calculate accuracy in a model performance evaluation?

A

Accuracy
• Of all predictions, how many are correct
• Acc=(TP+TN)/(TP+FP+FN+TN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When the label distribution is highly unbalanced, you can be easily fooled by the accuracy. How do you avoid being fooled?

A

• Report the accuracy of the ‘simple
majority baseline
• Check the label distribution of the
training data

17
Q

How do you calculate precision in a model performance evaluation?

A

Precision:
• % of labelled items that are correct
• TP/(TP+FP)
If precision is 0 it is N/A not 0

18
Q

How do you calculate recall in a model performance evaluation?

A

Recall:
• % of correct items that have
been labelled
• TP/(TP+FN)

19
Q

What are the qualities of an aggressive classifier?

A
• Tend to label more items
• High-recall low-precision
• When you don’t want to miss
any spam; suitable for first
round filtering (shortlist)
20
Q

What are the qualities of a conservative classifier?

A
• Tend to label fewer items; only
label the very certain ones
• High-precision low-recall
• When you don’t want any false
alarms; suitable for second
round selection
21
Q

What is F1 Score/F Measure and how is it calculated?

A

It is the combined weighted precision and weighted recall.

1/2 (equally weight P and R): F1=2PR/(P+R)

22
Q

What are ways to deal with imbalanced data?

A
• class weights: assign higher weights
to the minority class, i.e., use higher
loss function if a minority item is
misclassified as majority
Down-sampling: sample the majority
class to make their frequencies closer
to the rarest class. Use the sampled
subsets of data to train a model
• Pros: easy to implement; allows many
different sampling methods
• Cons: smaller training data size;
sometimes poor performance on real
data (with real class distributions)
• Up-sampling: the minority
class is resampled to increase
the corresponding frequencies
• In NLP, it means you need to
create some new text of the
minority class. This is also
known as data augmentation.
23
Q

To select the most likely class for a given document what do you do?

A

Given an input document, a classifier assigns a probability to each
class, P(c|d), and selects the most likely one: c = arg max(ci) P(d | ci)P(ci)

24
Q

Go over chapter 7 slide 19 to 24

A

Go over chapter 7 slide 19 to 24

25
Q

watch logistic regression video chapter 4 vid 3

A

watch logistic regression video chapter 4 vid 3

26
Q

What is F1 score?

A

The F-score, also called the F1-score, is a measure of a model’s accuracy on a dataset.

The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall.

27
Q

What tool in python is used for f1, accuracy, precision and recall?

A

sklearn