C4 Flashcards

1
Q

types of classification tasks

A
  • binary classifcation
  • multi-class classification
  • multi-label classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

pre-processing steps for text mining

A
  • tokenization or character k-grams
  • lower-casing and removal of punctuation
  • decide on vocabulary size and feature selection

and maybe:
- remove stop-words
- lemmatization or stemming
- add phrases as features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

feature selection for document classifcation

A

goals: dimensionality reduction and reduce overfitting

global term selection: overall term frequency is used as cut-off (remove rare terms)

local term selection: each term scored by function that captures its degree of correlation with each class it occurs in

only top n terms are used for classifier training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

term weighting three different options

A

compute term weights
- binary: occurence of term (yes, no)
- integer: term count
- real-valued: more advance weighting (tf-idf)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

term frequency tf

A

the term count tc_t,d of term t in document d is defined as the number of times that t occurs in d

we don’t want raw term count, because relevance does not increase proportionally to term frequency
=> use log frequency: tf_d,c = 1 + log_10(tc_t,d)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

inverse document frequency idf

A

most frequent items are not very informative
df_t is the number of documents that t occurs in -> inverse measure of the informativeness of t

idf = log_10 (N/df_t)

tf-idf = tf * idf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

classification methods three options

A
  1. if we use word features, we need an estimator that is well-suited for high-dimensional sparse data (Naive Bayes, SVM, Random Forest)
  2. if text is represented as dense embeddings vectors, we can use neural network architectures to train classifiers
  3. or transfer learning from pre-trained contextual embeddings with transformers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Naive Bayes classification

A
  • learning and classification based on probability theory
  • uses prior probability of each category given no information about an item
  • classification produces a posterior probability distribution over the possible categories given a description of an item

P(c|d) = P(d|c)P(c)/P(d)

P(c) = nr of documents for class c / total nr of documents

P(d|c) = P(t_1, t_2, …, t_k | c) = P(t_1|c) * P(t_2|c) * … * P(t_k|c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

problem with Maximum Likelihood Estimate

A

outcome is zero for a term-class combination that did not occur in the training data => multiplication of all terms gives posterior probability of zero

solution: add-one smoothing = assumption that each term occurs one additional time for each class

P(t|c) = (T_c,t + 1) / ∑_t’∈V (T_c,t’ + 1) = (T_c,t + 1) / ((∑_t’∈V T_c,t’) + |V|)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

assumptions of Naive Bayes

A
  1. Conditional Independence Assumption: features are independent of each other given the class (you can multiply the probabilies)
  2. Positional Independence Assumption: the conditional probabilities of a term are the same, independent of the position in the document

But a good baseline for text classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

why is the accuracy often not suitable as evaluation metric?

A
  • the 2 classes are often unbalanced. High accuracy in one class might mean low accuracy in the other class
  • we might be more interested in correctness of the labels than in completeness of the labels, or vice versa
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

F1 score

A

F1 = 2 * (precision * recall) / (precision + recall)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly