C4 Flashcards

Question 1

Q

types of classification tasks

Answer

A

binary classifcation
multi-class classification
multi-label classification

Question 2

Q

pre-processing steps for text mining

Answer

A

tokenization or character k-grams
lower-casing and removal of punctuation
decide on vocabulary size and feature selection

and maybe:
- remove stop-words
- lemmatization or stemming
- add phrases as features

Question 3

Q

feature selection for document classifcation

Answer

A

goals: dimensionality reduction and reduce overfitting

global term selection: overall term frequency is used as cut-off (remove rare terms)

local term selection: each term scored by function that captures its degree of correlation with each class it occurs in

only top n terms are used for classifier training

Question 4

Q

term weighting three different options

Answer

A

compute term weights
- binary: occurence of term (yes, no)
- integer: term count
- real-valued: more advance weighting (tf-idf)

Question 5

Q

term frequency tf

Answer

A

the term count tc_t,d of term t in document d is defined as the number of times that t occurs in d

we don’t want raw term count, because relevance does not increase proportionally to term frequency
=> use log frequency: tf_d,c = 1 + log_10(tc_t,d)

Question 6

Q

inverse document frequency idf

Answer

A

most frequent items are not very informative
df_t is the number of documents that t occurs in -> inverse measure of the informativeness of t

idf = log_10 (N/df_t)

tf-idf = tf * idf

Question 7

Q

classification methods three options

Answer

A

if we use word features, we need an estimator that is well-suited for high-dimensional sparse data (Naive Bayes, SVM, Random Forest)
if text is represented as dense embeddings vectors, we can use neural network architectures to train classifiers
or transfer learning from pre-trained contextual embeddings with transformers

Question 8

Q

Naive Bayes classification

Answer

A

learning and classification based on probability theory
uses prior probability of each category given no information about an item
classification produces a posterior probability distribution over the possible categories given a description of an item

P(c|d) = P(d|c)P(c)/P(d)

P(c) = nr of documents for class c / total nr of documents

P(d|c) = P(t_1, t_2, …, t_k | c) = P(t_1|c) * P(t_2|c) * … * P(t_k|c)

Question 9

Q

problem with Maximum Likelihood Estimate

Answer

A

outcome is zero for a term-class combination that did not occur in the training data => multiplication of all terms gives posterior probability of zero

solution: add-one smoothing = assumption that each term occurs one additional time for each class

P(t|c) = (T_c,t + 1) / ∑_t’∈V (T_c,t’ + 1) = (T_c,t + 1) / ((∑_t’∈V T_c,t’) + |V|)

Question 10

Q

assumptions of Naive Bayes

Answer

A

Conditional Independence Assumption: features are independent of each other given the class (you can multiply the probabilies)
Positional Independence Assumption: the conditional probabilities of a term are the same, independent of the position in the document

But a good baseline for text classification

Question 11

Q

why is the accuracy often not suitable as evaluation metric?

Answer

A

the 2 classes are often unbalanced. High accuracy in one class might mean low accuracy in the other class
we might be more interested in correctness of the labels than in completeness of the labels, or vice versa

Question 12

Q

F1 score

Answer

A

F1 = 2 * (precision * recall) / (precision + recall)

C4 Flashcards

(12 cards)