C4 Flashcards
types of classification tasks
- binary classifcation
- multi-class classification
- multi-label classification
pre-processing steps for text mining
- tokenization or character k-grams
- lower-casing and removal of punctuation
- decide on vocabulary size and feature selection
and maybe:
- remove stop-words
- lemmatization or stemming
- add phrases as features
feature selection for document classifcation
goals: dimensionality reduction and reduce overfitting
global term selection: overall term frequency is used as cut-off (remove rare terms)
local term selection: each term scored by function that captures its degree of correlation with each class it occurs in
only top n terms are used for classifier training
term weighting three different options
compute term weights
- binary: occurence of term (yes, no)
- integer: term count
- real-valued: more advance weighting (tf-idf)
term frequency tf
the term count tc_t,d of term t in document d is defined as the number of times that t occurs in d
we don’t want raw term count, because relevance does not increase proportionally to term frequency
=> use log frequency: tf_d,c = 1 + log_10(tc_t,d)
inverse document frequency idf
most frequent items are not very informative
df_t is the number of documents that t occurs in -> inverse measure of the informativeness of t
idf = log_10 (N/df_t)
tf-idf = tf * idf
classification methods three options
- if we use word features, we need an estimator that is well-suited for high-dimensional sparse data (Naive Bayes, SVM, Random Forest)
- if text is represented as dense embeddings vectors, we can use neural network architectures to train classifiers
- or transfer learning from pre-trained contextual embeddings with transformers
Naive Bayes classification
- learning and classification based on probability theory
- uses prior probability of each category given no information about an item
- classification produces a posterior probability distribution over the possible categories given a description of an item
P(c|d) = P(d|c)P(c)/P(d)
P(c) = nr of documents for class c / total nr of documents
P(d|c) = P(t_1, t_2, …, t_k | c) = P(t_1|c) * P(t_2|c) * … * P(t_k|c)
problem with Maximum Likelihood Estimate
outcome is zero for a term-class combination that did not occur in the training data => multiplication of all terms gives posterior probability of zero
solution: add-one smoothing = assumption that each term occurs one additional time for each class
P(t|c) = (T_c,t + 1) / ∑_t’∈V (T_c,t’ + 1) = (T_c,t + 1) / ((∑_t’∈V T_c,t’) + |V|)
assumptions of Naive Bayes
- Conditional Independence Assumption: features are independent of each other given the class (you can multiply the probabilies)
- Positional Independence Assumption: the conditional probabilities of a term are the same, independent of the position in the document
But a good baseline for text classification
why is the accuracy often not suitable as evaluation metric?
- the 2 classes are often unbalanced. High accuracy in one class might mean low accuracy in the other class
- we might be more interested in correctness of the labels than in completeness of the labels, or vice versa
F1 score
F1 = 2 * (precision * recall) / (precision + recall)