Text Classification Flashcards
How can we use machine learning for text classification? ⭐️
Machine learning classification algorithms predict a class based on a numerical feature representation. This means that in order to use machine learning for text classification, we need to extract numerical features from our text data first before we can apply machine learning algorithms. Common approaches to extract numerical features from text data are bag of words, N-grams or word embeddings.
What is bag of words? How we can use it for text classification? ⭐️
Bag of Words is a representation of text that describes the occurrence of words within a document. The order or structure of the words is not considered. For text classification, we look at the histogram of the words within the text and consider each word count as a feature.
What are the advantages and disadvantages of bag of words? ⭐️
dvantages:
Simple to understand and implement.
Disadvantages:
The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons
Discarding word order ignores the context, and in turn meaning of words in the document. Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”).
What are N-grams? How can we use them? ⭐️
The function to tokenize into consecutive sequences of words is called n-grams. It can be used to find out N most co-occurring words (how often word X is followed by word Y) in a given sentence.
What is TF-IDF? How is it useful for text classification? ⭐️
Term Frequency (TF) is a scoring of the frequency of the word in the current document. Inverse Document Frequency(IDF) is a scoring of how rare the word is across documents. It is used in scenario where highly recurring words may not contain as much informational content as the domain specific words. For example, words like “the” that are frequent across all documents therefore need to be less weighted. The TF-IDF score highlights words that are distinct (contain useful information) in a given document.
Which model would you use for text classification with bag of words features? ⭐️
Bag Of Words model
Word2Vec Embeddings
fastText Embeddings
Convolutional Neural Networks (CNN)
Long Short-Term Memory (LSTM)
Bidirectional Encoder Representations from Transformers (BERT)
Would you prefer gradient boosting trees model or logistic regression when doing text classification with bag of words? ⭐️
Usually logistic regression is better because bag of words creates a matrix with large number of columns. For a huge number of columns logistic regression is usually faster than gradient boosting trees.
What are word embeddings? Why are they useful? Do you know Word2Vec? ⭐️
Word Embeddings are vector representations for words. Each word is mapped to one vector, this vector tries to capture some characteristics of the word, allowing similar words to have similar vector representations.
Word Embeddings helps in capturing the inter-word semantics and represents it in real-valued vectors.
Word2Vec is a method to construct such an embedding. It takes a text corpus as input and outputs a set of vectors which represents words in that corpus.
It can be generated using two methods:
Common Bag of Words (CBOW)
Skip-Gram
Do you know any other ways to get word embeddings? 🚀
TF-IDF
GloVe
BERT
If you have a sentence with multiple words, you may need to combine multiple word embeddings into one. How would you do it? ⭐️
Approaches ranked from simple to more complex:
Take an average over all words
Take a weighted average over all words. Weighting can be done by inverse document frequency (idf part of tf-idf).
Use ML model like LSTM or Transformer.