Session 6.1 Flashcards
Bag of words
treats every word as a term in the document.
Bag of N-grams
treats every possible collection of N adjacent words as a term in the document.
term frequency (TF)
e.g., raw count
TF(t, d) = A raw count of times term t appears in a document d
TFIDF
TFIDF (t, d)= Product of Term Frequency TF(t, d) and Inverse Document Frequency IDF(t)
Document Term Matrix (DTM)
Each document is a row and each term is a column
Term Document Matrix (TDM)
Each term is a row and each document is a column
Disadvantage of bag of words/N-grams & solutions
Massive number of features
To decrease the numbers of words/terms/features in documents:
➢ Cleaning and preprocessing text • Case normalization • Removing punctuation • Removing numbers • Removing stopwords • Word stemming and stem completion
➢ Feature selection
➢ Special consideration to computational storage space
Sentiment analysis technique
Detecting the sentiment of the text, e.g.,
• positive/negative/neutral
• urgent/not urgent
Three main levels of sentiment analysis
- Document-level
- Sentence-level
- Aspect-level
How is topic modelling technique different from clustering?
Topic modelling:
A document can be associated with more than one topic
Clustering:
A document only shows up in one of the clusters