Session 6.4 Flashcards
Why text mining
➢ Text is everywhere
➢ It takes too much time to read a million customer reviews or tweets.
➢ Text mining helps us to reduce the information and draw out the important features.
A token/term
e.g. a word or a group of words
A document
one piece of text
A corpus
A collection of documents
Inverse document frequency (IDF)
Measuring the sparseness of term t in a corpus
A common term in the corpus has
low IDF
A rare term in the corpus has
high IDF
TFIDF is high when
both TF and IDF values are high
i.e., the word is rare in the corpus but frequent in a single document.
TFIDF
Product of Term Frequency (TF) and Inverse Document Frequency (IDF)
Is there any disadvantage of bag of words/N-grams approach?
Yes, there could be massive numbers of features, requiring a lot of memory and computational resources.
Possible Solutions:
- Feature selection
- Special consideration to computational storage space
- Cleaning and preprocessing text
Case normalization
➢ Computers often treat capitalized words as being different to their lowercase counterparts.
➢ Making every word to be in lowercase
➢ Can be helpful
➢ Can be harmful when capital letters help us to identify different things
Removing punctuation
➢ Can be helpful
e.g., “music” and “music.” will be correctly identified as the same word.
➢ Can be harmful when we are interested in how certain punctuation is used
Removing numbers
➢ Depending on the purpose of analyses we may want to remove numbers
➢ Don’t do this if we want to text mine quantities.
Removing stopwords
➢ Stopwords are frequently used words in the corpus but don’t offer much insight into the documents.
e.g., common stopwords in English: “the”, “and”, “of”, “is”, etc.
➢ Don’t remove stopwords that we are interested in and want to text mine. e.g., if we want to look at tense in English, then we shouldn’t remove the word “is”
or “was” etc.
➢ Remove stopwords that we are not interested in.
e.g., suppose we are studying a corpus about customer reviews on a phone and almost every customer review in
Word stemming and stem completion
➢ Word stemming reduces words to their word stem or root, so that different versions of the same word is unified across documents.
e.g., “announces”, “announced” and “announcing” are all reduced to “announc”
➢ May get some word stems that are not real words! e.g., “announc”
➢ Can choose to do stem completion.
i. e., Reconstructing the word stems into a known word
e. g., “announc” -> “announce”