Text Mining Flashcards
What methods do Text Mining use?
Information Retrieval. Pre-processing of text documnets
What tasks do text mining do?
Text Classification, Text Clustering or Text Summarization
What is an issue with text mining vs traditional data mining?
Traditional data mining is structured. Text often has no real structure.
What is a Vector Space Model?
A document is represented as a “bag” of words.
What is a problem with Vector Space Model?
There are many words in the English language.
How do you fix the limitations of the Vector Space Model?
Removing the stop words (“A, the, this, that …”)
Stemming (e.g combine the similar verbs (past/present tense)
How do you assign the weight (importance) of a term in text-mining?
Use TF-IDF
Weight = TF * IDF
TF = Term Frequency (how many times)
IDF = Inverse Document Frequency = log (total documents / document frequency)
What are the steps involved in text mining?
- Get the text
- Remove the stop words
- Convert all the words to lowercase (optional step)
- Stem the commonly associated word (interesting-> interested)
- Count the term frequency
- Create an index file, which has all the terms and all their frequency. Sort it alphabetically.
- Create Vector Space Model: For each occurence, put a 1 in its vector space, occurs 3, put 3).
- Compute the IDF. How many documents did this word appear in? / How many documents there are.
- Compute the weight (tf * idf)
- Normalize to less than 1. For each term, the weight is divided by the square root of the sum of all the weights squared
How to measure the similarity between two documents?
Use cosine distance.