Text Mining Flashcards

Question 1

Q

What methods do Text Mining use?

Answer

A

Information Retrieval. Pre-processing of text documnets

Question 2

Q

What tasks do text mining do?

Answer

A

Text Classification, Text Clustering or Text Summarization

Question 3

Q

What is an issue with text mining vs traditional data mining?

Answer

A

Traditional data mining is structured. Text often has no real structure.

Question 4

Q

What is a Vector Space Model?

Answer

A

A document is represented as a “bag” of words.

Question 5

Q

What is a problem with Vector Space Model?

Answer

A

There are many words in the English language.

Question 6

Q

How do you fix the limitations of the Vector Space Model?

Answer

A

Removing the stop words (“A, the, this, that …”)

Stemming (e.g combine the similar verbs (past/present tense)

Question 7

Q

How do you assign the weight (importance) of a term in text-mining?

Answer

A

Use TF-IDF

Weight = TF * IDF

TF = Term Frequency (how many times)

IDF = Inverse Document Frequency = log (total documents / document frequency)

Question 8

Q

What are the steps involved in text mining?

Answer

A

Get the text
Remove the stop words
Convert all the words to lowercase (optional step)
Stem the commonly associated word (interesting-> interested)
Count the term frequency
Create an index file, which has all the terms and all their frequency. Sort it alphabetically.
Create Vector Space Model: For each occurence, put a 1 in its vector space, occurs 3, put 3).
Compute the IDF. How many documents did this word appear in? / How many documents there are.
Compute the weight (tf * idf)
Normalize to less than 1. For each term, the weight is divided by the square root of the sum of all the weights squared

Question 9

Q

How to measure the similarity between two documents?

Answer

A

Use cosine distance.

(9 cards)