Text Mining Flashcards

1
Q

What methods do Text Mining use?

A

Information Retrieval. Pre-processing of text documnets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What tasks do text mining do?

A

Text Classification, Text Clustering or Text Summarization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an issue with text mining vs traditional data mining?

A

Traditional data mining is structured. Text often has no real structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a Vector Space Model?

A

A document is represented as a “bag” of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a problem with Vector Space Model?

A

There are many words in the English language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you fix the limitations of the Vector Space Model?

A

Removing the stop words (“A, the, this, that …”)

Stemming (e.g combine the similar verbs (past/present tense)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you assign the weight (importance) of a term in text-mining?

A

Use TF-IDF

Weight = TF * IDF

TF = Term Frequency (how many times)

IDF = Inverse Document Frequency = log (total documents / document frequency)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the steps involved in text mining?

A
  1. Get the text
  2. Remove the stop words
  3. Convert all the words to lowercase (optional step)
  4. Stem the commonly associated word (interesting-> interested)
  5. Count the term frequency
  6. Create an index file, which has all the terms and all their frequency. Sort it alphabetically.
  7. Create Vector Space Model: For each occurence, put a 1 in its vector space, occurs 3, put 3).
  8. Compute the IDF. How many documents did this word appear in? / How many documents there are.
  9. Compute the weight (tf * idf)
  10. Normalize to less than 1. For each term, the weight is divided by the square root of the sum of all the weights squared
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to measure the similarity between two documents?

A

Use cosine distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly