Session 6.4 Flashcards

1
Q

Why text mining

A

➢ Text is everywhere

➢ It takes too much time to read a million customer reviews or tweets.

➢ Text mining helps us to reduce the information and draw out the important features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A token/term

A

e.g. a word or a group of words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A document

A

one piece of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A corpus

A

A collection of documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Inverse document frequency (IDF)

A

Measuring the sparseness of term t in a corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A common term in the corpus has

A

low IDF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A rare term in the corpus has

A

high IDF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

TFIDF is high when

A

both TF and IDF values are high

i.e., the word is rare in the corpus but frequent in a single document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

TFIDF

A

Product of Term Frequency (TF) and Inverse Document Frequency (IDF)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is there any disadvantage of bag of words/N-grams approach?

A

Yes, there could be massive numbers of features, requiring a lot of memory and computational resources.

Possible Solutions:

  1. Feature selection
  2. Special consideration to computational storage space
  3. Cleaning and preprocessing text
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Case normalization

A

➢ Computers often treat capitalized words as being different to their lowercase counterparts.
➢ Making every word to be in lowercase
➢ Can be helpful
➢ Can be harmful when capital letters help us to identify different things

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Removing punctuation

A

➢ Can be helpful
e.g., “music” and “music.” will be correctly identified as the same word.
➢ Can be harmful when we are interested in how certain punctuation is used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Removing numbers

A

➢ Depending on the purpose of analyses we may want to remove numbers
➢ Don’t do this if we want to text mine quantities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Removing stopwords

A

➢ Stopwords are frequently used words in the corpus but don’t offer much insight into the documents.
e.g., common stopwords in English: “the”, “and”, “of”, “is”, etc.
➢ Don’t remove stopwords that we are interested in and want to text mine. e.g., if we want to look at tense in English, then we shouldn’t remove the word “is”
or “was” etc.
➢ Remove stopwords that we are not interested in.
e.g., suppose we are studying a corpus about customer reviews on a phone and almost every customer review in

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Word stemming and stem completion

A

➢ Word stemming reduces words to their word stem or root, so that different versions of the same word is unified across documents.
e.g., “announces”, “announced” and “announcing” are all reduced to “announc”

➢ May get some word stems that are not real words! e.g., “announc”

➢ Can choose to do stem completion.

i. e., Reconstructing the word stems into a known word
e. g., “announc” -> “announce”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cleaning and preprocessing text

A
  1. Case normalization
  2. Removing punctuation
  3. Removing numbers
  4. Removing stopwords
  5. Word stemming and stem completion