Session 6.4 Flashcards

Question 1

Q

Why text mining

Answer

A

➢ Text is everywhere

➢ It takes too much time to read a million customer reviews or tweets.

➢ Text mining helps us to reduce the information and draw out the important features.

Question 2

Q

A token/term

Answer

A

e.g. a word or a group of words

Question 3

Q

A document

Answer

A

one piece of text

Question 4

Q

A corpus

Answer

A

A collection of documents

Question 5

Q

Inverse document frequency (IDF)

Answer

A

Measuring the sparseness of term t in a corpus

Question 6

Q

A common term in the corpus has

Question 7

Q

A rare term in the corpus has

Question 8

Q

TFIDF is high when

Answer

A

both TF and IDF values are high

i.e., the word is rare in the corpus but frequent in a single document.

Question 9

Q

TFIDF

Answer

A

Product of Term Frequency (TF) and Inverse Document Frequency (IDF)

Question 10

Q

Is there any disadvantage of bag of words/N-grams approach?

Answer

A

Yes, there could be massive numbers of features, requiring a lot of memory and computational resources.

Possible Solutions:

Feature selection
Special consideration to computational storage space
Cleaning and preprocessing text

Question 11

Q

Case normalization

Answer

A

➢ Computers often treat capitalized words as being different to their lowercase counterparts.
➢ Making every word to be in lowercase
➢ Can be helpful
➢ Can be harmful when capital letters help us to identify different things

Question 12

Q

Removing punctuation

Answer

A

➢ Can be helpful
e.g., “music” and “music.” will be correctly identified as the same word.
➢ Can be harmful when we are interested in how certain punctuation is used

Question 13

Q

Removing numbers

Answer

A

➢ Depending on the purpose of analyses we may want to remove numbers
➢ Don’t do this if we want to text mine quantities.

Question 14

Q

Removing stopwords

Answer

A

➢ Stopwords are frequently used words in the corpus but don’t offer much insight into the documents.
e.g., common stopwords in English: “the”, “and”, “of”, “is”, etc.
➢ Don’t remove stopwords that we are interested in and want to text mine. e.g., if we want to look at tense in English, then we shouldn’t remove the word “is”
or “was” etc.
➢ Remove stopwords that we are not interested in.
e.g., suppose we are studying a corpus about customer reviews on a phone and almost every customer review in

Question 15

Q

Word stemming and stem completion

Answer

A

➢ Word stemming reduces words to their word stem or root, so that different versions of the same word is unified across documents.
e.g., “announces”, “announced” and “announcing” are all reduced to “announc”

➢ May get some word stems that are not real words! e.g., “announc”

➢ Can choose to do stem completion.

i. e., Reconstructing the word stems into a known word
e. g., “announc” -> “announce”

Question 16

Q

Cleaning and preprocessing text

Answer

Study These Flashcards

A

Case normalization
Removing punctuation
Removing numbers
Removing stopwords
Word stemming and stem completion

Session 6.4 Flashcards

(16 cards)