Text Mining Flashcards

Question 1

Q

What is:

A Token?

Answer

A

A Token or Term is the individual part of which a Document is composed, most commonly, they are words, but they can be word groups (N-grams) and so on…

Question 2

Q

What is:

A Corpus

Answer

A

A Corpus is a collection of documents containing a bunch of Tokens.

Question 3

Q

What is:

A Bag of Words Representation?

Answer

A

A Bag of Words Representation is a method in text mining where every document is simply treated as a collection of ddifferent words. Each possible word is a feature and each document of the corpus is an instance. The features can either take on a value of 1 (present in document) or a value of 0 (not present in document).

Question 4

Q

What is:

Term Frequency?

Answer

A

Term Frequency is a similar method to the Bag of Words Approach, but it does us the word count instead of a dummy for each feature.

Question 5

Q

What is:

Stemming?

Answer

A

Stemming is the act of removing suffixes of words and reducing plural to singular nouns in text preprocessing.

Question 6

Q

What is:

Normalizing?

Answer

A

Normalizing is the act of putting terms or Tokens in lowercase, so that these Terms are not split up due to their formulaztion (iPhone and iphone).

Question 7

Q

What is

IDF?

Answer

A

IDF or Inverse Document Frequency is a term that boosts rare terms in a corpus, because of its rareness.

Question 8

Q

What is:

TFIDF?

Answer

A

TF-IDF is a value representation specific to a document that boosts the frequency of a Term or Token according to the IDF of that Token across all of the corpus’ documents.

Question 9

Q

What is:

An N-gram

Answer

A

An N-gram is a general representation tactic that puts adjacent pairs or groups of words together as one feature.

“Bag of N-Grams up to x”

Question 10

Q

What is:

Names Entity Extraction?

Answer

A

Named Entity Extraction is a tactic that is much more knowledge intensive than N-Grams or Bag of Words, since they need to be hardcoded or learned from a very large Corpus.

Question 11

Q

What are:

Topic Models?

Answer

A

A Topic Model is a type of latent information model where an extra layer is added between de document and the words or tokens. The Topic Layer is a clustering of words. It learns us what terms or tokens are linked to what topic and what weight they should get.

Question 12

Q

Why is text mining so important?

Answer

A

Text Mining is important because increasingly amounts of unstructured data are coming into our lives instead of structured, numbered data. This ‘messy’ data, however can contain important information for very useful applications e.g. it can recognise vocabulary that a child would use and use that to distinguish impostors portraying as children to avoid pedophilia online.

Question 13

Q

What is LDA?

Answer

A

Latent Dirichlet Allocation is a

Text Mining Flashcards

(13 cards)