Text Mining Flashcards

1
Q

What is:

A Token?

A

A Token or Term is the individual part of which a Document is composed, most commonly, they are words, but they can be word groups (N-grams) and so on…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is:

A Corpus

A

A Corpus is a collection of documents containing a bunch of Tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is:

A Bag of Words Representation?

A

A Bag of Words Representation is a method in text mining where every document is simply treated as a collection of ddifferent words. Each possible word is a feature and each document of the corpus is an instance. The features can either take on a value of 1 (present in document) or a value of 0 (not present in document).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is:

Term Frequency?

A

Term Frequency is a similar method to the Bag of Words Approach, but it does us the word count instead of a dummy for each feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is:

Stemming?

A

Stemming is the act of removing suffixes of words and reducing plural to singular nouns in text preprocessing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is:

Normalizing?

A

Normalizing is the act of putting terms or Tokens in lowercase, so that these Terms are not split up due to their formulaztion (iPhone and iphone).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is

IDF?

A

IDF or Inverse Document Frequency is a term that boosts rare terms in a corpus, because of its rareness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is:

TFIDF?

A

TF-IDF is a value representation specific to a document that boosts the frequency of a Term or Token according to the IDF of that Token across all of the corpus’ documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is:

An N-gram

A

An N-gram is a general representation tactic that puts adjacent pairs or groups of words together as one feature.

“Bag of N-Grams up to x”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is:

Names Entity Extraction?

A

Named Entity Extraction is a tactic that is much more knowledge intensive than N-Grams or Bag of Words, since they need to be hardcoded or learned from a very large Corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are:

Topic Models?

A

A Topic Model is a type of latent information model where an extra layer is added between de document and the words or tokens. The Topic Layer is a clustering of words. It learns us what terms or tokens are linked to what topic and what weight they should get.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is text mining so important?

A

Text Mining is important because increasingly amounts of unstructured data are coming into our lives instead of structured, numbered data. This ‘messy’ data, however can contain important information for very useful applications e.g. it can recognise vocabulary that a child would use and use that to distinguish impostors portraying as children to avoid pedophilia online.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is LDA?

A

Latent Dirichlet Allocation is a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly