B03 Text Analytics II Flashcards

1
Q

Text Pre-Processing Steps

A

-Tokenize Text -Normalize Spelling -Remove Stop Words -Stem Words -Normalize case -Tag Parts of Speech -Detect Sentence Boundaries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Pre-Processing Definitions: Tokenization

A

-This involves breaking up text data into individual units or tokens. In other words, creating a bag of words from a document. -Tokens are most often a single word, but can also be an n-gram, sentence, or paragraph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Pre-Processing Definitions: n-gram

A

An n-gram is a contiguous sequence of n words within text. “The cow jumped over the moon;” “the cow” “cow jumped” “jumped over” “over the” “the moon”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Pre-Processing Definitions: Part-Of-Speech (POS) Tagging

A

-This involves labeling a word in text as belonging to a particular part of speech such as nouns, verbs, adjectives, etc. -Part-of-speech tagging is also known as ‘grammatical tagging’ or ‘word-category disambiguation’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Pre-Processing Definitions: Stop Words

A

-This involves removing words that occur a lot but have very little analytic impact. -Common examples of stop words include: “a”, “an”, “and”, “but”, “by”, “if”, “it”, “that”, “the”, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Pre-Processing Definitions: Stemming

A

-The process of normalizing related word tokens into a single form. -This typically includes the identification and removal of prefixes, suffixes, and inappropriate pluralizations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Pre-Processing Definitions: Lemmatization

A

-A more advanced form of stemming that attempts to group words based on their core concept or lemma. -It uses both the context surrounding the word and additional grammatical information such as part of speech to determine the lemma.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Pre-Processing Definitions: Spelling Normalization

A

This involves resolving spelling mistakes or eliminating spelling variations. Approaches include using: 1. A dictionary-based approach. 2. Fuzzy matching algorithms. 3. Word clustering and concept expansion techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pre-Processing Definitions: Sentence Boundary Detection

A

-The process of breaking down entire documents into individual grammatical sentences. -For English text, it is almost as easy as finding every occurrence of punctuations like “.”; “?”; or “!”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Pre-Processing Definitions: Case Normalization

A

-This involves converting the entire document to either completely lower case or completely upper case characters. -While mixed case text may be helpful to humans in order to differentiate between nouns and proper nouns, they are not always useful for algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define Vector-space Model

A

-In the vector-space model, the rows represent documents and the columns represent tokens (vectors). -The elements of the model represent the occurrence of tokens within the text. -A commonly used vector-space model in text analytics is the document-term-matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define the ‘bag-of-words assumption’

A

The vector-space model makes an implicit assumption that the order of words (or tokens) in a document does not matter. This is known as the bag-of-words assumption.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The Vector-space Model can take on 3 forms, these are:

A
  1. binary representation 2. frequency count 3. float-valued weighted vector.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Ex. Binary Representation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ex. Frequency Count

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define Weighted Vector in the Context of Vector Space Model

A

-Beyond simply using word frequency, the perceived
importance of the word is sometimes also considered and
applied as a weight.
-This requires choosing a weighting scheme.
-One of the most popular schemes is the tf-idf weighting
approach.

17
Q

Explain TF-IDF

A

-TF stands for term frequency. It is the proportion of times a
term appears in a document.
-IDF stands for inverse document frequency. The document
frequency for a term is the proportion of documents in
which that term appears. The inverse of this is the .

18
Q

Inverse document frequency is calculated as:

A
19
Q

The TF-IDF for a for a term is calculated as:

A
20
Q

Ex. of Weighted Vector

A
21
Q

The Tidy Text Format

The tidy data format requires that:

A
  • Each variable is a column
  • Each observation is a row
  • Each type of observational unit is a table

Applied to text data, this requires that tidy text tables have
one token per row.

22
Q

The Tidy Text Format

Tidy data sets allow data to be
manipulated with a standard set
of “tidy” tools and functions.

For instance:

A
  • dplyr
  • broom
  • ggplot2
  • tidyr
23
Q

The TidyText Package

A
  • Developed by Julia Silge and David Robinson.
  • Provides functions and supporting data sets for text analysis.
  • Provides support to tidy objects from other popular text mining packages such as tm and quanteda.
24
Q

What does unnest_tokens() function do?

A
  • Removes punctuations.
  • Removes whitespace.
  • Converts text to lowercase.
  • Puts a single token per row (default is one word per row).
25
Q

What does the anti_join() function do?

A

-Returns all records from table 1 where there are no matching
values in table 2.
-Keeps only columns from table 1.
-Used to remove stop words.

26
Q

What does bind_tf_idf() function do?

A

-Binds the tf_idf of a tidytext dataset to the original data.
-The dataset must have more than one document and it must
have one row per document-term combination.

27
Q
A