B03 Text Analytics II Flashcards

Question 1

Q

Text Pre-Processing Steps

Answer

A

-Tokenize Text -Normalize Spelling -Remove Stop Words -Stem Words -Normalize case -Tag Parts of Speech -Detect Sentence Boundaries

Question 2

Q

Pre-Processing Definitions: Tokenization

Answer

A

-This involves breaking up text data into individual units or tokens. In other words, creating a bag of words from a document. -Tokens are most often a single word, but can also be an n-gram, sentence, or paragraph.

Question 3

Q

Pre-Processing Definitions: n-gram

Answer

A

An n-gram is a contiguous sequence of n words within text. “The cow jumped over the moon;” “the cow” “cow jumped” “jumped over” “over the” “the moon”

Question 4

Q

Pre-Processing Definitions: Part-Of-Speech (POS) Tagging

Answer

A

-This involves labeling a word in text as belonging to a particular part of speech such as nouns, verbs, adjectives, etc. -Part-of-speech tagging is also known as ‘grammatical tagging’ or ‘word-category disambiguation’.

Question 5

Q

Pre-Processing Definitions: Stop Words

Answer

A

-This involves removing words that occur a lot but have very little analytic impact. -Common examples of stop words include: “a”, “an”, “and”, “but”, “by”, “if”, “it”, “that”, “the”, etc.

Question 6

Q

Pre-Processing Definitions: Stemming

Answer

A

-The process of normalizing related word tokens into a single form. -This typically includes the identification and removal of prefixes, suffixes, and inappropriate pluralizations.

Question 7

Q

Pre-Processing Definitions: Lemmatization

Answer

A

-A more advanced form of stemming that attempts to group words based on their core concept or lemma. -It uses both the context surrounding the word and additional grammatical information such as part of speech to determine the lemma.

Question 8

Q

Pre-Processing Definitions: Spelling Normalization

Answer

A

This involves resolving spelling mistakes or eliminating spelling variations. Approaches include using: 1. A dictionary-based approach. 2. Fuzzy matching algorithms. 3. Word clustering and concept expansion techniques.

Question 9

Q

Pre-Processing Definitions: Sentence Boundary Detection

Answer

A

-The process of breaking down entire documents into individual grammatical sentences. -For English text, it is almost as easy as finding every occurrence of punctuations like “.”; “?”; or “!”.

Question 10

Q

Pre-Processing Definitions: Case Normalization

Answer

A

-This involves converting the entire document to either completely lower case or completely upper case characters. -While mixed case text may be helpful to humans in order to differentiate between nouns and proper nouns, they are not always useful for algorithms.

Question 11

Q

Define Vector-space Model

Answer

A

-In the vector-space model, the rows represent documents and the columns represent tokens (vectors). -The elements of the model represent the occurrence of tokens within the text. -A commonly used vector-space model in text analytics is the document-term-matrix.

Question 12

Q

Define the ‘bag-of-words assumption’

Answer

A

The vector-space model makes an implicit assumption that the order of words (or tokens) in a document does not matter. This is known as the bag-of-words assumption.

Question 13

Q

The Vector-space Model can take on 3 forms, these are:

Answer

A

binary representation 2. frequency count 3. float-valued weighted vector.

Question 14

Q

Ex. Binary Representation

Question 15

Q

Ex. Frequency Count

Question 16

Q

Define Weighted Vector in the Context of Vector Space Model

Answer

A

-Beyond simply using word frequency, the perceived
importance of the word is sometimes also considered and
applied as a weight.
-This requires choosing a weighting scheme.
-One of the most popular schemes is the tf-idf weighting
approach.

Question 17

Q

Explain TF-IDF

Answer

A

-TF stands for term frequency. It is the proportion of times a
term appears in a document.
-IDF stands for inverse document frequency. The document
frequency for a term is the proportion of documents in
which that term appears. The inverse of this is the .

Question 18

Q

Inverse document frequency is calculated as:

Question 19

Q

The TF-IDF for a for a term is calculated as:

Question 20

Q

Ex. of Weighted Vector

Question 21

Q

The Tidy Text Format

The tidy data format requires that:

Answer

A

Each variable is a column
Each observation is a row
Each type of observational unit is a table

Applied to text data, this requires that tidy text tables have
one token per row.

Question 22

Q

The Tidy Text Format

Tidy data sets allow data to be
manipulated with a standard set
of “tidy” tools and functions.

For instance:

Answer

A

dplyr
broom
ggplot2
tidyr

Question 23

Q

The TidyText Package

Answer

A

Developed by Julia Silge and David Robinson.
Provides functions and supporting data sets for text analysis.
Provides support to tidy objects from other popular text mining packages such as tm and quanteda.

Question 24

Q

What does unnest_tokens() function do?

Answer

A

Removes punctuations.
Removes whitespace.
Converts text to lowercase.
Puts a single token per row (default is one word per row).

Question 25

Q

What does the anti_join() function do?

Answer

A

-Returns all records from table 1 where there are no matching
values in table 2.
-Keeps only columns from table 1.
-Used to remove stop words.

Question 26

Q

What does bind_tf_idf() function do?

Answer

A

-Binds the tf_idf of a tidytext dataset to the original data.
-The dataset must have more than one document and it must
have one row per document-term combination.

Question 27

Q