Text Analysis Flashcards

Question 1

Q

Numbers and strings that are stored as columns in relational databases or dataframes

Answer

A

Structured data

Question 2

Q

Data as is; data doesn’t fit neatly into a database; text and multimedia content

Answer

A

Unstructured data

Question 3

Q

What is text analysis

Answer

A

Converting textual data into a structured format suitable for analysis

Question 4

Q

Challenges of text analysis

Answer

A

Numerous languages
Variations in usage, grammar, dialects, etc
Hard to understand context, resolve ambiguity, formally encode rules of language, etc

Question 5

Q

Bag of words approach

Answer

A

Each document (bag) is a collection of tokens (words)
The order of words is ignored
Long strings are split into smaller pieces or “tokens”

Question 6

Q

What is a token

Answer

A

A meaningful unit of text, most often a word, that we are interested in using for further analysis

Question 7

Q

What is tokenization

Answer

A

The process of splitting text into tokens

Question 8

Q

What are stop words

Answer

A

Commonly occurring words such as:
A, an, and, after, by, why, your, we, etc.
that are not informative about the document

Question 9

Q

What is stemming

Answer

A

Families of related words with similar meanings can be considered as a single unit by reducing words to their “stem”, base, or root form

Question 10

Q

Ways of describing structured data

Answer

A

Summary statistics (mean, variance, etc)
Visualizations like scatter plot

Question 11

Q

3 ways to describe text

Answer

A

Word count analysis
Word count chart
Word cloud

Question 12

Q

What is word count analysis

Answer

A

Table with tokens in descending order of frequency

Question 13

Q

What is word count chart

Answer

A

Bar chart showing frequency of top N tokens

Question 14

Q

What is word cloud

Answer

A

Visual representation of frequency (or importance) of words in a corpus

Question 15

Q

What is sentiment analysis

Answer

A

Aka opinion mining

The computational study of opinions, sentiments, and emotions expressed in text

Question 16

Q

3 built in dictionaries of Tidytext that give sentiment, emotions, etc, for words

Answer

Study These Flashcards

A

AFINN from Finn Arup Nielsen
Bing from Bing Liu and collaborators
NRC from Saif Mohammad and Peter Turney

Question 17

Q

Goal of topic modeling

Answer

Study These Flashcards

A

To discover the latent topics or factors from a large number of text documents

Question 18

Q

Latent Dirichlet Allocation (LDA)

Answer

Study These Flashcards

A

An unsupervised learning method similar to cluster analysis (where we discover latent groups or clusters)
-used to discover these latent quality dimensions from reviews

Question 19

Q

This is the most common algorithm for topic modeling

Answer

Study These Flashcards

A

Latent Dirichlet Allocation (LDA)

Question 20

Q

Two guiding principles of LDA

Answer

Study These Flashcards

A

Every document is a mixture of topics

Every topic is a mixture of words

Text Analysis Flashcards

(20 cards)