Text Analysis Flashcards

1
Q

Numbers and strings that are stored as columns in relational databases or dataframes

A

Structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data as is; data doesn’t fit neatly into a database; text and multimedia content

A

Unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is text analysis

A

Converting textual data into a structured format suitable for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Challenges of text analysis

A

Numerous languages
Variations in usage, grammar, dialects, etc
Hard to understand context, resolve ambiguity, formally encode rules of language, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Bag of words approach

A
Each document (bag) is a collection of tokens (words)
The order of words is ignored
Long strings are split into smaller pieces or “tokens”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a token

A

A meaningful unit of text, most often a word, that we are interested in using for further analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is tokenization

A

The process of splitting text into tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are stop words

A

Commonly occurring words such as:
A, an, and, after, by, why, your, we, etc.
that are not informative about the document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is stemming

A

Families of related words with similar meanings can be considered as a single unit by reducing words to their “stem”, base, or root form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ways of describing structured data

A
Summary statistics (mean, variance, etc)
Visualizations like scatter plot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

3 ways to describe text

A

Word count analysis
Word count chart
Word cloud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is word count analysis

A

Table with tokens in descending order of frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is word count chart

A

Bar chart showing frequency of top N tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is word cloud

A

Visual representation of frequency (or importance) of words in a corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is sentiment analysis

A

Aka opinion mining

The computational study of opinions, sentiments, and emotions expressed in text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

3 built in dictionaries of Tidytext that give sentiment, emotions, etc, for words

A

AFINN from Finn Arup Nielsen
Bing from Bing Liu and collaborators
NRC from Saif Mohammad and Peter Turney

17
Q

Goal of topic modeling

A

To discover the latent topics or factors from a large number of text documents

18
Q

Latent Dirichlet Allocation (LDA)

A

An unsupervised learning method similar to cluster analysis (where we discover latent groups or clusters)
-used to discover these latent quality dimensions from reviews

19
Q

This is the most common algorithm for topic modeling

A

Latent Dirichlet Allocation (LDA)

20
Q

Two guiding principles of LDA

A

Every document is a mixture of topics

Every topic is a mixture of words