Text Analysis Flashcards
Numbers and strings that are stored as columns in relational databases or dataframes
Structured data
Data as is; data doesn’t fit neatly into a database; text and multimedia content
Unstructured data
What is text analysis
Converting textual data into a structured format suitable for analysis
Challenges of text analysis
Numerous languages
Variations in usage, grammar, dialects, etc
Hard to understand context, resolve ambiguity, formally encode rules of language, etc
Bag of words approach
Each document (bag) is a collection of tokens (words) The order of words is ignored Long strings are split into smaller pieces or “tokens”
What is a token
A meaningful unit of text, most often a word, that we are interested in using for further analysis
What is tokenization
The process of splitting text into tokens
What are stop words
Commonly occurring words such as:
A, an, and, after, by, why, your, we, etc.
that are not informative about the document
What is stemming
Families of related words with similar meanings can be considered as a single unit by reducing words to their “stem”, base, or root form
Ways of describing structured data
Summary statistics (mean, variance, etc) Visualizations like scatter plot
3 ways to describe text
Word count analysis
Word count chart
Word cloud
What is word count analysis
Table with tokens in descending order of frequency
What is word count chart
Bar chart showing frequency of top N tokens
What is word cloud
Visual representation of frequency (or importance) of words in a corpus
What is sentiment analysis
Aka opinion mining
The computational study of opinions, sentiments, and emotions expressed in text
3 built in dictionaries of Tidytext that give sentiment, emotions, etc, for words
AFINN from Finn Arup Nielsen
Bing from Bing Liu and collaborators
NRC from Saif Mohammad and Peter Turney
Goal of topic modeling
To discover the latent topics or factors from a large number of text documents
Latent Dirichlet Allocation (LDA)
An unsupervised learning method similar to cluster analysis (where we discover latent groups or clusters)
-used to discover these latent quality dimensions from reviews
This is the most common algorithm for topic modeling
Latent Dirichlet Allocation (LDA)
Two guiding principles of LDA
Every document is a mixture of topics
Every topic is a mixture of words