Text Analysis Flashcards
Numbers and strings that are stored as columns in relational databases or dataframes
Structured data
Data as is; data doesn’t fit neatly into a database; text and multimedia content
Unstructured data
What is text analysis
Converting textual data into a structured format suitable for analysis
Challenges of text analysis
Numerous languages
Variations in usage, grammar, dialects, etc
Hard to understand context, resolve ambiguity, formally encode rules of language, etc
Bag of words approach
Each document (bag) is a collection of tokens (words) The order of words is ignored Long strings are split into smaller pieces or “tokens”
What is a token
A meaningful unit of text, most often a word, that we are interested in using for further analysis
What is tokenization
The process of splitting text into tokens
What are stop words
Commonly occurring words such as:
A, an, and, after, by, why, your, we, etc.
that are not informative about the document
What is stemming
Families of related words with similar meanings can be considered as a single unit by reducing words to their “stem”, base, or root form
Ways of describing structured data
Summary statistics (mean, variance, etc) Visualizations like scatter plot
3 ways to describe text
Word count analysis
Word count chart
Word cloud
What is word count analysis
Table with tokens in descending order of frequency
What is word count chart
Bar chart showing frequency of top N tokens
What is word cloud
Visual representation of frequency (or importance) of words in a corpus
What is sentiment analysis
Aka opinion mining
The computational study of opinions, sentiments, and emotions expressed in text