Book - Chapter 9 Analytical Theory Text Analysis Flashcards
What is text analysis
Representation and processing of text
Why is text analysis high dimensionality
Every distinct time is a dimension
Is the data structured or unstructured
Unstructured
What are the three important steps/process InTEXT analysis
Passing. Search/retrieval. Text mining
What is parsing
Imposing structure on the unstructured/semistructured text for downstream analysis
What is search/retrieval
Which documents have this word or phrase. Which documents are about this topic or this entity
What is text mining
Understanding the content. For example clustering, classification
What are regular expressions
Or a means for finding words, strings or particular patterns in text
What does bag of words mean
Most common representation of the structure. The bag of words is a vector with one dimension for every unique term in the space
What is term frequency
The number of times a term occurs in a vector
What is a reverse index
For every possible feature, A list of all the documents that contain that feature
What are the corpus metrics
Volume. Corpus wide term frequencies. Inverse document frequency
What is the challenge with a corpus
A corpus is dynamic. The index and metrics must be updated continuously
What are the three things that determine quality of search results
Relevance. Precision . Recall
What is relevant in the quality of search results
Is the document what I wanted? It is used to rank search results