Book - Chapter 9 Analytical Theory Text Analysis Flashcards
What is text analysis
Representation and processing of text
Why is text analysis high dimensionality
Every distinct time is a dimension
Is the data structured or unstructured
Unstructured
What are the three important steps/process InTEXT analysis
Passing. Search/retrieval. Text mining
What is parsing
Imposing structure on the unstructured/semistructured text for downstream analysis
What is search/retrieval
Which documents have this word or phrase. Which documents are about this topic or this entity
What is text mining
Understanding the content. For example clustering, classification
What are regular expressions
Or a means for finding words, strings or particular patterns in text
What does bag of words mean
Most common representation of the structure. The bag of words is a vector with one dimension for every unique term in the space
What is term frequency
The number of times a term occurs in a vector
What is a reverse index
For every possible feature, A list of all the documents that contain that feature
What are the corpus metrics
Volume. Corpus wide term frequencies. Inverse document frequency
What is the challenge with a corpus
A corpus is dynamic. The index and metrics must be updated continuously
What are the three things that determine quality of search results
Relevance. Precision . Recall
What is relevant in the quality of search results
Is the document what I wanted? It is used to rank search results
What is precision in the quality of search results
What percentage of the document in the results are relevant
What is recall in the quality of search results
Of all the relevant documents in the corpus, what percentage were returned to me
What is term frequency
Assigns each item in the document are white.
What does inverse document frequency do
It measures the uniqueness of a term in the corpus
What is tf-idf
It provides measure that we await the presence of unusual terms in the query as higher indications of document relevance than the presence of more common terms
What is authoritativeness
Page rank used by Google
What is the recency metric
New documents are more relevant than old ones
The tasks such as reverse indexing, finding the inverse document frequencies and corpus term frequencies are implemented with what
Map and reduce algorithms