Exam Deck Flashcards
Basic Measures of TR system and formulas
Precision = a / a+c - are retreived results relevant Recall = a / a+b - have all relevant documents been retrieved F-measure = 2PR/P+R - combines Precision and recall
What is the ideal PR curve, what does it characterize
- Horizontal line, Precision = recall
2. characterizes overall accuracy
What is average precision
standard measure of comparing two ranking methods
- Combines Precision and recall
- Sensitive to rank of every relevant document
What is nDCG
- utility of top k documents
- utility of lowly ranked document is discounted
- normalized across queries
Describe all types of feedback
Relevance feedback - Reliable judgement, but requires effort
Pseudo Feedback - Not reliable, assumed top k ranked docs are reliable with no user effort
Implicit Feedback - Uses clickthrough
What is Latent semantic indexing
- find a way to represent the term-document space by a lower dimension latent space
- improve storage and ambiguity search
LSI steps
- Term document matrix -> Word assignment to topics -> Topic importance -> topic distribution
Pros of using VSM
- Automatic selection of index terms
- Partial Matching of queries and documents
- Ranking to the similarity score
- Term weighting schemes
- Various extensions
Problems with Lexical Semantics
- Synonymy = Different terms may have identical or similar meanings similarity is high even though cosine small
- Polysemy = words often have many meanings , vsm unable to discriminate. cosine large but should be small
Advantages of Lexical
Main idea of LSI
Perform a low-rank appx of the document term matrix
General Idea
- to map documents to low dimensions.
- represent semantic associations
- compute similarity based on the inner product in semantic space
Web search challenges and opportunity
Challenges
- Scalability = Parallel indexing and searching
- Low-quality information and spam
- Dynamics of web
Opportunities
- many additional heuristics can be leveraged to improve accuracy
What is web crawler
- an essential component of web search
- BFS
- complete vs focused crawling
- Incremental crawling is resource optimization
what is mapreduce
- Minimizes effort of programmer for simple parallel programming tasks
what is page rank algorithm
- captures page popularity
- random surfing to visit every page and assess the popularity