C1 Flashcards
information retrieval
finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections
how do IR systems support the search process?
- analyzing queries and documents
- retrieving documents by computing a relevance score for each document given a query
- ranking the documents by that relevance score (true relevance is typically binary)
how does a search engine work? (basic)
- user has an information need
- user types a query
- system returns item sorted by relevance to that query, based on match on the query and popularity
how do recommender systems work? (basic)
- user has an interest
- user goes to a service/app/website
- system provides items that are relevant to the user based on user history/profile and popularity
4 principles of IR
principles of
- relevance
- ranking
- text processing
- user interaction
principles of relevance
- term overlap: return documents that contain the query terms (or related terms?)
- document importance (PageRank)
- result popularity: clicks by previous users on the document
- diversification: different types of results for different interpretations of the query
evaluation of relevance
needed:
- a set of queries
- a document collection
- relevance assessment: set of documents for each query that are labelled as relevant or non-relevant
the retrieval system returns a ranking of all documents
principles of ranking
- estimate relevance for each document
- we need a score
- term weighting (basic notion)
- PageRank
machine learning for ranking
Idea: learn the relevance of a document based on human-labelled training data (relevance assessments)
Why is machine learning for ranking different from machine learning for classification?
Relevance depends on the query, so we cannot train a global classifier over all relevant and irrelevant documents in a labelled dataset
=> need a machine learning paradigm that includes the query
two-stage retrieval first stage
- from large collection
- unsupervised
- often term-based (sparse)
- priority: recall
two stage retrieval second stage
- ranking top-n documents from first stage
- supervised
- often based on embeddings (dense)
- priority: precision at high ranks
2 different relavance models for similarity
- term overlap: find documents that contain query words
- semantic similarity: find documents which semantic representation is close to the query
index time
- collect (new) documents
- pre-process the documents
- create the documents representation
- store the documents in the index
- indexing can take time
query time
- process the user query
- match the query to the index
- retrieve the documents that are potentially relevant
- rank the documents by relevance score
- retrieval cannot take time (< 1 sec)