Information retrieval for dummies Flashcards

1
Q

what is the unicode name for the encoded data?

A

codespace and the number assigned to a character is called a code point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a query in the context of IR?

A

A query is a formalized request for information from a document collection. It usually consists of keywords or phrases entered by the user to describe their information need.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an Information Retrieval (IR) system?

A

An IR system is a system that stores, retrieves, and organizes information from a collection of documents based on user queries. Its goal is to return the most relevant results to the user.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the key components of an IR system?

A

Document Collection: A repository of documents (text, images, etc.).
Indexing: Processes and organizes documents for efficient retrieval.
Query Processing: Parses and interprets the user query.
Retrieval Model: Matches the query against the indexed documents.
Ranking: Orders documents based on their relevance to the query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the main goal of an IR system?

A

To retrieve documents that are relevant to the user’s query while minimizing irrelevant results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does relevance mean in IR systems?

A

Relevance refers to how well a document satisfies the user’s information need as expressed by the query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between precision and recall in IR?

A

Precision: The proportion of retrieved documents that are relevant.

Recall: The proportion of relevant documents in the collection that are retrieved.
Together, Precision and Recall measure retrieval effectiveness, meant as the ability of a system to retrieve relevant documents while at the same time holding back non-relevant ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is indexing in IR systems?

A

Indexing is the process of organizing and structuring data in the document collection to allow for fast and efficient retrieval. Common techniques include inverted indexes and term frequency-based structures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is an inverted index?

A

An inverted index maps each term in the collection to the list of documents (and positions) where it appears. It’s the backbone of most IR systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What happens during query processing?

A

Tokenization: Breaking the query into terms or tokens.
Normalization: Converting terms to a standard form (e.g., lowercase).
Stopword Removal: Removing common words (e.g., “the”, “and”).
Stemming/Lemmatization: Reducing terms to their root form.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is ranking in IR systems?

A

Ranking orders retrieved documents based on their relevance to the user query, using scoring functions such as TF-IDF, BM25, or neural embeddings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is TF-IDF?

A

erm Frequency-Inverse Document Frequency (TF-IDF) is a scoring method that reflects how important a term is in a document relative to the entire collection.

TF-IDF(t,d)=TF(t,d)×IDF(t)
Where:

TF: Term frequency in the document.
IDF: Logarithmic measure of how rare the term is across the collection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the types of IR systems?

A

Boolean IR: Documents are retrieved based on strict matching of query terms (AND, OR, NOT).
Vector Space Models: Documents and queries are represented as vectors in a multi-dimensional space, with similarity measured (e.g., cosine similarity).
Probabilistic IR: Uses probability theory to estimate the likelihood of a document being relevant.
Neural IR: Uses deep learning models for query and document representation and ranking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the difference between exact-match and ranked retrieval?

A

Exact-Match Retrieval: Returns only documents that match the query exactly (e.g., Boolean retrieval).
Ranked Retrieval: Returns documents ranked by their relevance to the query (e.g., TF-IDF, BM25).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are user relevance feedback techniques in IR?

A

Explicit Feedback: Users label documents as relevant or not.
Implicit Feedback: Inferred from user interactions (e.g., click-through rates).
Query Expansion: Modifies the query based on user feedback to improve retrieval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an index in Information Retrieval?

A

An index is a data structure that enables fast and efficient retrieval of documents in response to a query. It maps terms to the documents in which they appear.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a dictionary in an IR index?

A

The dictionary (or vocabulary) is a component of an inverted index that stores all unique terms in the document collection, along with metadata like term frequency and pointer to the posting list.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a posting list?

A

A posting list is a list of document identifiers (and possibly additional data like term frequency or positions) that represents all the documents in which a specific term occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a lexicon?

A

A lexicon is another name for the dictionary in an IR system, containing all unique terms in the collection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a term?

A

A term is a distinct unit of text (usually a word or token) used for indexing and querying in an IR system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is an inverted index?

A

An inverted index maps each term in the dictionary to its posting list, enabling efficient retrieval of documents that contain the term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the difference between a forward index and an inverted index?

A

Forward Index: Maps each document to the terms it contains.
Inverted Index: Maps each term to the documents it appears in. The inverted index is more efficient for retrieval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What metadata might an inverted index store?

A

Document Frequency (df): The number of documents containing a term.
Term Frequency (tf): The number of times a term appears in a document.
Positions: The positions of a term within documents (useful for phrase queries).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the difference between positional and non-positional indexes?

A

Non-Positional Index: Only stores document IDs where a term appears.
Positional Index: Stores term positions within each document, allowing for phrase and proximity queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is term frequency (tf)?
Term frequency is the number of times a term appears in a document. It is often normalized to adjust for document length.
26
What is a skip pointer?
Skip pointers are additional pointers added to posting lists that allow skipping over parts of the list during query processing, improving efficiency.
27
What is stemming in the context of indexing and lemmatization?
Stemming reduces words to their root form (e.g., "running" → "run") to group similar terms together in the index. Lemmatization reduces words to their base or dictionary form (e.g., "better" → "good"), considering the word’s context.
28
What does Zipf's Law describe?
The frequency distribution of terms in a corpus, where the frequency of a term is inversely proportional to its rank. A few terms are very frequent, while most terms appear rarely, creating a long tail distribution. In a corpus, the most frequent word ("the") might appear 10,000 times, the second most frequent word ("and") 5,000 times, and so on.
29
What does Heap’s Law describe?
The growth of vocabulary size as more text is processed. The rate of growth slows down, as most common terms are already included in the vocabulary. If n = 1,000,000, k = 100, β=0.5, the vocabulary size V(N) would be =10,000.
30
what are daat and taat?
Document-at-a-Time (DAAT) and Term-at-a-Time (TAAT) are two fundamental query processing methods used in Information Retrieval (IR) systems for evaluating Boolean or ranked queries using inverted indexes.
31
how does daat work?
In DAAT, the system processes one document at a time across all terms in the query. The algorithm maintains a cursor for each term's posting list and evaluates documents in ascending order of document IDs. Start at the first document in all posting lists. Identify the smallest document ID among the current cursors. Check whether the document satisfies the query (e.g., all terms appear for Boolean queries, or compute the score for ranked queries). Advance the cursor of each term to the next occurrence of that term in the posting list, if needed. Repeat until all posting lists are exhausted.
32
how does taat work?
In TAAT, the system processes one term at a time across all documents. It iterates through each term's posting list sequentially and updates scores for all documents containing that term. Initialize a score accumulator for all documents (or for the subset relevant to the query terms). Process each term's posting list, for each document in the posting list, update its score based on the term's contribution (e.g., TF-IDF, BM25). After processing all terms, sort the documents by their scores for ranked queries or apply Boolean logic for exact-match queries.
33
what is the jaccard coefficient?
The Jaccard Coefficient measures the similarity between two sets by comparing their overlap to their union. In the context of information retrieval: Each document and query is represented as a set of terms. The Jaccard Coefficient calculates the proportion of shared terms between a query and a document.
34
What is the Bag of Words Model?
The Bag of Words (BoW) model is a simple vector representation of text that: Treats each document as a collection of terms (a "bag"). Ignores grammar, word order, and sentence structure. Focuses only on the frequency of terms. Construction Tokenization: Break text into individual terms (e.g., words). Vocabulary Creation: Identify all unique terms in the corpus. Frequency Count: Count how many times each term appears in each document.
35
is relevance proportional to term frequency?
Relevance does not increase proportionally with term frequency. We use the log of the term frequency
36
Are rare terms informative?
Rare terms are more informative than frequent terms. We want a high weight for rare terms
37
what are the formulas of idf?
the Inverse document frequency of term is calculated as idf= log(N/df) where n is the total number of documents in the dataset and df is the number of documents the term appear in.
38
what is a common way to represent documents and queries?
A vector space, where terms are the bases and documents and queries are high dimensional vectors
39
How do we compute vector proximity?
utilizing cosine similarity.
40
how do you calculate the user satisfaction for a search engine?
Utilizing relevance assessment. You measure relevance utilizing * A benchmark document collection * A benchmark suite of queries * An assessment of either Relevant or Non Relevant for each query and each document
41
What is Precision at Cutoff k?
Precision at cutoff k (P@k) measures the proportion of relevant documents in the top k retrieved documents.
42
What is Recall at Cutoff k?
Recall at cutoff k (R@k) measures the proportion of all relevant documents retrieved in the top k.
43
What is the F-Measure?
The F-Measure is the harmonic mean of precision and recall, balancing their importance. Fb = (1+b^2)(P * R)/(b^2 * P + R). Where b=1 gives equal importance to precision and recall.
44
What is Discounted Cumulative Gain (DCG)?
DCG evaluates the relevance of results, assigning higher importance to results at the top of the ranked list. DCG = SUM(rel/logk) where rel is the relevance score of the i-th document and the base of the logarithm is b and indicates the patience of the user to go through the result list, where b=2 is an impatient user, b=10 is a patient user.
45
What is Rank-Biased Precision (RBP)?
RBP is a metric that assigns a geometrically decaying weight to lower-ranked documents, emphasizing the importance of top-ranked results. RBP = (1-p)*SUM(p^(i-1)*rel_i) where p is the persistence parameter, controlling how quickly the weight decays.
46
What is ranked retrieval?
Ranked retrieval sorts documents in descending order of relevance scores, calculated based on how well each document matches the query.
47
What is the Probability Ranking Principle (PRP)?
PRP states that documents should be ranked in decreasing order of the probability of relevance to the query to maximize user satisfaction. Is calculated as P(R=r|q,d) so that the probability that document d has relevance r to the query q. The relevance probability can be expressed using Bayes’ Rule
48
How are odds used in information retrieval?
Odds express the likelihood of a document being relevant versus not relevant to a query: Odds =P(R=r|q,d)/P(R=\r|q,d)
49
What is the Binary Independence Model (BIM)?
BIM is a probabilistic model that assumes: Terms are binary (present or absent in documents). Terms occur independently in documents. The relevance of one document is independent of others. It ranks documents using the formula: S(d,q) = SUM((P(t| R =1))(1-P(t| R =0)/((1-P(t| R =1))(P(t| R =0)))
50
What is the eliteness hypothesis in IR?
The eliteness hypothesis assumes that a term’s occurrence in a document depends on whether the document belongs to an elite set for that term, where the term is considered important or useful for relevance. The eliteness hypothesis introduces Ei to the BIM, where P(t|R) = P(t|E) * P(E|R). It models the relationship between term occurrence and relevance through elite sets.
51
What are the advantages of ranked retrieval?
Supports partial matching. Provides graded relevance (not just binary). Prioritizes the most relevant documents for better user satisfaction.
52
What is BM25, and how does it improve upon earlier models like BM15 and BM11?
BM25 (Best Matching 25) is a probabilistic ranking function designed for ranked retrieval. It builds on earlier models like BM15 and BM11: BM15: Does not normalize for document length, which biases rankings toward longer documents. BM11: Normalizes term frequency linearly by document length, but over-penalizes long documents. BM25 introduces non-linear document length normalization and term frequency saturation to balance the impact of term frequency and document length. It uses two adjustable parameters: k : Controls term frequency saturation. b: Adjusts the degree of document length normalization. These enhancements make BM25 highly effective for ranking documents based on relevance.
53
What is the BM25 formula, and what are its key components?
k Ensures diminishing returns for very high term frequencies, preventing overemphasis on repetitive terms. 1-b + b(|d|/avgdl) Balances the influence of document length, avoiding biases toward short or long documents.
54
What is MaxScore?
MaxScore is a dynamic pruning technique. The Core Idea: Avoid fully evaluating documents unlikely to make it into the top-k results. How It Works: Compute an upper bound on the score for each term. Use the sum of these upper bounds to skip documents whose maximum possible score is less than the lowest score in the top-k results (the threshold). Process documents in decreasing order of score contributions.
55
what is WAND?
WAND is a dynamic pruning technique. The Core Idea: Efficiently combine term scores by skipping unnecessary evaluations while maintaining correctness. How It Works: - Maintain an upper bound score for each term. - Use a pivot document (a candidate with a high potential score). - Skip terms or documents that cannot improve the ranking of the pivot. Advantages: WAND is faster and more effective in minimizing evaluations than MaxScore in many cases.
56
What is caching and why is it useful?
Caching stores precomputed results to reduce redundant computations during query processing. It can be applied at multiple levels. There are two types of caching strategy: * Search Result Caching: stores the final ranked list of documents for a query * Term Caching: stores the posting lists for each of the query terms in memory
57
What is index compression?
Index compression reduces the size of the inverted index, improving storage efficiency and query processing speed. Common techniques: Vocabulary Compression: Reduces the size of the lexicon using techniques like front-coding or hash tables. Posting List Compression: Compresses posting lists using: -Delta Encoding: Store differences between consecutive document IDs. -Variable Byte Encoding (VByte): Represent numbers with variable-length bytes.
58
What is skipping?
Skipping enhances query processing by allowing the system to jump over irrelevant postings in a list. In particular if we were to move the posting iterator cursor to a specific docid, we need to read and decompress all postings in the middle. So we add: Skip Pointers: Added to posting lists to enable fast traversal. How It Works: Skip pointers divide the list into blocks. For a given query term, skip blocks unlikely to contain relevant documents. Advantages: Reduces the number of comparisons and improves query speed. Trade-off: Skip pointers increase index size slightly.
59
What is static pruning?
Static pruning reduces the size of the index before query processing by removing less useful data. Techniques include: Term-Based Pruning: Remove terms with low document frequency or low contribution to relevance. Document-Based Pruning: * Discards terms from a document that are less representative of a document’s content * Can be applied on-the-fly during indexing, IF we have reasonable collection statistics Impact-Based Pruning: Remove low-impact postings (e.g., terms contributing little to scores). Challenges: Risk of losing potentially relevant information. Balancing index size and retrieval effectiveness.
60
what is relevance feedback?
Key aspect of effective retrieval * Users can’t change the ranking algorithm but can change the results through interaction * Helps refine the description of information need. Relevance feedback is a powerful technique in information retrieval (IR) where the system improves query results by incorporating user-provided feedback about the relevance of retrieved documents. It’s an iterative process that refines the query representation, leading to better retrieval performance. The basic idea is to: Retrieve Initial Results: The system retrieves a ranked list of documents based on the original query. Gather Feedback: Users indicate which documents are relevant or non-relevant from the retrieved set. Refine Query: The system adjusts the query to emphasize terms from relevant documents and de-emphasize terms from non-relevant ones. Retrieve Improved Results: A new ranked list of documents is retrieved using the refined query.
61
What is Rocchio algorithm?
One of the most common methods, especially in vector space models to approach relevance feedback is Rocchio algorithm. It updates the query vector by incorporating terms from relevant and non-relevant documents. (look at the formula) Intuition: Boost Relevant Terms: Add terms from relevant documents (β component). Suppress Non-Relevant Terms: Subtract terms from non-relevant documents (γ component). Balance with Original Query: Retain the original query's focus using α. The optimal query maximises the difference between the average vector representing the relevant documents and the average vector representing the non-relevant documents
62
what is learning to rank?
Learning to Rank (LTR) is a technique that applies machine learning to rank documents in response to a query. Instead of relying solely on traditional hand-crafted scoring functions like BM25, LTR uses labeled training data to learn how to combine various relevance signals (features) into a ranking model.
63
What is Cascading?
Cascading is a multi-stage retrieval and ranking process used to improve efficiency and effectiveness in large-scale information retrieval systems. The idea is to filter documents through successive stages, each stage refining the ranked list using increasingly complex and computationally expensive methods. 1. Initial Retrieval (Recall-Oriented): Use simple and efficient methods like BM25 or TF-IDF to retrieve a broad set of documents. The goal is to ensure high recall (i.e., retrieve all potentially relevant documents). 2. Feature Extraction: For the subset of documents retrieved, compute features that describe the query-document relationship. Examples: Query-dependent: BM25 score, term overlap. Query-independent: PageRank, document length. Behavioral: Click-through rates, user preferences. 3. Intermediate/advanced Ranking (Precision-Oriented): Apply machine learning models (e.g., Logistic Regression, Random Forests) using computed features. Narrow down to a smaller set of highly relevant documents and use computationally expensive methods like neural networks to refine rankings. Optimize for advanced metrics like NDCG or MAP.
64
what are the General Steps in Learning to Rank?
Step 1: Sample Identification Goal: Select a set of query-document pairs for training. Approach: Use BM25 to retrieve a candidate set of documents for each query. Ensure the sample contains enough relevant documents to make learning effective. Evaluation: Use metrics like recall or precision at cutoff k to measure whether the sample is sufficiently rich. Step 2: Feature Computation Goal: Compute meaningful features to describe each query-document pair. Types of Features: Query-Dependent: BM25 or TF-IDF score. Term overlap: Number of terms in the query that appear in the document. Query-Independent: Document popularity (e.g., PageRank, click-through rate). Document length or age. Cross-Features: Term frequency in specific fields like the title or abstract. Step 3: Train Ranking Models Goal: Use the labeled data and features to train a machine learning model that predicts document relevance. Pointwise Approach: Treat ranking as a regression problem. Example: Predict relevance scores (e.g., 0 to 3) for each query-document pair. Pairwise Approach: Learn relative preferences between document pairs. Example: Given two documents, decide which one is more relevant for the query. Listwise Approach: Directly optimize a ranking metric (e.g., NDCG) over the entire ranked list. Step 4: Evaluation Split data into training, validation, and test sets. Evaluate the model using metrics like MAP, NDCG, or Precision@k. Perform cross-validation to tune hyperparameters.
65
what are the approaches of learning to rank?
Point-wise techniques suffer from limitations for ranking tasks * They consider each document independently, instead of trying to get the relevant documents ranked above the non-relevant ones * They ignore the fact that some documents are associated with the same query and some are not * If the number of documents varies largely for different queries, the overall loss function will be dominated by those queries with a large number of documents * The position of documents in the ranked list is not factored into the loss function(s) Pair-wise approaches * Take two documents with some preference * E.g. different relevance labels, 0 and 1, less and more clicks * Try to minimise a loss function based on the ranking error between pairs of documents * Examples: RankSVM, RankNet * Problem: Doesn’t always approximate a real evaluation measure * Problem: The number of document pairs per query is quadratic in the number of documents for that query List-wise approaches * Consider the entire ranking of documents * Encapsulates a standard IR evaluation measure within the loss function * Not all measures are appropriate for list wise learning, e.g., P@3 vs NDCG@1000 * Examples: AFS, LambdaMART * Generally more effective than Point-wise or Pair-wise
66
what are the 3 main classes of queries in web search?
Navigational queries: to reach a particular site that the user has in mind (a.k.a. known-item search) * Reach a particular webpage/URL Informational queries: to acquire some information assumed to be present on one or more webpages * Reading/bookmarking/printing are the only further user’s actions Transactional queries: to reach a site where further interaction will happen (e.g., shopping, downloading, gaming) * User further engages/interacts with websites in the results list
67
talk about web search
troppe robe, non ho voglia di scrivere la risposta, ma molto figo
68
what is PageRank?
PageRank is an algorithm developed by Larry Page and Sergey Brin (the founders of Google) to rank web pages based on their importance. It uses the structure of hyperlinks on the web to evaluate a page's relevance, assuming that links from one page to another act as votes of confidence.
69
how does PageRank work?
1. Initial Setup Assign an initial PageRank value to all pages (e.g., 1/N, where N is the total number of pages). 2. Iterative Calculation Update PageRank scores iteratively using the formula until convergence (i.e., scores stop changing significantly). 3. Damping Factor The damping factor d accounts for random web surfing behavior: 1−d: Probability of jumping to a random page. d: Probability of following a hyperlink. 4. Convergence The algorithm stops when the PageRank values stabilize (usually within a small number of iterations for most graphs). BASICALLY: Imagine a user doing a random walk on web pages: *Start at a random page *At each step, go out of the current page along one of the links on that page, equiprobably *“In the long run” each page has a long-term visit rate *use this as the page’s score
70
what is web crawling?
Web crawling is the process of locating, fetching, storing, and maintaining the pages available in the Web.