Information retrieval for dummies Flashcards

Question

What is term frequency (tf)?

Answer 1

Term frequency is the number of times a term appears in a document. It is often normalized to adjust for document length.

Answer 2

Skip pointers are additional pointers added to posting lists that allow skipping over parts of the list during query processing, improving efficiency.

Answer 3

Stemming reduces words to their root form (e.g., "running" → "run") to group similar terms together in the index. Lemmatization reduces words to their base or dictionary form (e.g., "better" → "good"), considering the word’s context.

Answer 4

The frequency distribution of terms in a corpus, where the frequency of a term is inversely proportional to its rank. A few terms are very frequent, while most terms appear rarely, creating a long tail distribution. In a corpus, the most frequent word ("the") might appear 10,000 times, the second most frequent word ("and") 5,000 times, and so on.

Answer 5

The growth of vocabulary size as more text is processed. The rate of growth slows down, as most common terms are already included in the vocabulary. If n = 1,000,000, k = 100, β=0.5, the vocabulary size V(N) would be =10,000.

Answer 6

Document-at-a-Time (DAAT) and Term-at-a-Time (TAAT) are two fundamental query processing methods used in Information Retrieval (IR) systems for evaluating Boolean or ranked queries using inverted indexes.

Answer 7

In DAAT, the system processes one document at a time across all terms in the query. The algorithm maintains a cursor for each term's posting list and evaluates documents in ascending order of document IDs. Start at the first document in all posting lists. Identify the smallest document ID among the current cursors. Check whether the document satisfies the query (e.g., all terms appear for Boolean queries, or compute the score for ranked queries). Advance the cursor of each term to the next occurrence of that term in the posting list, if needed. Repeat until all posting lists are exhausted.

Answer 8

In TAAT, the system processes one term at a time across all documents. It iterates through each term's posting list sequentially and updates scores for all documents containing that term. Initialize a score accumulator for all documents (or for the subset relevant to the query terms). Process each term's posting list, for each document in the posting list, update its score based on the term's contribution (e.g., TF-IDF, BM25). After processing all terms, sort the documents by their scores for ranked queries or apply Boolean logic for exact-match queries.

Answer 9

The Jaccard Coefficient measures the similarity between two sets by comparing their overlap to their union. In the context of information retrieval: Each document and query is represented as a set of terms. The Jaccard Coefficient calculates the proportion of shared terms between a query and a document.

Answer 10

The Bag of Words (BoW) model is a simple vector representation of text that: Treats each document as a collection of terms (a "bag"). Ignores grammar, word order, and sentence structure. Focuses only on the frequency of terms. Construction Tokenization: Break text into individual terms (e.g., words). Vocabulary Creation: Identify all unique terms in the corpus. Frequency Count: Count how many times each term appears in each document.

Answer 11

Relevance does not increase proportionally with term frequency. We use the log of the term frequency

Answer 12

Rare terms are more informative than frequent terms. We want a high weight for rare terms

Answer 13

the Inverse document frequency of term is calculated as idf= log(N/df) where n is the total number of documents in the dataset and df is the number of documents the term appear in.

Answer 14

A vector space, where terms are the bases and documents and queries are high dimensional vectors

Answer 15

utilizing cosine similarity.

Answer 16

Utilizing relevance assessment. You measure relevance utilizing * A benchmark document collection * A benchmark suite of queries * An assessment of either Relevant or Non Relevant for each query and each document

Answer 17

Precision at cutoff k (P@k) measures the proportion of relevant documents in the top k retrieved documents.

Answer 18

Recall at cutoff k (R@k) measures the proportion of all relevant documents retrieved in the top k.

Answer 19

The F-Measure is the harmonic mean of precision and recall, balancing their importance. Fb = (1+b^2)(P * R)/(b^2 * P + R). Where b=1 gives equal importance to precision and recall.

Answer 20

DCG evaluates the relevance of results, assigning higher importance to results at the top of the ranked list. DCG = SUM(rel/logk) where rel is the relevance score of the i-th document and the base of the logarithm is b and indicates the patience of the user to go through the result list, where b=2 is an impatient user, b=10 is a patient user.

Answer 21

RBP is a metric that assigns a geometrically decaying weight to lower-ranked documents, emphasizing the importance of top-ranked results. RBP = (1-p)*SUM(p^(i-1)*rel_i) where p is the persistence parameter, controlling how quickly the weight decays.

Answer 22

Ranked retrieval sorts documents in descending order of relevance scores, calculated based on how well each document matches the query.

Answer 23

PRP states that documents should be ranked in decreasing order of the probability of relevance to the query to maximize user satisfaction. Is calculated as P(R=r|q,d) so that the probability that document d has relevance r to the query q. The relevance probability can be expressed using Bayes’ Rule

Answer 24

Odds express the likelihood of a document being relevant versus not relevant to a query: Odds =P(R=r|q,d)/P(R=\r|q,d)

Answer 25

BIM is a probabilistic model that assumes: Terms are binary (present or absent in documents). Terms occur independently in documents. The relevance of one document is independent of others. It ranks documents using the formula: S(d,q) = SUM((P(t| R =1))(1-P(t| R =0)/((1-P(t| R =1))(P(t| R =0)))

Answer 26

The eliteness hypothesis assumes that a term’s occurrence in a document depends on whether the document belongs to an elite set for that term, where the term is considered important or useful for relevance. The eliteness hypothesis introduces Ei to the BIM, where P(t|R) = P(t|E) * P(E|R). It models the relationship between term occurrence and relevance through elite sets.

Answer 27

Supports partial matching. Provides graded relevance (not just binary). Prioritizes the most relevant documents for better user satisfaction.

Answer 28

BM25 (Best Matching 25) is a probabilistic ranking function designed for ranked retrieval. It builds on earlier models like BM15 and BM11: BM15: Does not normalize for document length, which biases rankings toward longer documents. BM11: Normalizes term frequency linearly by document length, but over-penalizes long documents. BM25 introduces non-linear document length normalization and term frequency saturation to balance the impact of term frequency and document length. It uses two adjustable parameters: k : Controls term frequency saturation. b: Adjusts the degree of document length normalization. These enhancements make BM25 highly effective for ranking documents based on relevance.

Answer 29

k Ensures diminishing returns for very high term frequencies, preventing overemphasis on repetitive terms. 1-b + b(|d|/avgdl) Balances the influence of document length, avoiding biases toward short or long documents.

Answer 30

MaxScore is a dynamic pruning technique. The Core Idea: Avoid fully evaluating documents unlikely to make it into the top-k results. How It Works: Compute an upper bound on the score for each term. Use the sum of these upper bounds to skip documents whose maximum possible score is less than the lowest score in the top-k results (the threshold). Process documents in decreasing order of score contributions.

Answer 31

WAND is a dynamic pruning technique. The Core Idea: Efficiently combine term scores by skipping unnecessary evaluations while maintaining correctness. How It Works: - Maintain an upper bound score for each term. - Use a pivot document (a candidate with a high potential score). - Skip terms or documents that cannot improve the ranking of the pivot. Advantages: WAND is faster and more effective in minimizing evaluations than MaxScore in many cases.

Answer 32

Caching stores precomputed results to reduce redundant computations during query processing. It can be applied at multiple levels. There are two types of caching strategy: * Search Result Caching: stores the final ranked list of documents for a query * Term Caching: stores the posting lists for each of the query terms in memory

Answer 33

Index compression reduces the size of the inverted index, improving storage efficiency and query processing speed. Common techniques: Vocabulary Compression: Reduces the size of the lexicon using techniques like front-coding or hash tables. Posting List Compression: Compresses posting lists using: -Delta Encoding: Store differences between consecutive document IDs. -Variable Byte Encoding (VByte): Represent numbers with variable-length bytes.

Answer 34

Skipping enhances query processing by allowing the system to jump over irrelevant postings in a list. In particular if we were to move the posting iterator cursor to a specific docid, we need to read and decompress all postings in the middle. So we add: Skip Pointers: Added to posting lists to enable fast traversal. How It Works: Skip pointers divide the list into blocks. For a given query term, skip blocks unlikely to contain relevant documents. Advantages: Reduces the number of comparisons and improves query speed. Trade-off: Skip pointers increase index size slightly.

Answer 35

Static pruning reduces the size of the index before query processing by removing less useful data. Techniques include: Term-Based Pruning: Remove terms with low document frequency or low contribution to relevance. Document-Based Pruning: * Discards terms from a document that are less representative of a document’s content * Can be applied on-the-fly during indexing, IF we have reasonable collection statistics Impact-Based Pruning: Remove low-impact postings (e.g., terms contributing little to scores). Challenges: Risk of losing potentially relevant information. Balancing index size and retrieval effectiveness.

Answer 36

Key aspect of effective retrieval * Users can’t change the ranking algorithm but can change the results through interaction * Helps refine the description of information need. Relevance feedback is a powerful technique in information retrieval (IR) where the system improves query results by incorporating user-provided feedback about the relevance of retrieved documents. It’s an iterative process that refines the query representation, leading to better retrieval performance. The basic idea is to: Retrieve Initial Results: The system retrieves a ranked list of documents based on the original query. Gather Feedback: Users indicate which documents are relevant or non-relevant from the retrieved set. Refine Query: The system adjusts the query to emphasize terms from relevant documents and de-emphasize terms from non-relevant ones. Retrieve Improved Results: A new ranked list of documents is retrieved using the refined query.

Answer 37

One of the most common methods, especially in vector space models to approach relevance feedback is Rocchio algorithm. It updates the query vector by incorporating terms from relevant and non-relevant documents. (look at the formula) Intuition: Boost Relevant Terms: Add terms from relevant documents (β component). Suppress Non-Relevant Terms: Subtract terms from non-relevant documents (γ component). Balance with Original Query: Retain the original query's focus using α. The optimal query maximises the difference between the average vector representing the relevant documents and the average vector representing the non-relevant documents

Answer 38

Learning to Rank (LTR) is a technique that applies machine learning to rank documents in response to a query. Instead of relying solely on traditional hand-crafted scoring functions like BM25, LTR uses labeled training data to learn how to combine various relevance signals (features) into a ranking model.

Answer 39

Cascading is a multi-stage retrieval and ranking process used to improve efficiency and effectiveness in large-scale information retrieval systems. The idea is to filter documents through successive stages, each stage refining the ranked list using increasingly complex and computationally expensive methods. 1. Initial Retrieval (Recall-Oriented): Use simple and efficient methods like BM25 or TF-IDF to retrieve a broad set of documents. The goal is to ensure high recall (i.e., retrieve all potentially relevant documents). 2. Feature Extraction: For the subset of documents retrieved, compute features that describe the query-document relationship. Examples: Query-dependent: BM25 score, term overlap. Query-independent: PageRank, document length. Behavioral: Click-through rates, user preferences. 3. Intermediate/advanced Ranking (Precision-Oriented): Apply machine learning models (e.g., Logistic Regression, Random Forests) using computed features. Narrow down to a smaller set of highly relevant documents and use computationally expensive methods like neural networks to refine rankings. Optimize for advanced metrics like NDCG or MAP.

Answer 40

Step 1: Sample Identification Goal: Select a set of query-document pairs for training. Approach: Use BM25 to retrieve a candidate set of documents for each query. Ensure the sample contains enough relevant documents to make learning effective. Evaluation: Use metrics like recall or precision at cutoff k to measure whether the sample is sufficiently rich. Step 2: Feature Computation Goal: Compute meaningful features to describe each query-document pair. Types of Features: Query-Dependent: BM25 or TF-IDF score. Term overlap: Number of terms in the query that appear in the document. Query-Independent: Document popularity (e.g., PageRank, click-through rate). Document length or age. Cross-Features: Term frequency in specific fields like the title or abstract. Step 3: Train Ranking Models Goal: Use the labeled data and features to train a machine learning model that predicts document relevance. Pointwise Approach: Treat ranking as a regression problem. Example: Predict relevance scores (e.g., 0 to 3) for each query-document pair. Pairwise Approach: Learn relative preferences between document pairs. Example: Given two documents, decide which one is more relevant for the query. Listwise Approach: Directly optimize a ranking metric (e.g., NDCG) over the entire ranked list. Step 4: Evaluation Split data into training, validation, and test sets. Evaluate the model using metrics like MAP, NDCG, or Precision@k. Perform cross-validation to tune hyperparameters.

Answer 41

Point-wise techniques suffer from limitations for ranking tasks * They consider each document independently, instead of trying to get the relevant documents ranked above the non-relevant ones * They ignore the fact that some documents are associated with the same query and some are not * If the number of documents varies largely for different queries, the overall loss function will be dominated by those queries with a large number of documents * The position of documents in the ranked list is not factored into the loss function(s) Pair-wise approaches * Take two documents with some preference * E.g. different relevance labels, 0 and 1, less and more clicks * Try to minimise a loss function based on the ranking error between pairs of documents * Examples: RankSVM, RankNet * Problem: Doesn’t always approximate a real evaluation measure * Problem: The number of document pairs per query is quadratic in the number of documents for that query List-wise approaches * Consider the entire ranking of documents * Encapsulates a standard IR evaluation measure within the loss function * Not all measures are appropriate for list wise learning, e.g., P@3 vs NDCG@1000 * Examples: AFS, LambdaMART * Generally more effective than Point-wise or Pair-wise

Answer 42

Navigational queries: to reach a particular site that the user has in mind (a.k.a. known-item search) * Reach a particular webpage/URL Informational queries: to acquire some information assumed to be present on one or more webpages * Reading/bookmarking/printing are the only further user’s actions Transactional queries: to reach a site where further interaction will happen (e.g., shopping, downloading, gaming) * User further engages/interacts with websites in the results list

Answer 43

troppe robe, non ho voglia di scrivere la risposta, ma molto figo

Answer 44

PageRank is an algorithm developed by Larry Page and Sergey Brin (the founders of Google) to rank web pages based on their importance. It uses the structure of hyperlinks on the web to evaluate a page's relevance, assuming that links from one page to another act as votes of confidence.

Answer 45

1. Initial Setup Assign an initial PageRank value to all pages (e.g., 1/N, where N is the total number of pages). 2. Iterative Calculation Update PageRank scores iteratively using the formula until convergence (i.e., scores stop changing significantly). 3. Damping Factor The damping factor d accounts for random web surfing behavior: 1−d: Probability of jumping to a random page. d: Probability of following a hyperlink. 4. Convergence The algorithm stops when the PageRank values stabilize (usually within a small number of iterations for most graphs). BASICALLY: Imagine a user doing a random walk on web pages: *Start at a random page *At each step, go out of the current page along one of the links on that page, equiprobably *“In the long run” each page has a long-term visit rate *use this as the page’s score

Answer 46

Web crawling is the process of locating, fetching, storing, and maintaining the pages available in the Web.

Information retrieval for dummies Flashcards

(70 cards)