Information retrieval for dummies Flashcards
what is the unicode name for the encoded data?
codespace and the number assigned to a character is called a code point
What is a query in the context of IR?
A query is a formalized request for information from a document collection. It usually consists of keywords or phrases entered by the user to describe their information need.
What is an Information Retrieval (IR) system?
An IR system is a system that stores, retrieves, and organizes information from a collection of documents based on user queries. Its goal is to return the most relevant results to the user.
What are the key components of an IR system?
Document Collection: A repository of documents (text, images, etc.).
Indexing: Processes and organizes documents for efficient retrieval.
Query Processing: Parses and interprets the user query.
Retrieval Model: Matches the query against the indexed documents.
Ranking: Orders documents based on their relevance to the query.
What is the main goal of an IR system?
To retrieve documents that are relevant to the user’s query while minimizing irrelevant results.
What does relevance mean in IR systems?
Relevance refers to how well a document satisfies the user’s information need as expressed by the query.
What is the difference between precision and recall in IR?
Precision: The proportion of retrieved documents that are relevant.
Recall: The proportion of relevant documents in the collection that are retrieved.
Together, Precision and Recall measure retrieval effectiveness, meant as the ability of a system to retrieve relevant documents while at the same time holding back non-relevant ones.
What is indexing in IR systems?
Indexing is the process of organizing and structuring data in the document collection to allow for fast and efficient retrieval. Common techniques include inverted indexes and term frequency-based structures.
What is an inverted index?
An inverted index maps each term in the collection to the list of documents (and positions) where it appears. It’s the backbone of most IR systems.
What happens during query processing?
Tokenization: Breaking the query into terms or tokens.
Normalization: Converting terms to a standard form (e.g., lowercase).
Stopword Removal: Removing common words (e.g., “the”, “and”).
Stemming/Lemmatization: Reducing terms to their root form.
What is ranking in IR systems?
Ranking orders retrieved documents based on their relevance to the user query, using scoring functions such as TF-IDF, BM25, or neural embeddings.
What is TF-IDF?
erm Frequency-Inverse Document Frequency (TF-IDF) is a scoring method that reflects how important a term is in a document relative to the entire collection.
TF-IDF(t,d)=TF(t,d)×IDF(t)
Where:
TF: Term frequency in the document.
IDF: Logarithmic measure of how rare the term is across the collection.
What are the types of IR systems?
Boolean IR: Documents are retrieved based on strict matching of query terms (AND, OR, NOT).
Vector Space Models: Documents and queries are represented as vectors in a multi-dimensional space, with similarity measured (e.g., cosine similarity).
Probabilistic IR: Uses probability theory to estimate the likelihood of a document being relevant.
Neural IR: Uses deep learning models for query and document representation and ranking.
What is the difference between exact-match and ranked retrieval?
Exact-Match Retrieval: Returns only documents that match the query exactly (e.g., Boolean retrieval).
Ranked Retrieval: Returns documents ranked by their relevance to the query (e.g., TF-IDF, BM25).
What are user relevance feedback techniques in IR?
Explicit Feedback: Users label documents as relevant or not.
Implicit Feedback: Inferred from user interactions (e.g., click-through rates).
Query Expansion: Modifies the query based on user feedback to improve retrieval.
What is an index in Information Retrieval?
An index is a data structure that enables fast and efficient retrieval of documents in response to a query. It maps terms to the documents in which they appear.
What is a dictionary in an IR index?
The dictionary (or vocabulary) is a component of an inverted index that stores all unique terms in the document collection, along with metadata like term frequency and pointer to the posting list.
What is a posting list?
A posting list is a list of document identifiers (and possibly additional data like term frequency or positions) that represents all the documents in which a specific term occurs.
What is a lexicon?
A lexicon is another name for the dictionary in an IR system, containing all unique terms in the collection.
What is a term?
A term is a distinct unit of text (usually a word or token) used for indexing and querying in an IR system.
What is an inverted index?
An inverted index maps each term in the dictionary to its posting list, enabling efficient retrieval of documents that contain the term.
What is the difference between a forward index and an inverted index?
Forward Index: Maps each document to the terms it contains.
Inverted Index: Maps each term to the documents it appears in. The inverted index is more efficient for retrieval.
What metadata might an inverted index store?
Document Frequency (df): The number of documents containing a term.
Term Frequency (tf): The number of times a term appears in a document.
Positions: The positions of a term within documents (useful for phrase queries).
What is the difference between positional and non-positional indexes?
Non-Positional Index: Only stores document IDs where a term appears.
Positional Index: Stores term positions within each document, allowing for phrase and proximity queries.