Information Retrieval Flashcards
What is Information Retrieval (IR)?
Information Retrieval (IR) refers to the process of obtaining relevant information from a large collection of unstructured data, typically in the form of text documents.
What is IR used for?
The primary goal of information retrieval is to find relevant information in response to a user’s query, often in the context of large datasets, such as search engines, document databases, or online repositories.
Draw me IR architecture.
Explain User:
The process starts with a user who has an information need and formulates a query.
Explain Query:
The query could be a word, phrase or question.
Explain Corpus:
Collection of documents or data that the IR system searches throuh to retrieve relevant information. The corpus could be web pages, documents, databases.
Explain Indexing:
The corpus undergoes indexing, where each document is processed and indexed to allow fast information retrieval.
Explain Indexed Data Structure:
The indexing process creates an indexed data structure as a result. This contains orgainsed representations of the corpus.
Explain IR System:
When the user submits a query, the IR system searches through the indexed data structures to find documents that match the query terms. It retrieves relevant documents and passes them back to the user.
Explain Output:
An output is generated based on the user’s query can could be a list of documents, ranked search results or relevant snippets.
What data structures are used in IR?
Lists
Dictionaires (also called Hash Maps/Tables)
What is a list and it’s usage?
A list is an ordered collection of elements where each element can be accessed by it’s index.
Usage in IR:
Document lists: Lists are used to store collections of documents or results. For example, a list of documents that match a query.
Posting lists: In inverted indexing (common in IR systems), each word (term) is associated with a posting list, which is a list of documents where the term appears.
Ordered result lists: The ranked list of search results presented to the user is often a list, where elements are ranked by relevance.
What is a dictionary and it’s usage?
A dictionary is a collection of key-value pairs where each key is unique and values can be accessed via their keys.
Usage in IR:
Inverted index: A dictionary is often used to store the inverted index, where each key is a term (word) and the value is a posting list of documents containing that term.
Term frequency counts: Dictionaries can store term frequencies, where keys are terms and values are counts of how often a term appears in a document.
Document metadata: Dictionaries are used to store metadata about documents, such as document IDs and their corresponding attributes (e.g., title, URL, length).
Pros and Cons of a List?
Pros:
Elements are stored in the order they are inserted
Has dynamic sizing
easy to iterate through
Insertion at the end is fast O(1)
Cons:
Slow search due to sequentially going through the list O(n)
Expensive insertions and deletions as it requries shifting of other elements O(n)
Not ideal for key based access
Pros and Cons of a Dictionary?
Pros:
Fast look up due to key O(1)
Efficient insertions and deletions O(1) due to key value pairs
Ideal for large data sets
No duplicate keys
Dynamic size
Cons:
Unordered
Memory overhead
Handling collisions
Complex
Not suitable for sequential data