Introduction to IR Flashcards
What is data mining?
Extracting knowledge from large amounts of data
What are the 4 main parts of information retrieval?
- The corpus
- An information need
- A metric of relevance
- A query
What is a corpus?
A large repository of documents
What is an information need?
The topic about which you desire to get information
What is relevance?
Measures if a document contains information satisfying the information need
What is a query?
How the information need is expressed to the computer
What is structured data?
Data that conforms to a predefined schema. Tends to refer to information in tables with clear structure
What is unstructured data?
Any data without a clear structure
What type of systems do each type of data require?
Structured: database systems
Unstructured: Information retrieval systems
What is semi-structured data?
Data that has some sort of structure but not a strict one. Almost no data is truly unstructured
Ex: A document has a title, subtitle, references, etc
What is information retrieval?
Finding material of an unstructured nature that satisfies an information need from within large collections
What is the goal of information retrieval?
To retrieve documents with information relevant to the user’s information need and helps the user complete a task
What are 2 metrics to measure the relevance of retrieved documents?
Precision and recall
What is precision?
The fraction of retrieved docs relevant to the user’s information need
TP/(TP + FP)
Number of good ones out of all ones retrieved
What is recall?
Fraction of relevant docs in the collection that are retrieved
TP/(TP + FN)
Number of good ones retrieved out of all good ones
Why is a linear scan of documents to perform a term search not a good solution?
- It has to be repeated for every query
- Only works for a small corpus, way too slow otherwise
- Operations like proximity searching are not possible
- Doesn’t allow for ranked retrieval
What is a term-document incidence matrix?
A way to speed up query processing by preprocessing the corpus in advance.
It creates a matrix of terms and documents. When a term appears in a document, the corresponding matrix entry is set to 1
How is a compound boolean query performed on a term-document incidence matrix?
Perform bitwise operations on term bit strings and combine results
What is the issue with a term-document incidence matrix?
For very large corpuses with many terms, storing the matrix takes up a lot of memory. Most of the matrix is sparse
How do we solve the issue with a term-document incidence matrix?
Only record the places in the matrix where 1’s would appear. Use an inverted index to store the doc IDs where each term appears in a dictionary structure.
What are the methods for storing a postings list in an inverted index?
Linked lists and variable length arrays
How do you perform a boolean retrieval query on an inverted index?
Retrieve postings lists for each term and merge them using set operations