Introduction to IR Flashcards
What is data mining?
Extracting knowledge from large amounts of data
What are the 4 main parts of information retrieval?
- The corpus
- An information need
- A metric of relevance
- A query
What is a corpus?
A large repository of documents
What is an information need?
The topic about which you desire to get information
What is relevance?
Measures if a document contains information satisfying the information need
What is a query?
How the information need is expressed to the computer
What is structured data?
Data that conforms to a predefined schema. Tends to refer to information in tables with clear structure
What is unstructured data?
Any data without a clear structure
What type of systems do each type of data require?
Structured: database systems
Unstructured: Information retrieval systems
What is semi-structured data?
Data that has some sort of structure but not a strict one. Almost no data is truly unstructured
Ex: A document has a title, subtitle, references, etc
What is information retrieval?
Finding material of an unstructured nature that satisfies an information need from within large collections
What is the goal of information retrieval?
To retrieve documents with information relevant to the user’s information need and helps the user complete a task
What are 2 metrics to measure the relevance of retrieved documents?
Precision and recall
What is precision?
The fraction of retrieved docs relevant to the user’s information need
TP/(TP + FP)
Number of good ones out of all ones retrieved
What is recall?
Fraction of relevant docs in the collection that are retrieved
TP/(TP + FN)
Number of good ones retrieved out of all good ones