Introduction to IR Flashcards

Question 1

Q

What is data mining?

Answer

A

Extracting knowledge from large amounts of data

Question 2

Q

What are the 4 main parts of information retrieval?

Answer

A

The corpus
An information need
A metric of relevance
A query

Question 3

Q

What is a corpus?

Answer

A

A large repository of documents

Question 4

Q

What is an information need?

Answer

A

The topic about which you desire to get information

Question 5

Q

What is relevance?

Answer

A

Measures if a document contains information satisfying the information need

Question 6

Q

What is a query?

Answer

A

How the information need is expressed to the computer

Question 7

Q

What is structured data?

Answer

A

Data that conforms to a predefined schema. Tends to refer to information in tables with clear structure

Question 8

Q

What is unstructured data?

Answer

A

Any data without a clear structure

Question 9

Q

What type of systems do each type of data require?

Answer

A

Structured: database systems
Unstructured: Information retrieval systems

Question 10

Q

What is semi-structured data?

Answer

A

Data that has some sort of structure but not a strict one. Almost no data is truly unstructured
Ex: A document has a title, subtitle, references, etc

Question 11

Q

What is information retrieval?

Answer

A

Finding material of an unstructured nature that satisfies an information need from within large collections

Question 12

Q

What is the goal of information retrieval?

Answer

A

To retrieve documents with information relevant to the user’s information need and helps the user complete a task

Question 13

Q

What are 2 metrics to measure the relevance of retrieved documents?

Answer

A

Precision and recall

Question 14

Q

What is precision?

Answer

A

The fraction of retrieved docs relevant to the user’s information need
TP/(TP + FP)
Number of good ones out of all ones retrieved

Question 15

Q

What is recall?

Answer

A

Fraction of relevant docs in the collection that are retrieved
TP/(TP + FN)
Number of good ones retrieved out of all good ones

Question 16

Q

Why is a linear scan of documents to perform a term search not a good solution?

Answer

Study These Flashcards

A

It has to be repeated for every query
Only works for a small corpus, way too slow otherwise
Operations like proximity searching are not possible
Doesn’t allow for ranked retrieval

Question 17

Q

What is a term-document incidence matrix?

Answer

Study These Flashcards

A

A way to speed up query processing by preprocessing the corpus in advance.
It creates a matrix of terms and documents. When a term appears in a document, the corresponding matrix entry is set to 1

Question 18

Q

How is a compound boolean query performed on a term-document incidence matrix?

Answer

Study These Flashcards

A

Perform bitwise operations on term bit strings and combine results

Question 19

Q

What is the issue with a term-document incidence matrix?

Answer

Study These Flashcards

A

For very large corpuses with many terms, storing the matrix takes up a lot of memory. Most of the matrix is sparse

Question 20

Q

How do we solve the issue with a term-document incidence matrix?

Answer

Study These Flashcards

A

Only record the places in the matrix where 1’s would appear. Use an inverted index to store the doc IDs where each term appears in a dictionary structure.

Question 21

Q

What are the methods for storing a postings list in an inverted index?

Answer

Study These Flashcards

A

Linked lists and variable length arrays

Question 22

Q

How do you perform a boolean retrieval query on an inverted index?

Answer

Study These Flashcards

A

Retrieve postings lists for each term and merge them using set operations

Introduction to IR Flashcards

(22 cards)