Introduction to IR Flashcards

1
Q

What is data mining?

A

Extracting knowledge from large amounts of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 4 main parts of information retrieval?

A
  1. The corpus
  2. An information need
  3. A metric of relevance
  4. A query
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a corpus?

A

A large repository of documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an information need?

A

The topic about which you desire to get information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is relevance?

A

Measures if a document contains information satisfying the information need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a query?

A

How the information need is expressed to the computer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is structured data?

A

Data that conforms to a predefined schema. Tends to refer to information in tables with clear structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is unstructured data?

A

Any data without a clear structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What type of systems do each type of data require?

A

Structured: database systems
Unstructured: Information retrieval systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is semi-structured data?

A

Data that has some sort of structure but not a strict one. Almost no data is truly unstructured
Ex: A document has a title, subtitle, references, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is information retrieval?

A

Finding material of an unstructured nature that satisfies an information need from within large collections

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the goal of information retrieval?

A

To retrieve documents with information relevant to the user’s information need and helps the user complete a task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are 2 metrics to measure the relevance of retrieved documents?

A

Precision and recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is precision?

A

The fraction of retrieved docs relevant to the user’s information need
TP/(TP + FP)
Number of good ones out of all ones retrieved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is recall?

A

Fraction of relevant docs in the collection that are retrieved
TP/(TP + FN)
Number of good ones retrieved out of all good ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is a linear scan of documents to perform a term search not a good solution?

A
  1. It has to be repeated for every query
  2. Only works for a small corpus, way too slow otherwise
  3. Operations like proximity searching are not possible
  4. Doesn’t allow for ranked retrieval
17
Q

What is a term-document incidence matrix?

A

A way to speed up query processing by preprocessing the corpus in advance.
It creates a matrix of terms and documents. When a term appears in a document, the corresponding matrix entry is set to 1

18
Q

How is a compound boolean query performed on a term-document incidence matrix?

A

Perform bitwise operations on term bit strings and combine results

19
Q

What is the issue with a term-document incidence matrix?

A

For very large corpuses with many terms, storing the matrix takes up a lot of memory. Most of the matrix is sparse

20
Q

How do we solve the issue with a term-document incidence matrix?

A

Only record the places in the matrix where 1’s would appear. Use an inverted index to store the doc IDs where each term appears in a dictionary structure.

21
Q

What are the methods for storing a postings list in an inverted index?

A

Linked lists and variable length arrays

22
Q

How do you perform a boolean retrieval query on an inverted index?

A

Retrieve postings lists for each term and merge them using set operations