01 Boolean retrieval Flashcards

1
Q

Information Retrievel

A

IR is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Index

A

The way to avoid linearly scanning the text for each query is to index the docs in advance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Terms

A

Are the indexed units

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Boolean retreival model

A

Is a model for information retrieval in which we can pose any query which is in the from of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR and NOT. The model views each document as just a set of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Collection (Corpus)

A

Group of documents over which we perform retrieval task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Ad hoc retrieval

A

An IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary used information need.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Effectiveness

A

Two key factors:

Precision
What fraction of the returned results are relevant to the information need?

Recall
What fraction of the relevant documents in the collection were returned by the system?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Inverted index (II, inverted file)

A

We keep a dictionary of terms. For each term, we have a list that records which documents the term occurs in

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Posting

A

Each item in a Inverted Index records the docID of the term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Posting lists

A

The list of postings is a posting list

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Postings

A

All posting lists put together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Dictionary

A

All terms in the II put together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Sorting

A

Core idea of II. Terms must be alphabetically ordered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Posting lists intersection (merging posting list)

A

Must intersect postings lists to find docs that contain both terms.
Common intersection algorithm:

  1. Maintain pointers to each PL, starting at the beginning
  2. Compare the docID pointed to by the pointers
  3. If they are the same, put posting in result list and advance both pointers 4. Otherwise, advance the pointer pointing to the smaller docID and re- peat from step 2.

O(x + y) operations where x and y are the length of the postinglist, but formally the complexity is O(N) where N is the size of the corpus.

Important! To use this algorithm, the PLs must be sorted by a single global ordering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Query optimization

A

Is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the systen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Ranked retrieval models

A

The boolean retrieval model contrasts with ranked retrieval models such as the vector space model, in which the system decides which documents best satisfy the query.

17
Q

Proximity operatos

A

Is a way of specifying that two terms in a query must occur close to each other in the document. More info at Lecture 2: Positional index

18
Q

Term frequency

A

The number of times a term occurs in a docuemnt