Lecture 7 - Information Retrieval Flashcards

1
Q

An IR system vs Database can also be phrased as

A

Unstructured vs Structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Structured data tends to refer to information in…

A

Tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the Boolean Retrieval Model

A

Boolean Retrieval Models refer to whether a query is present in the database/document.
It is absolute (Document matches condition or not)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the issue with Boolean Retrieval Model?

A

With bigger collections, this becomes computationally heavy

Example:

  • N = 1M documents, each with about 1000 words → Around 6gb of data.
    • If there are 500K distinct terms in these documents, then a matrix of size 500k * 1M will have half-a-trillion 0’s and 1’s
    • But no more than one billon 1’s
      • Matrix is extremely sparse
    • What’s a better representation?
      • We only store the ‘1’ positions → Inverted Index

Also:
Boolean queries often result in either too few (=0) or too many (1000s) results
It takes a lot of skill to come up with a query that produces a manageable number of hits
- AND gives too few; OR gives too many

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is an Inverted Index?

A

An inverted index keeps a dictionary of terms and for each term t, store a list of all documents that contain t

I.e: You can say that Boolean Retrieval Model is from the perspective of the document, whereas Inverted Index is from the perspective of the term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the steps of constructing an inverted index?

A

Documents to be indexed -> Tokenizer -> Linguistic Modules -> Indexer = Inverted Index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is meant by phrase queries?

A

We want to be able to answer queries such as “Stanford university” as a phrase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

For phrase queries, is it sufficient to store entries

A

No.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is meant by biword indexes

A

Instead of storing just single words, we now store biwords

Example:
“Friends, Romans, Countrymen” would now store:
- friends romans
- romans countrymen

This allows us to do two-word phrase query-processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can we do longer phrase queries using biwords?

A

Longer phrases can be processed by breaking them down

Example:
“Stanford university palo alto” can be broken into the boolean query on biwords:
- stanford university AND university palo AND palo alto

We cannot verify that the docs matching the above Boolean query do contain the phrase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the issues for biwords indexes?

A

False Positives

Index blowup due to bigger dictionary
- Infeasible for more than biwords, big even for them

Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name an alternative to biword indexes

A

Positional Indexes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Is a positional index larger or smaller than a non-positional index

A

Substantially larger

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Can you combine Biword Indexes and Positional Indexes?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between Boolean Retrieval Models and Ranked Retrieval Models?

A

Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is meant by “Free text queries”?

A

Rather than a query language of operators and expression (e.g. SQL), the user’s query is just one or more words in a human language

17
Q

What are some ways of scoring as the basis of ranked retrieval?

A

Term Frequency
-> Log-frequency weighting

IDF weight

TF-IDF weighting

18
Q

Explain Term Frequency

A

Term frequency is the number of occurrences of a term within a document.

The term frequency tf_t,d of term t in document d is defined as the number of times that t occurs in d

19
Q

Is raw term frequency sufficient?

A

No: A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. But not 10 times more relevant -> Relevance does not increase proportionally with term frequency

Therefore, we use log-frequency weighting

20
Q

Explain Document Frequency

A

Document frequency is the number of documents containing a particular term

21
Q

What is idf an acronym for?

A

Inverse document frequency

22
Q

Explain idf

A

The inverse document frequency (IDF) is a statistical weight used for measuring the importance of a term in a text document collection. The document frequency DF of a term is defined by the number of documents in which a term appears.

23
Q

Why is document frequency insufficient?

A

We want high weight for rare terms

Rare terms are more informative than frequent terms

  • Recall stop words (the, are, of)
  • Also terms like high, car increase, line not so relevant

Consider a term in the query that is rare in the collection (e.g., arachnocentric)
- A document containing this term is very likely to be relevant to the query arachnocentric

24
Q

What is the effect of idf on ranking for one-term queries, like “iphone”?

A

idf has no effect on ranking one-term queries

  • idf affects the ranking of documents for queries with at least two terms
  • For the query “capricious person”, idf weighting makes occurrences of “capricious” count for much more in the final document ranking than occurrences of “person”
25
Q

Explain tf-idf weighting

A

The tf-idf weight of a term is the product of its tf weight and its idf weight

-> This is the best known weighting scheme in information retrieval
(the hyphen in tf-idf is a hyphen, NOT a minus sign)

  • Increases with number of occurrences within a document
  • Increases with the rarity of the term in the collection
26
Q

When processing documents, we typically store these in…

A

Vectors

  • |V| dimensional vector space
    • Terms are axes of the space
    • Documents are points or vectors in this space

-> Very high dimensional (Tens of millions of dimensions when you apply this to a web search engine)

27
Q

Could we store queries as vectors as well?

A

Yes!
There are two key ideas:
1. Do the same for queries: represent them as vectors in the space
2. Rank documents according to their proximity to the query in this space

28
Q

Why would we like to store queries as vectors?

A

We do this because we want to get away from the “You’re either in or out” boolean model
Instead: Rank more relevant documents higher than less relevant documents

29
Q

Why is distance a bad idea to estimate proximity between two vectors?

A

See notion for a graph

30
Q

What should we use instead of euclidean distance to measure proximity?

A

We use the angle instead

We should rank documents according to angle with query instead -> The angle between the two documents is 0, corresponding to maximal similarity

31
Q

How are cosines related to text similarity?

A

You can use cosines to compute similarity between two words or documents or documents and queries etc

32
Q

How does the cosine function for a query and document work?

A

See notion for formula

33
Q

How can you evaluate information retrieval?

A

Precision: Fraction of retrieved docs that are relevant = P(relevant|retrieved)

Recall: Fraction of relevant docs that are retrieved P(retrieved | relevant)

34
Q

Why shouldn’t you rely on accuracy alone when evaluating information retrieval?

A

In almost all circumstances, the data is extremely skewed: normally over 99.9% of the documents are in the nonrelevant category

35
Q

What is a combined measure that you can use to assess the precision/recall

A

F score (weighted harmonic mean)

People usually use balanced F1 measure ( with B=1 or a = 1/2)
B < 1 emphasize precision; B>1 emphasize recall

36
Q

What can you use a precision-recall curve for?

A

Seeing how to balance precision vs recall

-> However, you need see the precision-recall curve for a whole bunch of queries