Lecture 7 - Information Retrieval Flashcards

1
Q

An IR system vs Database can also be phrased as

A

Unstructured vs Structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Structured data tends to refer to information in…

A

Tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the Boolean Retrieval Model

A

Boolean Retrieval Models refer to whether a query is present in the database/document.
It is absolute (Document matches condition or not)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the issue with Boolean Retrieval Model?

A

With bigger collections, this becomes computationally heavy

Example:

  • N = 1M documents, each with about 1000 words → Around 6gb of data.
    • If there are 500K distinct terms in these documents, then a matrix of size 500k * 1M will have half-a-trillion 0’s and 1’s
    • But no more than one billon 1’s
      • Matrix is extremely sparse
    • What’s a better representation?
      • We only store the ‘1’ positions → Inverted Index

Also:
Boolean queries often result in either too few (=0) or too many (1000s) results
It takes a lot of skill to come up with a query that produces a manageable number of hits
- AND gives too few; OR gives too many

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is an Inverted Index?

A

An inverted index keeps a dictionary of terms and for each term t, store a list of all documents that contain t

I.e: You can say that Boolean Retrieval Model is from the perspective of the document, whereas Inverted Index is from the perspective of the term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the steps of constructing an inverted index?

A

Documents to be indexed -> Tokenizer -> Linguistic Modules -> Indexer = Inverted Index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is meant by phrase queries?

A

We want to be able to answer queries such as “Stanford university” as a phrase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

For phrase queries, is it sufficient to store entries

A

No.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is meant by biword indexes

A

Instead of storing just single words, we now store biwords

Example:
“Friends, Romans, Countrymen” would now store:
- friends romans
- romans countrymen

This allows us to do two-word phrase query-processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can we do longer phrase queries using biwords?

A

Longer phrases can be processed by breaking them down

Example:
“Stanford university palo alto” can be broken into the boolean query on biwords:
- stanford university AND university palo AND palo alto

We cannot verify that the docs matching the above Boolean query do contain the phrase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the issues for biwords indexes?

A

False Positives

Index blowup due to bigger dictionary
- Infeasible for more than biwords, big even for them

Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name an alternative to biword indexes

A

Positional Indexes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Is a positional index larger or smaller than a non-positional index

A

Substantially larger

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Can you combine Biword Indexes and Positional Indexes?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between Boolean Retrieval Models and Ranked Retrieval Models?

A

Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is meant by “Free text queries”?

A

Rather than a query language of operators and expression (e.g. SQL), the user’s query is just one or more words in a human language

17
Q

What are some ways of scoring as the basis of ranked retrieval?

A

Term Frequency
-> Log-frequency weighting

IDF weight

TF-IDF weighting

18
Q

Explain Term Frequency

A

Term frequency is the number of occurrences of a term within a document.

The term frequency tf_t,d of term t in document d is defined as the number of times that t occurs in d

19
Q

Is raw term frequency sufficient?

A

No: A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. But not 10 times more relevant -> Relevance does not increase proportionally with term frequency

Therefore, we use log-frequency weighting

20
Q

Explain Document Frequency

A

Document frequency is the number of documents containing a particular term

21
Q

What is idf an acronym for?

A

Inverse document frequency

22
Q

Explain idf

A

The inverse document frequency (IDF) is a statistical weight used for measuring the importance of a term in a text document collection. The document frequency DF of a term is defined by the number of documents in which a term appears.

23
Q

Why is document frequency insufficient?

A

We want high weight for rare terms

Rare terms are more informative than frequent terms

  • Recall stop words (the, are, of)
  • Also terms like high, car increase, line not so relevant

Consider a term in the query that is rare in the collection (e.g., arachnocentric)
- A document containing this term is very likely to be relevant to the query arachnocentric

24
Q

What is the effect of idf on ranking for one-term queries, like “iphone”?

A

idf has no effect on ranking one-term queries

  • idf affects the ranking of documents for queries with at least two terms
  • For the query “capricious person”, idf weighting makes occurrences of “capricious” count for much more in the final document ranking than occurrences of “person”
25
Explain tf-idf weighting
The tf-idf weight of a term is the product of its tf weight and its idf weight -> This is the best known weighting scheme in information retrieval (the hyphen in tf-idf is a hyphen, NOT a minus sign) - Increases with number of occurrences within a document - Increases with the rarity of the term in the collection
26
When processing documents, we typically store these in...
Vectors - |V| dimensional vector space - Terms are axes of the space - Documents are points or vectors in this space -> Very high dimensional (Tens of millions of dimensions when you apply this to a web search engine)
27
Could we store queries as vectors as well?
Yes! There are two key ideas: 1. Do the same for queries: represent them as vectors in the space 2. Rank documents according to their proximity to the query in this space
28
Why would we like to store queries as vectors?
We do this because we want to get away from the "You're either in or out" boolean model Instead: Rank more relevant documents higher than less relevant documents
29
Why is distance a bad idea to estimate proximity between two vectors?
See notion for a graph
30
What should we use instead of euclidean distance to measure proximity?
We use the angle instead We should rank documents according to angle with query instead -> The angle between the two documents is 0, corresponding to maximal similarity
31
How are cosines related to text similarity?
You can use cosines to compute similarity between two words or documents or documents and queries etc
32
How does the cosine function for a query and document work?
See notion for formula
33
How can you evaluate information retrieval?
Precision: Fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall: Fraction of relevant docs that are retrieved P(retrieved | relevant)
34
Why shouldn't you rely on accuracy alone when evaluating information retrieval?
In almost all circumstances, the data is extremely skewed: normally over 99.9% of the documents are in the nonrelevant category
35
What is a combined measure that you can use to assess the precision/recall
F score (weighted harmonic mean) People usually use balanced F1 measure ( with B=1 or a = 1/2) B < 1 emphasize precision; B>1 emphasize recall
36
What can you use a precision-recall curve for?
Seeing how to balance precision vs recall -> However, you need see the precision-recall curve for a whole bunch of queries