Lecture 7 - Information Retrieval Flashcards by Simon Sardorf

An IR system vs Database can also be phrased as

Unstructured vs Structured data

How well did you know this?

Not at all

Perfectly

Structured data tends to refer to information in…

Tables

How well did you know this?

Not at all

Perfectly

Describe the Boolean Retrieval Model

Boolean Retrieval Models refer to whether a query is present in the database/document.
It is absolute (Document matches condition or not)

How well did you know this?

Not at all

Perfectly

What is the issue with Boolean Retrieval Model?

With bigger collections, this becomes computationally heavy

Example:

N = 1M documents, each with about 1000 words → Around 6gb of data.
- If there are 500K distinct terms in these documents, then a matrix of size 500k * 1M will have half-a-trillion 0’s and 1’s
- But no more than one billon 1’s
  - Matrix is extremely sparse
- What’s a better representation?
  - We only store the ‘1’ positions → Inverted Index

Also:
Boolean queries often result in either too few (=0) or too many (1000s) results
It takes a lot of skill to come up with a query that produces a manageable number of hits
- AND gives too few; OR gives too many

How well did you know this?

Not at all

Perfectly

What is an Inverted Index?

An inverted index keeps a dictionary of terms and for each term t, store a list of all documents that contain t

I.e: You can say that Boolean Retrieval Model is from the perspective of the document, whereas Inverted Index is from the perspective of the term

How well did you know this?

Not at all

Perfectly

What are the steps of constructing an inverted index?

Documents to be indexed -> Tokenizer -> Linguistic Modules -> Indexer = Inverted Index

How well did you know this?

Not at all

Perfectly

What is meant by phrase queries?

We want to be able to answer queries such as “Stanford university” as a phrase.

How well did you know this?

Not at all

Perfectly

For phrase queries, is it sufficient to store entries

No.

How well did you know this?

Not at all

Perfectly

What is meant by biword indexes

Instead of storing just single words, we now store biwords

Example:
“Friends, Romans, Countrymen” would now store:
- friends romans
- romans countrymen

This allows us to do two-word phrase query-processing

How well did you know this?

Not at all

Perfectly

How can we do longer phrase queries using biwords?

Longer phrases can be processed by breaking them down

Example:
“Stanford university palo alto” can be broken into the boolean query on biwords:
- stanford university AND university palo AND palo alto

We cannot verify that the docs matching the above Boolean query do contain the phrase

How well did you know this?

Not at all

Perfectly

What are the issues for biwords indexes?

False Positives

Index blowup due to bigger dictionary
- Infeasible for more than biwords, big even for them

Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy

How well did you know this?

Not at all

Perfectly

Name an alternative to biword indexes

Positional Indexes

How well did you know this?

Not at all

Perfectly

Is a positional index larger or smaller than a non-positional index

Substantially larger

How well did you know this?

Not at all

Perfectly

Can you combine Biword Indexes and Positional Indexes?

Yes

How well did you know this?

Not at all

Perfectly

What is the difference between Boolean Retrieval Models and Ranked Retrieval Models?

Rather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a query

How well did you know this?

Not at all

Perfectly

What is meant by “Free text queries”?

Study These Flashcards

Rather than a query language of operators and expression (e.g. SQL), the user’s query is just one or more words in a human language

What are some ways of scoring as the basis of ranked retrieval?

Study These Flashcards

Term Frequency
-> Log-frequency weighting

IDF weight

TF-IDF weighting

Explain Term Frequency

Study These Flashcards

Term frequency is the number of occurrences of a term within a document.

The term frequency tf_t,d of term t in document d is defined as the number of times that t occurs in d

Is raw term frequency sufficient?

Study These Flashcards

No: A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. But not 10 times more relevant -> Relevance does not increase proportionally with term frequency

Therefore, we use log-frequency weighting

Explain Document Frequency

Study These Flashcards

Document frequency is the number of documents containing a particular term

What is idf an acronym for?

Study These Flashcards

Inverse document frequency

Explain idf

Study These Flashcards

The inverse document frequency (IDF) is a statistical weight used for measuring the importance of a term in a text document collection. The document frequency DF of a term is defined by the number of documents in which a term appears.

Why is document frequency insufficient?

Study These Flashcards

We want high weight for rare terms

Rare terms are more informative than frequent terms

Recall stop words (the, are, of)
Also terms like high, car increase, line not so relevant

Consider a term in the query that is rare in the collection (e.g., arachnocentric)
- A document containing this term is very likely to be relevant to the query arachnocentric

What is the effect of idf on ranking for one-term queries, like “iphone”?

Study These Flashcards

idf has no effect on ranking one-term queries

idf affects the ranking of documents for queries with at least two terms
For the query “capricious person”, idf weighting makes occurrences of “capricious” count for much more in the final document ranking than occurrences of “person”

Explain tf-idf weighting

The tf-idf weight of a term is the product of its tf weight and its idf weight -> This is the best known weighting scheme in information retrieval (the hyphen in tf-idf is a hyphen, NOT a minus sign) - Increases with number of occurrences within a document - Increases with the rarity of the term in the collection

When processing documents, we typically store these in...

Vectors - |V| dimensional vector space - Terms are axes of the space - Documents are points or vectors in this space -> Very high dimensional (Tens of millions of dimensions when you apply this to a web search engine)

Could we store queries as vectors as well?

Yes! There are two key ideas: 1. Do the same for queries: represent them as vectors in the space 2. Rank documents according to their proximity to the query in this space

Why would we like to store queries as vectors?

We do this because we want to get away from the "You're either in or out" boolean model Instead: Rank more relevant documents higher than less relevant documents

Why is distance a bad idea to estimate proximity between two vectors?

See notion for a graph

What should we use instead of euclidean distance to measure proximity?

We use the angle instead We should rank documents according to angle with query instead -> The angle between the two documents is 0, corresponding to maximal similarity

How are cosines related to text similarity?

You can use cosines to compute similarity between two words or documents or documents and queries etc

How does the cosine function for a query and document work?

See notion for formula

How can you evaluate information retrieval?

Precision: Fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall: Fraction of relevant docs that are retrieved P(retrieved | relevant)

Why shouldn't you rely on accuracy alone when evaluating information retrieval?

In almost all circumstances, the data is extremely skewed: normally over 99.9% of the documents are in the nonrelevant category

What is a combined measure that you can use to assess the precision/recall

F score (weighted harmonic mean) People usually use balanced F1 measure ( with B=1 or a = 1/2) B < 1 emphasize precision; B>1 emphasize recall

What can you use a precision-recall curve for?

Seeing how to balance precision vs recall -> However, you need see the precision-recall curve for a whole bunch of queries

Lecture 7 - Information Retrieval Flashcards

(36 cards)