Week 3 - Information Retrieval Flashcards by Hannah Loo

What is information retrieval

It is finding material (usually documents) of an unstructured nature that satisfied an information need from within large collections

How well did you know this?

Not at all

Perfectly

What is the definition of a search

It is a conversation between user and search engine

How well did you know this?

Not at all

Perfectly

When your search result is not good, what does this tell you?

You have entered an ambiguous query or the query is irrelevant

How well did you know this?

Not at all

Perfectly

Why is user intent important

Billions of users, everyone has a different idea of what they are looking for based on the same search term. Search engines conduct localisation to get better at INTENT

How well did you know this?

Not at all

Perfectly

How can we abstract the IR of a search problem?

Input —> Process —> Output
Input is a query in a string of characters
output is a list of characters (Article name, link, article id)

How well did you know this?

Not at all

Perfectly

Why is a search problem considered ad hoc retrieval

Because we cannot anticipate all the queries upfront

How well did you know this?

Not at all

Perfectly

Why is relevance also ad hoc?

Because it is tied to the information need and the need is very specific

How well did you know this?

Not at all

Perfectly

Why is it called a TERM-DOCUMENT matrix

Terms do not always correspond with words. It can be numbers, symbols, abbreviations etc.

How well did you know this?

Not at all

Perfectly

A passage from Hamlet has been converted into 1s and 0s, what sort of IR can you carry out

Simple queries, like searching for the presence of a word

How well did you know this?

Not at all

Perfectly

Boolean retrieval

Query is posed in the form of a Boolean expression of terms

How well did you know this?

Not at all

Perfectly

What is tokenization

Tokenization is the process of splitting the sentence into words

How well did you know this?

Not at all

Perfectly

What is normalisation

So that people can search for plural,singular versions of the word, ways of spelling (US/UK), sentence capitalisation etc.

How well did you know this?

Not at all

Perfectly

What is stemming

Chopping off the word: “jumping” —> “jump”

How well did you know this?

Not at all

Perfectly

What is lemmatisation?

You maintain a dictionary. Dictionary matching: “run” from the word “ran”

How well did you know this?

Not at all

Perfectly

What is the purpose behind pre-processing?

To make the query easier for the user. If not the user must search for the EXACT word form

How well did you know this?

Not at all

Perfectly

When does “double quotes” feature?

Study These Flashcards

When you are looking for a specific variance. For example, “RUN”

Which is better? Stemming or lemmatisation?

Study These Flashcards

Lemmatisation is often slower than stemming. Because you minded to maintain a dictionary. And if new words come up, you need to update your lemmatisation dictionary

What are stop word?

Study These Flashcards

Filters out the commonly used words (a, an, by, or). They go beyond conjunctions and filter out words that have no real meaning. They are being removed from the query first.

What does TFIDF mean

Study These Flashcards

Term frequency - inverse document frequency

How does TFIDF work?

Study These Flashcards

It prioritises rare terms in the collection, across documents. If your document has a high frequency, it has high TF-IDF

Vector space model, how is it used in IR?

Study These Flashcards

Each vector represents the document. Given that, which vectors are the most similar. Calculate the vectors similarity. the query will be represented as a vector as well.

Precision definition

Study These Flashcards

What fraction of the returned results are relevant to the information needed

Recall definition

Study These Flashcards

What fraction of the relevant documents in the collection were returned by the system

When should you focus on precision or recall?

Study These Flashcards

Recall: if you want to do an exhaustive search, covering all relevant ground.
Precision: when you are more focused on results

There is always a trade off between precision and recall. What is the trade off?

The trade off involves threshold and cut off in scoring. This might cause some terms to slip through

When do you have perfect precision and recall

If you perfectly capture unwed intent

Search engine data structures. What is an index?

An index has all your key terms and pages in which the terms occur

what is an inverted index?

For each term t, store list of docIDs (unique) that contain t

What is a term dictionary

Set of terms in the dictionary forms the vocabulary

How do you have an inverted index?

Dictionary + postings = inverted index

Week 3 - Information Retrieval Flashcards

(30 cards)