Week 3 - Information Retrieval Flashcards
What is information retrieval
It is finding material (usually documents) of an unstructured nature that satisfied an information need from within large collections
What is the definition of a search
It is a conversation between user and search engine
When your search result is not good, what does this tell you?
You have entered an ambiguous query or the query is irrelevant
Why is user intent important
Billions of users, everyone has a different idea of what they are looking for based on the same search term. Search engines conduct localisation to get better at INTENT
How can we abstract the IR of a search problem?
Input —> Process —> Output
Input is a query in a string of characters
output is a list of characters (Article name, link, article id)
Why is a search problem considered ad hoc retrieval
Because we cannot anticipate all the queries upfront
Why is relevance also ad hoc?
Because it is tied to the information need and the need is very specific
Why is it called a TERM-DOCUMENT matrix
Terms do not always correspond with words. It can be numbers, symbols, abbreviations etc.
A passage from Hamlet has been converted into 1s and 0s, what sort of IR can you carry out
Simple queries, like searching for the presence of a word
Boolean retrieval
Query is posed in the form of a Boolean expression of terms
What is tokenization
Tokenization is the process of splitting the sentence into words
What is normalisation
So that people can search for plural,singular versions of the word, ways of spelling (US/UK), sentence capitalisation etc.
What is stemming
Chopping off the word: “jumping” —> “jump”
What is lemmatisation?
You maintain a dictionary. Dictionary matching: “run” from the word “ran”
What is the purpose behind pre-processing?
To make the query easier for the user. If not the user must search for the EXACT word form
When does “double quotes” feature?
When you are looking for a specific variance. For example, “RUN”
Which is better? Stemming or lemmatisation?
Lemmatisation is often slower than stemming. Because you minded to maintain a dictionary. And if new words come up, you need to update your lemmatisation dictionary
What are stop word?
Filters out the commonly used words (a, an, by, or). They go beyond conjunctions and filter out words that have no real meaning. They are being removed from the query first.
What does TFIDF mean
Term frequency - inverse document frequency
How does TFIDF work?
It prioritises rare terms in the collection, across documents. If your document has a high frequency, it has high TF-IDF
Vector space model, how is it used in IR?
Each vector represents the document. Given that, which vectors are the most similar. Calculate the vectors similarity. the query will be represented as a vector as well.
Precision definition
What fraction of the returned results are relevant to the information needed
Recall definition
What fraction of the relevant documents in the collection were returned by the system
When should you focus on precision or recall?
Recall: if you want to do an exhaustive search, covering all relevant ground.
Precision: when you are more focused on results
There is always a trade off between precision and recall. What is the trade off?
The trade off involves threshold and cut off in scoring. This might cause some terms to slip through
When do you have perfect precision and recall
If you perfectly capture unwed intent
Search engine data structures. What is an index?
An index has all your key terms and pages in which the terms occur
what is an inverted index?
For each term t, store list of docIDs (unique) that contain t
What is a term dictionary
Set of terms in the dictionary forms the vocabulary
How do you have an inverted index?
Dictionary + postings = inverted index