Week 3 - Information Retrieval Flashcards
What is information retrieval
It is finding material (usually documents) of an unstructured nature that satisfied an information need from within large collections
What is the definition of a search
It is a conversation between user and search engine
When your search result is not good, what does this tell you?
You have entered an ambiguous query or the query is irrelevant
Why is user intent important
Billions of users, everyone has a different idea of what they are looking for based on the same search term. Search engines conduct localisation to get better at INTENT
How can we abstract the IR of a search problem?
Input —> Process —> Output
Input is a query in a string of characters
output is a list of characters (Article name, link, article id)
Why is a search problem considered ad hoc retrieval
Because we cannot anticipate all the queries upfront
Why is relevance also ad hoc?
Because it is tied to the information need and the need is very specific
Why is it called a TERM-DOCUMENT matrix
Terms do not always correspond with words. It can be numbers, symbols, abbreviations etc.
A passage from Hamlet has been converted into 1s and 0s, what sort of IR can you carry out
Simple queries, like searching for the presence of a word
Boolean retrieval
Query is posed in the form of a Boolean expression of terms
What is tokenization
Tokenization is the process of splitting the sentence into words
What is normalisation
So that people can search for plural,singular versions of the word, ways of spelling (US/UK), sentence capitalisation etc.
What is stemming
Chopping off the word: “jumping” —> “jump”
What is lemmatisation?
You maintain a dictionary. Dictionary matching: “run” from the word “ran”
What is the purpose behind pre-processing?
To make the query easier for the user. If not the user must search for the EXACT word form