Information retrieval Flashcards
What is the task of IR systems
Finding results that are similar to query
What is the difference of searching in IR vs database
a database result will always give you an exact match whereas IR systems will retrieve documents that are similar
Different retrieval techniques dependent on search query
non-textual objects = Meta description
Content - Bag of words
Semantic tagging = what is the meaning of the specieis of text
Link analysis - indicates importance of document by incoming links (authority)
which types of retrieval models are they
Boolean and vector space
Explain the boolean retrieval model
Each document is a bag of words, the user designs a boolean query where he or she can tell the search system in more detail what and how to search. Query contains boolean operator: And, Or, Not. Can only filter not sort
Explain the extended boolean retrieval model
The searcher has more control of the search process, the model consider text structure and distance between word when it matches the query to a piece of text. No rankning just sorting
Explain vector space retrieval model
A Vector is defined by its lenght and direction. Only coordinates are necessary to identify a vector length thanks to pythagoras sats
What is Bag of words
A set of ordered words in a document where the frequency of each word is indicated
The structure of the text is lost
common retrieval models
- similarity between document vectors
- term weight (measuring importance of word)
- evaluation of retrieval
What is the purpose of term vector similarity
finding similarity between document vectors, where the query is more than a few words.
Explain term vector similarity between documents
Term-document vector space -
documents are represented as vectors in a n-dimensional space where each dimension/axis is a term/word and each vector coordinate is the weight of the term in the document. So if a word is present in a document it will get the value 1 (if binary) and so be on spot 1 for that axis. So the direction is the determined by the words in the text and the length is dependent of the amount of words in the document (not important)
How is similarity measured?
- common terms = straight forward, count the nr of terms that q and d have in common
- scalar product = multiply the coordinates of the vector (x1x2 + Y1Y2) in order to get the lenght, it is normalized based on the amount of words in the document.
- Cosine similarity - similariy between 2 documents calculated as fucntion of the angle between the term vectors of these documents.
how is term weight used
measuring the importance of a word, instead of binary outcome, a higher number means a more important term
term frequency - how often the term appears in a document
inverted document frequency - how unique the term is in the collection of documents. Low IDF not unique, high - unique. total nr of doc/nr of doc containing the term.
Term weigth = frequency * inverted document frequency
high if frequent in a document and is unique for a subset of documents. Document and collection specific
How is retrieval evaluated
Precision and recall
What is precision
Fraction of retrieved documents that are relevant. How many relevant documents did we manage to retrieve?
What is recall
Fraction of relevant documents that retrieved. Out of all the relevant documents, how many did we manage to retrieve?
Precision and recall curve (interpolated)
Is like an average precision and recall curve, we obtain this by finding the larget measured precision value for all the recall values equal or larger (more to the right) of the given/standard recall and plot it on the Y-axis (precision) and we get interpolated precision value. We can summerize a lot of curves like this. It alway drops