03 Architecture of retrieval system Flashcards
relevance
task statement
build a system that retrieves documents that users are likely to find relevant to their queries
saracevics relevance**
relevance is the measure of a correspondence existing between a document and a query as determined by the user
relevance in practice
- most models use statistical properties of text rather than linguistic
- focus on topical relevance
text representation
bags of words
- treat all words in a document as index terms
- assign a weight to each term based on importance
- disregard structure, meaning of word
assumptions
- term occurrence is independent
- document relevance is independent
document acquisition
accumulate text by web crawls
convert html, pdf to plain text
lexical analysis (tokenisation)
the process of converting stream of characters into stream of words
- identify words
- recognise spaces
- treating digits, hyphens, punctuations, case of letters
eg.
1999 vs 510B.C
state-of-the-art
list.id
Bank vs bank
elimination of stop words
words which are too frequent among documents in the collection are not good discriminators
very low discrimination values
can be important in combinations
- to be or not to be
strategies for stopword removal
- list look up: stop word list
- usage of frequency: information from other documents
- frequency analysis: terms occurring in 80% of documents
reduces size of indexing structure
conflation
expectation for system to be robust, plural forms should not affect
reduces word variants into a single form
stemming is a specific conflation technique
stemming
reduces all words with same root into a single root
SESS -> 1SS
(AEIOU)ED -> 1
(AEIOU)Y -> 1
increases retrieval of all possibly relevant documents
reduces index size
problem:
prevent interpretation of meaning (gravitation vs gravity)