04 Architecture of retrieval system 2 Flashcards
indexing
create a bag-of-words representation of each document by text in a fast look up structure
inverted index
primary data structure generated by the indexing process
make a dictionary of all words in the collection
for each word, list all the docs it occurred in
indexing steps
- lexical analysis (tokenisation)
- stop word removal
- stemming
- index structure creation
relevance estimation
compute the relevance of a document for a query
- term weighting scheme which allocate a numeric to each term reflecting their importance
- similarity coefficient: use term weight to compute an overall degree of similarity
search algorithm
binary AND search
best match algorithm
- for each document, score=0
- for each query term, search vocab list, pull out posting list
- for each document in the list, score += 1