03 Architecture of retrieval system Flashcards

Question 1

Q

relevance

Answer

A

task statement
build a system that retrieves documents that users are likely to find relevant to their queries

Question 2

Q

saracevics relevance**

Answer

A

relevance is the measure of a correspondence existing between a document and a query as determined by the user

Question 3

Q

relevance in practice

Answer

A

most models use statistical properties of text rather than linguistic
focus on topical relevance

Question 4

Q

text representation

Answer

A

bags of words
- treat all words in a document as index terms
- assign a weight to each term based on importance
- disregard structure, meaning of word

assumptions
- term occurrence is independent
- document relevance is independent

Question 5

Q

document acquisition

Answer

A

accumulate text by web crawls
convert html, pdf to plain text

Question 6

Q

lexical analysis (tokenisation)

Answer

A

the process of converting stream of characters into stream of words
- identify words
- recognise spaces
- treating digits, hyphens, punctuations, case of letters

eg.
1999 vs 510B.C
state-of-the-art
list.id
Bank vs bank

Question 7

Q

elimination of stop words

Answer

A

words which are too frequent among documents in the collection are not good discriminators

very low discrimination values

can be important in combinations
- to be or not to be

Question 8

Q

strategies for stopword removal

Answer

A

list look up: stop word list
usage of frequency: information from other documents
frequency analysis: terms occurring in 80% of documents

reduces size of indexing structure

Question 9

Q

conflation

Answer

A

expectation for system to be robust, plural forms should not affect
reduces word variants into a single form
stemming is a specific conflation technique

Question 10

Q

stemming

Answer

A

reduces all words with same root into a single root

SESS -> 1SS
(AEIOU)ED -> 1
(AEIOU)Y -> 1

increases retrieval of all possibly relevant documents
reduces index size
problem:
prevent interpretation of meaning (gravitation vs gravity)