08 Evaluation Flashcards
why do we need to evaluate
- economic reasons
- how effective is the solution
- scientific progress
- is their method better than competitors
- verification
- verify performance
what do we need to evaluate
- efficiency
- how fast - coverage
- how many pages is indexed - presentation
- effort required - effectiveness
- how correct is it
what is the IR experimental set up
maintain a test collection of docs, queries and relevance assessments using Ground truth
- measure of performance of precision, recall
- systems to compare for query TF vs TF-IDF
- experimental design
what are the assumptions for the evaluation
system provides a ranked list after searching the query
- a better system will provide a better ranked list
- a better ranked list generally satisfies the users
what is precision
retrieved docs that are relevant / all retrieved docs
what is recall
retrieved docs that are relevant / all relevant docs
ranking effectiveness
- how many to rank? eg. top 1, 3, 5?
- if precision at rank R is higher, recall will also be higher
what are the 3 methods of summarising ranking
- calculate recall, precision at fixed rank positions
- calculate precision at standard recall levels from 0.0 to 1.0
- interpolation - averaging precision values from the rank positions where a relevant document was retrieved
what is mean average precision (MAP)
summarise rankings from multiple queries by averaging average precision
- assume user is interested in finding many relevant documents for each query
- requires many relevance judgments in test collection
recall precision graphs
cannot show pattern
interpolation
defines precision at any recall level as the maximum precision observed in any recall-precision point at a higher recall level
- into step function
joining average precision points at standard recall levels
…