07 Term Scoring Flashcards
Ranked retrieval
What ate problems with boolean retrieval ?
- most users are not capable of writing boolean quries
- users don’t want huge numbers of result
- boolean queries hit too few or too many
Ranked retrieval
What are the main ideas of scoring in ranked retrieval ?
- assign score to each query-document pair
- measures how well document and query match
Ranked retrieval
What is the query-document match score that the Jaccard
coefficient computes for:
Query: “ides of March”
Document: “Caesar died in March”
jaccard(q,d) = 1/6
Ranked retrieval
What’s wrong with Jaccard ?
- no weighting
- rare terms are more informative thn frequent terms
Term frequency
What is tf weight (log frequency weight) of the following term frequencies:
a) tf = 1, b) tf=10, c) tf = 1000
tf(t,d) = number of times that t occurs in d
a) 1
b) 2
c) 4
we need log frequency in stead of raw (term) frequency
because relevance does not increase proportionally with term frequency
if tf > 0: 1+log10(tf), otherwise: 0
Document frequency
compute idf weight for the following document frequencies:
a) df = 1, b) df = 100, c) df = 1000 (given N = 1,000,000)
df(t) = number of d (in the whole collection) that t occurs in
a) 6 (most relevant)
b) 4
c) 3
idf = log10 * (N/df)
How do we calculate tf-idf weighting ?
tf multiply by idf
tf-idf is the best known weighting scheme in IR
Vector and space model
What are the key ideas of query as vectors ?
- represent queries as vector (= the same for documents)
- rank documents according to their proximity (=similarity) to the query
proximity = negative distance
Why Euclidean distance is not good for normalizing vectors ?
it is large for vectors of different lengths
query is a very short vector,we use angle instead of distance