05 Term weighting Flashcards
what is zipf’s distribution
a few words occur frequently
medium words occur mediumly (most useful and descriptive)
many words occur only once
what are the contrasting assumptions for Best match approach or a co-ordinated search
- term is present in document or not is binary
- it does not consider the degree of association between the term and the document
“relevancy”
what is the basis
determine word’s semantic utility based on its statistical properties
- count word frequency in document / all document in the collection
what is zipf’s law
rank (R) of a word * frequency is approximately a constant (K)
rank (R) * probability of the word occurrence is approximately empirical constant (A)
quite accurate except for very high rank or very low rank
what are the consequences of zipf’s law
some frequent words are not good discriminators called stop words
remove them to reduce inverted index storage cost
what is heaps law
V = size of vocabulary(number of unique words)
n = number of words
v = K * n**beta
typically
k 10 to 100
beta 0.5
what is resolving power
measure of relevancy
2 critical factors
- word frequency within a document
- word frequency in collection
what is term weighting
effective approach to score documents
- how many query terms it contains
- how discriminative those terms are
not all terms are useful so we give them a weight
output W = weight of kth word in document
input
f = number of occurrence of kth word in document (term frequency)
N = number of documents in collection
D = number of documents containing the kth word
inverse document frequency idf =
log(N/D)
calculate TF-IDF