05 Term weighting Flashcards

Question 1

Q

what is zipf’s distribution

Answer

A

a few words occur frequently

medium words occur mediumly (most useful and descriptive)

many words occur only once

Question 2

Q

what are the contrasting assumptions for Best match approach or a co-ordinated search

Answer

A

term is present in document or not is binary
it does not consider the degree of association between the term and the document
“relevancy”

Question 3

Q

what is the basis

Answer

A

determine word’s semantic utility based on its statistical properties

count word frequency in document / all document in the collection

Question 4

Q

what is zipf’s law

Answer

A

rank (R) of a word * frequency is approximately a constant (K)

rank (R) * probability of the word occurrence is approximately empirical constant (A)

quite accurate except for very high rank or very low rank

Question 5

Q

what are the consequences of zipf’s law

Answer

A

some frequent words are not good discriminators called stop words
remove them to reduce inverted index storage cost

Question 6

Q

what is heaps law

Answer

A

V = size of vocabulary(number of unique words)
n = number of words

v = K * n**beta

typically
k 10 to 100
beta 0.5

Question 7

Q

what is resolving power

Answer

A

measure of relevancy
2 critical factors
- word frequency within a document
- word frequency in collection

Question 8

Q

what is term weighting

Answer

A

effective approach to score documents
- how many query terms it contains
- how discriminative those terms are

not all terms are useful so we give them a weight

output W = weight of kth word in document
input
f = number of occurrence of kth word in document (term frequency)
N = number of documents in collection
D = number of documents containing the kth word

inverse document frequency idf =
log(N/D)

calculate TF-IDF

05 Term weighting Flashcards

(8 cards)