w2 L2 information retrival Flashcards
what is term frequency and why is it important
the more frequent a key term is used in the doc the more relevant the doc is
TF(term) = 1_term_in_doc * number_of_occurences
the 1_term_in_doc value is a boolean 1 or 0 indicator function
if all of your documents contain relevant terms, how do you find the actual relevant documents
downweigh the too frequent terms in the colleciton and upweigh the rarer terms
if every document contains a relevent word A, the word becomes like a stop word, so the rarer relevant words need to be priorizied
what and why is inverse document frequency
if a word is super common we need to weigh it less and vise versa so we need the inverse frequency
IDF of a term = N/(document frequency of term)
N = total number of documents in docletion
how to calculate idf
IDF(term) = log(N/ (df(term)+1))
N is the total number of documents
df(term) is how often the term shows up in the dataframe + 1 for smoothing
how to calcuate tf idf
tf-idf = TF(term) * IDF(term)
how do we measure success of algorithm
if it shows the most relevent results first
precision@k
what is precision at k
you can order/ rank the documents by simialrity to ur querey and cut off this list at a certain point
lets call this point k
if you have access to the list of all documents relevant to the query you can measure how many relevant documents are in the top-k documents returned by the algorithm
what is mean precision @k
you are not interested in the results of a signle query, but all of them so you need the average P@k
sum of P@k/number of queries
what is the mean reciprocal rank
measures how high, on average, the algorithm place the first relevent document that it returns
how often will you be happy with the first result
what is the formula of mean recipcal rank
RR = 1/ rank of the first relevant document in the ranked list
MRR = sum of RR/ number of queries