adv-db Flashcards
Boolean retrieval model
connectives: and, or, not, m of n
fuzzy matching: “database” matches “databases”
Thesaurus expansion: expand query into synonyms
stop-word-elimination: eliminate “the”, “is”, “with” (filler words)
Zipf’s law
Zipf’s Law is a statistical distribution in certain data sets, such as words in a linguistic corpus, in which the frequencies of certain words are inversely proportional to their ranks.
he word in the position n appears 1/n^(omega) times as often as the most frequent one. Where omega ~ 1.5 or 2
Precision
fraction of retrieved docs that are relevant
Recall
Fraction of relevant docs that are retrieved
f1 measure
2P * R / P + R
Jaccard coefficient
Measure of overlap between two sets
jaccard(A, B) = |A intersection B | / |A union B|; if A=B, then jaccard(A,B) = 1
Issues:
* Overaccount for long texts
* Does not account for term frequency
* Does not consider that rare terms may be more informative than frequent terms
Binary term-document “Incidence Matrix”
Assign 0 or 1 if the term appears in the given document and create a matrix of form:
Doc1 Doc2 term 1 1 0 term 2 0 1
Term-document “Count Matrix”
Assign count of term (term frequency) in the document and create a matrix of form:
Doc1 Doc2 term 1 73 3 term 2 0 44
Bag of words model
Vector representation doesn’t consider ordering of words in documents. Such vectors, are non positional representations.
Term frequency (tf)
frequency of term t in document d, defined as the number of times that t occurs in d
- Relevance does not grow linearly with term frequency
Log-term-frequency weighting
w = {1 + log_10(tf)} if tf > 0 otherwise 0
Then the score for a document query pair is just the sum of the weights for each term in the query
Inverse document frequency (idf)
df_t: number of documents in the database that contain term t
idf_t: inverse document frequency of t in the database -> idf_t = log_10(N/df_t); where N is the number of documents in the database
We use the log function to ‘dampen’ the effect of idf
Each idf is database-dependent: each term has one value for the whole DB
No effect on queries with only 1 term
Collection frequency
Number of occurrences of t in the collection, counting multiple occurrences within documents
tf-idf weighting
tf-idf = tf * idf
tf-idf_t,d = (1 + log_10(tf_t,d)) * log_10(N / df_t); where N is the number
For a query, sum the tf-idf of each term
Vector length normalization
Divide vector by it’s L2 norm
L2 norm –> ||x||_2 = sqrt(sum(x^2))