adv-db Flashcards
Boolean retrieval model
connectives: and, or, not, m of n
fuzzy matching: “database” matches “databases”
Thesaurus expansion: expand query into synonyms
stop-word-elimination: eliminate “the”, “is”, “with” (filler words)
Zipf’s law
Zipf’s Law is a statistical distribution in certain data sets, such as words in a linguistic corpus, in which the frequencies of certain words are inversely proportional to their ranks.
he word in the position n appears 1/n^(omega) times as often as the most frequent one. Where omega ~ 1.5 or 2
Precision
fraction of retrieved docs that are relevant
Recall
Fraction of relevant docs that are retrieved
f1 measure
2P * R / P + R
Jaccard coefficient
Measure of overlap between two sets
jaccard(A, B) = |A intersection B | / |A union B|; if A=B, then jaccard(A,B) = 1
Issues:
* Overaccount for long texts
* Does not account for term frequency
* Does not consider that rare terms may be more informative than frequent terms
Binary term-document “Incidence Matrix”
Assign 0 or 1 if the term appears in the given document and create a matrix of form:
Doc1 Doc2 term 1 1 0 term 2 0 1
Term-document “Count Matrix”
Assign count of term (term frequency) in the document and create a matrix of form:
Doc1 Doc2 term 1 73 3 term 2 0 44
Bag of words model
Vector representation doesn’t consider ordering of words in documents. Such vectors, are non positional representations.
Term frequency (tf)
frequency of term t in document d, defined as the number of times that t occurs in d
- Relevance does not grow linearly with term frequency
Log-term-frequency weighting
w = {1 + log_10(tf)} if tf > 0 otherwise 0
Then the score for a document query pair is just the sum of the weights for each term in the query
Inverse document frequency (idf)
df_t: number of documents in the database that contain term t
idf_t: inverse document frequency of t in the database -> idf_t = log_10(N/df_t); where N is the number of documents in the database
We use the log function to ‘dampen’ the effect of idf
Each idf is database-dependent: each term has one value for the whole DB
No effect on queries with only 1 term
Collection frequency
Number of occurrences of t in the collection, counting multiple occurrences within documents
tf-idf weighting
tf-idf = tf * idf
tf-idf_t,d = (1 + log_10(tf_t,d)) * log_10(N / df_t); where N is the number
For a query, sum the tf-idf of each term
Vector length normalization
Divide vector by it’s L2 norm
L2 norm –> ||x||_2 = sqrt(sum(x^2))
Cosine similarity
cos(q, d) = Sum(q_i * d_i) / sqrt( sum(q_i^2) * sum(d_i^2) )
Where:
q_i = tf-idf of term i in query q
d_i = tf-idf of term i in document d
Cosine similarity
cos(q, d) = Sum(q_i * d_i) / sqrt( sum(q_i^2) * sum(d_i^2) )
Where:
q_i = tf-idf of term i in query q
d_i = tf-idf of term i in document d
Rocchio’s algorithm
Given a query q, q_0 initial vector representation for q
q_t+1 = alpha * q_t + beta * sum(d_i for d_i in R) / |R| - gamma * sum(d_i for d_i in NR) / |NR|
Where:
d_i = vector representation of the document
R = relevant documents for q
NR: non relevant documents for q
Essentially we have a query vector of the form:
alpha. 1
beta 1
gamma 1
And we add the average tf-idf of each term given the relevant documents and then subtract the average tf-idf of each term of the non relevant documents
Inverted files
Term_1: d1, d2, d3
Use hash maps to store efficiently (i.e., key is term, value is list of documents)
- Sort by the terms
Champion list
identify, for each term t, the r documents with the highest weights for t (offline step).
The set of r documents is the ‘champion list’
When the query comes, make the collection the union of the r documents for each query term
Page rank
Probability matrix:
P = (alpha/N + (1 - alpha) * 1/L_i for L_i in Nodes)
Where:
N = Number of nodes
L_i = number of outgoing edges from Node i to another node
Pij is the probability of j being the next state given that random surfer is currently in state i
Query independent -> static quality score
Hubs & authorities
Query dependent alternative to PageRank
Authorities for query q are pages that have highly relevant content for q and are authoritative
Hubs for q are pages that are good directories of resources for q (i.e., pages that point to many good authorities for q)
Snowball