adv-db Flashcards

Question 1

Q

Boolean retrieval model

Answer

A

connectives: and, or, not, m of n
fuzzy matching: “database” matches “databases”
Thesaurus expansion: expand query into synonyms
stop-word-elimination: eliminate “the”, “is”, “with” (filler words)

Question 2

Q

Zipf’s law

Answer

A

Zipf’s Law is a statistical distribution in certain data sets, such as words in a linguistic corpus, in which the frequencies of certain words are inversely proportional to their ranks.

he word in the position n appears 1/n^(omega) times as often as the most frequent one. Where omega ~ 1.5 or 2

Question 3

Q

Precision

Answer

A

fraction of retrieved docs that are relevant

Question 4

Q

Recall

Answer

A

Fraction of relevant docs that are retrieved

Question 5

Q

f1 measure

Answer

A

2P * R / P + R

Question 6

Q

Jaccard coefficient

Answer

A

Measure of overlap between two sets
jaccard(A, B) = |A intersection B | / |A union B|; if A=B, then jaccard(A,B) = 1
Issues:
* Overaccount for long texts
* Does not account for term frequency
* Does not consider that rare terms may be more informative than frequent terms

Question 7

Q

Binary term-document “Incidence Matrix”

Answer

A

Assign 0 or 1 if the term appears in the given document and create a matrix of form:

                Doc1     Doc2 term 1         1             0 term 2         0            1

Question 8

Q

Term-document “Count Matrix”

Answer

A

Assign count of term (term frequency) in the document and create a matrix of form:

                Doc1     Doc2 term 1         73             3 term 2         0            44

Question 9

Q

Bag of words model

Answer

A

Vector representation doesn’t consider ordering of words in documents. Such vectors, are non positional representations.

Question 10

Q

Term frequency (tf)

Answer

A

frequency of term t in document d, defined as the number of times that t occurs in d
- Relevance does not grow linearly with term frequency

Question 11

Q

Log-term-frequency weighting

Answer

A

w = {1 + log_10(tf)} if tf > 0 otherwise 0
Then the score for a document query pair is just the sum of the weights for each term in the query

Question 12

Q

Inverse document frequency (idf)

Answer

A

df_t: number of documents in the database that contain term t
idf_t: inverse document frequency of t in the database -> idf_t = log_10(N/df_t); where N is the number of documents in the database

We use the log function to ‘dampen’ the effect of idf

Each idf is database-dependent: each term has one value for the whole DB

No effect on queries with only 1 term

Question 13

Q

Collection frequency

Answer

A

Number of occurrences of t in the collection, counting multiple occurrences within documents

Question 14

Q

tf-idf weighting

Answer

A

tf-idf = tf * idf
tf-idf_t,d = (1 + log_10(tf_t,d)) * log_10(N / df_t); where N is the number

For a query, sum the tf-idf of each term

Question 15

Q

Vector length normalization

Answer

A

Divide vector by it’s L2 norm
L2 norm –> ||x||_2 = sqrt(sum(x^2))

Question 16

Q

Cosine similarity

Answer

Study These Flashcards

A

cos(q, d) = Sum(q_i * d_i) / sqrt( sum(q_i^2) * sum(d_i^2) )

Where:
q_i = tf-idf of term i in query q
d_i = tf-idf of term i in document d

Question 17

Q

Cosine similarity

Answer

Study These Flashcards

A

cos(q, d) = Sum(q_i * d_i) / sqrt( sum(q_i^2) * sum(d_i^2) )

Where:
q_i = tf-idf of term i in query q
d_i = tf-idf of term i in document d

Question 18

Q

Rocchio’s algorithm

Answer

Study These Flashcards

A

Given a query q, q_0 initial vector representation for q

q_t+1 = alpha * q_t + beta * sum(d_i for d_i in R) / |R| - gamma * sum(d_i for d_i in NR) / |NR|

Where:
d_i = vector representation of the document
R = relevant documents for q
NR: non relevant documents for q

Essentially we have a query vector of the form:
alpha. 1
beta 1
gamma 1

And we add the average tf-idf of each term given the relevant documents and then subtract the average tf-idf of each term of the non relevant documents

Question 19

Q

Inverted files

Answer

Study These Flashcards

A

Term_1: d1, d2, d3
Use hash maps to store efficiently (i.e., key is term, value is list of documents)
- Sort by the terms

Question 20

Q

Champion list

Answer

Study These Flashcards

A

identify, for each term t, the r documents with the highest weights for t (offline step).
The set of r documents is the ‘champion list’
When the query comes, make the collection the union of the r documents for each query term

Question 21

Q

Page rank

Answer

Study These Flashcards

A

Probability matrix:
P = (alpha/N + (1 - alpha) * 1/L_i for L_i in Nodes)

Where:
N = Number of nodes
L_i = number of outgoing edges from Node i to another node

Pij is the probability of j being the next state given that random surfer is currently in state i

Query independent -> static quality score

Question 22

Q

Hubs & authorities

Answer

Study These Flashcards

A

Query dependent alternative to PageRank

Authorities for query q are pages that have highly relevant content for q and are authoritative

Hubs for q are pages that are good directories of resources for q (i.e., pages that point to many good authorities for q)

Question 23

Q

Snowball

Answer

Study These Flashcards

A

adv-db Flashcards

(23 cards)