06 Scoring, term weighting, and vector space model Flashcards

Question 1

Q

Metadata

Answer

A

Digital docs generally encode, in machine-recongniziable form, certain meta- data associated with each document. Such as author(s), title, and date of publication.

Question 2

Q

Field

Answer

A

Examples of fields: Date of creation, format, author, title. 22

Question 3

Q

Paramtetric indexes

Answer

A

There is one parmetric index for each field. It allows us to select only the docs mathcing a data specified in the query.

Question 4

Q

Zones

Answer

A

Similar to fields, exept the content of a zone can be arbitrary free text.

Question 5

Q

Weighted zone scoring

Answer

A

Assings to the pair (query, doc) a score in the interval [0, 1], by computing a linear combinatin of zone scores.

Question 6

Q

Term frequency

Answer

A

Number of occurences of a term in a doc.

Question 7

Q

Bag of words

Answer

A

A doc representation where the exact ordering of the terms in a document is ignored. We only retain information on the number of occrrences of each term. Therefore, ÒMary is quicker than JohnÓ gives the same document representation as ÒJohn is quicker than MaryÓ.

Question 8

Q

Document frequency

Answer

A

Defined to be the number of docs in the collection that contains a term t.

Question 9

Q

Inverse document frequency

Answer

A

idf = log (N/df). Thus, the idf of a rare term is high.

Question 10

Q

Tf-idf

Answer

A

tf-idf = tf x idf

Properties

Highest when a term occurs many times wihtin a small number of docs
Lower when the term occurs fewer times in a document, or occurs in many documents
Lowest when the term occurs in virtually all documents.

Question 11

Q

Document vector

Answer

A

Can view each document as a vector with one component corresponding to each term in the dictionary, together with a weight for each component that is given by tf-idf.

Question 12

Q

Vector space model

Answer

A

The representation of a set of docs as vectors in a common vector space is known as the vector space model.

Question 13

Q

Cosine similarity

Answer

A

The standard way of quantifying the similarity between two documents or a query and a document is to compute the cosine similiarity where the number- ator represents the dot product and the denominator is the product of their Euclidean length. This measure is the cosine of the angle /Theta between two vectors.

Question 14

Q

Term-document matrix

Answer

A

Viewing a collection of N docs as a collection of vectors leads to a natural view of a collection as term-document matrix.

Question 15

Q

Term-at-a-time

Answer

A

The process of adding contributions one query term at a time is known as term-at-a-time scoring or accumulation, and the N elements of the array ÒScoresÓ are therfore known as accumulators.

Question 16

Q

Document-at-a-time

Answer

Study These Flashcards

A

In a postings traversal, if we compute the scores of one document at a time, its called document-at-a-time scoring. Basically, if you use the merg- ing/intersection algorithm from chaper 1, you are using document-at-a-time scoring.

Question 17

Q

Sublinear tf-scaling

Answer

Study These Flashcards

A

Use the logarithm of the term frequency, because a term that occurs twenty times is (often) not 20 times more important than a term that occurs once. Therefore: tf-idf = wf x idf where wf = 1 + log tf if tf>0 wf = 0 otherwise

Question 18

Q

Maximum tf normalization

Answer

Study These Flashcards

A

Normalize the tf weights of all terms ocuring in a document by the maximum tf in that doc.

06 Scoring, term weighting, and vector space model Flashcards

(18 cards)