06 Scoring, term weighting, and vector space model Flashcards

1
Q

Metadata

A

Digital docs generally encode, in machine-recongniziable form, certain meta- data associated with each document. Such as author(s), title, and date of publication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Field

A

Examples of fields: Date of creation, format, author, title. 22

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Paramtetric indexes

A

There is one parmetric index for each field. It allows us to select only the docs mathcing a data specified in the query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Zones

A

Similar to fields, exept the content of a zone can be arbitrary free text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Weighted zone scoring

A

Assings to the pair (query, doc) a score in the interval [0, 1], by computing a linear combinatin of zone scores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Term frequency

A

Number of occurences of a term in a doc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Bag of words

A

A doc representation where the exact ordering of the terms in a document is ignored. We only retain information on the number of occrrences of each term. Therefore, ÒMary is quicker than JohnÓ gives the same document representation as ÒJohn is quicker than MaryÓ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Document frequency

A

Defined to be the number of docs in the collection that contains a term t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Inverse document frequency

A

idf = log (N/df). Thus, the idf of a rare term is high.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Tf-idf

A

tf-idf = tf x idf

Properties

  1. Highest when a term occurs many times wihtin a small number of docs
  2. Lower when the term occurs fewer times in a document, or occurs in many documents
  3. Lowest when the term occurs in virtually all documents.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Document vector

A

Can view each document as a vector with one component corresponding to each term in the dictionary, together with a weight for each component that is given by tf-idf.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Vector space model

A

The representation of a set of docs as vectors in a common vector space is known as the vector space model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cosine similarity

A

The standard way of quantifying the similarity between two documents or a query and a document is to compute the cosine similiarity where the number- ator represents the dot product and the denominator is the product of their Euclidean length. This measure is the cosine of the angle /Theta between two vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Term-document matrix

A

Viewing a collection of N docs as a collection of vectors leads to a natural view of a collection as term-document matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Term-at-a-time

A

The process of adding contributions one query term at a time is known as term-at-a-time scoring or accumulation, and the N elements of the array ÒScoresÓ are therfore known as accumulators.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Document-at-a-time

A

In a postings traversal, if we compute the scores of one document at a time, its called document-at-a-time scoring. Basically, if you use the merg- ing/intersection algorithm from chaper 1, you are using document-at-a-time scoring.

17
Q

Sublinear tf-scaling

A

Use the logarithm of the term frequency, because a term that occurs twenty times is (often) not 20 times more important than a term that occurs once. Therefore: tf-idf = wf x idf where wf = 1 + log tf if tf>0 wf = 0 otherwise

18
Q

Maximum tf normalization

A

Normalize the tf weights of all terms ocuring in a document by the maximum tf in that doc.