06 Scoring, term weighting, and vector space model Flashcards
Metadata
Digital docs generally encode, in machine-recongniziable form, certain meta- data associated with each document. Such as author(s), title, and date of publication.
Field
Examples of fields: Date of creation, format, author, title. 22
Paramtetric indexes
There is one parmetric index for each field. It allows us to select only the docs mathcing a data specified in the query.
Zones
Similar to fields, exept the content of a zone can be arbitrary free text.
Weighted zone scoring
Assings to the pair (query, doc) a score in the interval [0, 1], by computing a linear combinatin of zone scores.
Term frequency
Number of occurences of a term in a doc.
Bag of words
A doc representation where the exact ordering of the terms in a document is ignored. We only retain information on the number of occrrences of each term. Therefore, ÒMary is quicker than JohnÓ gives the same document representation as ÒJohn is quicker than MaryÓ.
Document frequency
Defined to be the number of docs in the collection that contains a term t.
Inverse document frequency
idf = log (N/df). Thus, the idf of a rare term is high.
Tf-idf
tf-idf = tf x idf
Properties
- Highest when a term occurs many times wihtin a small number of docs
- Lower when the term occurs fewer times in a document, or occurs in many documents
- Lowest when the term occurs in virtually all documents.
Document vector
Can view each document as a vector with one component corresponding to each term in the dictionary, together with a weight for each component that is given by tf-idf.
Vector space model
The representation of a set of docs as vectors in a common vector space is known as the vector space model.
Cosine similarity
The standard way of quantifying the similarity between two documents or a query and a document is to compute the cosine similiarity where the number- ator represents the dot product and the denominator is the product of their Euclidean length. This measure is the cosine of the angle /Theta between two vectors.
Term-document matrix
Viewing a collection of N docs as a collection of vectors leads to a natural view of a collection as term-document matrix.
Term-at-a-time
The process of adding contributions one query term at a time is known as term-at-a-time scoring or accumulation, and the N elements of the array ÒScoresÓ are therfore known as accumulators.