Vector Space Model Flashcards
Vector Space
Defined by a linearly independent set of basis vectors.
Orthogonal base vectors
v.w = 0 so v and w are orthogonal. If a set of vectors is pairwise orthogonal then it is linearly independent.
Terms as Basis Vectors
Terms chosen as orthogonal base vectors but clearly not orthogonal due to polysemy and synonymy.
Zipf’s Law
The frequency of a word is reciprocally proportional to its frequency:
freq(word_i) = 1/(i^theta) * freq(word_1)
Term Importance and Zipf’s law
Zone 1 - High frequency words are function words so not important
Zone 2 - Mid-frequency words are best indicators of document contents.
Zone 3 - Low frequency words are generally typos or overly specific words.
Term Frequency
Monotonic function of number of times a term appears in a document. (enhances recall)
Inverse document frequency
Monotonic function of the number of documents in which a term appears. Use instead of inverse collection frequency as a term may have a high collection frequency by being concentrated in a small set of documents.
Term similarity metrics
Terms can be compared using a string similarity metric such as edit distance.
Determining multi-word terms
Observe combinations in a large corpus of text then extract multi-word terms on frequency of occurrence.