06 Vector Space Model Flashcards
what is an N-Gram
a sequence of any consecutive chars
eg. john is quicker than marry
n =1»_space; “john”, “is”, “quicker”, “than”, “marry”»_space; doest not imply who is quicker
n =2»_space; “john is”, “is quicker”, “quicker than”, “than marry”…
used to estimate probability of sequence of word
what is TF-IDF
matrix of term weights
what is an IR model
model for
- document representation
- query representation
- estimating the relevance given the query
what is the vector space model
representation of documents and queries as vectors
relevance can be calculated by comparing the similarity of the vectors
documents that have similar vectors talk about the same thing
what does it mean if the cosine similarity is negative
some algorithms to penalise non matching terms to a negative number
what are the advantages for vector space model
- simple geometric interpretation
- easy to compute and measure
- easy to adapt to various weighting schemes
- provides ranked output
what are the disadvantages of vector space model
- high dimensionality
- term independence assumption
- which similarity metric to use?
- no guidance on when to stop ranking