02 The term vocabulary and postings lits Flashcards
Indexing granularity
For a given collection, what is the size of each document? For a collection of books, it would usually be a bad idea to index an entire book as a document.
Token
Is an instance of a sequence of characters in some particular document that are group together as useful semantic for processing.
Type
Is the class of all token containing the same character sequence.
Stop words
Extremely common words that would appear to be of little value in helping select the docs matching a used need.
Collection frequency
Total number of times each term appears in the document collection
Case folding
Reducing all letters to lower case
Truecasing
Machine learning on when to do case folding
Stemming
Refers to a crude heuristic process that chops off the ends of words.
Ex: Saw => s
Lemmatization
Refers to doing things properly with the use of a vocabulary and morpholog- ical analasys of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
Ex: Saw => see/saw (depending on wheter the word is used as a noun or as a verb)
Lemmatizer
A tool that finds the lemma in words.
Skip lists
A tecnique to merge posting lists in sublinear time. Augment the PLs with skip pointers that function as shortcuts that allow us to avoid processing parts of the posting list that will not figuire in the search result.
Biword index
Technique for handling phrase queries. Consider every pair of consecutive terms in a document as a phrase. Ex: Friends, Romans, Countrymen become the biwords Òfriends romansÓ and Òromans countrymenÓ. Each biword is treated as a vocabulary term and we index every biword. Longer phrase queries is broken down with AND as connector. Can cause false positives. Consider the example: Query: Standford University Palo Alto Ð> stanford university AND university palo AND palo alto. If this returns a match, there is still no guarantee that the document contains the phrase stanford university palo alto.
Phrase index
Extendsion of biword index. (More than two words)
Positional index
Technique for handling phrase queries. A posting will be on the form: docID: <pos1> To process a phrase query, you still need to index the II entries for each term. You start with the least frequent term and then work to further restrict the list of possible candidates. In the merge algorithm, you cannot just check if each term is in the same doc, but you also need to check that their positions of apperance in the document are compatible with the phrase query being evaluated.</pos1>
The same general method is applied for within k word proximity searches (proximity operators).