02 The term vocabulary and postings lits Flashcards

Question 1

Q

Indexing granularity

Answer

A

For a given collection, what is the size of each document? For a collection of books, it would usually be a bad idea to index an entire book as a document.

Question 2

Q

Token

Answer

A

Is an instance of a sequence of characters in some particular document that are group together as useful semantic for processing.

Question 3

Q

Type

Answer

A

Is the class of all token containing the same character sequence.

Question 4

Q

Stop words

Answer

A

Extremely common words that would appear to be of little value in helping select the docs matching a used need.

Question 5

Q

Collection frequency

Answer

A

Total number of times each term appears in the document collection

Question 6

Q

Case folding

Answer

A

Reducing all letters to lower case

Question 7

Q

Truecasing

Answer

A

Machine learning on when to do case folding

Question 8

Q

Stemming

Answer

A

Refers to a crude heuristic process that chops off the ends of words.

Ex: Saw => s

Question 9

Q

Lemmatization

Answer

A

Refers to doing things properly with the use of a vocabulary and morpholog- ical analasys of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Ex: Saw => see/saw (depending on wheter the word is used as a noun or as a verb)

Question 10

Q

Lemmatizer

Answer

A

A tool that finds the lemma in words.

Question 11

Q

Skip lists

Answer

A

A tecnique to merge posting lists in sublinear time. Augment the PLs with skip pointers that function as shortcuts that allow us to avoid processing parts of the posting list that will not figuire in the search result.

Question 12

Q

Biword index

Answer

A

Technique for handling phrase queries. Consider every pair of consecutive terms in a document as a phrase. Ex: Friends, Romans, Countrymen become the biwords Òfriends romansÓ and Òromans countrymenÓ. Each biword is treated as a vocabulary term and we index every biword. Longer phrase queries is broken down with AND as connector. Can cause false positives. Consider the example: Query: Standford University Palo Alto Ð> stanford university AND university palo AND palo alto. If this returns a match, there is still no guarantee that the document contains the phrase stanford university palo alto.

Question 13

Q

Phrase index

Answer

A

Extendsion of biword index. (More than two words)

Question 14

Q

Positional index

Answer

A

Technique for handling phrase queries. A posting will be on the form: docID: <pos1> To process a phrase query, you still need to index the II entries for each term. You start with the least frequent term and then work to further restrict the list of possible candidates. In the merge algorithm, you cannot just check if each term is in the same doc, but you also need to check that their positions of apperance in the document are compatible with the phrase query being evaluated.</pos1>

The same general method is applied for within k word proximity searches (proximity operators).