02 The term vocabulary and postings lits Flashcards

1
Q

Indexing granularity

A

For a given collection, what is the size of each document? For a collection of books, it would usually be a bad idea to index an entire book as a document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Token

A

Is an instance of a sequence of characters in some particular document that are group together as useful semantic for processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Type

A

Is the class of all token containing the same character sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Stop words

A

Extremely common words that would appear to be of little value in helping select the docs matching a used need.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Collection frequency

A

Total number of times each term appears in the document collection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Case folding

A

Reducing all letters to lower case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Truecasing

A

Machine learning on when to do case folding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Stemming

A

Refers to a crude heuristic process that chops off the ends of words.

Ex: Saw => s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Lemmatization

A

Refers to doing things properly with the use of a vocabulary and morpholog- ical analasys of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Ex: Saw => see/saw (depending on wheter the word is used as a noun or as a verb)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Lemmatizer

A

A tool that finds the lemma in words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Skip lists

A

A tecnique to merge posting lists in sublinear time. Augment the PLs with skip pointers that function as shortcuts that allow us to avoid processing parts of the posting list that will not figuire in the search result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Biword index

A

Technique for handling phrase queries. Consider every pair of consecutive terms in a document as a phrase. Ex: Friends, Romans, Countrymen become the biwords Òfriends romansÓ and Òromans countrymenÓ. Each biword is treated as a vocabulary term and we index every biword. Longer phrase queries is broken down with AND as connector. Can cause false positives. Consider the example: Query: Standford University Palo Alto Ð> stanford university AND university palo AND palo alto. If this returns a match, there is still no guarantee that the document contains the phrase stanford university palo alto.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Phrase index

A

Extendsion of biword index. (More than two words)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Positional index

A

Technique for handling phrase queries. A posting will be on the form: docID: <pos1> To process a phrase query, you still need to index the II entries for each term. You start with the least frequent term and then work to further restrict the list of possible candidates. In the merge algorithm, you cannot just check if each term is in the same doc, but you also need to check that their positions of apperance in the document are compatible with the phrase query being evaluated.</pos1>

The same general method is applied for within k word proximity searches (proximity operators).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly