Introduction - Text Representation and Boolean Model Flashcards

Question 1

Q

Primary task of an IR system

Answer

A

Retrieve documents with content that is relevant to a user’s information need.

Question 2

Q

What IR is not

Answer

A

It is not a database management systems. These store and process well-defined data. A search within them is exact and deterministic while search in an IR system is probabilistic.

Question 3

Q

Bag of Words

Answer

A

Document is represented as consisting of words as independent units with word order ignored.

Question 4

Q

Coordinate Matching

Answer

A

Document relevance measured by the number of query terms appearing a document. Terms provide the dimensions with the length along a dimension being either 0 or 1. Similarity measure is dot product of query and document vectors. This does not consider frequency of query terms in documents.

Question 5

Q

Term Frequency

Answer

A

Weighting by the frequency of terms in the document.

Question 6

Q

Inverse Document Frequency

Answer

A

Weight terms proportionally to the reciprocal of the number of documents they appear in.

Question 7

Q

Document Length

Answer

A

Similarity measure should be normalised to prevent document getting high score simply due to length.

Question 8

Q

Vector Space Similarity

Answer

A

Cosine of the angle between the query and document vectors.

Question 9

Q

Document

Answer

A

An item which may satisfy the user’s information need.

Question 10

Q

Query

Answer

A

Representation of user’s information need.

Question 11

Q

Term

Answer

A

Any word or phrase that can serve as a link to a document.

Question 12

Q

Inverted File

Answer

A

Keep following information for each term:

Document ID where this term occurs.
Frequency of occurrence of this term in each document
Possibly: Offset of this term in document

Question 13

Q

Tokenisation

Answer

A

Dividing a character stream into a sequence of distinct word forms (tokens). Separate on white-space, end of sentence punctuation, bracketing, hyphenation, apostrophes & slashes.

Question 14

Q

Stop Word

Answer

A

High-frequency word which is not useful for distinguishing between documents.

Question 15

Q

Equivalence Classes

Answer

A

Can be useful to put tokens into equivalence classes and treat a group of terms as the same term. This reduces size of index, may lead to improved retrieval and combined frequencies may better reflect content than individual frequencies.

Question 16

Q

Stem

Answer

Study These Flashcards

A

The core of a word (its main morpheme) to which inflectional and derivational morphology applies.

Question 17

Q

Stemming

Answer

Study These Flashcards

A

Attempts to remove inflectional (and some) derivational morphology.

Question 18

Q

Lemmatisation

Answer

Study These Flashcards

A

Just attempts to remove inflectional morphology.

Question 19

Q

Porter Stemmer

Answer

Study These Flashcards

A

Removes suffixes without a stem dictionary. Conflates terms and doesn’t deal with root changes.

Question 20

Q

Porter Stemmer - Word Representation

Answer

Study These Flashcards

A

C{m}[V]

C - one or more adjacent consonants
V - one or more adjacent vowels

[ ] - optionality
( ) - group operator
{x} - repetition x times
m - the “measure” measure of a word

m is calculated on word excluding suffix of rule under consideration.

Question 21

Q

Porter Stemmer - Possible Conditions

Answer

Study These Flashcards

A

Constraining the measure

Constraining the shape of the word piece e.g. stem ends with certain letter, stem contains a vowel, stem ends with a double consonant.

Boolean expressions

Question 22

Q

Boolean Model

Answer

Study These Flashcards

A

Queries consist of terms connected by AND, OR & NOT.

Assumptions:

Terms either present or absent in a document.
Terms are all equally informative when determining relevance.
A document is either relevant or not.

Question 23

Q

Boolean Model - Pros and Cons

Answer

Study These Flashcards

A

Pros:

Simple framework
Well-defined query semantics

Cons:

Difficult to formulate complex queries
Difficult to control output volume
No ranking facility

Introduction - Text Representation and Boolean Model Flashcards

(23 cards)