Introduction - Text Representation and Boolean Model Flashcards

1
Q

Primary task of an IR system

A

Retrieve documents with content that is relevant to a user’s information need.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What IR is not

A

It is not a database management systems. These store and process well-defined data. A search within them is exact and deterministic while search in an IR system is probabilistic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Bag of Words

A

Document is represented as consisting of words as independent units with word order ignored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Coordinate Matching

A

Document relevance measured by the number of query terms appearing a document. Terms provide the dimensions with the length along a dimension being either 0 or 1. Similarity measure is dot product of query and document vectors. This does not consider frequency of query terms in documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Term Frequency

A

Weighting by the frequency of terms in the document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Inverse Document Frequency

A

Weight terms proportionally to the reciprocal of the number of documents they appear in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Document Length

A

Similarity measure should be normalised to prevent document getting high score simply due to length.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Vector Space Similarity

A

Cosine of the angle between the query and document vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Document

A

An item which may satisfy the user’s information need.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Query

A

Representation of user’s information need.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Term

A

Any word or phrase that can serve as a link to a document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Inverted File

A

Keep following information for each term:

  • Document ID where this term occurs.
  • Frequency of occurrence of this term in each document
  • Possibly: Offset of this term in document
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Tokenisation

A

Dividing a character stream into a sequence of distinct word forms (tokens). Separate on white-space, end of sentence punctuation, bracketing, hyphenation, apostrophes & slashes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Stop Word

A

High-frequency word which is not useful for distinguishing between documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Equivalence Classes

A

Can be useful to put tokens into equivalence classes and treat a group of terms as the same term. This reduces size of index, may lead to improved retrieval and combined frequencies may better reflect content than individual frequencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Stem

A

The core of a word (its main morpheme) to which inflectional and derivational morphology applies.

17
Q

Stemming

A

Attempts to remove inflectional (and some) derivational morphology.

18
Q

Lemmatisation

A

Just attempts to remove inflectional morphology.

19
Q

Porter Stemmer

A

Removes suffixes without a stem dictionary. Conflates terms and doesn’t deal with root changes.

20
Q

Porter Stemmer - Word Representation

A

C{m}[V]

C - one or more adjacent consonants
V - one or more adjacent vowels

[ ] - optionality
( ) - group operator
{x} - repetition x times
m - the “measure” measure of a word

m is calculated on word excluding suffix of rule under consideration.

21
Q

Porter Stemmer - Possible Conditions

A

Constraining the measure

Constraining the shape of the word piece e.g. stem ends with certain letter, stem contains a vowel, stem ends with a double consonant.

Boolean expressions

22
Q

Boolean Model

A

Queries consist of terms connected by AND, OR & NOT.

Assumptions:

  • Terms either present or absent in a document.
  • Terms are all equally informative when determining relevance.
  • A document is either relevant or not.
23
Q

Boolean Model - Pros and Cons

A

Pros:

  • Simple framework
  • Well-defined query semantics

Cons:

  • Difficult to formulate complex queries
  • Difficult to control output volume
  • No ranking facility