Information Retrieval Flashcards

(28 cards)

1
Q

indexing

A

task of finding terms that describe the documents well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

manual indexing

A

done by using a predefined set of index terms and fixed vocabularies
the indexing is done by humans
labour intensive
they are high precision searches and work well for closed collections, however, searches need to know terms to achieve precision and labellers need to be trained in order to achieve consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

text retrieval

A

find documents that are relevant to a user query given a large static document collection and information needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

automatic indexing

A

uses natural language as indexing language, implementation of indices done via inverted files. It also consists of term manipulation and term weighting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

inverted file index

A

it can be used to record in which document a term occurs, how many occurrences, and the position of those occurrences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

bag-of-words approach

A

only records what terms are present and their occurrence. ignores the relationship between words i.e. ordering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

boolean model

A

the boolean query constructs complex search commands by combining basic search terms with Boolean operators.
precise simple, logical basis for deciding whether any document should be returned based on whether the basic terms of query do not appear in the document and the meaning of the logical operators.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

vector space model

A

uses the bag-of-words approach where documents = points in high dimensional vector space
dimension = term in an index
frequencies of terms in documents = values
queries are represented as vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

method to perform vector space model

A
  1. select documents(s) with the highest document-query similarity (model for relevance => ranking_
  2. number of documents returns => less relevant thus uses start at the top of ranking stop when satisfied
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

normalized correlation coefficient

A

the cosine of the angles between the vectors

  • vector pointing in the same direction: 1
  • orthogonal vector: 0
  • vectors pointing in opposite directions: -1

This computes how well occurrences of each term i correlate in query and document, then scales for the magnitude of the overall vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

term manipulation

A

the pre-process of terms for generalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

tokenization

A

process of removing punctuation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

capitalisation

A

normalise all words to lower/upper case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

lemmatisation

A

conflate different inflected forms of a word to their basic form (singular, present tense, 1st tense)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

stemming

A

conflate morphological variants by chopping their affix (connected, connection -> connect)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

normalisation

A

heuristics to conflate variants due to spelling, hyphenation, spaces,

17
Q

stop list

A

removes non-content words (most frequency and least useful for retrieval

18
Q

bigram indexing

A

store each bigram as a term in index

i..e pease porridge in the pot => pease porridge, porridge in, in the, the pot

19
Q

position indexing

A

identifies multi word phrase during retrieval by storing position terms in documents

20
Q

term weighting

A
  1. document collection
  2. size of collection
  3. term frequency
  4. collection frequency
  5. document frequency
21
Q

inverse document frequency

A

log(size of collection/documents containing the term)

22
Q

tf.idf

A

common weighting method which is the some of the term frequency and idf

23
Q

PageRank Algorithm

A

exploits the link structure of the web
- link from page A to page B confers authority on B depending on the PageRank score of A and its number of outgoing links (recursively defined)

24
Q

recall

A

the proportion of relevant documents returned

relevant & retrieved documents / all relevant

25
precision
the proportion of retrieved documents that are relevant | relevant & retrieved / all retrieved
26
f measure (f1)
combines precision & recall into a single figure to give equal weight to both
27
precision at cut off
measures how well a method ranks relevant documents before non relevant documents
28
average precision
precision computed for each point a relevant documents is found and figures averaged