Topic 6: Text Retrieval Flashcards

Question 1

Q

Conceptual model for IR

Answer

A

blocks involved are
documents, document representation, information needs, query, retrieved documents

indexing
documents -> document representation

formulation
information needs -> query

Question 2

Q

Definition of IR

Answer

A

Information retrieval is a field concerned with the
structure, analysis, organization, storage, searching, and
retrieval of information

Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers)

Question 3

Q

Document / Retrieval Unit

Answer

A

example document
web page, email, books, scholar papers, text msgs, words, PDF, forums, patents, etc

retrieval unit can be

part of document (paragraph, page, slide)
in a form of different structure (html, xml, text, etc)
in different sizes or length

Question 4

Q

Document Representation: 2 types

Answer

A

Full Text Representation

complete
need huge resources

Reduced (partial) Content Representation
- remove unimportant contents. eg stopwords
- standardization to reduce overlapped contents. eg stemming
retain only important content. eg noun phrases, header, etc

Question 5

Q

Bag of Words Model

Answer

A

Store the words as the bag (multiset) of its words, disregarding grammar
and even word order

Question 6

Q

Information Needs

Answer

A

Those things that you want Google to give you answer are
information needs.

what you want to know/search

Normally, you are required to formulate your information needs into
some keywords, known as query

Question 7

Q

Information Needs: Query

Answer

A

Simple Query
◦ Few keywords or more.
Boolean Query
◦ ‘neural network AND speech recognition’
Special Query
◦ 400 myr in usd

Question 8

Q

Simple Term Matching Approach

Answer

A

Compare the terms in a document and query.
Compute “similarity” between each document in the collection and
the query based on the terms they have in common.
Sorting the document in order of decreasing similarity with the query.
The outputs are a ranked list and displayed to the user - the top ones
are more relevant as judged by the system

Question 9

Q

Indexing

Answer

A

Convert documents into
representation or data structure to
improve the efficiency of retrieval

why? 
◦ Many variety of words used in texts,
but not all are important.
◦ Among the important words, some
are more contextually relevant.

Question 10

Q

Indexing: Some basic processes involved

Answer

A

◦ Tokenization
◦ Stop Words Removal
◦ Stemming
◦ Phrases
◦ Inverted File

Question 11

Q

Indexing (Tokenization)

Answer

A

Convert a sequence of characters
into a sequence of tokens with
some basic meaning

Token can be single or multiple terms.. give example

Question 12

Q

common issues in tokenization

Answer

A

Capitalized words can have different meaning from lower case words
Apostrophes can be a part of a word, a part of a possessive, or just a
mistake
Numbers can be important, including decimals
Periods can occur in numbers, abbreviations, URLs, ends of sentences, and
other situations

Question 13

Q

Indexing (Stopping)

Answer

A

Stopword list can be created from high-frequency words or based on
a standard list

Lists are customized for applications, domains, and even parts of
documents
◦ e.g., “click” is a good stopword for anchor text

Best policy is to index all words in documents, make decisions about
which words to use at query time

Question 14

Q

Indexing (Stemming)

Answer

A

Many morphological variations of words
◦ inflectional (plurals, tenses)
◦ derivational (making verbs nouns etc.)

in most cases, have same meanings.

Stemmers attempt to reduce morphological variations of words to a
common stem. usually involves removing suffixes

can be done at stemming or part of query processing

Question 15

Q

Porter Stemmer

Answer

A

consists of a series of rules designed to the longest possible suffix
produces stems not words.

step1a:

replace ssess by ss
delete s if the preceeding word part contains vowel
replace ied or ies
if suffix is us or ss, do nothing

step1b:

replace eed, eedly by ee if it is in the part of the word after the first non-vowel following vowel.
delete ed, edly, ing, ingly if preceeding word part contains a vowel and if ends in at, bl, iz, add e. if words ends with double letter that is not ll,ss,zz, remove last letter.

Question 16

Q

Indexing (Phrases)

Answer

A

Recall, token, meaningful tokens are better indexes, e.g.
phrases

Three possible approaches:
◦ Identify syntactic phrases using a part-of-speech (POS) tagger
◦ Use word n-grams
◦ Store word positions in indexes and use proximity operators in
queries

Question 17

Q

POS taggers

Answer

A

statistical models of text to predict

syntactic tags of words

Question 18

Q

Indexing (Inverted Index)

Answer

A

◦ Contains lists of documents, or lists of word occurrences in documents,
and other information.
◦ Each entry is called a posting.
◦ The part of the posting that refers to a specific document or
location is called a pointer
◦ Each document in the collection is given a unique number
◦ Lists are usually document-ordered (sorted by document number)

Question 19

Q

Retrieval Function

Answer

A

Documents are retrieved in sorted order according to a score computing using
the document representation, the query, and a ranking algorithm

Question 20

Q

Retrieval Function (Boolean Retrieval)

Answer

A

Advantages
◦ Results are predictable, relatively easy to explain
◦ Many different features can be incorporated
◦ Efficient processing since many documents can be eliminated from search
Disadvantages
◦ Effectiveness depends entirely on user
◦ Simple queries usually don’t work well
◦ Complex queries are difficult

Sequence of queries driven by number of retrieved documents

Question 21

Q

Retrieval Function (Vector Space Model)

Answer

A

Ranked based method.
Documents and query represented by a vector of term weights
Collection represented by a matrix of term weights

make sure know tf.idf

Question 22

Q

Retrieval Function (Vector Space Model)

Answer

A

Documents ranked by distance between points representing
query and documents
◦ Similarity measure e.g. Cosine correlation

exercise on cosine similarity computation

Question 23

Q

Evaluation

Answer

A

Standard Collection
◦ Task specific
◦ Human experts are used to judge relevant results

Performance Metric
◦ Precision
◦ Recall

Question 24

Q

Evaluation (Collection)

Answer

A

Test collections consisting of documents, queries, and relevance
judgments, e.g.,

Obtaining relevance judgments is an expensive, time-consuming
process
◦ who does it?
◦ what are the instructions?
◦ what is the level of agreement?

Question 25

Q

Evaluation (Collection): pooling technique

Answer

A

xhaustive judgments for all documents in a collection is not practical

Pooling technique is used in TREC
◦ top k results (for TREC, k varied between 50 and 200) from the rankings
obtained by different search engines (or retrieval algorithms) are merged into
a pool
◦ duplicates are removed
◦ documents are presented in some random order to the relevance judges

Question 26

Q

Evaluation (Effectiveness Measures)

Answer

A

confusion matrix

calculation of recall and precision

Question 27

Q

Database Records

Answer

A

Database records (or tuples in relational databases) are typically
made up of well-defined fields (or attributes)

Easy to compare fields with well-defined semantics to queries in
order to find matches
Text is more difficult

Question 28

Q

Search Query vs DB Query

Answer

A

database query Matches easily found by comparison with field values of records

search engine query text must be compared to the text of entire news stories

Question 29

Q

Dimensions of IR

Answer

A

IR work with different media, different types
of search applications, and different tasks

New applications increasingly involve new media
◦ e.g., video, photos, music, speech

Like text, content is difficult to describe and compare
◦ text may be used to represent them (e.g. tags)

tables..contents/applications/task

Question 30

Q

comparison of IR and search engine

Answer

A

Search Engine Issues

performance

measure and improve efficiency of search (response time, query throughput, indexing speed)
indexes are data structures designed to improve search efficiency (implementation are major issues for search engines)
dynamic data (collection are constantly changing in terms of update, addition, deletion..acquiring/crawling documents is a major task..measures such as coverage / freshness is used

Question 31

Q

Search Engine Issues

Answer

A

performance

measure and improve efficiency of search (response time, query throughput, indexing speed)
indexes are data structures designed to improve search efficiency (implementation are major issues for search engines)

dynamic data

(collection are constantly changing in terms of update, addition, deletion..
acquiring/crawling documents is a major task..
measures such as coverage / freshness is used

scalability

millions of users, terabytes of documents.
distributed processing is essential

adaptability
- changing and tuning search engine components such as ranking algo, index strats, interface for diff applications

Question 32

Q

Challenges in IR

Answer

A

cross lingual IR
big data
personalization
domain specific
multi modal IR
6 . . ..