Topic 6: Text Retrieval Flashcards

1
Q

Conceptual model for IR

A

blocks involved are
documents, document representation, information needs, query, retrieved documents

indexing
documents -> document representation

formulation
information needs -> query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Definition of IR

A

Information retrieval is a field concerned with the
structure, analysis, organization, storage, searching, and
retrieval of information

Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Document / Retrieval Unit

A

example document
web page, email, books, scholar papers, text msgs, words, PDF, forums, patents, etc

retrieval unit can be

  • part of document (paragraph, page, slide)
  • in a form of different structure (html, xml, text, etc)
  • in different sizes or length
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Document Representation: 2 types

A

Full Text Representation

  • complete
  • need huge resources

Reduced (partial) Content Representation
- remove unimportant contents. eg stopwords
- standardization to reduce overlapped contents. eg stemming
retain only important content. eg noun phrases, header, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Bag of Words Model

A

Store the words as the bag (multiset) of its words, disregarding grammar
and even word order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Information Needs

A

Those things that you want Google to give you answer are
information needs.

what you want to know/search

Normally, you are required to formulate your information needs into
some keywords, known as query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Information Needs: Query

A
Simple Query
◦ Few keywords or more.
Boolean Query
◦ ‘neural network AND speech recognition’
Special Query
◦ 400 myr in usd
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Simple Term Matching Approach

A
  1. Compare the terms in a document and query.
  2. Compute “similarity” between each document in the collection and
    the query based on the terms they have in common.
  3. Sorting the document in order of decreasing similarity with the query.
  4. The outputs are a ranked list and displayed to the user - the top ones
    are more relevant as judged by the system
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Indexing

A

Convert documents into
representation or data structure to
improve the efficiency of retrieval

why? 
◦ Many variety of words used in texts,
but not all are important.
◦ Among the important words, some
are more contextually relevant.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Indexing: Some basic processes involved

A
◦ Tokenization
◦ Stop Words Removal
◦ Stemming
◦ Phrases
◦ Inverted File
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Indexing (Tokenization)

A

Convert a sequence of characters
into a sequence of tokens with
some basic meaning

Token can be single or multiple terms.. give example

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

common issues in tokenization

A
  1. Capitalized words can have different meaning from lower case words
  2. Apostrophes can be a part of a word, a part of a possessive, or just a
    mistake
  3. Numbers can be important, including decimals
  4. Periods can occur in numbers, abbreviations, URLs, ends of sentences, and
    other situations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Indexing (Stopping)

A

Stopword list can be created from high-frequency words or based on
a standard list

Lists are customized for applications, domains, and even parts of
documents
◦ e.g., “click” is a good stopword for anchor text

Best policy is to index all words in documents, make decisions about
which words to use at query time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Indexing (Stemming)

A

Many morphological variations of words
◦ inflectional (plurals, tenses)
◦ derivational (making verbs nouns etc.)

in most cases, have same meanings.

Stemmers attempt to reduce morphological variations of words to a
common stem. usually involves removing suffixes

can be done at stemming or part of query processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Porter Stemmer

A

consists of a series of rules designed to the longest possible suffix
produces stems not words.

step1a:

  • replace ssess by ss
  • delete s if the preceeding word part contains vowel
  • replace ied or ies
  • if suffix is us or ss, do nothing

step1b:

  • replace eed, eedly by ee if it is in the part of the word after the first non-vowel following vowel.
  • delete ed, edly, ing, ingly if preceeding word part contains a vowel and if ends in at, bl, iz, add e. if words ends with double letter that is not ll,ss,zz, remove last letter.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Indexing (Phrases)

A

Recall, token, meaningful tokens are better indexes, e.g.
phrases

Three possible approaches:
◦ Identify syntactic phrases using a part-of-speech (POS) tagger
◦ Use word n-grams
◦ Store word positions in indexes and use proximity operators in
queries

17
Q

POS taggers

A

statistical models of text to predict

syntactic tags of words

18
Q

Indexing (Inverted Index)

A

◦ Contains lists of documents, or lists of word occurrences in documents,
and other information.
◦ Each entry is called a posting.
◦ The part of the posting that refers to a specific document or
location is called a pointer
◦ Each document in the collection is given a unique number
◦ Lists are usually document-ordered (sorted by document number)

19
Q

Retrieval Function

A

Documents are retrieved in sorted order according to a score computing using
the document representation, the query, and a ranking algorithm

20
Q

Retrieval Function (Boolean Retrieval)

A

Advantages
◦ Results are predictable, relatively easy to explain
◦ Many different features can be incorporated
◦ Efficient processing since many documents can be eliminated from search
Disadvantages
◦ Effectiveness depends entirely on user
◦ Simple queries usually don’t work well
◦ Complex queries are difficult

Sequence of queries driven by number of retrieved documents

21
Q

Retrieval Function (Vector Space Model)

A

Ranked based method.
Documents and query represented by a vector of term weights
Collection represented by a matrix of term weights

make sure know tf.idf

22
Q

Retrieval Function (Vector Space Model)

A

Documents ranked by distance between points representing
query and documents
◦ Similarity measure e.g. Cosine correlation

exercise on cosine similarity computation

23
Q

Evaluation

A

Standard Collection
◦ Task specific
◦ Human experts are used to judge relevant results

Performance Metric
◦ Precision
◦ Recall

24
Q

Evaluation (Collection)

A

Test collections consisting of documents, queries, and relevance
judgments, e.g.,

Obtaining relevance judgments is an expensive, time-consuming
process
◦ who does it?
◦ what are the instructions?
◦ what is the level of agreement?
25
Q

Evaluation (Collection): pooling technique

A

xhaustive judgments for all documents in a collection is not practical

Pooling technique is used in TREC
◦ top k results (for TREC, k varied between 50 and 200) from the rankings
obtained by different search engines (or retrieval algorithms) are merged into
a pool
◦ duplicates are removed
◦ documents are presented in some random order to the relevance judges

26
Q

Evaluation (Effectiveness Measures)

A

confusion matrix

calculation of recall and precision

27
Q

Database Records

A
Database records (or tuples in relational databases) are typically
made up of well-defined fields (or attributes)

Easy to compare fields with well-defined semantics to queries in
order to find matches
Text is more difficult

28
Q

Search Query vs DB Query

A

database query Matches easily found by comparison with field values of records

search engine query text must be compared to the text of entire news stories

29
Q

Dimensions of IR

A

IR work with different media, different types
of search applications, and different tasks

New applications increasingly involve new media
◦ e.g., video, photos, music, speech

Like text, content is difficult to describe and compare
◦ text may be used to represent them (e.g. tags)

tables..contents/applications/task

30
Q

comparison of IR and search engine

A

Search Engine Issues

performance

  • measure and improve efficiency of search (response time, query throughput, indexing speed)
  • indexes are data structures designed to improve search efficiency (implementation are major issues for search engines)
  • dynamic data (collection are constantly changing in terms of update, addition, deletion..acquiring/crawling documents is a major task..measures such as coverage / freshness is used
31
Q

Search Engine Issues

A

performance

  • measure and improve efficiency of search (response time, query throughput, indexing speed)
  • indexes are data structures designed to improve search efficiency (implementation are major issues for search engines)

dynamic data

  • (collection are constantly changing in terms of update, addition, deletion..
  • acquiring/crawling documents is a major task..
  • measures such as coverage / freshness is used

scalability

  • millions of users, terabytes of documents.
  • distributed processing is essential

adaptability
- changing and tuning search engine components such as ranking algo, index strats, interface for diff applications

32
Q

Challenges in IR

A
  1. cross lingual IR
  2. big data
  3. personalization
  4. domain specific
  5. multi modal IR
    6 . . ..