Topic 6: Text Retrieval Flashcards
Conceptual model for IR
blocks involved are
documents, document representation, information needs, query, retrieved documents
indexing
documents -> document representation
formulation
information needs -> query
Definition of IR
Information retrieval is a field concerned with the
structure, analysis, organization, storage, searching, and
retrieval of information
Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers)
Document / Retrieval Unit
example document
web page, email, books, scholar papers, text msgs, words, PDF, forums, patents, etc
retrieval unit can be
- part of document (paragraph, page, slide)
- in a form of different structure (html, xml, text, etc)
- in different sizes or length
Document Representation: 2 types
Full Text Representation
- complete
- need huge resources
Reduced (partial) Content Representation
- remove unimportant contents. eg stopwords
- standardization to reduce overlapped contents. eg stemming
retain only important content. eg noun phrases, header, etc
Bag of Words Model
Store the words as the bag (multiset) of its words, disregarding grammar
and even word order
Information Needs
Those things that you want Google to give you answer are
information needs.
what you want to know/search
Normally, you are required to formulate your information needs into
some keywords, known as query
Information Needs: Query
Simple Query ◦ Few keywords or more. Boolean Query ◦ ‘neural network AND speech recognition’ Special Query ◦ 400 myr in usd
Simple Term Matching Approach
- Compare the terms in a document and query.
- Compute “similarity” between each document in the collection and
the query based on the terms they have in common. - Sorting the document in order of decreasing similarity with the query.
- The outputs are a ranked list and displayed to the user - the top ones
are more relevant as judged by the system
Indexing
Convert documents into
representation or data structure to
improve the efficiency of retrieval
why? ◦ Many variety of words used in texts, but not all are important. ◦ Among the important words, some are more contextually relevant.
Indexing: Some basic processes involved
◦ Tokenization ◦ Stop Words Removal ◦ Stemming ◦ Phrases ◦ Inverted File
Indexing (Tokenization)
Convert a sequence of characters
into a sequence of tokens with
some basic meaning
Token can be single or multiple terms.. give example
common issues in tokenization
- Capitalized words can have different meaning from lower case words
- Apostrophes can be a part of a word, a part of a possessive, or just a
mistake - Numbers can be important, including decimals
- Periods can occur in numbers, abbreviations, URLs, ends of sentences, and
other situations
Indexing (Stopping)
Stopword list can be created from high-frequency words or based on
a standard list
Lists are customized for applications, domains, and even parts of
documents
◦ e.g., “click” is a good stopword for anchor text
Best policy is to index all words in documents, make decisions about
which words to use at query time
Indexing (Stemming)
Many morphological variations of words
◦ inflectional (plurals, tenses)
◦ derivational (making verbs nouns etc.)
in most cases, have same meanings.
Stemmers attempt to reduce morphological variations of words to a
common stem. usually involves removing suffixes
can be done at stemming or part of query processing
Porter Stemmer
consists of a series of rules designed to the longest possible suffix
produces stems not words.
step1a:
- replace ssess by ss
- delete s if the preceeding word part contains vowel
- replace ied or ies
- if suffix is us or ss, do nothing
step1b:
- replace eed, eedly by ee if it is in the part of the word after the first non-vowel following vowel.
- delete ed, edly, ing, ingly if preceeding word part contains a vowel and if ends in at, bl, iz, add e. if words ends with double letter that is not ll,ss,zz, remove last letter.
Indexing (Phrases)
Recall, token, meaningful tokens are better indexes, e.g.
phrases
Three possible approaches:
◦ Identify syntactic phrases using a part-of-speech (POS) tagger
◦ Use word n-grams
◦ Store word positions in indexes and use proximity operators in
queries
POS taggers
statistical models of text to predict
syntactic tags of words
Indexing (Inverted Index)
◦ Contains lists of documents, or lists of word occurrences in documents,
and other information.
◦ Each entry is called a posting.
◦ The part of the posting that refers to a specific document or
location is called a pointer
◦ Each document in the collection is given a unique number
◦ Lists are usually document-ordered (sorted by document number)
Retrieval Function
Documents are retrieved in sorted order according to a score computing using
the document representation, the query, and a ranking algorithm
Retrieval Function (Boolean Retrieval)
Advantages
◦ Results are predictable, relatively easy to explain
◦ Many different features can be incorporated
◦ Efficient processing since many documents can be eliminated from search
Disadvantages
◦ Effectiveness depends entirely on user
◦ Simple queries usually don’t work well
◦ Complex queries are difficult
Sequence of queries driven by number of retrieved documents
Retrieval Function (Vector Space Model)
Ranked based method.
Documents and query represented by a vector of term weights
Collection represented by a matrix of term weights
make sure know tf.idf
Retrieval Function (Vector Space Model)
Documents ranked by distance between points representing
query and documents
◦ Similarity measure e.g. Cosine correlation
exercise on cosine similarity computation
Evaluation
Standard Collection
◦ Task specific
◦ Human experts are used to judge relevant results
Performance Metric
◦ Precision
◦ Recall
Evaluation (Collection)
Test collections consisting of documents, queries, and relevance
judgments, e.g.,
Obtaining relevance judgments is an expensive, time-consuming process ◦ who does it? ◦ what are the instructions? ◦ what is the level of agreement?
Evaluation (Collection): pooling technique
xhaustive judgments for all documents in a collection is not practical
Pooling technique is used in TREC
◦ top k results (for TREC, k varied between 50 and 200) from the rankings
obtained by different search engines (or retrieval algorithms) are merged into
a pool
◦ duplicates are removed
◦ documents are presented in some random order to the relevance judges
Evaluation (Effectiveness Measures)
confusion matrix
calculation of recall and precision
Database Records
Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes)
Easy to compare fields with well-defined semantics to queries in
order to find matches
Text is more difficult
Search Query vs DB Query
database query Matches easily found by comparison with field values of records
search engine query text must be compared to the text of entire news stories
Dimensions of IR
IR work with different media, different types
of search applications, and different tasks
New applications increasingly involve new media
◦ e.g., video, photos, music, speech
Like text, content is difficult to describe and compare
◦ text may be used to represent them (e.g. tags)
tables..contents/applications/task
comparison of IR and search engine
Search Engine Issues
performance
- measure and improve efficiency of search (response time, query throughput, indexing speed)
- indexes are data structures designed to improve search efficiency (implementation are major issues for search engines)
- dynamic data (collection are constantly changing in terms of update, addition, deletion..acquiring/crawling documents is a major task..measures such as coverage / freshness is used
Search Engine Issues
performance
- measure and improve efficiency of search (response time, query throughput, indexing speed)
- indexes are data structures designed to improve search efficiency (implementation are major issues for search engines)
dynamic data
- (collection are constantly changing in terms of update, addition, deletion..
- acquiring/crawling documents is a major task..
- measures such as coverage / freshness is used
scalability
- millions of users, terabytes of documents.
- distributed processing is essential
adaptability
- changing and tuning search engine components such as ranking algo, index strats, interface for diff applications
Challenges in IR
- cross lingual IR
- big data
- personalization
- domain specific
- multi modal IR
6 . . ..