Topic 6: Text Retrieval Flashcards
Conceptual model for IR
blocks involved are
documents, document representation, information needs, query, retrieved documents
indexing
documents -> document representation
formulation
information needs -> query
Definition of IR
Information retrieval is a field concerned with the
structure, analysis, organization, storage, searching, and
retrieval of information
Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers)
Document / Retrieval Unit
example document
web page, email, books, scholar papers, text msgs, words, PDF, forums, patents, etc
retrieval unit can be
- part of document (paragraph, page, slide)
- in a form of different structure (html, xml, text, etc)
- in different sizes or length
Document Representation: 2 types
Full Text Representation
- complete
- need huge resources
Reduced (partial) Content Representation
- remove unimportant contents. eg stopwords
- standardization to reduce overlapped contents. eg stemming
retain only important content. eg noun phrases, header, etc
Bag of Words Model
Store the words as the bag (multiset) of its words, disregarding grammar
and even word order
Information Needs
Those things that you want Google to give you answer are
information needs.
what you want to know/search
Normally, you are required to formulate your information needs into
some keywords, known as query
Information Needs: Query
Simple Query ◦ Few keywords or more. Boolean Query ◦ ‘neural network AND speech recognition’ Special Query ◦ 400 myr in usd
Simple Term Matching Approach
- Compare the terms in a document and query.
- Compute “similarity” between each document in the collection and
the query based on the terms they have in common. - Sorting the document in order of decreasing similarity with the query.
- The outputs are a ranked list and displayed to the user - the top ones
are more relevant as judged by the system
Indexing
Convert documents into
representation or data structure to
improve the efficiency of retrieval
why? ◦ Many variety of words used in texts, but not all are important. ◦ Among the important words, some are more contextually relevant.
Indexing: Some basic processes involved
◦ Tokenization ◦ Stop Words Removal ◦ Stemming ◦ Phrases ◦ Inverted File
Indexing (Tokenization)
Convert a sequence of characters
into a sequence of tokens with
some basic meaning
Token can be single or multiple terms.. give example
common issues in tokenization
- Capitalized words can have different meaning from lower case words
- Apostrophes can be a part of a word, a part of a possessive, or just a
mistake - Numbers can be important, including decimals
- Periods can occur in numbers, abbreviations, URLs, ends of sentences, and
other situations
Indexing (Stopping)
Stopword list can be created from high-frequency words or based on
a standard list
Lists are customized for applications, domains, and even parts of
documents
◦ e.g., “click” is a good stopword for anchor text
Best policy is to index all words in documents, make decisions about
which words to use at query time
Indexing (Stemming)
Many morphological variations of words
◦ inflectional (plurals, tenses)
◦ derivational (making verbs nouns etc.)
in most cases, have same meanings.
Stemmers attempt to reduce morphological variations of words to a
common stem. usually involves removing suffixes
can be done at stemming or part of query processing
Porter Stemmer
consists of a series of rules designed to the longest possible suffix
produces stems not words.
step1a:
- replace ssess by ss
- delete s if the preceeding word part contains vowel
- replace ied or ies
- if suffix is us or ss, do nothing
step1b:
- replace eed, eedly by ee if it is in the part of the word after the first non-vowel following vowel.
- delete ed, edly, ing, ingly if preceeding word part contains a vowel and if ends in at, bl, iz, add e. if words ends with double letter that is not ll,ss,zz, remove last letter.