Information Retreival Flashcards
Data vs Information
Data is raw facts
Information is data processed, organized and structured in some way.
Information is represented by a set of symbols, has some structure and can be read to some extent understood by users of information.
What is IR ?
Information Retrieval is the scientific discipline that deals with the analysis, design and implementation of computerized systems that address access to info items.
Retrieval of unstructured records. Free-form natural lang text, images, sound, video.
What is a document and collection ?
DOCUMENT: Records that IR systems often process
COLLECTION: An organised repository used by IR to retrieve docs
Docs that satisfy a query in the judgement of user- RELEVANT
What is ranking and IR problem
RANKING: established order of docs retrieved
IR problem:retrieve all items relevant to user query, while retrieving as few non-relevant items
What are the types of data ?
STRUCTURED RECORDS: NAmed components that are organised to some well defined syntax
Each component will have some meaning and specific type EX: relational DB table records
UNSTRUCTURED DATA: Don’t have well defined syntax
No well defined meaning attribute to each component syntactical element
Ex:Emails, chapters from books
SEMI-STRUCTURED DATA:follow a general standard form.
EX: using JSON, XML they have tagged fields but particular schema
Write a short note on IR background
early goals: indexing text and searching for useful docs
Modern research:Modelling, web search, user interfaces
Library was first ones to adopt IR systems
(it was subject headings, keywords)
What is the logical views of docs ?
Docs are often represented through a set if index terms)can be extracted)
IR systems can adopt diff logical views of docs:
full text
Representative keywords: the words that occur frequently (a, at, an, the)-stop words
Stemming: reducing words to their grammatical roots
Identification of noun groups
These allow moving the logical view from full text to a set of index terms
What is retrieval and filtering
RETRIEVAL:Matching process between document keywords and words in queries.
Ad-hoc retrieval is the application of arbitrary queries to a fixed collection of docs: Static docs, new queries
FILTERING: A fixed no of queries that are applied to a stream of changing docs: static queries, new docs.
What is Modelling in IR?
Is a complex process aimed at producing a ranking function.
RANKING
FUNCTION: that assigns scores to documents with regard to a given query
There is two main tasks:
Conception of a logical framework for representing docs and queries
definition of a ranking function that allows quantifying similarities among docs and queries
What is an index term ?
A keyword that has some meaning of its own and plays the role of a noun
Can be implemented efficiently and simple to refer to in a query
What is ranking in Models ?
ordering of docs that reflect their relevance to a user query
any IR system has to deal with problem of predicting which docs user will find relevant
This embodies a degree of uncertainty and vagueness
What are the three classic models ?
BOOLEAN: docs and queries are sets of index terms, Set theoretic
VECTOR: docs and queries exist in N-DIMENSIOANL SPACE. Algebraic
PROBABILISTIC: based on probability theory.
What is the boolen model ?
Can pose any query in the form of a Boolean expression of terms
Terms are combined with AND/OR/NOT.
BAsed on set theory
Views each doc as a set of words
What are the examples of boolean model ?
- Shakspeare plays contain brutus and ceaser and not calpurnia - LINEAR SCAN THROUGH THE DOCS
if we want to avoid linear scan- index the docs in advance. The result is BINARY TERM-DOCUMENT INCIDENCE MATRIX
Matrix element(t,d) is 1 if the play in column d contains the word in row t and 0 is otherwise
Write a note on boolean model
Terms are present or absent
retrieval based on binary decision criteria with no notion of partial matching
No ranking of docs is provided
Return either too few or too many docs in response to a user query.
What is an inverted index ?
Keep a dictionary of terms
For each term, list of records which docs where the term occurs
Each item in the list that records the term that appeared in a doc is called a POSTING
The list is called POSTING LISTS and all together are called postings
How to create an inverted index ?
collect the docs to be indexed
Tokenize the text, turning the docs in a list of tokens. Also remove the stop words,
Do linguistic pre-processing, stemming
Index docs that each term occurs in by creating an inverted index.
What is web crawling ?
Gathering pages from the web in order to index them and support a search engine. Quick and efficient with link structure that interconnects them.
Also known as spiders
Must have these features:
ROBUSTNESS: Must not get caught in spider-traps(pages that mislead crawlers into fetching an infinite no of pages from some domain)
POLITNESS: have policies regulating the rate a web crawler can visit them and these must be respected.
What are the features they should provide ?
DISTRIBUTED: have the capability to execute in a distributed fashion
SCALABLE : Architecture should permit scaling up the crawl rate by adding extra machines and bandwidth
PERFORMANCE: make efficient use of resources like processors and storage
QUALITY: should be biased towards fetching more useful info
FRESHNESS: operate in a continuous mode and fetch fresh copies of previously fetched pages.
EXTENSIBLE: cope with new data and fetch new protocols
what is web crawling operation ?
Begin with one or more URL that constitute a SEED SET
It picks a URL from seed set and fetches web pages
Fetched pages are parsed and text and links are extracted.
Extracted text is fed to text indexer
Extracted links are added to the urL FRONTIER and are yet to be fetched
CONTINUOUS CRAWLING:the URL of
a fetched page is not deleted
from the URL frontier but is
fetched again in future.
What is Robots Exclusion Protocol ?
Many hosts on websites place portions of sites off-limits to crawling.
This is done by placing a robots.txt file at root of the URL.