Information Retreival Flashcards
Data vs Information
Data is raw facts
Information is data processed, organized and structured in some way.
Information is represented by a set of symbols, has some structure and can be read to some extent understood by users of information.
What is IR ?
Information Retrieval is the scientific discipline that deals with the analysis, design and implementation of computerized systems that address access to info items.
Retrieval of unstructured records. Free-form natural lang text, images, sound, video.
What is a document and collection ?
DOCUMENT: Records that IR systems often process
COLLECTION: An organised repository used by IR to retrieve docs
Docs that satisfy a query in the judgement of user- RELEVANT
What is ranking and IR problem
RANKING: established order of docs retrieved
IR problem:retrieve all items relevant to user query, while retrieving as few non-relevant items
What are the types of data ?
STRUCTURED RECORDS: NAmed components that are organised to some well defined syntax
Each component will have some meaning and specific type EX: relational DB table records
UNSTRUCTURED DATA: Don’t have well defined syntax
No well defined meaning attribute to each component syntactical element
Ex:Emails, chapters from books
SEMI-STRUCTURED DATA:follow a general standard form.
EX: using JSON, XML they have tagged fields but particular schema
Write a short note on IR background
early goals: indexing text and searching for useful docs
Modern research:Modelling, web search, user interfaces
Library was first ones to adopt IR systems
(it was subject headings, keywords)
What is the logical views of docs ?
Docs are often represented through a set if index terms)can be extracted)
IR systems can adopt diff logical views of docs:
full text
Representative keywords: the words that occur frequently (a, at, an, the)-stop words
Stemming: reducing words to their grammatical roots
Identification of noun groups
These allow moving the logical view from full text to a set of index terms
What is retrieval and filtering
RETRIEVAL:Matching process between document keywords and words in queries.
Ad-hoc retrieval is the application of arbitrary queries to a fixed collection of docs: Static docs, new queries
FILTERING: A fixed no of queries that are applied to a stream of changing docs: static queries, new docs.
What is Modelling in IR?
Is a complex process aimed at producing a ranking function.
RANKING
FUNCTION: that assigns scores to documents with regard to a given query
There is two main tasks:
Conception of a logical framework for representing docs and queries
definition of a ranking function that allows quantifying similarities among docs and queries
What is an index term ?
A keyword that has some meaning of its own and plays the role of a noun
Can be implemented efficiently and simple to refer to in a query
What is ranking in Models ?
ordering of docs that reflect their relevance to a user query
any IR system has to deal with problem of predicting which docs user will find relevant
This embodies a degree of uncertainty and vagueness
What are the three classic models ?
BOOLEAN: docs and queries are sets of index terms, Set theoretic
VECTOR: docs and queries exist in N-DIMENSIOANL SPACE. Algebraic
PROBABILISTIC: based on probability theory.
What is the boolen model ?
Can pose any query in the form of a Boolean expression of terms
Terms are combined with AND/OR/NOT.
BAsed on set theory
Views each doc as a set of words
What are the examples of boolean model ?
- Shakspeare plays contain brutus and ceaser and not calpurnia - LINEAR SCAN THROUGH THE DOCS
if we want to avoid linear scan- index the docs in advance. The result is BINARY TERM-DOCUMENT INCIDENCE MATRIX
Matrix element(t,d) is 1 if the play in column d contains the word in row t and 0 is otherwise
Write a note on boolean model
Terms are present or absent
retrieval based on binary decision criteria with no notion of partial matching
No ranking of docs is provided
Return either too few or too many docs in response to a user query.