Information Retreival Flashcards

1
Q

Data vs Information

A

Data is raw facts

Information is data processed, organized and structured in some way.

Information is represented by a set of symbols, has some structure and can be read to some extent understood by users of information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is IR ?

A

Information Retrieval is the scientific discipline that deals with the analysis, design and implementation of computerized systems that address access to info items.

Retrieval of unstructured records. Free-form natural lang text, images, sound, video.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a document and collection ?

A

DOCUMENT: Records that IR systems often process

COLLECTION: An organised repository used by IR to retrieve docs

Docs that satisfy a query in the judgement of user- RELEVANT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is ranking and IR problem

A

RANKING: established order of docs retrieved

IR problem:retrieve all items relevant to user query, while retrieving as few non-relevant items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the types of data ?

A

STRUCTURED RECORDS: NAmed components that are organised to some well defined syntax
Each component will have some meaning and specific type EX: relational DB table records

UNSTRUCTURED DATA: Don’t have well defined syntax
No well defined meaning attribute to each component syntactical element
Ex:Emails, chapters from books

SEMI-STRUCTURED DATA:follow a general standard form.
EX: using JSON, XML they have tagged fields but particular schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Write a short note on IR background

A

early goals: indexing text and searching for useful docs

Modern research:Modelling, web search, user interfaces
Library was first ones to adopt IR systems
(it was subject headings, keywords)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the logical views of docs ?

A

Docs are often represented through a set if index terms)can be extracted)

IR systems can adopt diff logical views of docs:
full text
Representative keywords: the words that occur frequently (a, at, an, the)-stop words
Stemming: reducing words to their grammatical roots
Identification of noun groups

These allow moving the logical view from full text to a set of index terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is retrieval and filtering

A

RETRIEVAL:Matching process between document keywords and words in queries.
Ad-hoc retrieval is the application of arbitrary queries to a fixed collection of docs: Static docs, new queries

FILTERING: A fixed no of queries that are applied to a stream of changing docs: static queries, new docs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Modelling in IR?

A

Is a complex process aimed at producing a ranking function.

RANKING
FUNCTION: that assigns scores to documents with regard to a given query

There is two main tasks:
Conception of a logical framework for representing docs and queries
definition of a ranking function that allows quantifying similarities among docs and queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an index term ?

A

A keyword that has some meaning of its own and plays the role of a noun

Can be implemented efficiently and simple to refer to in a query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is ranking in Models ?

A

ordering of docs that reflect their relevance to a user query

any IR system has to deal with problem of predicting which docs user will find relevant
This embodies a degree of uncertainty and vagueness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the three classic models ?

A

BOOLEAN: docs and queries are sets of index terms, Set theoretic

VECTOR: docs and queries exist in N-DIMENSIOANL SPACE. Algebraic

PROBABILISTIC: based on probability theory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the boolen model ?

A

Can pose any query in the form of a Boolean expression of terms

Terms are combined with AND/OR/NOT.

BAsed on set theory

Views each doc as a set of words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the examples of boolean model ?

A
  1. Shakspeare plays contain brutus and ceaser and not calpurnia - LINEAR SCAN THROUGH THE DOCS

if we want to avoid linear scan- index the docs in advance. The result is BINARY TERM-DOCUMENT INCIDENCE MATRIX

Matrix element(t,d) is 1 if the play in column d contains the word in row t and 0 is otherwise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Write a note on boolean model

A

Terms are present or absent

retrieval based on binary decision criteria with no notion of partial matching

No ranking of docs is provided

Return either too few or too many docs in response to a user query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an inverted index ?

A

Keep a dictionary of terms

For each term, list of records which docs where the term occurs

Each item in the list that records the term that appeared in a doc is called a POSTING

The list is called POSTING LISTS and all together are called postings

17
Q

How to create an inverted index ?

A

collect the docs to be indexed

Tokenize the text, turning the docs in a list of tokens. Also remove the stop words,

Do linguistic pre-processing, stemming

Index docs that each term occurs in by creating an inverted index.

18
Q

What is web crawling ?

A

Gathering pages from the web in order to index them and support a search engine. Quick and efficient with link structure that interconnects them.

Also known as spiders

Must have these features:

ROBUSTNESS: Must not get caught in spider-traps(pages that mislead crawlers into fetching an infinite no of pages from some domain)

POLITNESS: have policies regulating the rate a web crawler can visit them and these must be respected.

19
Q

What are the features they should provide ?

A

DISTRIBUTED: have the capability to execute in a distributed fashion

SCALABLE : Architecture should permit scaling up the crawl rate by adding extra machines and bandwidth

PERFORMANCE: make efficient use of resources like processors and storage

QUALITY: should be biased towards fetching more useful info

FRESHNESS: operate in a continuous mode and fetch fresh copies of previously fetched pages.

EXTENSIBLE: cope with new data and fetch new protocols

20
Q

what is web crawling operation ?

A

Begin with one or more URL that constitute a SEED SET

It picks a URL from seed set and fetches web pages

Fetched pages are parsed and text and links are extracted.

Extracted text is fed to text indexer

Extracted links are added to the urL FRONTIER and are yet to be fetched

CONTINUOUS CRAWLING:the URL of
a fetched page is not deleted
from the URL frontier but is
fetched again in future.

21
Q

What is Robots Exclusion Protocol ?

A

Many hosts on websites place portions of sites off-limits to crawling.

This is done by placing a robots.txt file at root of the URL.