information retrieval: search engines, spiders and crawlers Flashcards

Question 1

Q

information retrieval

Answer

A

information retrieval is a field concerned with the
structure, analysis, organisation, storage, searching, and
retrieval of information.
* Distinction between a document and a database record
– most information is in the form of unstructured text.
* Need to design algorithms that can compare text – not
easy.
* Understanding how people compare texts and
designing algorithms to do this, is the core of
information retrieval.

Question 2

Q

big issues

Answer

A

Relevance
* Topical relevance and
user relevance
* Retrieval models
* Ranking algorithms
Evaluation
* Precision and recall
* Test collections
* Clickthrough and log
data
Information Needs
* Query suggestion
* Query expansion
* Relevance feedback

Question 3

Q

Search Engine -
Architecture

Answer

A

Two primary goals:
1. Effectiveness – (quality) – want to retrieve the
most relevant set of documents possible for a
query
2. Efficiency – (speed) – want to process queries
from users as quickly as possible
UIMA – Unstructured Information Management
Architecture.

Question 4

Q

text acquisition

Answer

A

Crawler
* General web crawler most common
* Follows links to web pages to discover and download
new pages
* Challenges with handling huge volume of new pages
and ensuring that pages that have changed are kept
fresh for search engines
* Can be restricted to a single site for site search
* Focused or Topical crawlers – use classification
techniques to restrict pages that are visited

Feeds
* Document Feeds – access real time stream of
documents e.g. news feed
* Search engine acquires new documents by simply
monitoring the feed
* Common Standard RSS (Really Simple Syndication)
* Reader monitors feed and provides new content
when it arrives. Can be audio, video streams as well
as documents

Conversion
* Documents in a feed are rarely in plain text
* Search engines require then to be converted into a
consistent text plus metadata format.
* Text transformation component
* Encoding - ASCIII

Document Data Store
* A database to manage large numbers of
documents and the structured data associated
with them.
* Contents typically stored in compressed form
* Structured data consists of metadata and other
information such as links, anchor text
* Can use a relational database but some
applications use simpler more efficient storage
system to provide fast retrieval times.

Question 5

Q

text transformation

Answer

A

Parser
* Responsible for processing the sequence of text
tokens
* In many cases tokens are the same as words
* Tokenising can be non-trivial
* Uses knowledge of the syntax of the markup
language to identify the structure

Stopping
* Removes common words from the stream of tokens
that become index terms
* Most common words are function words that help
form sentence structure e.g. ‘the’, ‘of’ , ‘to’ and ‘for’
* Removal reduces the size of the index
* Doesn’t affect the search engine’s effectiveness
* Can be difficult to decide what words to have on the
stop-word list

Stemming
* Group words together that are derived from a
common stem e.g. ‘fish’, ‘fishes’ and ‘fishing’
* Replace each member of a group with one
designated word – increase the likelihood that
words in queries and documents will match
* May not be effective for some languages

Link extraction and
analysis
* Analysis – popularity and
authority of a page
* Anchor text – enhance
the text content of a page
that the link points to.

Information extraction
* Identify complex index
terms
* Syntactic analysis

Classifier
* Identifies class-related
metadata
* Assign labels to
documents representing
topic categories e.g. sport
or spam or advertising
* Use clustering techniques
to group documents
without predefined
categories

Question 6

Q

index creation

Answer

A

Document Statistics
- Gathers and records statistical
information about words,
features and documents.
- Information is then used by
Ranking component
- Actual data is determined by
the retrieval model and ranking
algorithm.
- Document statistics are stored
in lookup tables – data
structures designed for fast
retrieval

Weighting
* Calculates the weights using
document statistics and
stores in lookup tables
* Could be calculated as part
of query process but better
during indexing process –
makes query process more
efficient
* tf.idf weighting – gives high
weights to terms that occur
in very few documents

Inversion – core of
indexing process.
Changes the stream of
document-term
information into term-
document information.
For the creation of
inverted indices
* Challenge is to do this
efficiently

Index distribution -
distributes indexes
across multiple
computers/sites on a
network

Question 7

Q

user interaction

Answer

A

Query Input
* Provides an interface and parser for a query language
* Simple query language for most search engines –
only a small number of operators
* One example – use of “ “ – indicate that the enclosed
words should occur as a phrase in the document.
* A typical web query consists of a small number of
keywords with no operators
* Boolean query languages: AND, OR and NOT.

Query Transformation
* Tokenisation, stopping and
stemming must be done on
the query text to produce
index terms that are
comparable to the
document terms
* Spell-checking and query
suggestion
* Query expansion techniques
also suggest additional
terms.

Results output
* Component is responsible
for constructing display of
ranked documents
* Generate snippets to
summarise retrieved
documents, cluster output
to identify related group of
documents, highlight
important words and
passages in documents.

Question 8

Q

ranking

Answer

A

Scoring (query
processing) calculates
scores for documents
using the ranking
algorithm.

Performance Optimisation
* Design of ranking
algorithms to decrease
response time and
increase query
throughput
* Number of different ways
to calculate scoring:
* Term-at-a time scoring
* Document-at-a-time
scoring

Distribution
* Ranking can also be
distributed
* Query Broker
* Caching

Question 9

Q

evaluation

Answer

A

Logging – one of the
most valuable sources of
info for tuning and
improving search engines

Ranking Analysis – uses
log data/relevance
judgement pairs to
measure and compare
ranking algorithm
effectiveness
Performance Analysis
component –
monitoring and
improving overall
effectiveness
Response time,
throughput, network
usage
Simulations

Question 10

Q

crawl and feeds

Answer

A

Crawling – finding and downloading web pages
automatically. Occasionally known as spidering and
a crawler is sometimes called a spider.
* Sheer scale
* Web pages usually not under the control of those
building the search engine database

Question 11

Q

retrieving web pages

Answer

A

Client program connects to a domain name system
(DNS) server.
DNS server translates the hostname into an internet
protocol address (IP)
The program then attempts to connect to a server
computer with that IP address
Once connection established, client program sends a
http request to the web server to request a page
GET request
POST request

Question 12

Q

web crawler

Answer

A

If a web server is not very
powerful – may spend all
its time handling requests
from the crawler
* Tends to make web server
administrators angry
* Politeness Policies
* Request queue is split into
a single queue per web
server
* Politeness Window
* can fetch 100 pp per sec
* Politeness policy = one page
every 30 seconds
* Would need URLS from at least
3000 different web servers in
its request queue in order to
achieve high throughput!
* Many URLS will come from
same servers – so the request
queue would need to have
tens of thousands of URLS on it
before a crawler can reach
peak throughput

Question 13

Q

focused crawling

Answer

A

Focused or topical crawling –
less expensive approach
* Relies on the fact that web
pages tend to have links to
other pages on the same topic
* In practice – use a number of
popular pages as seeds
* Use text classifiers to
determine what page is about
* If page is on topic, keeps the
page and uses links from the
page to find other related sites.
* Anchor text is important
* Crawler keeps track of
topicality of downloaded
pages
* Uses this to determine
whether to download
similar pages
* Anchor text data and page
link topicality data can be
combined together to
determine which pages
should be crawled next

Question 14

Q

deep web

Answer

A

Difficult for a crawler to find
* Three categories:
* Private Sites – need an account
* Form results – flight timetables
* Scripted pages - JavaScript

Question 15

Q

document feeds

Answer

A

Many documents published – created at a fixed time
and rarely updated again
* Since they have an associated time – can be ordered in
a sequence called a document feed.
* Particularly interesting to crawlers – easily examine for
new documents by looking at just the end of the feed.
* Two kinds of feeds – Push and Pull
* Push alerts the subscriber to new documents.
* A Pull feed requires the subscriber to check periodically
* Most common format for pull feeds is called RSS
* Really Simple Syndication
* RDF Site Summary
* Rich Site Summary
* Each entry contains a time of publication and a tag
ttl – time to live – measured in minutes
* Advantages over traditional pages (for crawlers):
* Give a natural structure to the data
* Easy to parse and contain detailed time information
* Single location to look for new data

Question 16

Q

the conversion problem

Answer

A

Text documents stored in incompatible formats –
PDF, raw text, RTF, HTML. XML and others
Other kinds of files such as PPT and Excel
spreadsheets and sometimes obsolete formats
Use conversion tools - tagged format (XML, HTML)
Readability not the concern of the search engine

Question 17

Q

document storage

Answer

A

Documents need to be stored for indexing
Crawling for documents can be expensive in terms
of CPU and network load.
It makes sense to keep documents around instead
of trying to fetch them again the next time you
want to build an index.
Storage systems can be a starting point for
information extraction

Question 18

Q

document storage: database

Answer

A

For many applications a database is an excellent
place to store documents
However, few of the major search engines use a
relational database
Sheer volume of data can overwhelm traditional
database
Database vendors expect that servers will use expensive
disk systems – impractical given size
Use alternatives instead

Question 19

Q

BigTable (Change et al, 2006)

Answer

A

Used internally at GOOGLE
Distributed database system
built for storing web pages
originally
Really is a big table – over a
petabyte in size – each
database only contains one
table
Table is split into small
pieces called tablets, which
are served by thousands of
machines

Any changes to a BigTable tablet are recorded in
the transaction log which is also stored in a shared
file system
* If any tablet server crashes, then another server
can immediately read the tablet data and
transaction log from the file system and take over.
* BigTable stores its data in unchangeable files.

Can have a huge number of columns per row
Not all rows have the same columns
All anchor information in single record
Only a single disk read needed to read all of the
document data.
Range-based partitioning