information retrieval: search engines, spiders and crawlers Flashcards
information retrieval
information retrieval is a field concerned with the
structure, analysis, organisation, storage, searching, and
retrieval of information.
* Distinction between a document and a database record
– most information is in the form of unstructured text.
* Need to design algorithms that can compare text – not
easy.
* Understanding how people compare texts and
designing algorithms to do this, is the core of
information retrieval.
big issues
Relevance
* Topical relevance and
user relevance
* Retrieval models
* Ranking algorithms
Evaluation
* Precision and recall
* Test collections
* Clickthrough and log
data
Information Needs
* Query suggestion
* Query expansion
* Relevance feedback
Search Engine -
Architecture
Two primary goals:
1. Effectiveness – (quality) – want to retrieve the
most relevant set of documents possible for a
query
2. Efficiency – (speed) – want to process queries
from users as quickly as possible
UIMA – Unstructured Information Management
Architecture.
text acquisition
Crawler
* General web crawler most common
* Follows links to web pages to discover and download
new pages
* Challenges with handling huge volume of new pages
and ensuring that pages that have changed are kept
fresh for search engines
* Can be restricted to a single site for site search
* Focused or Topical crawlers – use classification
techniques to restrict pages that are visited
Feeds
* Document Feeds – access real time stream of
documents e.g. news feed
* Search engine acquires new documents by simply
monitoring the feed
* Common Standard RSS (Really Simple Syndication)
* Reader monitors feed and provides new content
when it arrives. Can be audio, video streams as well
as documents
Conversion
* Documents in a feed are rarely in plain text
* Search engines require then to be converted into a
consistent text plus metadata format.
* Text transformation component
* Encoding - ASCIII
Document Data Store
* A database to manage large numbers of
documents and the structured data associated
with them.
* Contents typically stored in compressed form
* Structured data consists of metadata and other
information such as links, anchor text
* Can use a relational database but some
applications use simpler more efficient storage
system to provide fast retrieval times.
text transformation
Parser
* Responsible for processing the sequence of text
tokens
* In many cases tokens are the same as words
* Tokenising can be non-trivial
* Uses knowledge of the syntax of the markup
language to identify the structure
Stopping
* Removes common words from the stream of tokens
that become index terms
* Most common words are function words that help
form sentence structure e.g. ‘the’, ‘of’ , ‘to’ and ‘for’
* Removal reduces the size of the index
* Doesn’t affect the search engine’s effectiveness
* Can be difficult to decide what words to have on the
stop-word list
Stemming
* Group words together that are derived from a
common stem e.g. ‘fish’, ‘fishes’ and ‘fishing’
* Replace each member of a group with one
designated word – increase the likelihood that
words in queries and documents will match
* May not be effective for some languages
Link extraction and
analysis
* Analysis – popularity and
authority of a page
* Anchor text – enhance
the text content of a page
that the link points to.
Information extraction
* Identify complex index
terms
* Syntactic analysis
Classifier
* Identifies class-related
metadata
* Assign labels to
documents representing
topic categories e.g. sport
or spam or advertising
* Use clustering techniques
to group documents
without predefined
categories
index creation
Document Statistics
- Gathers and records statistical
information about words,
features and documents.
- Information is then used by
Ranking component
- Actual data is determined by
the retrieval model and ranking
algorithm.
- Document statistics are stored
in lookup tables – data
structures designed for fast
retrieval
Weighting
* Calculates the weights using
document statistics and
stores in lookup tables
* Could be calculated as part
of query process but better
during indexing process –
makes query process more
efficient
* tf.idf weighting – gives high
weights to terms that occur
in very few documents
Inversion – core of
indexing process.
Changes the stream of
document-term
information into term-
document information.
For the creation of
inverted indices
* Challenge is to do this
efficiently
- Index distribution -
distributes indexes
across multiple
computers/sites on a
network
user interaction
Query Input
* Provides an interface and parser for a query language
* Simple query language for most search engines –
only a small number of operators
* One example – use of “ “ – indicate that the enclosed
words should occur as a phrase in the document.
* A typical web query consists of a small number of
keywords with no operators
* Boolean query languages: AND, OR and NOT.
Query Transformation
* Tokenisation, stopping and
stemming must be done on
the query text to produce
index terms that are
comparable to the
document terms
* Spell-checking and query
suggestion
* Query expansion techniques
also suggest additional
terms.
Results output
* Component is responsible
for constructing display of
ranked documents
* Generate snippets to
summarise retrieved
documents, cluster output
to identify related group of
documents, highlight
important words and
passages in documents.
ranking
Scoring (query
processing) calculates
scores for documents
using the ranking
algorithm.
Performance Optimisation
* Design of ranking
algorithms to decrease
response time and
increase query
throughput
* Number of different ways
to calculate scoring:
* Term-at-a time scoring
* Document-at-a-time
scoring
Distribution
* Ranking can also be
distributed
* Query Broker
* Caching
evaluation
Logging – one of the
most valuable sources of
info for tuning and
improving search engines
- Ranking Analysis – uses
log data/relevance
judgement pairs to
measure and compare
ranking algorithm
effectiveness - Performance Analysis
component –
monitoring and
improving overall
effectiveness - Response time,
throughput, network
usage - Simulations
crawl and feeds
Crawling – finding and downloading web pages
automatically. Occasionally known as spidering and
a crawler is sometimes called a spider.
* Sheer scale
* Web pages usually not under the control of those
building the search engine database
retrieving web pages
- Client program connects to a domain name system
(DNS) server. - DNS server translates the hostname into an internet
protocol address (IP) - The program then attempts to connect to a server
computer with that IP address - Once connection established, client program sends a
http request to the web server to request a page - GET request
- POST request
web crawler
If a web server is not very
powerful – may spend all
its time handling requests
from the crawler
* Tends to make web server
administrators angry
* Politeness Policies
* Request queue is split into
a single queue per web
server
* Politeness Window
* can fetch 100 pp per sec
* Politeness policy = one page
every 30 seconds
* Would need URLS from at least
3000 different web servers in
its request queue in order to
achieve high throughput!
* Many URLS will come from
same servers – so the request
queue would need to have
tens of thousands of URLS on it
before a crawler can reach
peak throughput
focused crawling
Focused or topical crawling –
less expensive approach
* Relies on the fact that web
pages tend to have links to
other pages on the same topic
* In practice – use a number of
popular pages as seeds
* Use text classifiers to
determine what page is about
* If page is on topic, keeps the
page and uses links from the
page to find other related sites.
* Anchor text is important
* Crawler keeps track of
topicality of downloaded
pages
* Uses this to determine
whether to download
similar pages
* Anchor text data and page
link topicality data can be
combined together to
determine which pages
should be crawled next
deep web
Difficult for a crawler to find
* Three categories:
* Private Sites – need an account
* Form results – flight timetables
* Scripted pages - JavaScript
document feeds
Many documents published – created at a fixed time
and rarely updated again
* Since they have an associated time – can be ordered in
a sequence called a document feed.
* Particularly interesting to crawlers – easily examine for
new documents by looking at just the end of the feed.
* Two kinds of feeds – Push and Pull
* Push alerts the subscriber to new documents.
* A Pull feed requires the subscriber to check periodically
* Most common format for pull feeds is called RSS
* Really Simple Syndication
* RDF Site Summary
* Rich Site Summary
* Each entry contains a time of publication and a tag
ttl – time to live – measured in minutes
* Advantages over traditional pages (for crawlers):
* Give a natural structure to the data
* Easy to parse and contain detailed time information
* Single location to look for new data