information retrieval: search engines, spiders and crawlers Flashcards

1
Q

information retrieval

A

information retrieval is a field concerned with the
structure, analysis, organisation, storage, searching, and
retrieval of information.
* Distinction between a document and a database record
– most information is in the form of unstructured text.
* Need to design algorithms that can compare text – not
easy.
* Understanding how people compare texts and
designing algorithms to do this, is the core of
information retrieval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

big issues

A

Relevance
* Topical relevance and
user relevance
* Retrieval models
* Ranking algorithms
Evaluation
* Precision and recall
* Test collections
* Clickthrough and log
data
Information Needs
* Query suggestion
* Query expansion
* Relevance feedback

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Search Engine -
Architecture

A

Two primary goals:
1. Effectiveness – (quality) – want to retrieve the
most relevant set of documents possible for a
query
2. Efficiency – (speed) – want to process queries
from users as quickly as possible
UIMA – Unstructured Information Management
Architecture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

text acquisition

A

Crawler
* General web crawler most common
* Follows links to web pages to discover and download
new pages
* Challenges with handling huge volume of new pages
and ensuring that pages that have changed are kept
fresh for search engines
* Can be restricted to a single site for site search
* Focused or Topical crawlers – use classification
techniques to restrict pages that are visited

Feeds
* Document Feeds – access real time stream of
documents e.g. news feed
* Search engine acquires new documents by simply
monitoring the feed
* Common Standard RSS (Really Simple Syndication)
* Reader monitors feed and provides new content
when it arrives. Can be audio, video streams as well
as documents

Conversion
* Documents in a feed are rarely in plain text
* Search engines require then to be converted into a
consistent text plus metadata format.
* Text transformation component
* Encoding - ASCIII

Document Data Store
* A database to manage large numbers of
documents and the structured data associated
with them.
* Contents typically stored in compressed form
* Structured data consists of metadata and other
information such as links, anchor text
* Can use a relational database but some
applications use simpler more efficient storage
system to provide fast retrieval times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

text transformation

A

Parser
* Responsible for processing the sequence of text
tokens
* In many cases tokens are the same as words
* Tokenising can be non-trivial
* Uses knowledge of the syntax of the markup
language to identify the structure

Stopping
* Removes common words from the stream of tokens
that become index terms
* Most common words are function words that help
form sentence structure e.g. ‘the’, ‘of’ , ‘to’ and ‘for’
* Removal reduces the size of the index
* Doesn’t affect the search engine’s effectiveness
* Can be difficult to decide what words to have on the
stop-word list

Stemming
* Group words together that are derived from a
common stem e.g. ‘fish’, ‘fishes’ and ‘fishing’
* Replace each member of a group with one
designated word – increase the likelihood that
words in queries and documents will match
* May not be effective for some languages

Link extraction and
analysis
* Analysis – popularity and
authority of a page
* Anchor text – enhance
the text content of a page
that the link points to.

Information extraction
* Identify complex index
terms
* Syntactic analysis

Classifier
* Identifies class-related
metadata
* Assign labels to
documents representing
topic categories e.g. sport
or spam or advertising
* Use clustering techniques
to group documents
without predefined
categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

index creation

A

Document Statistics
- Gathers and records statistical
information about words,
features and documents.
- Information is then used by
Ranking component
- Actual data is determined by
the retrieval model and ranking
algorithm.
- Document statistics are stored
in lookup tables – data
structures designed for fast
retrieval

Weighting
* Calculates the weights using
document statistics and
stores in lookup tables
* Could be calculated as part
of query process but better
during indexing process –
makes query process more
efficient
* tf.idf weighting – gives high
weights to terms that occur
in very few documents

Inversion – core of
indexing process.
Changes the stream of
document-term
information into term-
document information.
For the creation of
inverted indices
* Challenge is to do this
efficiently

  • Index distribution -
    distributes indexes
    across multiple
    computers/sites on a
    network
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

user interaction

A

Query Input
* Provides an interface and parser for a query language
* Simple query language for most search engines –
only a small number of operators
* One example – use of “ “ – indicate that the enclosed
words should occur as a phrase in the document.
* A typical web query consists of a small number of
keywords with no operators
* Boolean query languages: AND, OR and NOT.

Query Transformation
* Tokenisation, stopping and
stemming must be done on
the query text to produce
index terms that are
comparable to the
document terms
* Spell-checking and query
suggestion
* Query expansion techniques
also suggest additional
terms.

Results output
* Component is responsible
for constructing display of
ranked documents
* Generate snippets to
summarise retrieved
documents, cluster output
to identify related group of
documents, highlight
important words and
passages in documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

ranking

A

Scoring (query
processing) calculates
scores for documents
using the ranking
algorithm.

Performance Optimisation
* Design of ranking
algorithms to decrease
response time and
increase query
throughput
* Number of different ways
to calculate scoring:
* Term-at-a time scoring
* Document-at-a-time
scoring

Distribution
* Ranking can also be
distributed
* Query Broker
* Caching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

evaluation

A

Logging – one of the
most valuable sources of
info for tuning and
improving search engines

  • Ranking Analysis – uses
    log data/relevance
    judgement pairs to
    measure and compare
    ranking algorithm
    effectiveness
  • Performance Analysis
    component –
    monitoring and
    improving overall
    effectiveness
  • Response time,
    throughput, network
    usage
  • Simulations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

crawl and feeds

A

Crawling – finding and downloading web pages
automatically. Occasionally known as spidering and
a crawler is sometimes called a spider.
* Sheer scale
* Web pages usually not under the control of those
building the search engine database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

retrieving web pages

A
  • Client program connects to a domain name system
    (DNS) server.
  • DNS server translates the hostname into an internet
    protocol address (IP)
  • The program then attempts to connect to a server
    computer with that IP address
  • Once connection established, client program sends a
    http request to the web server to request a page
  • GET request
  • POST request
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

web crawler

A

If a web server is not very
powerful – may spend all
its time handling requests
from the crawler
* Tends to make web server
administrators angry
* Politeness Policies
* Request queue is split into
a single queue per web
server
* Politeness Window
* can fetch 100 pp per sec
* Politeness policy = one page
every 30 seconds
* Would need URLS from at least
3000 different web servers in
its request queue in order to
achieve high throughput!
* Many URLS will come from
same servers – so the request
queue would need to have
tens of thousands of URLS on it
before a crawler can reach
peak throughput

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

focused crawling

A

Focused or topical crawling –
less expensive approach
* Relies on the fact that web
pages tend to have links to
other pages on the same topic
* In practice – use a number of
popular pages as seeds
* Use text classifiers to
determine what page is about
* If page is on topic, keeps the
page and uses links from the
page to find other related sites.
* Anchor text is important
* Crawler keeps track of
topicality of downloaded
pages
* Uses this to determine
whether to download
similar pages
* Anchor text data and page
link topicality data can be
combined together to
determine which pages
should be crawled next

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

deep web

A

Difficult for a crawler to find
* Three categories:
* Private Sites – need an account
* Form results – flight timetables
* Scripted pages - JavaScript

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

document feeds

A

Many documents published – created at a fixed time
and rarely updated again
* Since they have an associated time – can be ordered in
a sequence called a document feed.
* Particularly interesting to crawlers – easily examine for
new documents by looking at just the end of the feed.
* Two kinds of feeds – Push and Pull
* Push alerts the subscriber to new documents.
* A Pull feed requires the subscriber to check periodically
* Most common format for pull feeds is called RSS
* Really Simple Syndication
* RDF Site Summary
* Rich Site Summary
* Each entry contains a time of publication and a tag
ttl – time to live – measured in minutes
* Advantages over traditional pages (for crawlers):
* Give a natural structure to the data
* Easy to parse and contain detailed time information
* Single location to look for new data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

the conversion problem

A
  • Text documents stored in incompatible formats –
    PDF, raw text, RTF, HTML. XML and others
  • Other kinds of files such as PPT and Excel
    spreadsheets and sometimes obsolete formats
  • Use conversion tools - tagged format (XML, HTML)
  • Readability not the concern of the search engine
17
Q

document storage

A
  • Documents need to be stored for indexing
  • Crawling for documents can be expensive in terms
    of CPU and network load.
  • It makes sense to keep documents around instead
    of trying to fetch them again the next time you
    want to build an index.
  • Storage systems can be a starting point for
    information extraction
18
Q

document storage: database

A
  • For many applications a database is an excellent
    place to store documents
  • However, few of the major search engines use a
    relational database
  • Sheer volume of data can overwhelm traditional
    database
  • Database vendors expect that servers will use expensive
    disk systems – impractical given size
  • Use alternatives instead
19
Q

BigTable (Change et al, 2006)

A
  • Used internally at GOOGLE
  • Distributed database system
    built for storing web pages
    originally
  • Really is a big table – over a
    petabyte in size – each
    database only contains one
    table
  • Table is split into small
    pieces called tablets, which
    are served by thousands of
    machines

Any changes to a BigTable tablet are recorded in
the transaction log which is also stored in a shared
file system
* If any tablet server crashes, then another server
can immediately read the tablet data and
transaction log from the file system and take over.
* BigTable stores its data in unchangeable files.

  • Can have a huge number of columns per row
  • Not all rows have the same columns
  • All anchor information in single record
  • Only a single disk read needed to read all of the
    document data.
  • Range-based partitioning