Spiders and crawlers Flashcards
Information Retrieval
A field concerned with the structure, analysis, organisation, storage, and retrieval of information.
Goals of Search Engines
Effectiveness - quality - retrieve the most relevant set of documents possible
Efficiency - speed - process results from users as quickly as possible
Indexing Process
Text acquisition, transformation and index creation
Text acquisition
Store parsed data into a document data store
Text transformation
Remove duplicates from the store, classify information and organise data
Index creation
Create an index to quickly locate information - stored in an index database that can be queried by user
Query Process
Interaction, evaluation and log data
Interaction
User interacts with document data store, which runs a ranking algorithm from the index to display most relevant results
Evaluation
Find out how relevant results are e.g. quiz user on how relevant the results are so the algorithm can be updated
Log data
Log user data and update algorithm
Types of text acquisition
Crawler and feed
Crawler
Web crawler is the most common method of text acquisition: opens a webpage and looks for links
Challenges faced by crawler
Pages are constantly updated so crawlers must be ran very frequently to keep up; may not be able to handle huge volumes of new pages
Can only operate on a single website
Focused crawler
Classification technique to determine whether a page is relevant or not; will not access pages that are deemed irrelevant
Feed
Real time stream of documents e.g. a news feed
Search engine acquires new documents simply by monitoring the feed
Feed Conversion
Documents in feed are rarely plain text
Search engines require them to be converted into consistent text + metadata
Document Data Store
Database to manage large numbers of documents and structured data (metadata) associated with them
Types of text transformation
Parser, stopping and stemming
Parser
Processes a sequence of text tokens
Uses knowledge of syntax to identify structure of text/information
Stopping
Removes common words from the stream of tokens e.g. ‘the’, ‘of’, ‘to’, ‘for’
Reduces size of index and does not affect quality
Stemming
Group words together that derive from a common stem e.g. ‘fish’, ‘fishes’ and ‘fishing’
May not be effective for all languages
User interaction
Query input, query transformation and results output
Query input
Small number of keywords to query from a document e.g. web query
Query transformation
Tokenisation, stopping and stemming must be done to compare with document
Results output
Construct display of ranked documents - snippets of documents, important words/passages etc
Logging
One of the most valuable sources for tuning and improving search engines
Ranking analysis uses log data to compare effectiveness of algorithm
Simulations
Politeness Policy
Some crawlers may have a limit to one page every x seconds instead of max speed
Ratelimits
A limit of times a single IP address is allowed to access a server
Deep Web
Private sites (accounts), form results (bus/flight timetables), scripted pages (javascript)
Difficult for a crawler to find
Conversion Problem
Text document stored in incompatible formats; PDF, raw text, RTF, HTML, XML, others
Sometimes in PPT/Excel documents or obsolete formats
Big Table
Used internally at Google
A distributed database built for storing web pages. A big table
Split into tablets served by thousands of machines, any changes are recorded in the transaction log
If a tablet crashes then another server can read data from transaction log and take over