Spiders and crawlers Flashcards
Information Retrieval
A field concerned with the structure, analysis, organisation, storage, and retrieval of information.
Goals of Search Engines
Effectiveness - quality - retrieve the most relevant set of documents possible
Efficiency - speed - process results from users as quickly as possible
Indexing Process
Text acquisition, transformation and index creation
Text acquisition
Store parsed data into a document data store
Text transformation
Remove duplicates from the store, classify information and organise data
Index creation
Create an index to quickly locate information - stored in an index database that can be queried by user
Query Process
Interaction, evaluation and log data
Interaction
User interacts with document data store, which runs a ranking algorithm from the index to display most relevant results
Evaluation
Find out how relevant results are e.g. quiz user on how relevant the results are so the algorithm can be updated
Log data
Log user data and update algorithm
Types of text acquisition
Crawler and feed
Crawler
Web crawler is the most common method of text acquisition: opens a webpage and looks for links
Challenges faced by crawler
Pages are constantly updated so crawlers must be ran very frequently to keep up; may not be able to handle huge volumes of new pages
Can only operate on a single website
Focused crawler
Classification technique to determine whether a page is relevant or not; will not access pages that are deemed irrelevant
Feed
Real time stream of documents e.g. a news feed
Search engine acquires new documents simply by monitoring the feed