Spiders and crawlers Flashcards

Question 1

Q

Information Retrieval

Answer

A

A field concerned with the structure, analysis, organisation, storage, and retrieval of information.

Question 2

Q

Goals of Search Engines

Answer

A

Effectiveness - quality - retrieve the most relevant set of documents possible

Efficiency - speed - process results from users as quickly as possible

Question 3

Q

Indexing Process

Answer

A

Text acquisition, transformation and index creation

Question 4

Q

Text acquisition

Answer

A

Store parsed data into a document data store

Question 5

Q

Text transformation

Answer

A

Remove duplicates from the store, classify information and organise data

Question 6

Q

Index creation

Answer

A

Create an index to quickly locate information - stored in an index database that can be queried by user

Question 7

Q

Query Process

Answer

A

Interaction, evaluation and log data

Question 8

Q

Interaction

Answer

A

User interacts with document data store, which runs a ranking algorithm from the index to display most relevant results

Question 9

Q

Evaluation

Answer

A

Find out how relevant results are e.g. quiz user on how relevant the results are so the algorithm can be updated

Question 10

Q

Log data

Answer

A

Log user data and update algorithm

Question 11

Q

Types of text acquisition

Answer

A

Crawler and feed

Question 12

Q

Crawler

Answer

A

Web crawler is the most common method of text acquisition: opens a webpage and looks for links

Question 13

Q

Challenges faced by crawler

Answer

A

Pages are constantly updated so crawlers must be ran very frequently to keep up; may not be able to handle huge volumes of new pages

Can only operate on a single website

Question 14

Q

Focused crawler

Answer

A

Classification technique to determine whether a page is relevant or not; will not access pages that are deemed irrelevant

Question 15

Q

Feed

Answer

A

Real time stream of documents e.g. a news feed

Search engine acquires new documents simply by monitoring the feed

Question 16

Q

Feed Conversion

Answer

A

Documents in feed are rarely plain text

Search engines require them to be converted into consistent text + metadata

Question 17

Q

Document Data Store

Answer

A

Database to manage large numbers of documents and structured data (metadata) associated with them

Question 18

Q

Types of text transformation

Answer

A

Parser, stopping and stemming

Question 19

Q

Parser

Answer

A

Processes a sequence of text tokens

Uses knowledge of syntax to identify structure of text/information

Question 20

Q

Stopping

Answer

A

Removes common words from the stream of tokens e.g. ‘the’, ‘of’, ‘to’, ‘for’

Reduces size of index and does not affect quality

Question 21

Q

Stemming

Answer

A

Group words together that derive from a common stem e.g. ‘fish’, ‘fishes’ and ‘fishing’

May not be effective for all languages

Question 22

Q

User interaction

Answer

A

Query input, query transformation and results output

Question 23

Q

Query input

Answer

A

Small number of keywords to query from a document e.g. web query

Question 24

Q

Query transformation

Answer

A

Tokenisation, stopping and stemming must be done to compare with document

Question 25

Q

Results output

Answer

A

Construct display of ranked documents - snippets of documents, important words/passages etc

Question 26

Q

Logging

Answer

A

One of the most valuable sources for tuning and improving search engines

Ranking analysis uses log data to compare effectiveness of algorithm

Simulations

Question 27

Q

Politeness Policy

Answer

A

Some crawlers may have a limit to one page every x seconds instead of max speed

Question 28

Q

Ratelimits

Answer

A

A limit of times a single IP address is allowed to access a server

Question 29

Q

Deep Web

Answer

A

Private sites (accounts), form results (bus/flight timetables), scripted pages (javascript)

Difficult for a crawler to find

Question 30

Q

Conversion Problem

Answer

A

Text document stored in incompatible formats; PDF, raw text, RTF, HTML, XML, others

Sometimes in PPT/Excel documents or obsolete formats

Question 31

Q

Big Table

Answer

A

Used internally at Google

A distributed database built for storing web pages. A big table

Split into tablets served by thousands of machines, any changes are recorded in the transaction log

If a tablet crashes then another server can read data from transaction log and take over