01 Web search crawler Flashcards

Question 1

Q

what is webcrawling

Answer

A

the process of locating, fetching, storing pages available in the web

computer programs that perform this task referred to as:
- crawler
- spider
- harvester
- robot

Question 2

Q

what is web crawler repository

Answer

A

cache for the online content
provide quick access to physical copies of pages
speed up indexing process

Question 3

Q

what is the fundamental assumption

Answer

A

the web is well linked
crawlers exploit the hyperlink structure

Question 4

Q

basic web crawling process

Answer

A

initialise url download queue (URL frontier) with some seed url
repeat
fetch content of url from queue
store fetched content in repository
extract hyperlink from content
add new links to download queue

Question 5

Q

what are the crawler requirements**

Answer

A

scalability
- distribute and increase crawl rate by adding machines

robustness
- avoid spam and spider trap

selection
- cannot index everything, how to select

duplicates
- integrate duplication detection

politeness
- avoid overloading crawed sites

freshness
- refresh crawled content

Question 6

Q

what are the crawling challenges

Answer

A

how to distribute crawling
how to make best use of resources
how deep should the site be crawled
how often should we crawl

Question 7

Q

crawling can be divided into 3 sets

Answer

A

downloaded
discovered
undiscovered

Question 8

Q

what is robot.txt

Answer

A

explicit politeness: advise web crawlers on which part of the site is accessible

implicit politeness
even without specification, avoid hitting site too often

Question 9

Q

spider traps

Answer

A

crawlers need to avoid
- ill formed HTML
- misleading/ hostile sites
- spams

Question 10

Q

solutions to spider traps

Answer

A

no automatic technique can be foolproof
check for url length
trap guards
- prepare crawl statistics
- add black list to guard module
- eliminate url with non textual data type

Question 11

Q

duplication detection

Answer

A

if page is already in index, avoid wasting resources

exact duplicates:
easy to eliminate using hashing

near duplicates:
difficult to eliminate
identified using document fingerprint or shingles

Question 12

Q

key crawling components

Answer

A

url frontier: queue for url to be crawled
seen url: set() of crawled links
fetcher: download url
parser: extract outgoing links
url filtering: filter url that are images
content seen filtering: eliminate duplicate pages

Question 13

Q

URL prioritisation

Answer

A

2 queues for URL
1. discovery queue
- random
- breadth first
- in degree (has more links pointed to itself)
- page rank

refreshing queue
- random
- age (older)
- page rank
- user feedback
- longevity (how often is page updated)

Question 14

Q

breadth first search**

Answer

A

append new url to end of the queue

FIFO

requires memory of all nodes on the previous level

Question 15

Q

depth first search**

Answer

A

append new url to the start of the queue

LIFO

memory of only depth times branching factor but may get too deep

Question 16

Q

what is the crawling metrics

Answer

Study These Flashcards

A

Quality metrics
1. coverage
2. freshness
3. page importance

Performance metrics
1. throughput: content download rate

Question 17

Q

mirror sites

Answer

Study These Flashcards

A

replica of existing sites
can lead to redundant crawling
can be detected using:
- url similarity
- link structure
- content similarity

Question 18

Q

geographically distributed webcrawling

Answer

Study These Flashcards

A

higher crawling throughput
- proximity
- lower crawling latency
improved politeness
- less overhead on routers because of fewer hops
better coverage
increased availability

Question 19

Q

why is data structure important

Answer

Study These Flashcards

A

efficiency for webcrawler
seen URL table, as url continue to grow
highspace requirements
frequent url cached in memory

01 Web search crawler Flashcards

(19 cards)