web search engines Flashcards
web spider
piece of software that systematically visits all pages on the web to index them
when a user searches queries in a web browser they search indexed results from spider, not web pages themselves
web crawling
starting with a list of seed URLs making up the initial queue
each URL is visited where text is parsed and hyperlinks found
any unseen URLs added to the queue (known as URL Frontier)
repeat until all URL’s visited
issues with crawling
massive task - requires massive server farm - continuous task as pages updated
need to be robust - avoid loop and malicious software
avoid junk - duplicates and mirror sights
dynamic content
robots.txt
file on all web servers specifying who can crawl each page on sight and what is allowed to be indexed
index
associating words and themes with a page
allows queries to return relevant pages
inverted index
associating words in a dictionary with all the webpages they occur in - massive - often replicated on multiple machines with load balancers - can contain sub-indexes for efficient searching
stemming
eat word has a stem, only the stem of the word is indexed
ie cats would be indexed as cat
running as run
offsets
indexing where a word occurred in the web page, allows for phrases to be queried
dictionary
set of words, sorted alphabeticaly
posting list
list of DocId’s
constructing the index
document to be indexed -> tokeniser identifies words -> linguistic model applies stemming to create modified tokenised list -> indexed
result rankings
many different metrics for ranking each page - determines the order of the page displayed in query results
page rank
googles algorithm for ranking pages - PR(A) page rank value is the probability a random surfer would visit the page - considers web to be a graph, nodes are pages, edges are hyperlinks - surfer at node A will visit a linked page with a probability of 1/(number of links) - page with more ongoing links more likely to be visited - page with more outgoing links has less voting power
teleporting
if the node has no outgoing links surfer gets ‘teleported’ to a random node at a probability of 1/(number of nodes on web) - for nodes with outgoing links a probability of teleporting to another node not linked is included modifying the probability of navigating to a linked node to 1/(number of links + probability of teleporting)
calculating page rank
calculated iteratively until values stabilize - each page given set value for first iteration eg 100 points - each iteration redistributes points