web search engines Flashcards

Question 1

Q

web spider

Answer

A

piece of software that systematically visits all pages on the web to index them
when a user searches queries in a web browser they search indexed results from spider, not web pages themselves

Question 2

Q

web crawling

Answer

A

starting with a list of seed URLs making up the initial queue
each URL is visited where text is parsed and hyperlinks found
any unseen URLs added to the queue (known as URL Frontier)
repeat until all URL’s visited

Question 3

Q

issues with crawling

Answer

A

massive task - requires massive server farm - continuous task as pages updated
need to be robust - avoid loop and malicious software
avoid junk - duplicates and mirror sights
dynamic content

Question 4

Q

robots.txt

Answer

A

file on all web servers specifying who can crawl each page on sight and what is allowed to be indexed

Question 5

Q

index

Answer

A

associating words and themes with a page
allows queries to return relevant pages

Question 6

Q

inverted index

Answer

A

associating words in a dictionary with all the webpages they occur in - massive - often replicated on multiple machines with load balancers - can contain sub-indexes for efficient searching

Question 7

Q

stemming

Answer

A

eat word has a stem, only the stem of the word is indexed
ie cats would be indexed as cat
running as run

Question 8

Q

offsets

Answer

A

indexing where a word occurred in the web page, allows for phrases to be queried

Question 9

Q

dictionary

Answer

A

set of words, sorted alphabeticaly

Question 10

Q

posting list

Answer

A

list of DocId’s

Question 11

Q

constructing the index

Answer

A

document to be indexed -> tokeniser identifies words -> linguistic model applies stemming to create modified tokenised list -> indexed

Question 12

Q

result rankings

Answer

A

many different metrics for ranking each page - determines the order of the page displayed in query results

Question 13

Q

page rank

Answer

A

googles algorithm for ranking pages - PR(A) page rank value is the probability a random surfer would visit the page - considers web to be a graph, nodes are pages, edges are hyperlinks - surfer at node A will visit a linked page with a probability of 1/(number of links) - page with more ongoing links more likely to be visited - page with more outgoing links has less voting power

Question 14

Q

teleporting

Answer

A

if the node has no outgoing links surfer gets ‘teleported’ to a random node at a probability of 1/(number of nodes on web) - for nodes with outgoing links a probability of teleporting to another node not linked is included modifying the probability of navigating to a linked node to 1/(number of links + probability of teleporting)

Question 15

Q

calculating page rank

Answer

A

calculated iteratively until values stabilize - each page given set value for first iteration eg 100 points - each iteration redistributes points

Question 16

Q

page rank algorithm

Answer

Study These Flashcards

A

PR(A) = (1-p) + p(PR(t1)/C(t1) + PR(t2)/C(t2) + … + PR(tn)/C(tn)) where PR(f) is the page rank value of a page, C(f) is the number of outgoing links of a page and p is the probability of teleportation

Question 17

Q

search engine manipulation

Answer

Study These Flashcards

A

used to increase a pages page rank score
cloaking - if a spider visits a page a high PR page is given, else the regular page
doorway pages - pages optimized to a single word that immediately redirects the user to the real page
keyword spam - excessive repetition of keywords, engineered anchor text, hidden text, and misleading meta tags
link spamming - having multiple web pages link to a page unjustly

web search engines Flashcards

(17 cards)