web search engines Flashcards

1
Q

web spider

A

piece of software that systematically visits all pages on the web to index them
when a user searches queries in a web browser they search indexed results from spider, not web pages themselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

web crawling

A

starting with a list of seed URLs making up the initial queue
each URL is visited where text is parsed and hyperlinks found
any unseen URLs added to the queue (known as URL Frontier)
repeat until all URL’s visited

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

issues with crawling

A

massive task - requires massive server farm - continuous task as pages updated
need to be robust - avoid loop and malicious software
avoid junk - duplicates and mirror sights
dynamic content

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

robots.txt

A

file on all web servers specifying who can crawl each page on sight and what is allowed to be indexed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

index

A

associating words and themes with a page
allows queries to return relevant pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

inverted index

A

associating words in a dictionary with all the webpages they occur in - massive - often replicated on multiple machines with load balancers - can contain sub-indexes for efficient searching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

stemming

A

eat word has a stem, only the stem of the word is indexed
ie cats would be indexed as cat
running as run

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

offsets

A

indexing where a word occurred in the web page, allows for phrases to be queried

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

dictionary

A

set of words, sorted alphabeticaly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

posting list

A

list of DocId’s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

constructing the index

A

document to be indexed -> tokeniser identifies words -> linguistic model applies stemming to create modified tokenised list -> indexed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

result rankings

A

many different metrics for ranking each page - determines the order of the page displayed in query results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

page rank

A

googles algorithm for ranking pages - PR(A) page rank value is the probability a random surfer would visit the page - considers web to be a graph, nodes are pages, edges are hyperlinks - surfer at node A will visit a linked page with a probability of 1/(number of links) - page with more ongoing links more likely to be visited - page with more outgoing links has less voting power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

teleporting

A

if the node has no outgoing links surfer gets ‘teleported’ to a random node at a probability of 1/(number of nodes on web) - for nodes with outgoing links a probability of teleporting to another node not linked is included modifying the probability of navigating to a linked node to 1/(number of links + probability of teleporting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

calculating page rank

A

calculated iteratively until values stabilize - each page given set value for first iteration eg 100 points - each iteration redistributes points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

page rank algorithm

A

PR(A) = (1-p) + p(PR(t1)/C(t1) + PR(t2)/C(t2) + … + PR(tn)/C(tn)) where PR(f) is the page rank value of a page, C(f) is the number of outgoing links of a page and p is the probability of teleportation

17
Q

search engine manipulation

A

used to increase a pages page rank score
cloaking - if a spider visits a page a high PR page is given, else the regular page
doorway pages - pages optimized to a single word that immediately redirects the user to the real page
keyword spam - excessive repetition of keywords, engineered anchor text, hidden text, and misleading meta tags
link spamming - having multiple web pages link to a page unjustly