Unit 3: Page rank, greedy algorithms, divide and conquer Flashcards
Occam’s razor
The simplest explanation is usually the right one.
The architecture of search engines
3 components of a search engine
- Web-crawler
- Indexing
- Assessing the webpages
3 components of a search engine
Web-Crawler
Computer programmes that search the internet, in order to identify new or changes webpages.
The information that is found by crawlers is processed and stored.
3 components of a search engine
Indexing
The information is stored in a data structure, which supports real-time access to all webpages that contain a given keyword.
3 components of a search engine
Assessing the webpages
The information content of the chosen webpages is assessed regarding possible keywords and regarding their general importance in the internet.
Input & Goal of a Search Engine
Input: A query, consisting of one or several keywords.
Goal: A list of webpages that contain information on the keywords, sorted by relevance.
Search Engines
2 criteria for ‘relevance’ of keywords
- The frequency and positioning of the search keywords on the respective webpage, and the labelling of the links to the webpage.
- The fundamental importance of a webpage.
‘Fundamental importance’ of a webpage via the link structure
If a webpage i
contains a link to a webpage j
, then (hopefully), the following is true:
- There is a relationship between the content of both webpages.
- The author of webpage
i
considers the information on webpagej
to be relevant.
2 Methods that yield a measure of the fundamental importance of a webpage.
- The page rank method, invented by Sergei Brin and Larry Page, Google founders.
- The HITS method (Hypertext Induced Topic Search) by Jon Kleinberg.
Modelling the webgraph
- Use a directed graph
G=(V, E)
to denote the webgraph. - Assume websited are numbered
1, ..., n
and thatV={1, ..., n}
. - Every vertex of
G
represents a webpage, and every edge(i, j) ∈ E
models a link from webpagei
toj
. - For a webpage
j ∈ V
we writeN⁻(j)
to denote the set of all webpages that
contain a link toj
, i. e.N⁻(j)
:=
{i ∈ V | (i, j) ∈ E}
The elements of N⁻(j)
are called predecessors of j
.
Page rank idea
Model the fundamental importance of webpage i
by a number PRᵢ
, the page rank of i
.
-
PRᵢ
is meant to represent the quality of webpagei
. - The larger
PRᵢ
, the bigger the reputation ofi
. -
PRⱼ
of webpagej
is high, if many pagesi
with high page rankPRᵢ
point toj
.
The values PRᵢ
associated to all webpages i∈V
are chosen such that:
Webpage i
with outᵢ
outgoing links passes the fraction PRᵢ
/outᵢ
of its page rank on to webpage j
with (i, j) ∈ E
.
For each j ∈ V
with N⁻(j) ≠ ⦰
we would obtain:
PRⱼ
= ∑ PRᵢ
/outᵢ
over all i ∈ N⁻(j)
Page rank idea
Issue 1: Sinks
With the proposed definition, viertices of out-degree 0
cause a problem: they do not pass on their page rank to other vertices, and they can lead to values PRᵢ
that do not make sense as a measure of importance of a webpage.
Page rank idea
Define a sink
A vertex of out-degree 0
2 Options for removing sinks
In order to determine the page rank, we only consider graphs without sinks. I.e. every vertex has out-degree ≥ 1.
2 Options:
- For every sink i
, add additional edges (i, j)
to all j ∈ V
.
- Delete all sinks and repeat this until a graph without sinks is obtained.
Irreducible Markov Chain
The stochastic matrix P
for a Markov chain is irreducible, if the graph corresponding to P
is strongly connected.
Aperiodic Markov Chain
The stochastic matrix P
for a Markov chain is aperiodic, if every vertex i
of the graph corresponding to P
satisfies:
the greatest common divisor of the length of all paths from i
to i
is 1.