Unit 3: Page rank, greedy algorithms, divide and conquer Flashcards
Occam’s razor
The simplest explanation is usually the right one.
The architecture of search engines
3 components of a search engine
- Web-crawler
- Indexing
- Assessing the webpages
3 components of a search engine
Web-Crawler
Computer programmes that search the internet, in order to identify new or changes webpages.
The information that is found by crawlers is processed and stored.
3 components of a search engine
Indexing
The information is stored in a data structure, which supports real-time access to all webpages that contain a given keyword.
3 components of a search engine
Assessing the webpages
The information content of the chosen webpages is assessed regarding possible keywords and regarding their general importance in the internet.
Input & Goal of a Search Engine
Input: A query, consisting of one or several keywords.
Goal: A list of webpages that contain information on the keywords, sorted by relevance.
Search Engines
2 criteria for ‘relevance’ of keywords
- The frequency and positioning of the search keywords on the respective webpage, and the labelling of the links to the webpage.
- The fundamental importance of a webpage.
‘Fundamental importance’ of a webpage via the link structure
If a webpage i
contains a link to a webpage j
, then (hopefully), the following is true:
- There is a relationship between the content of both webpages.
- The author of webpage
i
considers the information on webpagej
to be relevant.
2 Methods that yield a measure of the fundamental importance of a webpage.
- The page rank method, invented by Sergei Brin and Larry Page, Google founders.
- The HITS method (Hypertext Induced Topic Search) by Jon Kleinberg.
Modelling the webgraph
- Use a directed graph
G=(V, E)
to denote the webgraph. - Assume websited are numbered
1, ..., n
and thatV={1, ..., n}
. - Every vertex of
G
represents a webpage, and every edge(i, j) ∈ E
models a link from webpagei
toj
. - For a webpage
j ∈ V
we writeN⁻(j)
to denote the set of all webpages that
contain a link toj
, i. e.N⁻(j)
:=
{i ∈ V | (i, j) ∈ E}
The elements of N⁻(j)
are called predecessors of j
.
Page rank idea
Model the fundamental importance of webpage i
by a number PRᵢ
, the page rank of i
.
-
PRᵢ
is meant to represent the quality of webpagei
. - The larger
PRᵢ
, the bigger the reputation ofi
. -
PRⱼ
of webpagej
is high, if many pagesi
with high page rankPRᵢ
point toj
.
The values PRᵢ
associated to all webpages i∈V
are chosen such that:
Webpage i
with outᵢ
outgoing links passes the fraction PRᵢ
/outᵢ
of its page rank on to webpage j
with (i, j) ∈ E
.
For each j ∈ V
with N⁻(j) ≠ ⦰
we would obtain:
PRⱼ
= ∑ PRᵢ
/outᵢ
over all i ∈ N⁻(j)
Page rank idea
Issue 1: Sinks
With the proposed definition, viertices of out-degree 0
cause a problem: they do not pass on their page rank to other vertices, and they can lead to values PRᵢ
that do not make sense as a measure of importance of a webpage.
Page rank idea
Define a sink
A vertex of out-degree 0
2 Options for removing sinks
In order to determine the page rank, we only consider graphs without sinks. I.e. every vertex has out-degree ≥ 1.
2 Options:
- For every sink i
, add additional edges (i, j)
to all j ∈ V
.
- Delete all sinks and repeat this until a graph without sinks is obtained.
Irreducible Markov Chain
The stochastic matrix P
for a Markov chain is irreducible, if the graph corresponding to P
is strongly connected.
Aperiodic Markov Chain
The stochastic matrix P
for a Markov chain is aperiodic, if every vertex i
of the graph corresponding to P
satisfies:
the greatest common divisor of the length of all paths from i
to i
is 1.
Greedy Paradigm
- Build a solution incrementally by choosing the next solution component using a local and not forward-looking criteria.
- Greedily take a solution component that looks locally optimal.
- Works for problems where a non-optimal solution can be improved to an optimal solution by incrementally replacing locally non-optimal solution components.
- Leads to very efficient algorithms.
Interval Scheduling Problem
Input: jobs 1, ..., n
Job j
starts at time sⱼ
and finishes at time fⱼ
Two jobs are compatible if they do not overlap.
Goal: find the maximum cardinality subset of mutually compatible jobs.
eInterval Scheduling - Greedy Algorithm
Implementation: runtime in O( n logn )
.
- Jobs are sorted by finish time and renumbered accordingly. If
fᵢ ≤ fⱼ
, theni<j
. The sorting runs inO(n logn )
. - Jobs are chosen in order of ascending
fᵢ
. - Let
t
be the finish time of the current job.- Then take the first job
j
in order such thatsⱼ≥t
. - This job becomes the new current job and the search is continued with this job.
- Then take the first job
- The algorithm requires only one run through the jobs
O(n)
- Together with the sort the required time is therefore
O(n logn)
Minimum spanning tree (MST)
Given a connected (undirected) graph G = (V, E)
with positive rational edge weights w
, an MST T = (V, F)
is a spanning tree, whose sum of edge weights is minimized.
Cayley’s theorem
the complete graph Kₙ
on n vertices has n^{n−2}
spanning trees
Kruskal’s algorithm
Start with E(T) = ∅
. Consider edges in ascending order of weight. Insert edge e
in T
unless doing so would create a cycle.
Prim’s algorithm
Start with some root node s
and greedily grow a tree T
from s
outward.
At each step, add a cheapest edge e
to T
that has exactly one endpoint in T
.
Runtime of Kruskal
- Union-find-operation is practically possible in constant time. Therefore, the second part of the algorithm is essentially linear (in
m
). - The runtime is therefore dominated by sorting the edges, i. e.,
O(m log n)
.
Runtime of Prim
- If one uses a classical heap as a priority queue, the total runtime is
O(m log n)
. - Can be improved to
O(m + n log n)
if one uses a so-called Fibonacci-Heap.
Prim vs Kruskal
Application in practise
- For dense graphs (
m = Θ(n²)
) Prim’s algorithm is more efficient. - For sparse graphs (
m = Θ(n)
) Kruskal’s algorithm is preferred.
Cutset
A cut is a subset of nodes S
.
The corresponding cutset D
is the subset of edges with exactly one endpoint in S
.
Shortest path problem
Given:
- a directed graph G = (V, E),
- a length (weight/cost) function w
and
- a source (initial) vertex s
.
Problem: Find a shortest directed path from s
to every other vertex in D
Dijkstra’s algorithm: main idea
Works similar to BFS or Prim’s:
- Grow a set S
of nodes for which you have already computed a shortest path from s
.
- Always pick some vertex next that is closest to s
among all nodes outside of S
.
Dijkstra’s algorithm: running time
- linked list: O(n²)
-
priority queue:
- Minimum Heap: O( (n+m) logn )
- Fibonacci Heap: O( n logn + m)
Heuristic
A speculation, estimation, or educated guess that guides the search for a solution to a problem.
Admissible heuristic:
A heuristic function is admissible, if it never overestimates the distance to the goal.
i. e. all nodes v
satisfy: h(v)
≤ length of a shortest v-t-path.
Monotone heuristic
Let u, v ∈ V
with (u, v) ∈ E
. A heuristic h
is called monotone (or consistent),
if the following is satisfied:h(u) ≤ w(u, v) + h(v)
Note that this resembles the triangle-inequality
Divide and conquer paradigm
Break up the problem into subproblems. Solve subproblems independently and combine solutions for the subproblems to a solution for the whole problem.
3 Examples of divide and conquer algorithms
- binary search
- merge sort
- quick sort
Mergesort
- Divide array into two halves.
- Recursively sort each half.
- Merge two halves to make sorted whole.
Mergesort running time
T(n) = O(n log₂ n)
Quicksort
Also uses divide and conquer, however, slightly different.
Divide: Choose a pivot element x
from the sequence A
, e. g. the first/last element.
Split A
without x
into two subsequences Aₗ
and Aᵣ
such that:
- Aₗ
contains all elements that are at most x (≤ x)
.
- Aᵣ
contains all elements that are larger than x (> x)
.
Conquer:
Recurse for Aₗ
.
Recurse for Aᵣ
.
Combine: Obtain the sorted list as Aₗ
, x
, Aᵣ
Quicksort: analysis
Worst-case: Θ(n²)
Average-case: Θ(n log n)
Memory usage for Quicksort and Mergesort
Quicksort:
- Worst-case in Θ(n)
,
- best/average-case in Θ(log n)
.
Mergesort:
Best/average/worst-case in Θ(n)
Quicksort vs Mergesort
preferred applications
Quicksort:
Often the preferred method for most use cases.
Mergesort:
- Mergesort is mainly used for sorting lists.
- Also used for sorting elements on external memory***
*** Here one uses an iterative (instead of recursive) version of mergesort that only requires O(logn)
passes through a file.