Model the fundamental importance of webpage `i` by a number `PRᵢ`, the page rank of `i`. - `PRᵢ` is meant to represent the quality of webpage `i`. - The larger `PRᵢ`, the bigger the reputation of `i`. - `PRⱼ` of webpage `j` is high, if many pages `i` with high page rank `PRᵢ` point to `j`. The values `PRᵢ` associated to all webpages `i∈V` are chosen such that: Webpage `i` with `outᵢ` outgoing links passes the fraction `PRᵢ`/`outᵢ` of its page rank on to webpage `j` with `(i, j) ∈ E`. For each `j ∈ V` with `N⁻(j) ≠ ⦰` we would obtain: `PRⱼ` = ∑ `PRᵢ`/`outᵢ` over all `i ∈ N⁻(j)`

- Build a solution incrementally by choosing the next solution component using a local and not forward-looking criteria. - Greedily take a solution component that looks locally optimal. - Works for problems where a non-optimal solution can be improved to an optimal solution by incrementally replacing locally non-optimal solution components. - Leads to very efficient algorithms.

Unit 3: Page rank, greedy algorithms, divide and conquer Flashcards by Rayno Mostert

Occam’s razor

The simplest explanation is usually the right one.

How well did you know this?

Not at all

Perfectly

The architecture of search engines

3 components of a search engine

Web-crawler
Indexing
Assessing the webpages

How well did you know this?

Not at all

Perfectly

3 components of a search engine

Web-Crawler

Computer programmes that search the internet, in order to identify new or changes webpages.

The information that is found by crawlers is processed and stored.

How well did you know this?

Not at all

Perfectly

3 components of a search engine

Indexing

The information is stored in a data structure, which supports real-time access to all webpages that contain a given keyword.

How well did you know this?

Not at all

Perfectly

3 components of a search engine

Assessing the webpages

The information content of the chosen webpages is assessed regarding possible keywords and regarding their general importance in the internet.

How well did you know this?

Not at all

Perfectly

Input & Goal of a Search Engine

Input: A query, consisting of one or several keywords.

Goal: A list of webpages that contain information on the keywords, sorted by relevance.

How well did you know this?

Not at all

Perfectly

Search Engines

2 criteria for ‘relevance’ of keywords

The frequency and positioning of the search keywords on the respective webpage, and the labelling of the links to the webpage.
The fundamental importance of a webpage.

How well did you know this?

Not at all

Perfectly

‘Fundamental importance’ of a webpage via the link structure

If a webpage i contains a link to a webpage j, then (hopefully), the following is true:

There is a relationship between the content of both webpages.
The author of webpage i considers the information on webpage j to be relevant.

How well did you know this?

Not at all

Perfectly

2 Methods that yield a measure of the fundamental importance of a webpage.

The page rank method, invented by Sergei Brin and Larry Page, Google founders.
The HITS method (Hypertext Induced Topic Search) by Jon Kleinberg.

How well did you know this?

Not at all

Perfectly

Modelling the webgraph

Use a directed graph G=(V, E) to denote the webgraph.
Assume websited are numbered 1, ..., n and that V={1, ..., n}.
Every vertex of G represents a webpage, and every edge (i, j) ∈ E models a link from webpage i to j.
For a webpage j ∈ V we write N⁻(j) to denote the set of all webpages that
contain a link to j, i. e.
N⁻(j) := {i ∈ V | (i, j) ∈ E}

The elements of N⁻(j) are called predecessors of j.

How well did you know this?

Not at all

Perfectly

Page rank idea

Model the fundamental importance of webpage i by a number PRᵢ, the page rank of i.

PRᵢ is meant to represent the quality of webpage i.
The larger PRᵢ, the bigger the reputation of i.
PRⱼ of webpage j is high, if many pages i with high page rank PRᵢ point to j.

The values PRᵢ associated to all webpages i∈V are chosen such that:
Webpage i with outᵢ outgoing links passes the fraction PRᵢ/outᵢ of its page rank on to webpage j with (i, j) ∈ E.

For each j ∈ V with N⁻(j) ≠ ⦰ we would obtain:

PRⱼ = ∑ PRᵢ/outᵢ over all i ∈ N⁻(j)

How well did you know this?

Not at all

Perfectly

Page rank idea

Issue 1: Sinks

With the proposed definition, viertices of out-degree 0 cause a problem: they do not pass on their page rank to other vertices, and they can lead to values PRᵢ that do not make sense as a measure of importance of a webpage.

How well did you know this?

Not at all

Perfectly

Page rank idea

Define a sink

A vertex of out-degree 0

How well did you know this?

Not at all

Perfectly

2 Options for removing sinks

In order to determine the page rank, we only consider graphs without sinks. I.e. every vertex has out-degree ≥ 1.

2 Options:
- For every sink i, add additional edges (i, j) to all j ∈ V.
- Delete all sinks and repeat this until a graph without sinks is obtained.

How well did you know this?

Not at all

Perfectly

Irreducible Markov Chain

The stochastic matrix P for a Markov chain is irreducible, if the graph corresponding to P is strongly connected.

How well did you know this?

Not at all

Perfectly

Aperiodic Markov Chain

The stochastic matrix P for a Markov chain is aperiodic, if every vertex i of the graph corresponding to P satisfies:

the greatest common divisor of the length of all paths from i to i is 1.

How well did you know this?

Not at all

Perfectly

Greedy Paradigm

Study These Flashcards

Build a solution incrementally by choosing the next solution component using a local and not forward-looking criteria.
Greedily take a solution component that looks locally optimal.
Works for problems where a non-optimal solution can be improved to an optimal solution by incrementally replacing locally non-optimal solution components.
Leads to very efficient algorithms.

Interval Scheduling Problem

Study These Flashcards

Input: jobs 1, ..., n

Job j starts at time sⱼ and finishes at time fⱼ

Two jobs are compatible if they do not overlap.

Goal: find the maximum cardinality subset of mutually compatible jobs.

eInterval Scheduling - Greedy Algorithm

Study These Flashcards

Implementation: runtime in O( n logn ).

Jobs are sorted by finish time and renumbered accordingly. If fᵢ ≤ fⱼ, then i<j. The sorting runs in O(n logn ).
Jobs are chosen in order of ascending fᵢ.
Let t be the finish time of the current job.
- Then take the first job j in order such that sⱼ≥t.
- This job becomes the new current job and the search is continued with this job.
The algorithm requires only one run through the jobs O(n)
Together with the sort the required time is therefore O(n logn)

Minimum spanning tree (MST)

Study These Flashcards

Given a connected (undirected) graph G = (V, E) with positive rational edge weights w, an MST T = (V, F) is a spanning tree, whose sum of edge weights is minimized.

Cayley’s theorem

Study These Flashcards

the complete graph Kₙ
on n vertices has n^{n−2}
spanning trees

Kruskal’s algorithm

Study These Flashcards

Start with E(T) = ∅. Consider edges in ascending order of weight. Insert edge e in T unless doing so would create a cycle.

Prim’s algorithm

Study These Flashcards

Start with some root node s and greedily grow a tree T from s outward.

At each step, add a cheapest edge e to T that has exactly one endpoint in T.

Runtime of Kruskal

Study These Flashcards

Union-find-operation is practically possible in constant time. Therefore, the second part of the algorithm is essentially linear (in m).
The runtime is therefore dominated by sorting the edges, i. e., O(m log n).

Runtime of Prim

- If one uses a classical heap as a priority queue, the total runtime is `O(m log n)`. - Can be improved to `O(m + n log n)` if one uses a so-called **Fibonacci-Heap**.

Prim vs Kruskal Application in practise

- For dense graphs (`m = Θ(n²)`) Prim’s algorithm is more efficient. - For sparse graphs (`m = Θ(n)`) Kruskal’s algorithm is preferred.

Cutset

A cut is a subset of nodes `S`. The corresponding cutset `D` is the subset of edges with exactly one endpoint in `S`.

Shortest path problem

Given: - a directed graph G = (V, E), - a length (weight/cost) function `w` and - a source (initial) vertex `s`. **Problem**: Find a shortest directed path from `s` to every other vertex in `D`

Dijkstra’s algorithm: main idea

Works similar to BFS or Prim’s: - Grow a set `S` of nodes for which you have already computed a shortest path from `s`. - Always pick some vertex next that is closest to `s` among all nodes outside of `S`.

Dijkstra's algorithm: running time

- **linked list**: O(n²) - **priority queue**: - Minimum Heap: O( (n+m) logn ) - Fibonacci Heap: O( n logn + m)

Heuristic

A speculation, estimation, or educated guess that guides the search for a solution to a problem.

Admissible heuristic:

A heuristic function is admissible, if it never overestimates the distance to the goal. i. e. all nodes `v` satisfy: `h(v)` ≤ length of a shortest v-t-path.

Monotone heuristic

Let `u, v ∈ V` with `(u, v) ∈ E`. A heuristic `h` is called **monotone** (or consistent), if the following is satisfied: `h(u) ≤ w(u, v) + h(v)` ## Footnote Note that this resembles the triangle-inequality

Divide and conquer paradigm

Break up the problem into subproblems. Solve subproblems independently and combine solutions for the subproblems to a solution for the whole problem.

3 Examples of divide and conquer algorithms

- binary search - merge sort - quick sort

Mergesort

- Divide array into two halves. - Recursively sort each half. - Merge two halves to make sorted whole.

Mergesort running time

T(n) = O(n log₂ n)

Quicksort

Also uses divide and conquer, however, slightly different. **Divide**: Choose a pivot element `x` from the sequence `A`, e. g. the first/last element. Split `A` without `x` into two subsequences `Aₗ` and `Aᵣ` such that: - `Aₗ` contains all elements that are at most `x (≤ x)`. - `Aᵣ` contains all elements that are larger than `x (> x)`. **Conquer**: Recurse for `Aₗ`. Recurse for `Aᵣ`. **Combine**: Obtain the sorted list as `Aₗ`, `x`, `Aᵣ`

Quicksort: analysis

**Worst-case**: `Θ(n²)` **Average-case**: `Θ(n log n)`

Memory usage for Quicksort and Mergesort

**Quicksort**: - Worst-case in `Θ(n)`, - best/average-case in `Θ(log n)`. **Mergesort**: Best/average/worst-case in `Θ(n)`

Quicksort vs Mergesort preferred applications

**Quicksort**: Often the preferred method for most use cases. **Mergesort**: - Mergesort is mainly used for sorting lists. - Also used for sorting elements on external memory*** ## Footnote *** Here one uses an iterative (instead of recursive) version of mergesort that only requires `O(logn)` passes through a file.

Unit 3: Page rank, greedy algorithms, divide and conquer Flashcards

(41 cards)