Unit 3: Page rank, greedy algorithms, divide and conquer Flashcards

1
Q

Occam’s razor

A

The simplest explanation is usually the right one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The architecture of search engines

3 components of a search engine

A
  • Web-crawler
  • Indexing
  • Assessing the webpages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 components of a search engine

Web-Crawler

A

Computer programmes that search the internet, in order to identify new or changes webpages.

The information that is found by crawlers is processed and stored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

3 components of a search engine

Indexing

A

The information is stored in a data structure, which supports real-time access to all webpages that contain a given keyword.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

3 components of a search engine

Assessing the webpages

A

The information content of the chosen webpages is assessed regarding possible keywords and regarding their general importance in the internet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Input & Goal of a Search Engine

A

Input: A query, consisting of one or several keywords.

Goal: A list of webpages that contain information on the keywords, sorted by relevance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Search Engines

2 criteria for ‘relevance’ of keywords

A
  • The frequency and positioning of the search keywords on the respective webpage, and the labelling of the links to the webpage.
  • The fundamental importance of a webpage.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

‘Fundamental importance’ of a webpage via the link structure

A

If a webpage i contains a link to a webpage j, then (hopefully), the following is true:

  • There is a relationship between the content of both webpages.
  • The author of webpage i considers the information on webpage j to be relevant.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

2 Methods that yield a measure of the fundamental importance of a webpage.

A
  • The page rank method, invented by Sergei Brin and Larry Page, Google founders.
  • The HITS method (Hypertext Induced Topic Search) by Jon Kleinberg.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Modelling the webgraph

A
  • Use a directed graph G=(V, E) to denote the webgraph.
  • Assume websited are numbered 1, ..., n and that V={1, ..., n}.
  • Every vertex of G represents a webpage, and every edge (i, j) ∈ E models a link from webpage i to j.
  • For a webpage j ∈ V we write N⁻(j) to denote the set of all webpages that
    contain a link to j, i. e.
    N⁻(j) := {i ∈ V | (i, j) ∈ E}

The elements of N⁻(j) are called predecessors of j.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Page rank idea

A

Model the fundamental importance of webpage i by a number PRᵢ, the page rank of i.

  • PRᵢ is meant to represent the quality of webpage i.
  • The larger PRᵢ, the bigger the reputation of i.
  • PRⱼ of webpage j is high, if many pages i with high page rank PRᵢ point to j.

The values PRᵢ associated to all webpages i∈V are chosen such that:
Webpage i with outᵢ outgoing links passes the fraction PRᵢ/outᵢ of its page rank on to webpage j with (i, j) ∈ E.

For each j ∈ V with N⁻(j) ≠ ⦰ we would obtain:

PRⱼ = ∑ PRᵢ/outᵢ over all i ∈ N⁻(j)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Page rank idea

Issue 1: Sinks

A

With the proposed definition, viertices of out-degree 0 cause a problem: they do not pass on their page rank to other vertices, and they can lead to values PRᵢ that do not make sense as a measure of importance of a webpage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Page rank idea

Define a sink

A

A vertex of out-degree 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

2 Options for removing sinks

A

In order to determine the page rank, we only consider graphs without sinks. I.e. every vertex has out-degree ≥ 1.

2 Options:
- For every sink i, add additional edges (i, j) to all j ∈ V.
- Delete all sinks and repeat this until a graph without sinks is obtained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Irreducible Markov Chain

A

The stochastic matrix P for a Markov chain is irreducible, if the graph corresponding to P is strongly connected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Aperiodic Markov Chain

A

The stochastic matrix P for a Markov chain is aperiodic, if every vertex i of the graph corresponding to P satisfies:

the greatest common divisor of the length of all paths from i to i is 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Greedy Paradigm

A
  • Build a solution incrementally by choosing the next solution component using a local and not forward-looking criteria.
  • Greedily take a solution component that looks locally optimal.
  • Works for problems where a non-optimal solution can be improved to an optimal solution by incrementally replacing locally non-optimal solution components.
  • Leads to very efficient algorithms.
18
Q

Interval Scheduling Problem

A

Input: jobs 1, ..., n

Job j starts at time sⱼ and finishes at time fⱼ

Two jobs are compatible if they do not overlap.

Goal: find the maximum cardinality subset of mutually compatible jobs.

19
Q

eInterval Scheduling - Greedy Algorithm

A

Implementation: runtime in O( n logn ).

  • Jobs are sorted by finish time and renumbered accordingly. If fᵢ ≤ fⱼ, then i<j. The sorting runs in O(n logn ).
  • Jobs are chosen in order of ascending fᵢ.
  • Let t be the finish time of the current job.
    • Then take the first job j in order such that sⱼ≥t.
    • This job becomes the new current job and the search is continued with this job.
  • The algorithm requires only one run through the jobs O(n)
  • Together with the sort the required time is therefore O(n logn)
20
Q

Minimum spanning tree (MST)

A

Given a connected (undirected) graph G = (V, E) with positive rational edge weights w, an MST T = (V, F) is a spanning tree, whose sum of edge weights is minimized.

21
Q

Cayley’s theorem

A

the complete graph Kₙ
on n vertices has n^{n−2}
spanning trees

22
Q

Kruskal’s algorithm

A

Start with E(T) = ∅. Consider edges in ascending order of weight. Insert edge e in T unless doing so would create a cycle.

23
Q

Prim’s algorithm

A

Start with some root node s and greedily grow a tree T from s outward.

At each step, add a cheapest edge e to T that has exactly one endpoint in T.

24
Q

Runtime of Kruskal

A
  • Union-find-operation is practically possible in constant time. Therefore, the second part of the algorithm is essentially linear (in m).
  • The runtime is therefore dominated by sorting the edges, i. e., O(m log n).
25
Q

Runtime of Prim

A
  • If one uses a classical heap as a priority queue, the total runtime is O(m log n).
  • Can be improved to O(m + n log n) if one uses a so-called Fibonacci-Heap.
26
Q

Prim vs Kruskal
Application in practise

A
  • For dense graphs (m = Θ(n²)) Prim’s algorithm is more efficient.
  • For sparse graphs (m = Θ(n)) Kruskal’s algorithm is preferred.
27
Q

Cutset

A

A cut is a subset of nodes S.

The corresponding cutset D is the subset of edges with exactly one endpoint in S.

28
Q

Shortest path problem

A

Given:
- a directed graph G = (V, E),
- a length (weight/cost) function w and
- a source (initial) vertex s.

Problem: Find a shortest directed path from s to every other vertex in D

29
Q

Dijkstra’s algorithm: main idea

A

Works similar to BFS or Prim’s:
- Grow a set S of nodes for which you have already computed a shortest path from s.
- Always pick some vertex next that is closest to s among all nodes outside of S.

30
Q

Dijkstra’s algorithm: running time

A
  • linked list: O(n²)
  • priority queue:
    • Minimum Heap: O( (n+m) logn )
    • Fibonacci Heap: O( n logn + m)
31
Q

Heuristic

A

A speculation, estimation, or educated guess that guides the search for a solution to a problem.

32
Q

Admissible heuristic:

A

A heuristic function is admissible, if it never overestimates the distance to the goal.

i. e. all nodes v satisfy: h(v) ≤ length of a shortest v-t-path.

33
Q

Monotone heuristic

A

Let u, v ∈ V with (u, v) ∈ E. A heuristic h is called monotone (or consistent),

if the following is satisfied:
h(u) ≤ w(u, v) + h(v)

Note that this resembles the triangle-inequality

34
Q

Divide and conquer paradigm

A

Break up the problem into subproblems. Solve subproblems independently and combine solutions for the subproblems to a solution for the whole problem.

35
Q

3 Examples of divide and conquer algorithms

A
  • binary search
  • merge sort
  • quick sort
36
Q

Mergesort

A
  • Divide array into two halves.
  • Recursively sort each half.
  • Merge two halves to make sorted whole.
37
Q

Mergesort running time

A

T(n) = O(n log₂ n)

38
Q

Quicksort

A

Also uses divide and conquer, however, slightly different.

Divide: Choose a pivot element x from the sequence A, e. g. the first/last element.

Split A without x into two subsequences Aₗ and Aᵣ such that:
- Aₗ contains all elements that are at most x (≤ x).
- Aᵣ contains all elements that are larger than x (> x).

Conquer:
Recurse for Aₗ.
Recurse for Aᵣ.

Combine: Obtain the sorted list as Aₗ, x, Aᵣ

39
Q

Quicksort: analysis

A

Worst-case: Θ(n²)
Average-case: Θ(n log n)

40
Q

Memory usage for Quicksort and Mergesort

A

Quicksort:
- Worst-case in Θ(n),
- best/average-case in Θ(log n).

Mergesort:
Best/average/worst-case in Θ(n)

41
Q

Quicksort vs Mergesort
preferred applications

A

Quicksort:
Often the preferred method for most use cases.

Mergesort:
- Mergesort is mainly used for sorting lists.
- Also used for sorting elements on external memory***

*** Here one uses an iterative (instead of recursive) version of mergesort that only requires O(logn) passes through a file.