How does Google rank webpages? Flashcards
Number of webpages out there
40-60 billion
Directional links
- which way a link leads to
- in-degree
- links point towards webpage
- out-degree
- links points away from webpage
Networks are made of what elements?
- topolgy
- graphs, matrices
- functionality
- what you do on the graph
Calculate Node Importance Scores
- add up importance scores through incoming links
- normalize by the spread of importance
What is generally important in determining a node’s importance score?
- the hyperlink graph of webpages
- the out-degree of all nodes pointing to the node
- the importance scores of other nodes
The hyperlink graph is key, because that gives the adjacency relationships among the nodes. This gives us parameters such as the out-degrees of all nodes pointing to i. But we must also remember that computing importance is recursive: In the end, we must solve a linear system of equations that will comput the importance scores of all nodes simultaneously.
What does Google do? What two scores determine a page’s rank?
- crawling the web
- storing and indexing the pages
- computing two scores to rank pages per search
- relevant scores
- search terms
- importance scores
- relevant scores
Matrix Muliplication
- vector
- matrix with one column
- columns of first matrix must = rows of second matrix
- 1x2 x 2x2
- πTH = CπT
- H = vector
- π2 = eigen vector of G
- C = constant = eigen value
What is the first ROW of the matrix H for the hyperlink graph?
- [0 0.5 0.5 0]
Dangling Node
- node with in-degrees and no out-degrees
What are reasons why dangling nodes pose an issue to computing a set of importance scores?
- Mathematically speaking, each requires division by 0 across one row of the H matrix.
- They make the algorithm less distributed.
- In the equation π’H=π’ (with dangling rows in H filled with zeros), they could cause the only solution to be a vector of zeros.
- They could cause the equation π’H=π’ not to have an equilibrium.
Since a dangling node has an out-degree of 0, this technically requires division by 0 when constructing H. Also, assuming we fill these dangling rows with 0’s, we showed a case where the only solution is a vector of 0’s. Finally, we stated that a dangling node can prevent H from having a dominant eivenvector corresponding to an eigenvalue of 1, which is a requirement for equilibrium. But they have nothing to do with making the PageRank algorithm more or less distributed.
What is true of the Google matrix G?
- G has strictly positive entries (as long as θ is not 1).
- It guarantees that the PageRank algorithm will converge to a unique equilibrium.
- It is composed of a large, sparse matrix and two rank-one matrices.
The randomization ingredient guarantees that G will be strictly positive. Also, G guarantees that the PageRank algorithm will converge to a unique equilibrium. Further, we formed G as the sum of H, a large sparse matrix in reality, and two rank-one matrices. But the rows of G always sum to 1.
Consider the following webgraph. Without performing any computations, what is true?
- Node 2 will have a higher importance score than Node 1.
- Node 3 is a dangling node.
- Node 4 is a dangling node.
- The Google matrix G will have all positive entries (as long as theta is not 1).
Node 2 must have a higher importance score than Node 1, because Node 1’s importance comes entirely from Node 4, while Node 2’s comes from Node 4 and Node 1. Node 3 is a dangling node by definition because it doesn’t point to any other nodes. And the Google matrix always has positive entries. But Node 4 is not a dangling node.