Module 8b - Graph Processing Flashcards
What is the motivation for Google Pregel?
- Many important data sets look like graphs.
- Some graph sets are massive, and computations can be parallelized
A Pregel computation consists of a sequence of iterations called a superstep.
What are the 6 things that occur during a superstep?
- Workers asynchronously execute a user-defined function on each vertex
- Vertices can receive messages sent in the previous superstep
- Vertices can modify their value, modify the values of their edges, as well as add/remove edges
- Vertices can send messages to be received in the next superstep
- Each vertex may also deactivate itself (i.e., vote to halt)
- An inactive vertex is reactivated when it receives a message
The model of computation in pregel is vertex-centric. What does this mean?
“think like a vertex”
During a superstep in pregel, what can a vertex do?
- A vertex can read messages sent to it in superstep S − 1,
- A vertex can send messages to other vertices that will be received at superstep S + 1
- A vertex can modify the state of V and its outgoing edges.
What does pregel maintain in the state for each vertex?
- A problem specific value (ex: PageRank value)
- A list of messages sent to the vertex
- A list of outgoing edges
- A binary active/inactive state
In pregel, supersteps are computed _______, whereas workers compute _______ within each superstep, and communicate in _______ supersteps
synchronously
asynchronously
between
What is a vertex partition in Pregel?
A subset of a directed graph’s vertices. Each worker is responsible for a vertex partition
In Pregel, vertices can be either “active” or “inactive”.
How does a vertex switch from active to inactive? vice versa?
- If an active vertex votes to halt, then it becomes inactive
- If an inactive vertex receives a message, then it becomes active
When does the pregel program stop executing?
A distributed execution stops when all vertices are inactive and where are no more messages to process
Do computing units (vertices) in Pregel use shared memory?
No. They use message passing. Each vertex uses message passing to communicate with other vertices
Upon initialization of a Pregel program, how is vertex ownership determined?
- The vertex ownership is determined by a partitioner, which by default is a simple hash function over vertices.
- This ensures a fairly even distribution of data across workers, but does not always balance the computation
In Pregel, what does a worker do upon initialization?
- Each worker reads its section of the input
- Stores vertices that belong to it
- Forwards the remaining vertices to the appropriate workers
In Pregel, what does the master do upon initialization?
- Assigns a section of the input to each worker, (similarly to the way input splits are determined in Hadoop)
Upon initialization of the Pregel program, can the user override the default partitioning scheme?
Yes, it can be overwritten to exploit locality by co-locating graph components or dense subgraphs
In Pregel, combiners are supported (similar to Hadoop).
What is the benefit of combiners in Pregel? What is the drawback?
- Combiners reduce the amount of data exchanged over the network and the number of messages
- This trades CPI cycles against network I/O