14. Parallel Query Processing Flashcards

1
Q

What is parallel query processing?

A

Running a database and queries to it on multiple machines in parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is shared memory architecture?

A

every CPU share memory and disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a shared disk architecture?

A

each CPU has its own memory, but all of them share the same disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Shared nothing architecture - the main trade-off

A

Shared memory/disk architectures are easy to reason about, but sharing resources holds back the system. It is possible to achieve a much higher level of parallelism if all the machines have their own disk and memory because they do not need to wait for the resource to become available.

In shared-nothing, the machines communicate with each other solely through the network by passing messages to each other.

Trade-off: easy to reason about vs. performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Intra vs inter-query parallelism

A

Intra-query parallelism attempts to make one query run as fast as possible by spreading the work over multiple computers. The other major type of parallelism is inter-query parallelism which gives each machine different queries to work on so that the system can achieve a high throughput and complete as many queries as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Types of Intra-query Parallelism

A

We can further divide Intra-query parallelism into two classes: intra-operator and inter-operator.

Intra-operator is making one operator run as quickly as possible. An example of intra-operator parallelism is dividing up the data onto several machines and having them sort the data in parallel. This parallelism makes sorting (one operation) as fast as possible.

Inter-operator parallelism is making a query run as fast as possible by running the operators in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Types of Inter-operator Parallelism

A

The first type is pipeline parallelism. In pipeline-parallelism records are passed to the parent operator as soon as they are done. The parent operator can work on a record that its child has already processed while the child operator is working on a different record.

bushy tree parallelism - different branches of the tree are run in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sharding and replication - explain

A

Both are strategies for spreading data across multiple machines.

Sharding - each data page will be stored on only one machine. Used to achieve better performance.

Replication - each data page appeared on multiple machines. Used to achieve better availability (if one machine goes down another can handle requests)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Partitioning scheme for sharding - def.

A

A partitioning scheme is a rule that determines what machine a certain record will end up on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

range partitioning scheme - describe, when it’s a good scheme?

A

each machine gets a certain range of values that it will store (i.e. machine 1 will store values 1-5, machine 2 will store values 6-10, and so on).

This scheme is very good for queries that lookup on a specific key (especially range queries compared to the other schemes we’ll talk about) because you only need to request data from the machines that the values reside on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hash Partitioning scheme - describe, when it’s good/bad?

A

In a hash partitioning scheme, each record is hashed and is sent to a machine matches that hash value. This means that all like values will be assigned to the same machine (i.e. if value 4 goes to machine 1 then all of the 4s must go to that machine), but it makes no guarantees about where close values go.

It will still perform well for key lookup, but not for range queries (because every machine will likely need to be queried). Hash partitioning is the other scheme of choice for parallel hashing and parallel hash join.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Round Robin Partitioning scheme - describe, when good/bad?

A

The last scheme we will talk about is called round robin partitioning. In this scheme we go record by record and assign each record to the next machine. For example the first record will be assigned to the first machine, the second record will be assigned to the second machine and so on. When we reach the final machine we will assign the next record to the first machine.

every machine is guaranteed to get the same amount of data. This scheme will actually achieve maximum parallelization. The downside, of course, is that every machine will need to be activated for every query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Parallel sorting - describe

A
  1. Range partition the table
  2. Perform local sort on each machine

We range partition the table because then once each machine sorts its data, the entire table is in sorted order (the data on each machine can be simply concatenated together if needed).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Parallel Hashing - describe

A
  1. Hash partition the table
  2. Perform local hashing on each machine

Hash partitioning the table guarantees that like values will be assigned to the same machine. It would also be valid to range partition the data (because it has the same guarantee). However, in practice, range partitioning is a little harder (how do you come up with the ranges efficiently?) than hash-partitioning so use hash partitioning when possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Parallel Sort Merge Join - describe

A
  1. Range partition each table on the join column using the same ranges
  2. Perform local sort merge join on each machine

We need to use the same ranges to guarantee that all matches appear on the same machine. If we used different ranges for each table, then it’s possible that a record from table R will appear on a different machine than a record from table S even if they have the same value for the join column which will prevent these records from ever getting joined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Parallel Grace Hash Join - describe

A
  1. Hash partition each table using the same hash function on the join column 2. Perform local grace hash join on each machine

Similarly to parallel sort-merge join, we need to use the same hash function for both tables to guarantee that matching records will be partitioned to the same machines.

17
Q

Broadcast join - describe, the main trade-off

A

A broadcast join will send the entire small relation to every machine, and then each machine will perform a local join. The concatenation of the results from each machine will be the final result of the join. While this algorithm has each machine operate on more data, we make up for this by sending much less data over the network. When we’re joining together one really large table and one small table, a broadcast join will normally be the fastest because of its network cost advantages.

18
Q

Hierarchical Aggregation - what is it?

Explain for count and avg

A

Hierarchical aggregation is how we parallelize aggregation operations (i.e. SUM, COUNT, AVG).

To parallelize COUNT, each machine individually counts their records. The machines all send their counts to the coordinator machine who will then sum them together to figure out the overall count.

AVG is a little harder to calculate because the average of a bunch of averages isn’t necessarily the average of the data set. To parallelize AVG each machine must calculate the sum of all the values and the count. They then send these values to the coordinator machine. The coordinator machine adds up the sums to calculate the overall sum and then adds up the counts to calculate the overall count. It then divides the sum by the count to calculate the final average.