14. Parallel Query Processing Flashcards
What is parallel query processing?
Running a database and queries to it on multiple machines in parallel
What is shared memory architecture?
every CPU share memory and disk
What is a shared disk architecture?
each CPU has its own memory, but all of them share the same disk
Shared nothing architecture - the main trade-off
Shared memory/disk architectures are easy to reason about, but sharing resources holds back the system. It is possible to achieve a much higher level of parallelism if all the machines have their own disk and memory because they do not need to wait for the resource to become available.
In shared-nothing, the machines communicate with each other solely through the network by passing messages to each other.
Trade-off: easy to reason about vs. performance
Intra vs inter-query parallelism
Intra-query parallelism attempts to make one query run as fast as possible by spreading the work over multiple computers. The other major type of parallelism is inter-query parallelism which gives each machine different queries to work on so that the system can achieve a high throughput and complete as many queries as possible.
Types of Intra-query Parallelism
We can further divide Intra-query parallelism into two classes: intra-operator and inter-operator.
Intra-operator is making one operator run as quickly as possible. An example of intra-operator parallelism is dividing up the data onto several machines and having them sort the data in parallel. This parallelism makes sorting (one operation) as fast as possible.
Inter-operator parallelism is making a query run as fast as possible by running the operators in parallel.
Types of Inter-operator Parallelism
The first type is pipeline parallelism. In pipeline-parallelism records are passed to the parent operator as soon as they are done. The parent operator can work on a record that its child has already processed while the child operator is working on a different record.
bushy tree parallelism - different branches of the tree are run in parallel.
Sharding and replication - explain
Both are strategies for spreading data across multiple machines.
Sharding - each data page will be stored on only one machine. Used to achieve better performance.
Replication - each data page appeared on multiple machines. Used to achieve better availability (if one machine goes down another can handle requests)
Partitioning scheme for sharding - def.
A partitioning scheme is a rule that determines what machine a certain record will end up on.
range partitioning scheme - describe, when it’s a good scheme?
each machine gets a certain range of values that it will store (i.e. machine 1 will store values 1-5, machine 2 will store values 6-10, and so on).
This scheme is very good for queries that lookup on a specific key (especially range queries compared to the other schemes we’ll talk about) because you only need to request data from the machines that the values reside on.
Hash Partitioning scheme - describe, when it’s good/bad?
In a hash partitioning scheme, each record is hashed and is sent to a machine matches that hash value. This means that all like values will be assigned to the same machine (i.e. if value 4 goes to machine 1 then all of the 4s must go to that machine), but it makes no guarantees about where close values go.
It will still perform well for key lookup, but not for range queries (because every machine will likely need to be queried). Hash partitioning is the other scheme of choice for parallel hashing and parallel hash join.
Round Robin Partitioning scheme - describe, when good/bad?
The last scheme we will talk about is called round robin partitioning. In this scheme we go record by record and assign each record to the next machine. For example the first record will be assigned to the first machine, the second record will be assigned to the second machine and so on. When we reach the final machine we will assign the next record to the first machine.
every machine is guaranteed to get the same amount of data. This scheme will actually achieve maximum parallelization. The downside, of course, is that every machine will need to be activated for every query.
Parallel sorting - describe
- Range partition the table
- Perform local sort on each machine
We range partition the table because then once each machine sorts its data, the entire table is in sorted order (the data on each machine can be simply concatenated together if needed).
Parallel Hashing - describe
- Hash partition the table
- Perform local hashing on each machine
Hash partitioning the table guarantees that like values will be assigned to the same machine. It would also be valid to range partition the data (because it has the same guarantee). However, in practice, range partitioning is a little harder (how do you come up with the ranges efficiently?) than hash-partitioning so use hash partitioning when possible.
Parallel Sort Merge Join - describe
- Range partition each table on the join column using the same ranges
- Perform local sort merge join on each machine
We need to use the same ranges to guarantee that all matches appear on the same machine. If we used different ranges for each table, then it’s possible that a record from table R will appear on a different machine than a record from table S even if they have the same value for the join column which will prevent these records from ever getting joined.