Week 4 - Parallel Data Architecture Flashcards
Two types of Parallel database system
1) Pipeline Parallelism
2) Partition Parallelism
What is Pipeline Parallelism
Many machines each doing on set in a milt-step process
What is Partition Parallelism
Many machines doing the same thing to different pieces of data
What is Speed up?
More resources means proportionally less time for a given amount of data 45 degree angle
What is scale-up?
If resources increased in proportion to increased data size,time, is constant (no diminishing returns )
When is scale up used in parallel databases?
1) To implement parallelism in databases for faster processing.
2) To have the same performance levels when workloads increase.
3) To break the processing in a sequential manner.
2) To have the same performance levels when workloads increase.
Shared Memory (SMP) means
multiple CPUs that can run things in parallel but they share the same memory space.
Shared Disk
In the shared disk architecture, you have multiple CPUs and
each one has its own memory space.
Shared Nothing
For the shared nothing architecture, multiple CPUs have their own memory space, not only that, they also have their own secondary storage
How do machines communicate using the share nothing
only way the machines communicate with each other is through the network
Advantage of Shared Memory
Easy to program
2 Disadvantage of Shared Memory
1) expensive to build
2) Difficult to scale
2 Advantage of sShared Nothing
1) cheaper to build
2) easier to scale up
Disadvantage of Shared Nothing
Harder to program
Intra-operator Parallelism
Get all machines working to computer a give operation
scan,sort,join
Inter-operator Parallelism
each operator may run concurrently on a different site
exploits pipelining
Inter-query Parallelism
different queries run on different sites
3 Types of data partitioning
1) Range
2) Hash
3) Round Robin
Range Partitioning means
Partitioning data on a machine and doing the processing on that machine (Partitioning based on logical sort of data) Like by Age
Hash Partitioning means
range partitioning runs a hash function,
and the hash function will decide which tuple,
or Retiro in the table will be assigned to which partition.
Round Robin Partitioning means
For each row in the table, you assign it to the first partition.
The second row you assign it to the second partition. And so on, and so forth.
3 Items Parallel Sorting
1) scan in parallel and range-partition as you go (sort attribute)
2) As tuples come in, begin “local” sorting on each
3) Resulting data is stored and range-partitioned
Parallel Sorting Problem
skew!
Some partitions will have more data than others, unbalanced load
Parallel Sorting Solution:
sample the data at start to determine partition points (find data distribution so data can be sorted evenly in partitions)
2 types of Parallel join
1) Nested loop
2) Sort Merge (plain merge join)
Nested loop
2 items
1) Each outer tuple must be compared with each inner tuple that might join
2) Easy for range Partitioning on join cols, hard otherwise
Sort Merge (plain merge join)
2 items
1) Sorting give range partitioning
2) Merging partitioned tables is local
Complex Queries:Inter-Operator parallelism
2 items
1) Pipeline between operators
2) Bushy Trees
What is the high-level query processing language used by database management systems?
1) SQL
2) HTML
3) XML
4) PL
1) SQL
Which of the following cannot be a goal in a query processing?
1) Maximizing solution space
2) Minimizing processing time
3) Maximizing throughput
4) Minimizing transfers among distributed sites
1) Maximizing solution space
Which of the following search algorithms takes the longest processing time?
1) Exhaustive search
2) Heuristic algorithm
3) Simulated annealing
4) Genetic algorithm
1) Exhaustive search
What is the correct order of tasks in a typical distributed query processing?
1) Decomposition, Localization, Optimization
2) Decomposition, Optimization, Localization
3) Localization, Decomposition, Optimization
4) Optimization, Decomposition, Localization
1) Decomposition, Localization, Optimization
What is the correct order of tasks in the decomposition step of the distributed query processing?
1) Normalization, Eliminating Redundancy, Algebraic Rewriting
2) Normalization, Algebraic Rewriting, Eliminating Redundancy
3) Eliminating Redundancy, Normalization, Algebraic Rewriting
4) Eliminating Redundancy, Algebraic Rewriting, Normalization
1) Normalization, Eliminating Redundancy, Algebraic Rewriting