Partitioning - Week 6 Flashcards
Shared Nothing infrastructure
data storage, update access and processing takes place across a collection of networked machines
What are partitioning and sharding examples of?
Distributing data across available nodes for parallel processing in order to achieve performance benefits
Types of Partitioning
Random / Round-robin
Key Range
Hash
Random / Round-robin partitioning explanation with Pros and Cons
Data is assigned as it is inserted across different nodes. (alternating or at random)
Pros:
Uniform initial distribution
Easy to rebalance on update
Cons:
Querying involves accessing all nodes
Key Range partitioning explanation with Pros and Cons
Each partition is associated with a range of values for the key
Pros:
Focused range queries on key
Focused direct access on key
Cons:
Uniform distribution not intrinsic
Risk of hot spots for popular keys
Rebalancing can be expensive
Only helps for key-based requests
Hash partitioning explanation with Pros and Cons
Partition by hashing the key
Pros:
Uniform distribution with certain keys
Focused direct access on keys
Cons:
Rebalancing can be expensive
Only helps for requests that have the key
Risk of hot spots for popular keys
No support for range queries
Super very popular, core for NoSQL and is also used for MapReduce
What features are desirable in a key to be used for partitioning? In the context of the potential keys (Id, Name, Town, Size)
Used for direct access requests - Name, Town
Diverse Values - Id, Name, Town, Size
Useful for range access requests - Size
Not subject to skew - Id, Name, Town, Size
Repartitioning
Moving data from one partitioning to another (Change the type of partitioning schema, or increase/decrease the number of partitions)
Skew in partitions
When the partition is uneven
e.g. partitioning on the first letter of Country in the University table
This may mean you need to re-partition the data
Partitioning hot spots
When the load is unevenly balanced
E.g. more people may look up some universities than others
This may mean you need to re-partition the data
Assuming a uniform distribution, and a mod-based partitioning, what estimates the fraction of data that needs to be re-partitioned (where we don’t have many partitions per node)
1 - 1/(final number of nodes)
How can we reducing the cost of repartitioning
By using many more partitions than nodes, with a level of indirection between the hash and the node location
With many partitions per node, what fraction of the data needs to relocate if we have many partitions per node and we want to add a further node whilst continuing to spread the data uniformly? In terms of n being the number of starting node
1/(n+1)
Secondary Index
An index on an attribute that has not been used for partitioning.
Maps the key (indexed attribute) to it’s id locations, e.g:
Key - Value
UK {2}
USA {4, 6}
(This example is a local secondary index, global secondary indexes also store the node the primary index is on)
Local Secondary Index explanation with Pros and Cons
Storing the secondary index on the same node as the data it is indexing. This gives many local indexes.
Pros:
Updating a document only leads to local index updates (so no distributed transactions)
Cons:
A lookup needs to go to every partition, leading to many (parallel) index lookups