Partitioning - Week 6 Flashcards

Question 1

Q

Shared Nothing infrastructure

Answer

A

data storage, update access and processing takes place across a collection of networked machines

Question 2

Q

What are partitioning and sharding examples of?

Answer

A

Distributing data across available nodes for parallel processing in order to achieve performance benefits

Question 3

Q

Types of Partitioning

Answer

A

Random / Round-robin
Key Range
Hash

Question 4

Q

Random / Round-robin partitioning explanation with Pros and Cons

Answer

A

Data is assigned as it is inserted across different nodes. (alternating or at random)

Pros:
Uniform initial distribution
Easy to rebalance on update

Cons:
Querying involves accessing all nodes

Question 5

Q

Key Range partitioning explanation with Pros and Cons

Answer

A

Each partition is associated with a range of values for the key

Pros:
Focused range queries on key
Focused direct access on key

Cons:
Uniform distribution not intrinsic
Risk of hot spots for popular keys
Rebalancing can be expensive
Only helps for key-based requests

Question 6

Q

Hash partitioning explanation with Pros and Cons

Answer

A

Partition by hashing the key

Pros:
Uniform distribution with certain keys
Focused direct access on keys

Cons:
Rebalancing can be expensive
Only helps for requests that have the key
Risk of hot spots for popular keys
No support for range queries

Super very popular, core for NoSQL and is also used for MapReduce

Question 7

Q

What features are desirable in a key to be used for partitioning? In the context of the potential keys (Id, Name, Town, Size)

Answer

A

Used for direct access requests - Name, Town

Diverse Values - Id, Name, Town, Size

Useful for range access requests - Size

Not subject to skew - Id, Name, Town, Size

Question 8

Q

Repartitioning

Answer

A

Moving data from one partitioning to another (Change the type of partitioning schema, or increase/decrease the number of partitions)

Question 9

Q

Skew in partitions

Answer

A

When the partition is uneven
e.g. partitioning on the first letter of Country in the University table

This may mean you need to re-partition the data

Question 10

Q

Partitioning hot spots

Answer

A

When the load is unevenly balanced
E.g. more people may look up some universities than others

This may mean you need to re-partition the data

Question 11

Q

Assuming a uniform distribution, and a mod-based partitioning, what estimates the fraction of data that needs to be re-partitioned (where we don’t have many partitions per node)

Answer

A

1 - 1/(final number of nodes)

Question 12

Q

How can we reducing the cost of repartitioning

Answer

A

By using many more partitions than nodes, with a level of indirection between the hash and the node location

Question 13

Q

With many partitions per node, what fraction of the data needs to relocate if we have many partitions per node and we want to add a further node whilst continuing to spread the data uniformly? In terms of n being the number of starting node

Question 14

Q

Secondary Index

Answer

A

An index on an attribute that has not been used for partitioning.

Maps the key (indexed attribute) to it’s id locations, e.g:

Key - Value
UK {2}
USA {4, 6}

(This example is a local secondary index, global secondary indexes also store the node the primary index is on)

Question 15

Q

Local Secondary Index explanation with Pros and Cons

Answer

A

Storing the secondary index on the same node as the data it is indexing. This gives many local indexes.

Pros:
Updating a document only leads to local index updates (so no distributed transactions)

Cons:
A lookup needs to go to every partition, leading to many (parallel) index lookups

Question 16

Q

Global secondary indexing explanation with Pros and Cons

Answer

A

Storing a single distributed secondary index, where the index tells you the key, and the node that it is on.

Pros:
Index lookups no longer need to go to all nodes (note some index lookups will have no hits)

Cons:
Updates to a document on a node now also lead to index updates (and thus potentially do distributed transactions)

Question 17

Q

Can partitioning interact with other functionalities such as for replication and consistency management?

Answer

A

Yes (what were the two examples mentioned in the question?)

Question 18

Q

Partitioning for evaluation

Answer

A

Partitioning that is done to evaluate requests, e.g. a Join

If we are joining on an attribute that the data is partitioned on, we can join without repartitioning
i.e:
Join(P1: Uni-sizes, P1: Uni-places) on Node 1
Join(P2: Uni-sizes, P2: Uni-places) on Node 2

If the data has not been partitioned with the join attribute, we need to re-partition it
i.e. (given Uni-sizes is already partitioned on the join attribute)
Re-partition Uni-places on name to get P1’, P2’
Join(P1: Uni-sizes, P1’: Uni-places) on Node 1
Join(P2: Uni-sizes, P2’: Uni-places) on Node 2

We should partition on both if both aren’t partitioned on the join attribute

These re-partitionings are not likely to persist