Partitioning - Week 6 Flashcards

1
Q

Shared Nothing infrastructure

A

data storage, update access and processing takes place across a collection of networked machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are partitioning and sharding examples of?

A

Distributing data across available nodes for parallel processing in order to achieve performance benefits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Types of Partitioning

A

Random / Round-robin
Key Range
Hash

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Random / Round-robin partitioning explanation with Pros and Cons

A

Data is assigned as it is inserted across different nodes. (alternating or at random)

Pros:
Uniform initial distribution
Easy to rebalance on update

Cons:
Querying involves accessing all nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Key Range partitioning explanation with Pros and Cons

A

Each partition is associated with a range of values for the key

Pros:
Focused range queries on key
Focused direct access on key

Cons:
Uniform distribution not intrinsic
Risk of hot spots for popular keys
Rebalancing can be expensive
Only helps for key-based requests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hash partitioning explanation with Pros and Cons

A

Partition by hashing the key

Pros:
Uniform distribution with certain keys
Focused direct access on keys

Cons:
Rebalancing can be expensive
Only helps for requests that have the key
Risk of hot spots for popular keys
No support for range queries

Super very popular, core for NoSQL and is also used for MapReduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What features are desirable in a key to be used for partitioning? In the context of the potential keys (Id, Name, Town, Size)

A

Used for direct access requests - Name, Town

Diverse Values - Id, Name, Town, Size

Useful for range access requests - Size

Not subject to skew - Id, Name, Town, Size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Repartitioning

A

Moving data from one partitioning to another (Change the type of partitioning schema, or increase/decrease the number of partitions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Skew in partitions

A

When the partition is uneven
e.g. partitioning on the first letter of Country in the University table

This may mean you need to re-partition the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Partitioning hot spots

A

When the load is unevenly balanced
E.g. more people may look up some universities than others

This may mean you need to re-partition the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Assuming a uniform distribution, and a mod-based partitioning, what estimates the fraction of data that needs to be re-partitioned (where we don’t have many partitions per node)

A

1 - 1/(final number of nodes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we reducing the cost of repartitioning

A

By using many more partitions than nodes, with a level of indirection between the hash and the node location

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

With many partitions per node, what fraction of the data needs to relocate if we have many partitions per node and we want to add a further node whilst continuing to spread the data uniformly? In terms of n being the number of starting node

A

1/(n+1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Secondary Index

A

An index on an attribute that has not been used for partitioning.

Maps the key (indexed attribute) to it’s id locations, e.g:

Key - Value
UK {2}
USA {4, 6}

(This example is a local secondary index, global secondary indexes also store the node the primary index is on)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Local Secondary Index explanation with Pros and Cons

A

Storing the secondary index on the same node as the data it is indexing. This gives many local indexes.

Pros:
Updating a document only leads to local index updates (so no distributed transactions)

Cons:
A lookup needs to go to every partition, leading to many (parallel) index lookups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Global secondary indexing explanation with Pros and Cons

A

Storing a single distributed secondary index, where the index tells you the key, and the node that it is on.

Pros:
Index lookups no longer need to go to all nodes (note some index lookups will have no hits)

Cons:
Updates to a document on a node now also lead to index updates (and thus potentially do distributed transactions)

17
Q

Can partitioning interact with other functionalities such as for replication and consistency management?

A

Yes (what were the two examples mentioned in the question?)

18
Q

Partitioning for evaluation

A

Partitioning that is done to evaluate requests, e.g. a Join

If we are joining on an attribute that the data is partitioned on, we can join without repartitioning
i.e:
Join(P1: Uni-sizes, P1: Uni-places) on Node 1
Join(P2: Uni-sizes, P2: Uni-places) on Node 2

If the data has not been partitioned with the join attribute, we need to re-partition it
i.e. (given Uni-sizes is already partitioned on the join attribute)
Re-partition Uni-places on name to get P1’, P2’
Join(P1: Uni-sizes, P1’: Uni-places) on Node 1
Join(P2: Uni-sizes, P2’: Uni-places) on Node 2

We should partition on both if both aren’t partitioned on the join attribute

These re-partitionings are not likely to persist