PowerFlex Whitepapers Flashcards
What is the role of the SDR?
to proxy the IO of replicated volumes between the SDC and the SDSs where data is ultimately stored
How does the SDR work?
write IO operations are split - sending one copy to the SDS and another copy to the replication journal volume
Where is the SDR in the architecture?
sits between the SDC and SDS and is deployed alongside the SDS nodes
What does the SDR appear to be from the SDS point of view?
like an SDC sending writes
What does the SDR appear to be from the SDC point of view?
like an SDS to which writes can be sent
What method does PowerFlex utilize instead of snapshotting?
journaling
What is a limitation of snapshot based replication?
identifying block change delta is easy but as RPOs get smaller the number of required snapshots increases dramatically
places hard limit on how small RPOs can be
What is the advantage of journal based replication?
provides possibility of smallest RPO and not constrained by the maximum number of snapshots available in a system/volume
How are journals maintained in PowerFlex?
live as volumes in an SP in the same PD
journal volume does not need to reside in the same SP as the volume being replicated
What is important to know about sizing journal volumes for replication?
journal volume must have enough available capacity to continue ingesting replication data even when the WAN is down and target site is not available to send
must consider the maximum cumulative writes that might occur in an outage
What is the minimum requirement for journal capacity?
28GB x # of SDR sessions
SDR sessions = # of SDRs installed + 1
reserve at least 5% of SP for journal volumes
How can reserve journal capacity be distributed?
can be split into several volumes across multiple SPs or can reside all in one SP in a PD
What is the performance requirement for journal volumes?
performance of any SP where journal volume resides must match or exceed performance of SP where replicated volumes reside
What is the single most important consideration when sizing journal capacity?
possible WAN outage
How would you assess the journal capacity needed per application?
need to know the maximum application write bandwidth during the busiest hour
minimum outage allowance is 1hr - strongly recommend using 3 hr allowance
What is an example of journal capacity calculation per application?
Calculation example:
Our application generates 1 GB of writes during peak hours.
Using 3 hours as the supported outage, we calculate from 10,800 seconds.
The journal capacity reservation needed is 1 GB/s * 10800 s = ~10.547 TB.
Because journal capacity is calculated as a percentage of storage pool capacity, we divide the needed space by the storage pool usable capacity. Let us assume that usable capacity is 200 TB.
100 * 10.547 TB / 200 TB = 5.27%.
As a safety margin, we will round up to 6%.
Repeat the calculation for each application being replicated.
How does the SDR organize data as it’s being written?
assembles journal files that contain checkpoints to preserve write order
What happens to duplicate blocks that get sent to the journal volume?
consolidated to minimize volume of data being sent over
How is data transferred from source to target on PowerFlex?
SDR sends data over dedicated local subnets or external WAN networks assigned to replication
How is compression affected in replicated volumes?
compressed data is not sent over the WAN
SDS responsible for compressing writes
What is a volume migration limitation related to replication in PowerFlex?
migrating replicated volumes from one PD to another is not possible
since replication journals don’t span PDs
What are the asynchronous replication topologies available on PowerFlex?
one directional
bi-directional
one-to-many
many-to-one
What was necessary before PowerFlex 4.x to perform replication?
manually had to create source and target volume
can now automate it
What are the rules w/ RCGs?
a volume whether source or target can be member of one RCG
RCGs can only consist of two PowerFlex systems max
What must happen before two PowerFLex systems can talk to one another?
must have exchanged certificates and been peered
What does an RCG do?
establish the attributes and behavior of the replication of one or more volume pairs
What are the rules for volume pairs in an RCG?
must be identical in size
don’t have to reside in same type of SP
do not have to have same properties (compressed, thin/thick provisioned etc.)
What are the RPO limits for RCGs on PowerFlex 4.x?
15 sec - 60 minutes
min on 3.5 was 30 seconds
How many IP Addresses do SDRs have?
each have 2 for redundancy
What is the bandwidth rule for PowerFlex replication?
of writes to replicated volumes can’t exceed bandwidth of single network path between clusters
Why do write operations to replicated volumes take up 3x the bandwidth of PowerFlex?
journaling adds two IO operations
SDR first writes to relevant SDS backing the journal volume and the SDS sends another copy to the secondary
SDR makes a second read from the journal volume before sending to remote site
What is the networking recommendation if using replication on PowerFlex?
4 x 25GbE
2 x 100GbE
What is the bandwidth recommendation for WAN based replication?
sustained write bandwidth of all replicated volumes doesn’t exceed 80% of total available WAN bandwidth
What are the max number of SDRs per system?
128
What is the max replicated volume size?
64TBs
What is the max number of RCGs per system?
32000
How many snapshots can be put on policy based snapshot schedule?
60 out of 126 on a vTree
What sector especially is important for secure snapshots?
financial sector
What is the goal of maintenance modes on PowerFlex?
to avoid a rebuild operation and control offline process without throwing an error
What happens in an unplanned outage event on PowerFlex?
system automatically goes into a rebuild state - redistributes data on remaining nodes until it’s back to being online
What is the overall spare capacity rule on PowerFlex?
spare capacity in the system must be equal to or greater than the capacity of the smallest fault unit (node)
How does a cluster function when a node is put in maintenance mode?
cluster still functions with one less node - less performance/capacity
writes are sent to and mirrored on other nodes in the system (one to many rebalance)
one node is brought back online many to one rebalance occurs (slow process)
How does IMM (instant maintenance mode) work?
when node put into IMM data is not evacuated from node but data is not accessible
application read operations are directed to the other nodes that contain the mirror copy of the data
What happens to the MDM when a node enters IMM?
provides an updated map to the SDCs for IO operations
instructs the SDCs to use another SDS for read/write IO
any changes that would’ve affected the node in IMM are tracked
What happens when a node exits IMM?
don’t need to do full hydration (many to one) like if you took a node off the cluster and added another one after maintenance
only sync back relevant changes that occurred during maintenance
allows fast exit from maintenance and quick return to full capacity/performance
What is the primary disadvantage of IMM?
having a temporary single available copy since all the data on the node in maintenance is unavailable including any secondary copies
if an operational primary node fails while secondary is in maintenance mode could result in data loss
What happens when a node enters PMM (protective maintenance mode)?
initiates a many to many rebalancing process but data on node is preserved
data is unavailable while in PMM but temporary third copy of data is made on other nodes in the system
What is the main advantage of PMM?
guarantees two copies and thus avoids single data copy risk like IMM
What happens when a node exits PMM?
data is not needing full rehydration just the relevant changes that were tracked by other nodes
third copy is removed once all data has been resynced
How does maintenance of an SDR work?
does not enter PMM - is a manual process
What is the rule for mixing maintenance methodologies?
PMM and IMM can’t occur simultaneously in same PD
IMM can be used in a PD while PMM is used in another
What is the rule for concurrent operations with maintenance modes?
within a PD all SDSs concurrently in/entering PMM must belong to the same fault set
if you don’t use fault sets only node in a PD can be in maintenance mode at a time
What types of operations are typically recommended for IMM?
back-end software component upgrades (SDS, MDM, SDC etc.)
What types of operations are typically recommended for PMM?
node maintenance actives like (firmware or driver upgrades)
How much spare capacity do you need to build in if a cluster is going to use PMM?
must be enough spare capacity in a system to handle at least one node failure
Free + Spare - 5% of the Storage Pool >= capacity of PMM node(s)