- Distributed programming framework for Big Data processing - Based on functional programming

Striping - Stripes data over a large number of disks to improve sequential+random reads&writes - Very bad choice for fault tolerance, only for scratch data that can be regenerated easily

Mirroring - Each data block is written to two hard disks: first one is the master copy, and secondly a mirror slave copy is written after the master - Reads are served by either one of the two hard disks - Loses half the storage space and halves write bandwidth/IOPS compared to using two drives - Data is available if either hard disk is available - Easy repair: Replace the failed hard disk and copy all of the data over to the replacement drive

Concepts Flashcards by Sallamari Sallmen

Scaling up

When the need for parallelism arises, a single powerful computer is added with more CPU cores, more memory, and more hard disks

How well did you know this?

Not at all

Perfectly

Scaling out

When the need for parallelism arises, the task is divided between a large number of less powerful machines with (relatively) slow CPUs, moderate memory amounts, moderate hard disk counts

How well did you know this?

Not at all

Perfectly

Pros and Cons of Scaling Up vs Scaling Out

Scaling up is more expensive than scaling out.
(Big high-end systems have much higher pricing for a given: CPU power, memory, and hard disk space)
Scaling out is more challenging for fault tolerance.
(A large number of loosely coupled systems means more components and thus more failures in hardware and in networking. Solution: Software fault tolerance)
Scaling out is more challenging for software development. (due to larger number of components, larger number of failures both in nodes and networking connecting them, and increased latencies. Solution: Scalable cloud platforms)

How well did you know this?

Not at all

Perfectly

Cloud computing

Cloud computing is a model for enabling convenient,
on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics, three service models, and four deployment models.

How well did you know this?

Not at all

Perfectly

Essential charasteristics (5)

On-demand self-service,
Broad network access, 
Resource pooling, 
Rapid elasticity, 
and Measured Service

How well did you know this?

Not at all

Perfectly

Service models (3)

Cloud Software as a Service (SaaS),
Cloud Platform as a Service (PaaS),
and Cloud Infrastructure as a Service (IaaS)

How well did you know this?

Not at all

Perfectly

Deployment models (4)

Private cloud,
Community cloud,
Public cloud,
and Hybrid cloud

How well did you know this?

Not at all

Perfectly

Apache Spark

Distributed programming framework for Big Data processing

- Based on functional programming

How well did you know this?

Not at all

Perfectly

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs) are Scala
collection-like entities that are distributed over several
computers
The framework stores the RDDs in partitions, where a
separate thread can process each partition
To implement fault tolerance, the RDD partitions record lineage: A recipe to recompute the RDD partition based on the parent RDDs and the operation used to generate the partition
If a server goes down, the lineage information can be used to recompute its partitions on another server

How well did you know this?

Not at all

Perfectly

RAID Redundant Array of Independent Disks

Most commonly used fault tolerance mechanism in small scale

How well did you know this?

Not at all

Perfectly

RAID 0

Striping

Stripes data over a large number of disks to improve sequential+random reads&writes
Very bad choice for fault tolerance, only for scratch data that can be regenerated easily

How well did you know this?

Not at all

Perfectly

RAID 1

Mirroring

Each data block is written to two hard disks: first one is the master copy, and secondly a mirror slave copy is written after the master
Reads are served by either one of the two hard disks
Loses half the storage space and halves write bandwidth/IOPS compared to using two drives
Data is available if either hard disk is available
Easy repair: Replace the failed hard disk and copy all of the data over to the replacement drive

How well did you know this?

Not at all

Perfectly

RAID 5

Block-level striping with distributed parity
- Data is stored on n + 1 hard disks, where a parity checksum block is stored for each n blocks of data. The
parity blocks are distributed over all disks. Tolerates one hard disk failure

How well did you know this?

Not at all

Perfectly

RAID 5 - Properties (reads, writes, storage)

Sequential reads and writes are striped over all disks
Loses only one hard disk worth of storage to parity
Sequential read and write speeds are good, random read IOPS are good
Random small write requires reading one data block +
parity block, and writing back modified data block +
modified parity block, requiring 4 x IOPS in the worst case (battery backed up caches try to minimize this overhead.)

How well did you know this?

Not at all

Perfectly

RAID 5 - Properties (rebuild)

Rebuilding a disk requires reading all of the contents of the other n disks to regenerate the missing disk using parity - this is a slow process
Slow during rebuild: When one disk has failed, each
missing data block read requires reading n blocks of data when rebuild is in progress
Vulnerability to a second hard disk failure during array
rebuild and long rebuild times make most vendors instead recommend RAID 6 (see two slides ahead) with large capacity SATA drives

How well did you know this?

Not at all

Perfectly

RAID 6

Study These Flashcards

Block-level striping with double distributed parity
- Data is stored on n + 2 hard disks, where two parity
checksum blocks are stored for each n
blocks of data. The parity blocks are distributed over all disks. Tolerates two hard disk failures

RAID 6 - Properties (reads, writes, storage)

Study These Flashcards

Sequential reads and writes are striped over all disks
Loses two hard disk worth of storage to parity
Sequential read and write speeds are good, random read IOPS are good
Random small write requires reading one data block + two parity blocks, and writing back modified data block + two modified parity blocks, requiring 6 x IOPS in the worst case (battery backed up caches try to minimize this overhead)
Slow during rebuild: When 1-2 disks have failed, each
missing data block read requires reading n blocks of data when rebuild is in progress

RAID 10

Study These Flashcards

Stripe of Mirrors
- Data is stored on 2n hard disks, each mirror pair consists of two hard disks and the pairs are striped together to a single logical device

RAID 10 -Properties (reads, writes, storage)

Study These Flashcards

Tolerates only one hard disk failure in the worst case (n - one disk from each mirror pair - in the best case)
Loses half of the storage space
Sequential reads and writes are striped over all disks
Sequential read and write speeds are good, random read and write IOPS are good
Each data block is written to two hard disks. Random small write require 2 x IOPS in the worst case
Quicker during rebuild: Missing data is just copied over from the other disk of the mirror. Use of “hot spares” recommended to speed up recovery start

RAID Usage Scenarios

Study These Flashcards

RAID 0: Temporary space for data that can be discarded
RAID 1: Small scale installations, server boot disks, etc.
RAID 5: Some fault tolerance with minimum storage
overhead, small random writes can be problematic, RAID 5 Write Hole problem without special hardware (battery backed up cache / UPS)
RAID 6: Fairly good fault tolerance with reasonable storage overhead, small random writes can be even more problematic than in RAID 5, RAID 6 Write Hole problem without special hardware (battery backed up cache / UPS)
RAID 10: Less fault tolerant than RAID 6 but more fault
tolerant than RAID 5. Loses half of the storage capacity.
Good small random write performance makes this often
the choice for database workloads. Can avoid the “Write hole problem” under write ordering guarantees.

Database ACID guarantees

Study These Flashcards

Atomicity: Database modifications must follow an ”all or nothing” rule. Each transaction is said to be atomic. If one part of the transaction fails, the entire transaction fails and the database state is left unchanged.
Consistency (as defined in databases): Any transaction the database performs will take it from one consistent state to another.
Isolation: No transaction should be able to interfere with another transaction at all.
Durability: Once a transaction has been committed, it will remain so.

CAP properties (in distributed databases)

Study These Flashcards

Consistency: All nodes have a consistent view of the contents of the (distributed) database
Availability: A guarantee that every database request
eventually receives a response about whether it was
successful or whether it failed
Partition Tolerance: The system continues to operate
despite arbitrary message loss

Brewer’s CAP Theorem

Study These Flashcards

It is impossible to create a distributed system that is at the same time satisfies all three CAP properties:

Consistency
Availability
Partition tolerance

Study These Flashcards

A non-distributed (centralized) database system

- Example: Centralized version control - Svn

- A distributed database that can not be modified when network splits to partitions - Example: Google Bigtable

- A distributed database that can become inconsistent when network splits into partitions - Example: Distributed version control system - Git

System Tradeoffs - CA Systems

- Limited scalability - use when expected system load is low - Easiest to implement - Easy to program against - High latency if clients are not geographically close

System Tradeoffs - CP Systems

- Good scalability - Relatively easy to program against - Data consistency achieved in a scalable way - High latency of updates - Quorum algorithms (e.g., Paxos) can make the system available if a majority of the system components are connected to each other - Will eventually result in user visible unavailability for updates after network partition

System Tradeoffs - AP Systems

- Extremely good scalability - Very difficult to program against - There is no single algorithm to merge inconsistent updates. Analogy: In distributed version control systems there will always be some conflicting updates that can not both be done, and there is no single general method to intelligently choose which one of the conflicting updates should be applied, if any.

Consistent hashing

- Consistent hashing is a way to distribute the contents of a hash table over a distributed hash table (DHT) - The main advantage of the method is its elasticity. Namely if hash table contains K keys, and is distributed over n servers, either adding or removing a server will only require the relocation of O(K/n) keys

FLP Theorem

There is no algorithm that will solve the | asynchronous consensus problem!

Paxos Algorithm

- The Paxos algorithm was designed to be a sound algorithm for the asynchronous consensus problem - If the Paxos algorithm terminates, then its output is a correct outcome of a consensus algorithm - It terminates in practice with very high probability

Concepts Flashcards

(32 cards)