Chapter 5 Flashcards
Data wrangling
filter, cleanse and otherwise prepare the data for downstream analysis
storage is required whenever the following
occurs:
external datasets are acquired, or internal data will be used in a Big Data environment
• data is manipulated to be made amenable for data analysis
• data is processed via an ETL activity, or output is generated as a result of an
analytical operation
Clusters
collection of servers, or nodes that connected together via network to work as a single unit
Each node in the cluster has its own dedicated resources, such as memory, a processor, and a hard drive.
file system
method of storing and organizing data on a storage device, such as flash drives, DVDs and hard drives
A file
atomic unit of storage used by the file
system to store data
distributed file system
a file system that can store large files spread across the nodes
of a cluster
Not-only SQL (NoSQL) database
is a non-relational database that is highly scalable,
fault-tolerant and specifically designed to house semi-structured and unstructured data.
- provides an API-based query interface
- support query languages other than Structured Query Language (SQL)
Sharding is
the process of horizontally partitioning a large dataset into a collection of smaller, more manageable datasets called shards.
- Each shard is stored on a separate node and each node is responsible for only the data stored on it.
- Each shard shares the same schema, and all shards collectively represent the complete dataset
- achieve horizontal scalability
Horizontal scaling
is a method for increasing a system’s capacity by adding similar or
higher capacity resources alongside existing resources
benefits of sharding
A benefit of sharding is that it provides partial tolerance toward failures. In case of a node failure, only data stored on that node is affected.
Replication
stores multiple copies of a dataset, known as replicas, on multiple nodes
-provides scalability and availability due to the fact that the same
data is replicated on various nodes
-nodes. Fault tolerance is also achieved since data redundancy
ensures that data is not lost when an individual node fails.
There are two different methods
that are used to implement replication
master-slave
• peer-to-peer
master-slave replication
all data is written to a master node
write requests, including insert, update and delete, occur on the
master node
data is replicated over to multiple
slave nodes
read requests can be fulfilled by any slave node
Master-slave replication is ideal for
read intensive loads rather than write intensive loads
One concern with master-slave replication is
read inconsistency
an issue if a slave node is read prior to an update to the master being copied to it
a voting system can be implemented where a read is declared consistent if the majority of the slaves contain the same version of the record
peer-to-peer replication
there is not a master-slave relationship between the nodes. Each node, known as a peer, is equally capable of handling reads and writes.
Peer-to-peer replication is prone to
write inconsistencies that occur as a result of a
simultaneous update of the same data across multiple peers
Pessimistic concurrency
proactive strategy that prevents inconsistency. It uses
locking to ensure that only one update to a record can occur at a time.
However, this is detrimental to availability since the database record being updated remains unavailable until all locks are released.
Optimistic concurrency is
reactive strategy that does not use locking. Instead, it
allows inconsistency to occur with knowledge that eventually consistency will be achieved after all updates have propagated.
Combining Sharding and Master-Slave Replication
When sharding is combined with master-slave replication, multiple shards become slaves of a single master, and the master itself is a shard.
Combining Sharding and Peer-to-Peer Replication
When combining sharding with peer-to-peer replication, each shard is replicated to multiple peers, and each peer is only responsible for a subset of the overall dataset.
Collectively, this helps achieve increased scalability and fault tolerance.
As there is no master involved, there is no single point of failure and fault-tolerance for both read and
write operations is supported
CAP Theorem
Consistency, Availability, and Partition tolerance
It states that a distributed database system, running on a cluster, can only provide two of the following three properties:
CAP Theorem- Consistency
A read from any node results in the same data across multiple nodes
CAP Theorem- Availability
A read/write request will always be acknowledged in the form of a
success or a failure
CAP Theorem - Partition tolerance
The database system can tolerate communication outages that
split the cluster into multiple silos and can still service read/write requests
ACID
database design principle related to transaction management. It is an acronym that stands for: • atomicity • consistency • isolation • durability
ACID: Atomicity
ensures that all operations will always succeed or fail completely. In other
words, there are no partial transactions.
so if 2 out of 3 transactions are successfully but one is not, database roll backs any partial effects of the transaction and puts the
system back to its prior state
ACID: Consistency
ensures that the database will always remain in a consistent state by ensuring that only data that conforms to the constraints of the database schema can be written to the database
ex: wrong data type being added to table
Isolation
ensures that the results of a transaction are not visible to other operations until it is complete
ex one person can update at a time
Durability
ensures that the results of an operation are permanent. In other words, once a transaction has been committed, it cannot be rolled back. This is irrespective of any system failure.
BASE
- basically available
- soft state
- eventual consistency
BASE favors
availability over consistency.
In other words, the database is A+P from a CAP perspective.
In essence, BASE leverages optimistic concurrency by relaxing the strong consistency constraints mandated by the ACID properties.
“basically available,”
database will always acknowledge a client’s request, either in the form of the requested data or a success/failure notification.
Soft state
a database may be in an inconsistent state when data is read; thus, the
results may change if the same data is requested again.
Eventual consistency
the state in which reads by different clients, immediately
following a write to the database, may not return consistent results