Chapter 5 Flashcards by Joohi Rana

Data wrangling

filter, cleanse and otherwise prepare the data for downstream analysis

How well did you know this?

Not at all

Perfectly

storage is required whenever the following

occurs:

external datasets are acquired, or internal data will be used in a Big Data environment
• data is manipulated to be made amenable for data analysis
• data is processed via an ETL activity, or output is generated as a result of an
analytical operation

How well did you know this?

Not at all

Perfectly

Clusters

collection of servers, or nodes that connected together via network to work as a single unit
Each node in the cluster has its own dedicated resources, such as memory, a processor, and a hard drive.

How well did you know this?

Not at all

Perfectly

file system

method of storing and organizing data on a storage device, such as flash drives, DVDs and hard drives

How well did you know this?

Not at all

Perfectly

A file

atomic unit of storage used by the file

system to store data

How well did you know this?

Not at all

Perfectly

distributed file system

a file system that can store large files spread across the nodes
of a cluster

How well did you know this?

Not at all

Perfectly

Not-only SQL (NoSQL) database

is a non-relational database that is highly scalable,
fault-tolerant and specifically designed to house semi-structured and unstructured data.
- provides an API-based query interface
- support query languages other than Structured Query Language (SQL)

How well did you know this?

Not at all

Perfectly

Sharding is

the process of horizontally partitioning a large dataset into a collection of smaller, more manageable datasets called shards.
- Each shard is stored on a separate node and each node is responsible for only the data stored on it.

Each shard shares the same schema, and all shards collectively represent the complete dataset
achieve horizontal scalability

How well did you know this?

Not at all

Perfectly

Horizontal scaling

is a method for increasing a system’s capacity by adding similar or
higher capacity resources alongside existing resources

How well did you know this?

Not at all

Perfectly

benefits of sharding

A benefit of sharding is that it provides partial tolerance toward failures. In case of a node failure, only data stored on that node is affected.

How well did you know this?

Not at all

Perfectly

Replication

stores multiple copies of a dataset, known as replicas, on multiple nodes
-provides scalability and availability due to the fact that the same
data is replicated on various nodes
-nodes. Fault tolerance is also achieved since data redundancy
ensures that data is not lost when an individual node fails.

How well did you know this?

Not at all

Perfectly

There are two different methods

that are used to implement replication

master-slave

• peer-to-peer

How well did you know this?

Not at all

Perfectly

master-slave replication

all data is written to a master node
write requests, including insert, update and delete, occur on the
master node
data is replicated over to multiple
slave nodes
read requests can be fulfilled by any slave node

How well did you know this?

Not at all

Perfectly

Master-slave replication is ideal for

read intensive loads rather than write intensive loads

How well did you know this?

Not at all

Perfectly

One concern with master-slave replication is

read inconsistency
an issue if a slave node is read prior to an update to the master being copied to it
a voting system can be implemented where a read is declared consistent if the majority of the slaves contain the same version of the record

How well did you know this?

Not at all

Perfectly

peer-to-peer replication

Study These Flashcards

there is not a master-slave relationship between the nodes. Each node, known as a peer, is equally capable of handling reads and writes.

Peer-to-peer replication is prone to

Study These Flashcards

write inconsistencies that occur as a result of a

simultaneous update of the same data across multiple peers

Pessimistic concurrency

Study These Flashcards

proactive strategy that prevents inconsistency. It uses
locking to ensure that only one update to a record can occur at a time.
However, this is detrimental to availability since the database record being updated remains unavailable until all locks are released.

Optimistic concurrency is

Study These Flashcards

reactive strategy that does not use locking. Instead, it
allows inconsistency to occur with knowledge that eventually consistency will be achieved after all updates have propagated.

Combining Sharding and Master-Slave Replication

Study These Flashcards

When sharding is combined with master-slave replication, multiple shards become slaves of a single master, and the master itself is a shard.

Combining Sharding and Peer-to-Peer Replication

Study These Flashcards

When combining sharding with peer-to-peer replication, each shard is replicated to multiple peers, and each peer is only responsible for a subset of the overall dataset.

Collectively, this helps achieve increased scalability and fault tolerance.

As there is no master involved, there is no single point of failure and fault-tolerance for both read and
write operations is supported

CAP Theorem

Study These Flashcards

Consistency, Availability, and Partition tolerance
It states that a distributed database system, running on a cluster, can only provide two of the following three properties:

CAP Theorem- Consistency

Study These Flashcards

A read from any node results in the same data across multiple nodes

CAP Theorem- Availability

Study These Flashcards

A read/write request will always be acknowledged in the form of a
success or a failure

CAP Theorem - Partition tolerance

The database system can tolerate communication outages that | split the cluster into multiple silos and can still service read/write requests

ACID

``` database design principle related to transaction management. It is an acronym that stands for: • atomicity • consistency • isolation • durability ```

ACID: Atomicity

ensures that all operations will always succeed or fail completely. In other words, there are no partial transactions. so if 2 out of 3 transactions are successfully but one is not, database roll backs any partial effects of the transaction and puts the system back to its prior state

ACID: Consistency

ensures that the database will always remain in a consistent state by ensuring that only data that conforms to the constraints of the database schema can be written to the database ex: wrong data type being added to table

Isolation

ensures that the results of a transaction are not visible to other operations until it is complete ex one person can update at a time

Durability

ensures that the results of an operation are permanent. In other words, once a transaction has been committed, it cannot be rolled back. This is irrespective of any system failure.

BASE

* basically available * soft state * eventual consistency

BASE favors

availability over consistency. In other words, the database is A+P from a CAP perspective. In essence, BASE leverages optimistic concurrency by relaxing the strong consistency constraints mandated by the ACID properties.

“basically available,”

database will always acknowledge a client’s request, either in the form of the requested data or a success/failure notification.

Soft state

a database may be in an inconsistent state when data is read; thus, the results may change if the same data is requested again.

Eventual consistency

the state in which reads by different clients, immediately | following a write to the database, may not return consistent results

Chapter 5 Flashcards

(35 cards)