Chapter 5 Flashcards

1
Q

Data wrangling

A

filter, cleanse and otherwise prepare the data for downstream analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

storage is required whenever the following

occurs:

A

external datasets are acquired, or internal data will be used in a Big Data environment
• data is manipulated to be made amenable for data analysis
• data is processed via an ETL activity, or output is generated as a result of an
analytical operation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Clusters

A

collection of servers, or nodes that connected together via network to work as a single unit
Each node in the cluster has its own dedicated resources, such as memory, a processor, and a hard drive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

file system

A

method of storing and organizing data on a storage device, such as flash drives, DVDs and hard drives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A file

A

atomic unit of storage used by the file

system to store data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

distributed file system

A

a file system that can store large files spread across the nodes
of a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Not-only SQL (NoSQL) database

A

is a non-relational database that is highly scalable,
fault-tolerant and specifically designed to house semi-structured and unstructured data.
- provides an API-based query interface
- support query languages other than Structured Query Language (SQL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sharding is

A

the process of horizontally partitioning a large dataset into a collection of smaller, more manageable datasets called shards.
- Each shard is stored on a separate node and each node is responsible for only the data stored on it.

  • Each shard shares the same schema, and all shards collectively represent the complete dataset
  • achieve horizontal scalability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Horizontal scaling

A

is a method for increasing a system’s capacity by adding similar or
higher capacity resources alongside existing resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

benefits of sharding

A

A benefit of sharding is that it provides partial tolerance toward failures. In case of a node failure, only data stored on that node is affected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Replication

A

stores multiple copies of a dataset, known as replicas, on multiple nodes
-provides scalability and availability due to the fact that the same
data is replicated on various nodes
-nodes. Fault tolerance is also achieved since data redundancy
ensures that data is not lost when an individual node fails.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

There are two different methods

that are used to implement replication

A

master-slave

• peer-to-peer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

master-slave replication

A

all data is written to a master node
write requests, including insert, update and delete, occur on the
master node
data is replicated over to multiple
slave nodes
read requests can be fulfilled by any slave node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Master-slave replication is ideal for

A

read intensive loads rather than write intensive loads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

One concern with master-slave replication is

A

read inconsistency
an issue if a slave node is read prior to an update to the master being copied to it
a voting system can be implemented where a read is declared consistent if the majority of the slaves contain the same version of the record

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

peer-to-peer replication

A

there is not a master-slave relationship between the nodes. Each node, known as a peer, is equally capable of handling reads and writes.

17
Q

Peer-to-peer replication is prone to

A

write inconsistencies that occur as a result of a

simultaneous update of the same data across multiple peers

18
Q

Pessimistic concurrency

A

proactive strategy that prevents inconsistency. It uses
locking to ensure that only one update to a record can occur at a time.
However, this is detrimental to availability since the database record being updated remains unavailable until all locks are released.

19
Q

Optimistic concurrency is

A

reactive strategy that does not use locking. Instead, it
allows inconsistency to occur with knowledge that eventually consistency will be achieved after all updates have propagated.

20
Q

Combining Sharding and Master-Slave Replication

A

When sharding is combined with master-slave replication, multiple shards become slaves of a single master, and the master itself is a shard.

21
Q

Combining Sharding and Peer-to-Peer Replication

A

When combining sharding with peer-to-peer replication, each shard is replicated to multiple peers, and each peer is only responsible for a subset of the overall dataset.

Collectively, this helps achieve increased scalability and fault tolerance.

As there is no master involved, there is no single point of failure and fault-tolerance for both read and
write operations is supported

22
Q

CAP Theorem

A

Consistency, Availability, and Partition tolerance
It states that a distributed database system, running on a cluster, can only provide two of the following three properties:

23
Q

CAP Theorem- Consistency

A

A read from any node results in the same data across multiple nodes

24
Q

CAP Theorem- Availability

A

A read/write request will always be acknowledged in the form of a
success or a failure

25
Q

CAP Theorem - Partition tolerance

A

The database system can tolerate communication outages that

split the cluster into multiple silos and can still service read/write requests

26
Q

ACID

A
database design principle related to transaction management. It is an acronym that stands for:
• atomicity
• consistency
• isolation
• durability
27
Q

ACID: Atomicity

A

ensures that all operations will always succeed or fail completely. In other
words, there are no partial transactions.

so if 2 out of 3 transactions are successfully but one is not, database roll backs any partial effects of the transaction and puts the
system back to its prior state

28
Q

ACID: Consistency

A

ensures that the database will always remain in a consistent state by ensuring that only data that conforms to the constraints of the database schema can be written to the database

ex: wrong data type being added to table

29
Q

Isolation

A

ensures that the results of a transaction are not visible to other operations until it is complete

ex one person can update at a time

30
Q

Durability

A

ensures that the results of an operation are permanent. In other words, once a transaction has been committed, it cannot be rolled back. This is irrespective of any system failure.

31
Q

BASE

A
  • basically available
  • soft state
  • eventual consistency
32
Q

BASE favors

A

availability over consistency.
In other words, the database is A+P from a CAP perspective.
In essence, BASE leverages optimistic concurrency by relaxing the strong consistency constraints mandated by the ACID properties.

33
Q

“basically available,”

A

database will always acknowledge a client’s request, either in the form of the requested data or a success/failure notification.

34
Q

Soft state

A

a database may be in an inconsistent state when data is read; thus, the
results may change if the same data is requested again.

35
Q

Eventual consistency

A

the state in which reads by different clients, immediately

following a write to the database, may not return consistent results