Chapter 5 Flashcards
Data wrangling
filter, cleanse and otherwise prepare the data for downstream analysis
storage is required whenever the following
occurs:
external datasets are acquired, or internal data will be used in a Big Data environment
• data is manipulated to be made amenable for data analysis
• data is processed via an ETL activity, or output is generated as a result of an
analytical operation
Clusters
collection of servers, or nodes that connected together via network to work as a single unit
Each node in the cluster has its own dedicated resources, such as memory, a processor, and a hard drive.
file system
method of storing and organizing data on a storage device, such as flash drives, DVDs and hard drives
A file
atomic unit of storage used by the file
system to store data
distributed file system
a file system that can store large files spread across the nodes
of a cluster
Not-only SQL (NoSQL) database
is a non-relational database that is highly scalable,
fault-tolerant and specifically designed to house semi-structured and unstructured data.
- provides an API-based query interface
- support query languages other than Structured Query Language (SQL)
Sharding is
the process of horizontally partitioning a large dataset into a collection of smaller, more manageable datasets called shards.
- Each shard is stored on a separate node and each node is responsible for only the data stored on it.
- Each shard shares the same schema, and all shards collectively represent the complete dataset
- achieve horizontal scalability
Horizontal scaling
is a method for increasing a system’s capacity by adding similar or
higher capacity resources alongside existing resources
benefits of sharding
A benefit of sharding is that it provides partial tolerance toward failures. In case of a node failure, only data stored on that node is affected.
Replication
stores multiple copies of a dataset, known as replicas, on multiple nodes
-provides scalability and availability due to the fact that the same
data is replicated on various nodes
-nodes. Fault tolerance is also achieved since data redundancy
ensures that data is not lost when an individual node fails.
There are two different methods
that are used to implement replication
master-slave
• peer-to-peer
master-slave replication
all data is written to a master node
write requests, including insert, update and delete, occur on the
master node
data is replicated over to multiple
slave nodes
read requests can be fulfilled by any slave node
Master-slave replication is ideal for
read intensive loads rather than write intensive loads
One concern with master-slave replication is
read inconsistency
an issue if a slave node is read prior to an update to the master being copied to it
a voting system can be implemented where a read is declared consistent if the majority of the slaves contain the same version of the record