L7 Flashcards

Question 1

Q

traditional ways of data storage

Answer

A

structured, relation data
rows and columns
very expensive when high in volume, velocity, etc
time consuming
not scalable
rigid check is made when data is inserted or updated which poses a problem for data arriving at high velocity
RDBMS, data warehouse/marts, SQL database
-> only 20% nowadays are traditional structured data

Question 2

Q

RDBMS vs non-RDBMS

data
schema
scalability
language
transaction
best for
examples

Answer

A

stored in tables vs stored according to different models (key-value, wide-column store, multi-model, etc)
supports rigid pre-defined data schema vs supports dynamic data schema
vertically scalable vs horizontally scalable
structured query language (SQL) vs NoSQL language
ACID-compliant vs CAP theorem, possibly ACID compliant
best for transactions and routine data analysis through complex queries vs best for storing and modelling all types of data
RDBMS ex: amazon redshift, oracle database, mySQL
non-RDBMS ex: amazon dynamoDB, oracle NoSQL database

Question 3

Q

ACID compliant

Answer

A

set of properties to guarantee data validity despite errors, etc in RDBMS
1. atomicity: either all changes made by transaction are made or none
2. consistency: prevents transactions from violating integrity constraints or database rules
3. isolation: transactions don’t affect each other
4. durability: finalized transactions are permanent and will survive system failures

Question 4

Q

CAP theorem

Answer

A

any distributed data store can only guarantee 2/3 of the following:
1. consistency: all users see same view of data
2. availability: all users can find replica in case of node failure
3. partition tolerance: works even in presence of network failure

Question 5

Q

def BD storage

Answer

A

infrastructure designed specifically to store, manage and retrieve massive amounts of data,
compute-and-storage architectue that collects and manages large datasets and enables real-time data analytics

Question 6

Q

5 methods to store BD

Answer

A

Hadoop Distributed File System (HDFS)
NoSQL
NewSQL
Clustered Network Attached Storage (NAS)
Cloud Computing

Question 7

Q

HDFS

Answer

A

one of the components of Apache Hadoop
designed to run on commodity hardware
HDFS is a cluster that is made up of many racks which are collections of datanodes (where data is stored) and one namenode (where indices and block names are stored)
divides data in blocks (eg. of 128MB) and distributes across countless datanodes (computers)
each block is replicated in more than one node -> fault-tolerance

Question 8

Q

HDFS Advantages

Answer

A

Pros:
- no single point of failure due to replication -> fault tolerance
- possible to communicate with closest available node to reduce latency and network traffic
- good at handling extremely large files
- streaming data access (data is read at constant velocity, not in batches) -> read-intesive
- commodity hardware

Question 9

Q

NoSQL database

what is it
features

Answer

A

Not Only SQL
designed to meet big data processing demands
supports structure, semi and un-structured data
better for lots of unstructured data
CAP theorem
BASE model
Schemaless
Horizontal scalability

Question 10

Q

BASE model

Answer

A

Basically Available: always abailable despite network failure
Soft state: node information could be temporarily inconsistent
Eventual consistency: changes will be updated on all nodes

Question 11

Q

Schemaless

Answer

A

doesn’t require pre-defined structure to store data
can be in any format (structured or un)
no prior knowledge necessary about data

Question 12

Q

Horizontal Scalability

Answer

A

scale-out (add more instances) vs scale-up (add more RAM, CPU)
sharding (distribution of datasets into smaller chunks) and replication (for fault tolerance)
capable of growing dynamically

Question 13

Q

4 types of NoSQL

Answer

A

key-value store db
column-store db
document db
graph db

Question 14

Q

Key-value store DB

Answer

A

simplest and most efficient
stores data as key-value pairs where key is unique identifier
can be simple object or complex compound objects
key is string and value is data
similar to hash tables
OLAP suitable (cubes of multidimensional data)
eg) Amazon DynamoDB

Question 15

Q

Column-store DB

Answer

A

fields for each record are stores in long row following columns
increases speed of query to access elements
OLTP-suitable
eg) Cassandra

Question 16

Q

Document DB

Answer

Study These Flashcards

A

stores data in JSON , XML, PDF, Word format
JavaScript Object Notation: format that is easily human-readable and machine-readable that uses key-value pairs and arrays to store data
eg) MongoDB, Amazon DocumentDB, etc

Question 17

Q

Graph DB

Answer

Study These Flashcards

A

consists of nodes and edges (relationship between nodes)
edges have properties and directional significance
good for understanding relationships and connections
eg) Neo4j

Question 18

Q

NewSQL

Answer

Study These Flashcards

A

bridges gap between SQL and NoSQL
scalable performance of NoSQL (horizontal scaling) combined with OLTP-based transactions while keeping ACID properties
sometimes support for unstructured data
very fast performance for large data
minimal overhead (excess resources required for a task)
low support from community though
eg) VoltDB, Cockroach DB
perfect solution for organizations that handle large amount of high-profile data (require consistency and scalability)

Question 19

Q

Clustered Network Attached Storage (NAS)

Answer

Study These Flashcards

A

file storage that allows users to collaborate and share data more efficiently
data is stored in clusters of centralized disks -> each centralized disk is called a network-attached storage
can be accessed via standard Ethernet connection if they are on a LAN (local area network)
cluster computing means that the system is distributed or parallel as multiple PCs work together as one resource
scale-out solution: flexible, fault tolerant
not good at storing smaller amount of data

Question 20

Q

Clustered NAS vs HDFS

Answer

Study These Flashcards

A

file-level computer data storage server connected to computer network providing access to different groups of clients vs HDFS is java based file system with scalable and reliable data storage designed to span large clusters of commodity hardware
dat is stored on dedicated server vs stored across local drives of machines in cluster
high-end and expensive vs suitable for cost-effective commodity hardware
HDFS is more scalable and effetive due to rack-awareness (closest datanode will be read) and data locality (computation brought to node, instead of node brought to computation)

Question 21

Q

Cloud Computing

Answer

Study These Flashcards

A

provides access to shared pool of resources, such as computers, storage, applications and service, over a network (usually internet)
on-demand aka utility computing rental services
accessible and affordable to any size of enterprises due to lower cost for small firms
scalable, agile and reliable
dependence on network to access data though
lower security ensured
no standardization
eg) AWS, Azure, Google Cloud

Question 22

Q

4 Cloud Deployment Models

Answer

Study These Flashcards

A

Public cloud: providers like AWS and Azure (scalability and good for any size)
Private cloud: within company (very expensive, good for small projects)
Hybrid cloud: keeping data local but outsourcing computation and analysis
Multi-cloud: combination of multiple public, private or hybrid clouds (complex and challenging)

L7 Flashcards

(22 cards)