L7 Flashcards

1
Q

traditional ways of data storage

A
  • structured, relation data
  • rows and columns
  • very expensive when high in volume, velocity, etc
  • time consuming
  • not scalable
  • rigid check is made when data is inserted or updated which poses a problem for data arriving at high velocity
    RDBMS, data warehouse/marts, SQL database
    -> only 20% nowadays are traditional structured data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

RDBMS vs non-RDBMS

data
schema
scalability
language
transaction
best for
examples

A
  1. stored in tables vs stored according to different models (key-value, wide-column store, multi-model, etc)
  2. supports rigid pre-defined data schema vs supports dynamic data schema
  3. vertically scalable vs horizontally scalable
  4. structured query language (SQL) vs NoSQL language
  5. ACID-compliant vs CAP theorem, possibly ACID compliant
  6. best for transactions and routine data analysis through complex queries vs best for storing and modelling all types of data
    RDBMS ex: amazon redshift, oracle database, mySQL
    non-RDBMS ex: amazon dynamoDB, oracle NoSQL database
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ACID compliant

A

set of properties to guarantee data validity despite errors, etc in RDBMS
1. atomicity: either all changes made by transaction are made or none
2. consistency: prevents transactions from violating integrity constraints or database rules
3. isolation: transactions don’t affect each other
4. durability: finalized transactions are permanent and will survive system failures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

CAP theorem

A

any distributed data store can only guarantee 2/3 of the following:
1. consistency: all users see same view of data
2. availability: all users can find replica in case of node failure
3. partition tolerance: works even in presence of network failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

def BD storage

A

infrastructure designed specifically to store, manage and retrieve massive amounts of data,
compute-and-storage architectue that collects and manages large datasets and enables real-time data analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

5 methods to store BD

A
  1. Hadoop Distributed File System (HDFS)
  2. NoSQL
  3. NewSQL
  4. Clustered Network Attached Storage (NAS)
  5. Cloud Computing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

HDFS

A
  • one of the components of Apache Hadoop
  • designed to run on commodity hardware
  • HDFS is a cluster that is made up of many racks which are collections of datanodes (where data is stored) and one namenode (where indices and block names are stored)
  • divides data in blocks (eg. of 128MB) and distributes across countless datanodes (computers)
  • each block is replicated in more than one node -> fault-tolerance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

HDFS Advantages

A

Pros:
- no single point of failure due to replication -> fault tolerance
- possible to communicate with closest available node to reduce latency and network traffic
- good at handling extremely large files
- streaming data access (data is read at constant velocity, not in batches) -> read-intesive
- commodity hardware

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

NoSQL database

what is it
features

A
  • Not Only SQL
  • designed to meet big data processing demands
  • supports structure, semi and un-structured data
  • better for lots of unstructured data
  • CAP theorem
  • BASE model
  • Schemaless
  • Horizontal scalability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

BASE model

A

Basically Available: always abailable despite network failure
Soft state: node information could be temporarily inconsistent
Eventual consistency: changes will be updated on all nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Schemaless

A
  • doesn’t require pre-defined structure to store data
  • can be in any format (structured or un)
  • no prior knowledge necessary about data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Horizontal Scalability

A
  • scale-out (add more instances) vs scale-up (add more RAM, CPU)
  • sharding (distribution of datasets into smaller chunks) and replication (for fault tolerance)
  • capable of growing dynamically
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

4 types of NoSQL

A
  1. key-value store db
  2. column-store db
  3. document db
  4. graph db
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Key-value store DB

A
  • simplest and most efficient
  • stores data as key-value pairs where key is unique identifier
  • can be simple object or complex compound objects
  • key is string and value is data
  • similar to hash tables
  • OLAP suitable (cubes of multidimensional data)
    eg) Amazon DynamoDB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Column-store DB

A
  • fields for each record are stores in long row following columns
  • increases speed of query to access elements
  • OLTP-suitable
    eg) Cassandra
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Document DB

A
  • stores data in JSON , XML, PDF, Word format
  • JavaScript Object Notation: format that is easily human-readable and machine-readable that uses key-value pairs and arrays to store data
    eg) MongoDB, Amazon DocumentDB, etc
17
Q

Graph DB

A
  • consists of nodes and edges (relationship between nodes)
  • edges have properties and directional significance
  • good for understanding relationships and connections
    eg) Neo4j
18
Q

NewSQL

A
  • bridges gap between SQL and NoSQL
  • scalable performance of NoSQL (horizontal scaling) combined with OLTP-based transactions while keeping ACID properties
  • sometimes support for unstructured data
  • very fast performance for large data
  • minimal overhead (excess resources required for a task)
  • low support from community though
    eg) VoltDB, Cockroach DB
  • perfect solution for organizations that handle large amount of high-profile data (require consistency and scalability)
19
Q

Clustered Network Attached Storage (NAS)

A
  • file storage that allows users to collaborate and share data more efficiently
  • data is stored in clusters of centralized disks -> each centralized disk is called a network-attached storage
  • can be accessed via standard Ethernet connection if they are on a LAN (local area network)
  • cluster computing means that the system is distributed or parallel as multiple PCs work together as one resource
  • scale-out solution: flexible, fault tolerant
  • not good at storing smaller amount of data
20
Q

Clustered NAS vs HDFS

A
  1. file-level computer data storage server connected to computer network providing access to different groups of clients vs HDFS is java based file system with scalable and reliable data storage designed to span large clusters of commodity hardware
  2. dat is stored on dedicated server vs stored across local drives of machines in cluster
  3. high-end and expensive vs suitable for cost-effective commodity hardware
  4. HDFS is more scalable and effetive due to rack-awareness (closest datanode will be read) and data locality (computation brought to node, instead of node brought to computation)
21
Q

Cloud Computing

A
  • provides access to shared pool of resources, such as computers, storage, applications and service, over a network (usually internet)
  • on-demand aka utility computing rental services
  • accessible and affordable to any size of enterprises due to lower cost for small firms
  • scalable, agile and reliable
  • dependence on network to access data though
  • lower security ensured
  • no standardization
    eg) AWS, Azure, Google Cloud
22
Q

4 Cloud Deployment Models

A

Public cloud: providers like AWS and Azure (scalability and good for any size)
Private cloud: within company (very expensive, good for small projects)
Hybrid cloud: keeping data local but outsourcing computation and analysis
Multi-cloud: combination of multiple public, private or hybrid clouds (complex and challenging)