Quiz 3 Flashcards
What is Cloud Storage?
Data storage in clouds
What are the three things cloud providers support?
Scalability, Elasticity and Pay as you go
What are the three models of cloud storage?
File System
Blob/Object Storage
Databases
What is the cloud file system?
A system that organizes data into files and directories
What is a file/directory?
A file is a logical unit of data on a storage device
An array of bytes which can be created, read, written and deleted
What type of architecture do cloud file systems have?
Tree architecture
What is the AWS Elastic Block Store good at?
Managing data that is too big for VM’s memory, data processing frameworks that rely on local storage, Databases, MySQL, MS SQL Server
What is the AWS Elastic Block Store bad at?
EC2 only, No seamless scalability
What is the AWS Elastic File System good at?
Its a good replacement of NFS
What is the AWS Elastic File System bad at?
Its slow
What are the two storage types that Google Compute Cloud has?
Persistent Disks
Local SSD
Advantages of a cloud file system?
Familiarity
Many applications support file systems (without much modification)
Disadvantages of a cloud file system?
Scalability
Generally support concurrency
What does BLOB stand for?
Binary Large Object
What is BLOB or object storage?
A flat object model for storing data
What are the features of BLOB storage?
Stores unstructured data
Highly scalable
Automatic backup/replica management
Blob/Object Storage Pros?
Simple, Performs well, Reliable, No modification needed, No file-level synchronization
Disadvantages of Blob/Object Storage?
Little support to organize data
No support for search by file context
Requires index mechanism
No mechanism to work with structured data
Cannot be mounted as a file system directly
If you wanted to use a Blob/Object Storage for a file system how would you do this?
By using open sources projects
What are the two types of databases?
Relational databases and NoSQL databases
What are some features of relational databases?
Designed for structured data
Tables, SQLs
Indexing and join operations
Supports ACID semantics
What are some features of NOSQL databases?
Cloud scale database by giving up ACID semantics
Supports CAP theorem
Eventual consistency
What are some relation databases?
AWS RDS
Azure databases
Google Cloud SQL
What are some NoSQL databases?
Key/Value Store
Document DB
Graph DB
In-Memory DB
Time-Series DB
What is big data?
A collection of data sets which is so large and complex so that it becomes difficult to process using traditional relation database management systems
What are the three types of big data sets?
Structured Data
Semi-structured Data
Unstructured Data
What is Structured Data?
Data that can be represented in Table with Schema
What is Semi-Structured Data?
Data that cannot be stored in RDBMS but has organizational properties
What is Unstructured Data?
Data that is not organized in a pre-defined manner or does not have a pre-defined data model
What are the Big 4V’s of Big Data?
Volume
Variety
Velocity
Veracity
What is the major challenge of Big Data?
Processing
An Iphone 15 has how many times more computing power than the Beowulf-1?
20000 times more
What is the magic infrastructure that allows map-reduce to work?
The Google File System
What are the disadvantages of divide and conquer with many machines?
Merging all of the results can be difficult
If the machines or disks fail there can be an issue
What are the cons of Map-Reduce?
Needs magic to address the failures
Performance may still be an issue
What is the Google File System?
A scalable, fault tolerant distributed file system that stores 100s of TB of scaled data to support map reduce
What is the workload for the GFS?
Large stream reads
Small random reads
Many large sequential appends
No random write that overwrites (updates) data
What is the GFS Architecture?
A single master with multiple chunkservers, and multiple clients
What does the master maintain?
All metadata
What does the master’s metadata hold?
Namespace in GFS, Access control, Current location of chunks
Why does the Master periodically communicates with other chunk servers
To perform a health check
To determine chunk locations and evaluate the state of the overall system
What do GFS chunkservers do?
Manage chunks
How can chunkservers identify chunks?
Through immutable and globally unique chunk handles
What are the two request sent by the GFS client?
Control requests to master servers and data requests directly to chunk servers
What is the default chunk size?
64mb
What is the default shuck size in linux?
4KB to 256KB
What is the Cons to having a 64mB chunk size?
Waste storage space due to internal fragmentation
High overhead from many small files
What is the Pros to having a 64mB chunk size?
Larger chunk size == small # of chunks
What is a Borg Cell?
A set of machines managed by borg as one unit
What is a Borg Job?
The form that users submit work in
What is a task?
The things that jobs do
What is a Borg Alloc?
A reserve set of resources and a job
What is a Borg instance?
Instances having jobs
What is a borg master?
The central brain of the system
Holds the cluster state
Uses paxos for leader election and log replication
Uses Shared State Scheduling
What is a Borglet?
A unit that manages and monitors tasks and resources
What is a Borglet called in Kubernetes?
A Kubelet
What is the MapReduce Data Flow?
Read data from GFS pass to Mappers pass to intermediate local files pass to reducers pass to write data to GFS
What is the role of the Job Tracker in Hadoop?
Coordinates the execution of jobs
What is the role of the Task Tracker in Hadoop?
Controls the execution of map and reduce tasks in slave machines
What is the Name Node in Hadoop?
Manages the file system, keeps metadata
What is the Data Node in Hadoop?
Follows the instructions from the name node, stores, retrieves data
What happens if a task fails in hadoop?
Task tracker detects the failure
Sends message to Job Tracker
Job Tracker reschedules the task
What happens if a data node fails in Hadoop?
Both Name Node and Job tracker detects the failure
All tasks on the failed node are re-scheduled
Name node replicates the data chunk to another one
What are some benefits of Hadoop?
Highly Scalabe
Fault Tolerant
Simple Programming Model
What is some limitations of Hadoop?
64MB block size
Batch processing only
Data Locality
What are some reasons database users do not like map reduce?
Its a giant leap backwards
Sub optimal implementation
Not novel
Missing most of the features in current data bases
Incompatible with all of the tools
What are two add ons for hadoop?
Hive
Hbase
What is the general workflow for hive?
User sends a hiveQL to hive system -> Hive parses and plans the execution of query -> Query is converted to map reduce and executed on HDFS
What are the pros of Hive?
Built on top of hadoop
SQLish batch jobs over large sets
Support SQLish language
Similar to RDBMS
Can handle much larger dataset than RDBMS
What are the cons of Hive?
Its not designed for OLTP but OLAP
No real time queries, Latency
Batch Jobs
It is not RDBMS
What are the two categories of large-scale data?
Web search data
Web access data
What type of large-scale data is important to understand user’s behavior?
Web access logs
What is BigTable or HBase?
A sparse, distributed, persistent, multidimensional sorted map
What are the three components in HBase Architecture?
Master
Region Server
Zookeeper