Quiz 3 Flashcards by James Fullwood

What is Cloud Storage?

Data storage in clouds

How well did you know this?

Not at all

Perfectly

What are the three things cloud providers support?

Scalability, Elasticity and Pay as you go

How well did you know this?

Not at all

Perfectly

What are the three models of cloud storage?

File System
Blob/Object Storage
Databases

How well did you know this?

Not at all

Perfectly

What is the cloud file system?

A system that organizes data into files and directories

How well did you know this?

Not at all

Perfectly

What is a file/directory?

A file is a logical unit of data on a storage device
An array of bytes which can be created, read, written and deleted

How well did you know this?

Not at all

Perfectly

What type of architecture do cloud file systems have?

Tree architecture

How well did you know this?

Not at all

Perfectly

What is the AWS Elastic Block Store good at?

Managing data that is too big for VM’s memory, data processing frameworks that rely on local storage, Databases, MySQL, MS SQL Server

How well did you know this?

Not at all

Perfectly

What is the AWS Elastic Block Store bad at?

EC2 only, No seamless scalability

How well did you know this?

Not at all

Perfectly

What is the AWS Elastic File System good at?

Its a good replacement of NFS

How well did you know this?

Not at all

Perfectly

What is the AWS Elastic File System bad at?

Its slow

How well did you know this?

Not at all

Perfectly

What are the two storage types that Google Compute Cloud has?

Persistent Disks
Local SSD

How well did you know this?

Not at all

Perfectly

Advantages of a cloud file system?

Familiarity
Many applications support file systems (without much modification)

How well did you know this?

Not at all

Perfectly

Disadvantages of a cloud file system?

Scalability
Generally support concurrency

How well did you know this?

Not at all

Perfectly

What does BLOB stand for?

Binary Large Object

How well did you know this?

Not at all

Perfectly

What is BLOB or object storage?

A flat object model for storing data

How well did you know this?

Not at all

Perfectly

What are the features of BLOB storage?

Stores unstructured data
Highly scalable
Automatic backup/replica management

How well did you know this?

Not at all

Perfectly

Blob/Object Storage Pros?

Simple, Performs well, Reliable, No modification needed, No file-level synchronization

How well did you know this?

Not at all

Perfectly

Disadvantages of Blob/Object Storage?

Little support to organize data
No support for search by file context
Requires index mechanism
No mechanism to work with structured data
Cannot be mounted as a file system directly

How well did you know this?

Not at all

Perfectly

If you wanted to use a Blob/Object Storage for a file system how would you do this?

By using open sources projects

How well did you know this?

Not at all

Perfectly

What are the two types of databases?

Relational databases and NoSQL databases

How well did you know this?

Not at all

Perfectly

What are some features of relational databases?

Designed for structured data
Tables, SQLs
Indexing and join operations
Supports ACID semantics

How well did you know this?

Not at all

Perfectly

What are some features of NOSQL databases?

Cloud scale database by giving up ACID semantics
Supports CAP theorem
Eventual consistency

How well did you know this?

Not at all

Perfectly

What are some relation databases?

AWS RDS
Azure databases
Google Cloud SQL

How well did you know this?

Not at all

Perfectly

What are some NoSQL databases?

Key/Value Store
Document DB
Graph DB
In-Memory DB
Time-Series DB

How well did you know this?

Not at all

Perfectly

What is big data?

A collection of data sets which is so large and complex so that it becomes difficult to process using traditional relation database management systems

What are the three types of big data sets?

Structured Data Semi-structured Data Unstructured Data

What is Structured Data?

Data that can be represented in Table with Schema

What is Semi-Structured Data?

Data that cannot be stored in RDBMS but has organizational properties

What is Unstructured Data?

Data that is not organized in a pre-defined manner or does not have a pre-defined data model

What are the Big 4V's of Big Data?

Volume Variety Velocity Veracity

What is the major challenge of Big Data?

Processing

An Iphone 15 has how many times more computing power than the Beowulf-1?

20000 times more

What is the magic infrastructure that allows map-reduce to work?

The Google File System

What are the disadvantages of divide and conquer with many machines?

Merging all of the results can be difficult If the machines or disks fail there can be an issue

What are the cons of Map-Reduce?

Needs magic to address the failures Performance may still be an issue

What is the Google File System?

A scalable, fault tolerant distributed file system that stores 100s of TB of scaled data to support map reduce

What is the workload for the GFS?

Large stream reads Small random reads Many large sequential appends No random write that overwrites (updates) data

What is the GFS Architecture?

A single master with multiple chunkservers, and multiple clients

What does the master maintain?

All metadata

What does the master's metadata hold?

Namespace in GFS, Access control, Current location of chunks

Why does the Master periodically communicates with other chunk servers

To perform a health check To determine chunk locations and evaluate the state of the overall system

What do GFS chunkservers do?

Manage chunks

How can chunkservers identify chunks?

Through immutable and globally unique chunk handles

What are the two request sent by the GFS client?

Control requests to master servers and data requests directly to chunk servers

What is the default chunk size?

64mb

What is the default shuck size in linux?

4KB to 256KB

What is the Cons to having a 64mB chunk size?

Waste storage space due to internal fragmentation High overhead from many small files

What is the Pros to having a 64mB chunk size?

Larger chunk size == small # of chunks

What is a Borg Cell?

A set of machines managed by borg as one unit

What is a Borg Job?

The form that users submit work in

What is a task?

The things that jobs do

What is a Borg Alloc?

A reserve set of resources and a job

What is a Borg instance?

Instances having jobs

What is a borg master?

The central brain of the system Holds the cluster state Uses paxos for leader election and log replication Uses Shared State Scheduling

What is a Borglet?

A unit that manages and monitors tasks and resources

What is a Borglet called in Kubernetes?

A Kubelet

What is the MapReduce Data Flow?

Read data from GFS pass to Mappers pass to intermediate local files pass to reducers pass to write data to GFS

What is the role of the Job Tracker in Hadoop?

Coordinates the execution of jobs

What is the role of the Task Tracker in Hadoop?

Controls the execution of map and reduce tasks in slave machines

What is the Name Node in Hadoop?

Manages the file system, keeps metadata

What is the Data Node in Hadoop?

Follows the instructions from the name node, stores, retrieves data

What happens if a task fails in hadoop?

Task tracker detects the failure Sends message to Job Tracker Job Tracker reschedules the task

What happens if a data node fails in Hadoop?

Both Name Node and Job tracker detects the failure All tasks on the failed node are re-scheduled Name node replicates the data chunk to another one

What are some benefits of Hadoop?

Highly Scalabe Fault Tolerant Simple Programming Model

What is some limitations of Hadoop?

64MB block size Batch processing only Data Locality

What are some reasons database users do not like map reduce?

Its a giant leap backwards Sub optimal implementation Not novel Missing most of the features in current data bases Incompatible with all of the tools

What are two add ons for hadoop?

Hive Hbase

What is the general workflow for hive?

User sends a hiveQL to hive system -> Hive parses and plans the execution of query -> Query is converted to map reduce and executed on HDFS

What are the pros of Hive?

Built on top of hadoop SQLish batch jobs over large sets Support SQLish language Similar to RDBMS Can handle much larger dataset than RDBMS

What are the cons of Hive?

Its not designed for OLTP but OLAP No real time queries, Latency Batch Jobs It is not RDBMS

What are the two categories of large-scale data?

Web search data Web access data

What type of large-scale data is important to understand user's behavior?

Web access logs

What is BigTable or HBase?

A sparse, distributed, persistent, multidimensional sorted map

What are the three components in HBase Architecture?

Master Region Server Zookeeper

Quiz 3 Flashcards

(75 cards)