FInal Exam Flashcards
This is it
What is Kubernetes?
An open source version of Google’s Borg
What is Kubernetes job?
To manage container clusters
Can Kubernetes support multiple infrastructures?
Yes
Can Kubernetes support multiple containers running?
Yes
What are the four basic objects in Kubernetes?
Pod
Volume
Service
Namespace
What is the main role of a Pod in Kubernetes?
Basic deployment unit
What is the main role of a Volume in Kubernetes?
Persistent storage
What is the main role of a Service in Kubernetes?
Group of pods that work together
What is the main role of a Namespace in Kubernetes?
Logical slices of the Kubernetes cluster
Does a pod contain one or many (different) containers?
A pod can have one or many containers
What are some interesting features of a pod?
Co-scheduling
Localhost
Persistent Storage
What are the three multi-container models in Pod?
Sidecar, Ambassador and Adaptor
What is pod co-scheduling in Kubernetes?
Containers in a pod must be scheduled together on the same node
What is the sidecar’s role?
To be a helper
What is the ambassador’s role?
To be a proxy
What is the adapter’s role?
To be a common output interface
What is etcd’s responsibility in Kubernetes?
To store the key/value pair, all cluster states and it is the primary target of backups
Why is etcd important for failure recovery?
The cluster can be restored with etcd
What are controllers in Kubernetes?
Thing that create and manage the four objects
What is autoscaling in Kubernetes?
The increasing and decreasing of clusters and pods
What Prometheus in Kubernetes?
A third party framework that monitors Kubernetes
What are the three directions autoscaling
Vertical and Horizontal and Multidimensional
What does the Kubernetes controller ReplicaSet do?
Makes a stable number of pods
What does the Kubernetes controller Deployment do?
Supports rolling back to different versions
Supports upgrading Kubernetes
What does the Kubernetes controller DaemonSet do?
Monitoring and logging Daemon
What does the Kubernetes controller Job do?
Runs to complete
What is the work order of the Horizontal Pod Autoscaler?
HPA 1: Read metrics ->2: Threshold is reached -> 3: Change # of replicas -> 4: Scale in and Out pods
What is the Vertical Pod Autoscaler?
VPA 1: Read metrics ->2: Threshold is reached -> 3: Change CPU/MEM values -> 4: Adjust resource allocation
What is the Multidimensional Pod Autoscaler?
MPA
What is the Cluster Autoscaler?
CA
Prometheus’s data model is what?
A time series
What is serverless computing?
A cloud execution model where the provider dynamically manages the infrastructure, allowing developers to focus on writing code that runs in response to events, with automatic scaling and pay-per-use pricing.
Why did people make server less computing?
To reduce over provisioning
To change from big resource models to smaller ones
Serverless computing is what?
Event based
What is a warm start?
A function is already deployed
What is the order of big resource models to small ones?
Server -> VM -> Containers -> Functions
What is a container timeout?
The amount of time a container can stay running without an application before its closed
True or false: Serverless computing is a single language framework
False: it is typically polyglot
What occurs when you have a cold start?
You find a VM or create one
You do not have to pay for what in serverless computing?
Idle time
What is application timeout?
The amount of time an application can stay running
What is a cold start?
The first execution of the program
What is the steps in a cold start?
Find host VM -> Load a container -> Function is loaded -> a Response is given
What is the relationship between cold starts and wasted memory
The more cold starts the less wasted memory
What is serverless computing good at?
Avoiding over provisioning
No infrastructure management
Hiding underlying infrastructure
Scalability/Concurrency
True on demand cost
Never pay for idle resource
Near unlimited computing resources
What is the negative of Serverless computing?
Very new
Limited resources and execution duration
Vendor lock in
Stateless
How do you locally store files in serverless computing?
Use S3 or Azure Blob store
What is Structured Data?
Data that can be represented in a table with schema
What is Unstructured Data?
Data that is not organized in a pre-defined manner
What is Semi-structured Data?
Cannot be stored in RDBMS, but has organizational properties
What is the BLOB Storage Model?
A Flat object model for storing data
What are the three APIs for BLOB storage?
Put, Get, Delete
What does BLOB store?
Unstructured data
What is BLOB good at?
Highly scalable
Automatic Backup replica management
What are the five design assumptions used to design GFS?
System built from many inexpensive commodity machines (prone to failure)
System stores modest number of large files
Supporting three Google specific workloads
Concurrent, atomic append
Stable bandwidth is much more important than low latency
What are the Google specific workloads?
Large stream read
Small random read
Many large sequential append
No random write
What is a typical example for a large stream read?
Crawled data processing
What is a typical example for a small random read?
Read small pieces from large data
What is a typical example for a large sequential append?
Append search index with new context
What is the reason for not supporting random write operations?
Simplicity in FS design
Simplicity in failover and data management
What is the GFS architecture?
One master with many chunk servers and many clients
What does the clients do in the GFS?
Run programs that access data in chunk servers
What does the master contain in the GFS architecture?
Has a main controller and meta data
What does the chunk server do in the GFS architecture?
Store data
What resource in a computer system at GFS master stores a unique handler for a data chunk?
The master’s memory
What is a chunk?
A FS data block
What is the default chunk size in GFS?
64 mb
What are the pros of a 64mb chunk?
Large chunk size == small number of chunks
Reduce size of metadata stored in meme space of GFS master
Reduce # of operations between clients and master
Many operations on a given chunk
What are the cons of a 64mb chunk?
Waste storage space due to internal fragmentation
High overhead when handing many small files
Why does the GFS client not have client side caching?
Data is too big to cache
What are the two requests that the GFS client handles?
Control request to master
Data access request to Chunkservers
What is HDFS?
Hadoop distributed file system
Opensource implementation of GFS
What is the master in HDFS?
The name node
What represents the chunk server in HDFS?
Data node
What is MapReduce?
Spliting a large dataset into smaller subsets to do computation over it
What are the two operations in MapReduce?
Map operation
Reduce operation
What is the Map operation procedure?
Takes a series of key/value pairs, generate intermediate key/value pairs
What is the Reduce operation procedure?
Process key/value pairs from Map operations
Generate new output
What is the MapReduce process?
Read data from GFS -> Mappers -> Intermediate local files -> Reduces -> Write Data to GFS
What happens if a task fails in Hadoop?
Task tracker detects failure
Sends message to job tracker
Job tracker reschedules the task
What happens if a data node fails in Hadoop?
Implemented based on GFS mechanism
Both name node and job tracker detect the failure
All tasks on the failed node are rescheduled
Name node replicates the data chunk to another one
What happens if a Name Node or a Job Trackers fails?
The entire cluster fails if it is before v2.0 afterwards YARN handles the failure
What are the things Hadoop is good at?
Highly scalable
Fault tolerant
Simple Programming model
Doesn’t require a distributed processing background
Why does data locality become an issue for Hadoop?
When data needs to be moved from one cluster of nodes to another there can be a latency delay
What are two more potential limitations of Hadoop/MapReduce?
Batch processing only
64mb block size
What are some bad things of Hadoop?
64MB block size
Batch Processing Only
Data Locality
Hadoop’s Ecosystem is what?
Sparse
What is the difference between Hadoop’s computing model and Parallel DBMS computing model?
For Hadoop jobs are the unit of work while for Parallel DBMS transactions are the unit of work
Hadoop does not have concurrency control while parallel DBMS have concurrency controls
What is the difference between Hadoop’s Data Model and Parallel DBMS Data Model?
Hadoop uses any data and its data is read only while Parallel DBMS uses structured data with schema and uses read/writes
What is the difference between Hadoop’s cost model and Parallel DBMS cost model?
Hadoop uses cheap commodity machines while Parallel DBMS uses expensive servers
What is the difference between Hadoop’s failure model and Parallel DBMS failure model?
Hadoop has alot of failures and simple recovery mechanism while Parallel DBMS has very few failures with more intricate recovery mechanisms
What is the difference between Hadoop’s Key Characteristics and Parallel DBMS Key Characteristics?
Hadoop is scalable, flexible and fault tolerant while Parallel DBMS is efficient, optimized, and fine tuned
What is the limitation of map reduce?
Most of its exec time is I/O
What is its Disk IO operations?
Data saved in disks after each iteration, Creating 3 chunk replicas (3 by default), Fault Tolerant
How much faster is RAM is compared to HDD?
x12
What is the idea behind Spark?
Creating ROM style RAM disks
What is the benefit behind Spark’s ROM style RAM disk?
Minimizes page update operations
Resilient Distributed Dataset
What is Spark’s structure?
RDD + Programming Interface
What is the RDD?
Restricted form of a distributed shared memory
What is the attributes of RDD?
Immutable, partitioned collection of records
Read only
Distributed over a cluster of many nodes
Two data flows Disk to RDD and RDD to RDD
How is RDD fault tolerant?
Lineage -> history of executions
Disk based check points
Re-execute steps from failures
What is the FIFO scheduler?
Task are scheduled according to their arrival time
What is the Shortest Task First Scheduler?
Tasks are scheduled according to their Duration
What is the Round Robin Scheduler?
Each task is given a certain duration of time to run
What is FIFO or shorted task first preferable for?
Batch applications
What is Round Robin preferable for?
Interactive applications
What are the four types of Hadoop scheduling?
FIFO Scheduler
Capacity Scheduler
Fair Scheduler
Delay Scheduler
What are the pros of FIFO?
Simple
Predictable
Fair
Preserves order
What are the cons of FIFO?
Lack of prioritzation
Stalling
Inflexible
What is the Capacity Scheduler?
A Scheduler with multiple queues with each queue having a soft limit of minimum portion of cluster activity
What is resource elasticity in capacity scheduler?
he ability to dynamically adjust the allocation of resources to different queues or applications
What are the disadvantages of capacity scheduler?
Complex
Overhead is high
Potential for resource fragmentation
What is Fair scheduler?
All jobs get an equal share of resources
How does Fair scheduler handle the cluster?
It divides clusters into pools, and then divides the resources equally among the pools
Each pool in a Fair scheduler has what?
Fair share scheduling
FIFO
Is the Fair Scheduler preemptive or non preemptive?
Preemptive
What happens when a pool does not have a minimum share of resources?
Take resources away from other pools
Currently running tasks will be killed and their tasks rescheduled
Select victim tasks on those who just started
What is the limitations of Fair Schedulers?
Doesn’t support data locality
What is Delaying Scheduling
Better Fair scheduler, has a relaxed queuing policy that makes jobs wait for a limited time to find idle machines with data locality
What is the Benefit from delay scheduling?
Improve performance
What is a centralized approach to scheduling?
Monolithic Scheduler
What are the decentralized approach to system schedulers?
Statically Partitioned
Two Level
Shared State
What is a Monolithic Scheduler?
A single centralized scheduler
What are the characteristics a Monolithic scheduler scheduling algorithm?
Applies the same scheduling algorithm to all incoming jobs
What are the pros of the Monolithic Scheduler?
Centralized control, scheduler knows everything, optimal scheduling decision
What are the cons of the Monolithic Scheduler?
Single code base
Difficult to add new scheduling policies
Increase in code complexity
Scheduler becomes bottleneck
Not suitable for large cluster size
What is the Statically Partitioned Scheduler?
Distributed scheduler used for cluster of multiple applications
What are the Pros of the Statically Partitioned Scheduler?
Can handle multiple frameworks
Bottleneck from one application will not affect other applications scheduling
What are the cons of the Statically Partitioned Scheduler?
Resource fragmentation
Sub-optimal resource utilization
What is the Two level scheduler?
Application level scheduler
Resource coordinator
Why is the Two level scheduler two level?
Because it has two levels of scheduling
What does the resource coordinator of the Two Level scheduler do?
Does dynamic resource partitioning
What does the application scheduler of the Two Level scheduler do?
Locks resources that are offered to it
What are the Pros of the Two Level scheduler?
Dynamic resource partitioning
High resource utilization
What are the cons of the Two Level scheduler?
Application schedulers are not omniscience
App schedule doesn’t who use which resource
Select offer or reject offer
What is the Shared State Scheduler?
Application schedulers have a replica of cluster state
What is the Cluster shared state of the shared state scheduler?
A replica
What are the pros of the shared state scheduler?
Better performance
What are the cons of the shared state scheduler?
App schedulers often have stale information
What is Single Resource Fairness?
Each user gets 1/n of the shared resource
What is the max-min Fairness?
There is a minimum and max of resources that each user gets
What is weighted Max-Min Fairness?
Gives weights to users according to importance
What is Dominant Resource Fairness?
Allocates resources in a cluster environment by providing fairness to tasks/jobs based on their dominant resource requirement
What is NOSQL?
Not only using SQL for databases
What is the focus of NoSQL?
Focused on Scalability
No ACID but BASE
What is the motivation of NOSQL?
Make scalable DBMS for cloud apps
What are the categories of NOSQL?
Document based
Key/Value pair
Column-Based
Graph-based
What is ACID?
Atomicity
Consistency
Isolation
Durability
What is NOSQL characteristics?
Scalability
Availability and eventual consistency
Replication models
Sharding of files
Does not require schema
No declarative query language
What are the characteristics as distributed systems
Scalability
Availability and eventual consistency
Replication models
Sharing of files
What are the characteristics related to data models and query language?
Does not require schema
No declarative query language
Out of a Master and Slave who can accept writes?
The Master
What is the pro of a Master/Slave split?
Consistency
What is the con of a Master/Slave split?
Master can be a bottleneck or a SPOF
What is the pro of a Master/Master split
Performance (fast), HA
What is the con of a Master/Master split?
Inconsistency or need coordination
What is Sharding?
Horizontal data distribution over nodes
What are the partitioning strategies?
Hash-based and Range Based
What is a challenge in Multi-shard operations?
Joining and aggregation
What is Hash-based sharding?
The key determines the partition
What is Ranged based sharding?
Assigns ranges defined over fields to partition
What are the pro of Hash based sharding?
Even distribution
What is the con of hash based sharding?
No data locality
What is the pro of Range based sharding?
Enable range scan and sorting
What is the con of Range based sharding?
Repartitioning and Rebalancing
What is the CAP Theorem?
You can only have 2 of 3 things in a distributed system when sharing data
Consistency
Availability
Partition tolerance
What is Consistency?
All replicas have the same copy
What is Availability?
Reads and writes always succeed
What is Partition tolerance?
The system continues to operate in the presence of network partition
What will commonly happen to very large systems?
It will partition at some point
Due to partitioning of very large datasets what is the result?
Relaxed consistency
What are the two consistency models?
ACID and BASE
Which consistency model results in strong consistency?
ACID
Which consistency model results in weak consistency?
BASE
What is required for a database or data processing system to become eventually consistent?
All replicas will gradually become consistent in the absence of updates
What was the first system with eventual consistency?
Amazon Dynamo DB
What is real time?
Computation with a deadline
What is hard real time?
Missing a job deadline can result in system failure
What is soft real time?
Missing deadlines can result in the degradation of the systems QOS
What are some attributes of streaming data?
Unbounded
Push model
Concept of time
What are the four key components of the Pub/Sub model?
Publishers
Subscribers
MSG Broker
Topics
What are the Pros of the Pub/Sub model?
Simply/flexible, Scalable, Net efficiency
What are the cons of a Pub/Sub model?
Simple/Flexible, inherently limited
What is the data scope for Batch Processing?
All or most of the data in the data set
What is the data size for Batch Processing?
Large
What is the data scope for Stream Processing?
Within a rolling time window or most recent data record
What is the performance for Batch Processing?
Latencies in minutes to hours
What is the analysis for Batch Processing?
Complex analytics
What is the data size for Stream Processing?
Very small, individual records or micro batches
What is the performance for Stream Processing?
Latency in the order of seconds or milliseconds
What is the analysis for Stream Processing?
Simple response functions, aggregates, and rolling metrics
What is Apache Storm?
First production ready, well adopted stream processor
High compatibility
Low level
Super fast
What is the stream processing pipeline?
Data Source -> Message Queue -> Stream Processor -> Batch -> Application
What is Spout?
Sources of data for topology
Receives data from message queue
Emits tuples to bolts
What is a Bolt?
Core unit of computation
Emits outgoing tuples
What is a Tuple?
Stream message ie a collection of data