FInal Exam Flashcards

This is it

1
Q

What is Kubernetes?

A

An open source version of Google’s Borg

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Kubernetes job?

A

To manage container clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Can Kubernetes support multiple infrastructures?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Can Kubernetes support multiple containers running?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the four basic objects in Kubernetes?

A

Pod
Volume
Service
Namespace

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the main role of a Pod in Kubernetes?

A

Basic deployment unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the main role of a Volume in Kubernetes?

A

Persistent storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the main role of a Service in Kubernetes?

A

Group of pods that work together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the main role of a Namespace in Kubernetes?

A

Logical slices of the Kubernetes cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Does a pod contain one or many (different) containers?

A

A pod can have one or many containers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some interesting features of a pod?

A

Co-scheduling
Localhost
Persistent Storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the three multi-container models in Pod?

A

Sidecar, Ambassador and Adaptor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is pod co-scheduling in Kubernetes?

A

Containers in a pod must be scheduled together on the same node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the sidecar’s role?

A

To be a helper

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the ambassador’s role?

A

To be a proxy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the adapter’s role?

A

To be a common output interface

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is etcd’s responsibility in Kubernetes?

A

To store the key/value pair, all cluster states and it is the primary target of backups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why is etcd important for failure recovery?

A

The cluster can be restored with etcd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are controllers in Kubernetes?

A

Thing that create and manage the four objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is autoscaling in Kubernetes?

A

The increasing and decreasing of clusters and pods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What Prometheus in Kubernetes?

A

A third party framework that monitors Kubernetes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the three directions autoscaling

A

Vertical and Horizontal and Multidimensional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does the Kubernetes controller ReplicaSet do?

A

Makes a stable number of pods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does the Kubernetes controller Deployment do?

A

Supports rolling back to different versions
Supports upgrading Kubernetes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
What does the Kubernetes controller DaemonSet do?
Monitoring and logging Daemon
25
What does the Kubernetes controller Job do?
Runs to complete
26
What is the work order of the Horizontal Pod Autoscaler?
HPA 1: Read metrics ->2: Threshold is reached -> 3: Change # of replicas -> 4: Scale in and Out pods
27
What is the Vertical Pod Autoscaler?
VPA 1: Read metrics ->2: Threshold is reached -> 3: Change CPU/MEM values -> 4: Adjust resource allocation
28
What is the Multidimensional Pod Autoscaler?
MPA
29
What is the Cluster Autoscaler?
CA
30
Prometheus's data model is what?
A time series
31
What is serverless computing?
A cloud execution model where the provider dynamically manages the infrastructure, allowing developers to focus on writing code that runs in response to events, with automatic scaling and pay-per-use pricing.
32
Why did people make server less computing?
To reduce over provisioning To change from big resource models to smaller ones
33
Serverless computing is what?
Event based
34
What is a warm start?
A function is already deployed
34
What is the order of big resource models to small ones?
Server -> VM -> Containers -> Functions
34
What is a container timeout?
The amount of time a container can stay running without an application before its closed
34
True or false: Serverless computing is a single language framework
False: it is typically polyglot
34
What occurs when you have a cold start?
You find a VM or create one
34
You do not have to pay for what in serverless computing?
Idle time
34
What is application timeout?
The amount of time an application can stay running
35
What is a cold start?
The first execution of the program
36
What is the steps in a cold start?
Find host VM -> Load a container -> Function is loaded -> a Response is given
37
What is the relationship between cold starts and wasted memory
The more cold starts the less wasted memory
38
What is serverless computing good at?
Avoiding over provisioning No infrastructure management Hiding underlying infrastructure Scalability/Concurrency True on demand cost Never pay for idle resource Near unlimited computing resources
39
What is the negative of Serverless computing?
Very new Limited resources and execution duration Vendor lock in Stateless
40
How do you locally store files in serverless computing?
Use S3 or Azure Blob store
41
What is Structured Data?
Data that can be represented in a table with schema
42
What is Unstructured Data?
Data that is not organized in a pre-defined manner
43
What is Semi-structured Data?
Cannot be stored in RDBMS, but has organizational properties
44
What is the BLOB Storage Model?
A Flat object model for storing data
45
What are the three APIs for BLOB storage?
Put, Get, Delete
46
What does BLOB store?
Unstructured data
47
What is BLOB good at?
Highly scalable Automatic Backup replica management
48
What are the five design assumptions used to design GFS?
System built from many inexpensive commodity machines (prone to failure) System stores modest number of large files Supporting three Google specific workloads Concurrent, atomic append Stable bandwidth is much more important than low latency
49
What are the Google specific workloads?
Large stream read Small random read Many large sequential append No random write
50
What is a typical example for a large stream read?
Crawled data processing
51
What is a typical example for a small random read?
Read small pieces from large data
52
What is a typical example for a large sequential append?
Append search index with new context
53
What is the reason for not supporting random write operations?
Simplicity in FS design Simplicity in failover and data management
54
What is the GFS architecture?
One master with many chunk servers and many clients
55
What does the clients do in the GFS?
Run programs that access data in chunk servers
56
What does the master contain in the GFS architecture?
Has a main controller and meta data
57
What does the chunk server do in the GFS architecture?
Store data
58
What resource in a computer system at GFS master stores a unique handler for a data chunk?
The master's memory
59
What is a chunk?
A FS data block
60
What is the default chunk size in GFS?
64 mb
61
What are the pros of a 64mb chunk?
Large chunk size == small number of chunks Reduce size of metadata stored in meme space of GFS master Reduce # of operations between clients and master Many operations on a given chunk
62
What are the cons of a 64mb chunk?
Waste storage space due to internal fragmentation High overhead when handing many small files
63
Why does the GFS client not have client side caching?
Data is too big to cache
64
What are the two requests that the GFS client handles?
Control request to master Data access request to Chunkservers
65
What is HDFS?
Hadoop distributed file system Opensource implementation of GFS
66
What is the master in HDFS?
The name node
67
What represents the chunk server in HDFS?
Data node
68
What is MapReduce?
Spliting a large dataset into smaller subsets to do computation over it
69
What are the two operations in MapReduce?
Map operation Reduce operation
70
What is the Map operation procedure?
Takes a series of key/value pairs, generate intermediate key/value pairs
71
What is the Reduce operation procedure?
Process key/value pairs from Map operations Generate new output
72
What is the MapReduce process?
Read data from GFS -> Mappers -> Intermediate local files -> Reduces -> Write Data to GFS
73
What happens if a task fails in Hadoop?
Task tracker detects failure Sends message to job tracker Job tracker reschedules the task
74
What happens if a data node fails in Hadoop?
Implemented based on GFS mechanism Both name node and job tracker detect the failure All tasks on the failed node are rescheduled Name node replicates the data chunk to another one
75
What happens if a Name Node or a Job Trackers fails?
The entire cluster fails if it is before v2.0 afterwards YARN handles the failure
76
What are the things Hadoop is good at?
Highly scalable Fault tolerant Simple Programming model Doesn't require a distributed processing background
77
Why does data locality become an issue for Hadoop?
When data needs to be moved from one cluster of nodes to another there can be a latency delay
78
What are two more potential limitations of Hadoop/MapReduce?
Batch processing only 64mb block size
79
What are some bad things of Hadoop?
64MB block size Batch Processing Only Data Locality
80
Hadoop's Ecosystem is what?
Sparse
81
What is the difference between Hadoop's computing model and Parallel DBMS computing model?
For Hadoop jobs are the unit of work while for Parallel DBMS transactions are the unit of work Hadoop does not have concurrency control while parallel DBMS have concurrency controls
82
What is the difference between Hadoop's Data Model and Parallel DBMS Data Model?
Hadoop uses any data and its data is read only while Parallel DBMS uses structured data with schema and uses read/writes
83
What is the difference between Hadoop's cost model and Parallel DBMS cost model?
Hadoop uses cheap commodity machines while Parallel DBMS uses expensive servers
84
What is the difference between Hadoop's failure model and Parallel DBMS failure model?
Hadoop has alot of failures and simple recovery mechanism while Parallel DBMS has very few failures with more intricate recovery mechanisms
85
What is the difference between Hadoop's Key Characteristics and Parallel DBMS Key Characteristics?
Hadoop is scalable, flexible and fault tolerant while Parallel DBMS is efficient, optimized, and fine tuned
86
What is the limitation of map reduce?
Most of its exec time is I/O
87
What is its Disk IO operations?
Data saved in disks after each iteration, Creating 3 chunk replicas (3 by default), Fault Tolerant
88
How much faster is RAM is compared to HDD?
x12
89
What is the idea behind Spark?
Creating ROM style RAM disks
90
What is the benefit behind Spark's ROM style RAM disk?
Minimizes page update operations Resilient Distributed Dataset
91
What is Spark's structure?
RDD + Programming Interface
92
What is the RDD?
Restricted form of a distributed shared memory
93
What is the attributes of RDD?
Immutable, partitioned collection of records Read only Distributed over a cluster of many nodes Two data flows Disk to RDD and RDD to RDD
94
How is RDD fault tolerant?
Lineage -> history of executions Disk based check points Re-execute steps from failures
95
What is the FIFO scheduler?
Task are scheduled according to their arrival time
96
What is the Shortest Task First Scheduler?
Tasks are scheduled according to their Duration
97
What is the Round Robin Scheduler?
Each task is given a certain duration of time to run
98
What is FIFO or shorted task first preferable for?
Batch applications
99
What is Round Robin preferable for?
Interactive applications
100
What are the four types of Hadoop scheduling?
FIFO Scheduler Capacity Scheduler Fair Scheduler Delay Scheduler
101
What are the pros of FIFO?
Simple Predictable Fair Preserves order
102
What are the cons of FIFO?
Lack of prioritzation Stalling Inflexible
103
What is the Capacity Scheduler?
A Scheduler with multiple queues with each queue having a soft limit of minimum portion of cluster activity
104
What is resource elasticity in capacity scheduler?
he ability to dynamically adjust the allocation of resources to different queues or applications
105
What are the disadvantages of capacity scheduler?
Complex Overhead is high Potential for resource fragmentation
106
What is Fair scheduler?
All jobs get an equal share of resources
107
How does Fair scheduler handle the cluster?
It divides clusters into pools, and then divides the resources equally among the pools
108
Each pool in a Fair scheduler has what?
Fair share scheduling FIFO
109
Is the Fair Scheduler preemptive or non preemptive?
Preemptive
110
What happens when a pool does not have a minimum share of resources?
Take resources away from other pools Currently running tasks will be killed and their tasks rescheduled Select victim tasks on those who just started
111
What is the limitations of Fair Schedulers?
Doesn't support data locality
112
What is Delaying Scheduling
Better Fair scheduler, has a relaxed queuing policy that makes jobs wait for a limited time to find idle machines with data locality
113
What is the Benefit from delay scheduling?
Improve performance
114
What is a centralized approach to scheduling?
Monolithic Scheduler
115
What are the decentralized approach to system schedulers?
Statically Partitioned Two Level Shared State
116
What is a Monolithic Scheduler?
A single centralized scheduler
117
What are the characteristics a Monolithic scheduler scheduling algorithm?
Applies the same scheduling algorithm to all incoming jobs
118
What are the pros of the Monolithic Scheduler?
Centralized control, scheduler knows everything, optimal scheduling decision
119
What are the cons of the Monolithic Scheduler?
Single code base Difficult to add new scheduling policies Increase in code complexity Scheduler becomes bottleneck Not suitable for large cluster size
120
What is the Statically Partitioned Scheduler?
Distributed scheduler used for cluster of multiple applications
121
What are the Pros of the Statically Partitioned Scheduler?
Can handle multiple frameworks Bottleneck from one application will not affect other applications scheduling
122
What are the cons of the Statically Partitioned Scheduler?
Resource fragmentation Sub-optimal resource utilization
123
What is the Two level scheduler?
Application level scheduler Resource coordinator
124
Why is the Two level scheduler two level?
Because it has two levels of scheduling
125
What does the resource coordinator of the Two Level scheduler do?
Does dynamic resource partitioning
126
What does the application scheduler of the Two Level scheduler do?
Locks resources that are offered to it
127
What are the Pros of the Two Level scheduler?
Dynamic resource partitioning High resource utilization
128
What are the cons of the Two Level scheduler?
Application schedulers are not omniscience App schedule doesn't who use which resource Select offer or reject offer
129
What is the Shared State Scheduler?
Application schedulers have a replica of cluster state
130
What is the Cluster shared state of the shared state scheduler?
A replica
131
What are the pros of the shared state scheduler?
Better performance
132
What are the cons of the shared state scheduler?
App schedulers often have stale information
133
What is Single Resource Fairness?
Each user gets 1/n of the shared resource
134
What is the max-min Fairness?
There is a minimum and max of resources that each user gets
135
What is weighted Max-Min Fairness?
Gives weights to users according to importance
136
What is Dominant Resource Fairness?
Allocates resources in a cluster environment by providing fairness to tasks/jobs based on their dominant resource requirement
137
What is NOSQL?
Not only using SQL for databases
138
What is the focus of NoSQL?
Focused on Scalability No ACID but BASE
139
What is the motivation of NOSQL?
Make scalable DBMS for cloud apps
140
What are the categories of NOSQL?
Document based Key/Value pair Column-Based Graph-based
141
What is ACID?
Atomicity Consistency Isolation Durability
142
What is NOSQL characteristics?
Scalability Availability and eventual consistency Replication models Sharding of files Does not require schema No declarative query language
143
What are the characteristics as distributed systems
Scalability Availability and eventual consistency Replication models Sharing of files
144
What are the characteristics related to data models and query language?
Does not require schema No declarative query language
145
Out of a Master and Slave who can accept writes?
The Master
146
What is the pro of a Master/Slave split?
Consistency
147
What is the con of a Master/Slave split?
Master can be a bottleneck or a SPOF
148
What is the pro of a Master/Master split
Performance (fast), HA
149
What is the con of a Master/Master split?
Inconsistency or need coordination
150
What is Sharding?
Horizontal data distribution over nodes
151
What are the partitioning strategies?
Hash-based and Range Based
152
What is a challenge in Multi-shard operations?
Joining and aggregation
153
What is Hash-based sharding?
The key determines the partition
154
What is Ranged based sharding?
Assigns ranges defined over fields to partition
155
What are the pro of Hash based sharding?
Even distribution
156
What is the con of hash based sharding?
No data locality
157
What is the pro of Range based sharding?
Enable range scan and sorting
158
What is the con of Range based sharding?
Repartitioning and Rebalancing
159
What is the CAP Theorem?
You can only have 2 of 3 things in a distributed system when sharing data Consistency Availability Partition tolerance
160
What is Consistency?
All replicas have the same copy
161
What is Availability?
Reads and writes always succeed
162
What is Partition tolerance?
The system continues to operate in the presence of network partition
163
What will commonly happen to very large systems?
It will partition at some point
164
Due to partitioning of very large datasets what is the result?
Relaxed consistency
165
What are the two consistency models?
ACID and BASE
166
Which consistency model results in strong consistency?
ACID
167
Which consistency model results in weak consistency?
BASE
168
What is required for a database or data processing system to become eventually consistent?
All replicas will gradually become consistent in the absence of updates
169
What was the first system with eventual consistency?
Amazon Dynamo DB
170
What is real time?
Computation with a deadline
171
What is hard real time?
Missing a job deadline can result in system failure
172
What is soft real time?
Missing deadlines can result in the degradation of the systems QOS
173
What are some attributes of streaming data?
Unbounded Push model Concept of time
174
What are the four key components of the Pub/Sub model?
Publishers Subscribers MSG Broker Topics
175
What are the Pros of the Pub/Sub model?
Simply/flexible, Scalable, Net efficiency
176
What are the cons of a Pub/Sub model?
Simple/Flexible, inherently limited
177
What is the data scope for Batch Processing?
All or most of the data in the data set
178
What is the data size for Batch Processing?
Large
179
What is the data scope for Stream Processing?
Within a rolling time window or most recent data record
179
What is the performance for Batch Processing?
Latencies in minutes to hours
180
What is the analysis for Batch Processing?
Complex analytics
181
What is the data size for Stream Processing?
Very small, individual records or micro batches
181
What is the performance for Stream Processing?
Latency in the order of seconds or milliseconds
182
What is the analysis for Stream Processing?
Simple response functions, aggregates, and rolling metrics
183
What is Apache Storm?
First production ready, well adopted stream processor High compatibility Low level Super fast
183
What is the stream processing pipeline?
Data Source -> Message Queue -> Stream Processor -> Batch -> Application
184
What is Spout?
Sources of data for topology Receives data from message queue Emits tuples to bolts
185
What is a Bolt?
Core unit of computation Emits outgoing tuples
186
What is a Tuple?
Stream message ie a collection of data