FInal Exam Flashcards

This is it

1
Q

What is Kubernetes?

A

An open source version of Google’s Borg

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Kubernetes job?

A

To manage container clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Can Kubernetes support multiple infrastructures?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Can Kubernetes support multiple containers running?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the four basic objects in Kubernetes?

A

Pod
Volume
Service
Namespace

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the main role of a Pod in Kubernetes?

A

Basic deployment unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the main role of a Volume in Kubernetes?

A

Persistent storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the main role of a Service in Kubernetes?

A

Group of pods that work together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the main role of a Namespace in Kubernetes?

A

Logical slices of the Kubernetes cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Does a pod contain one or many (different) containers?

A

A pod can have one or many containers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some interesting features of a pod?

A

Co-scheduling
Localhost
Persistent Storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the three multi-container models in Pod?

A

Sidecar, Ambassador and Adaptor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is pod co-scheduling in Kubernetes?

A

Containers in a pod must be scheduled together on the same node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the sidecar’s role?

A

To be a helper

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the ambassador’s role?

A

To be a proxy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the adapter’s role?

A

To be a common output interface

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is etcd’s responsibility in Kubernetes?

A

To store the key/value pair, all cluster states and it is the primary target of backups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why is etcd important for failure recovery?

A

The cluster can be restored with etcd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are controllers in Kubernetes?

A

Thing that create and manage the four objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is autoscaling in Kubernetes?

A

The increasing and decreasing of clusters and pods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What Prometheus in Kubernetes?

A

A third party framework that monitors Kubernetes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the three directions autoscaling

A

Vertical and Horizontal and Multidimensional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does the Kubernetes controller ReplicaSet do?

A

Makes a stable number of pods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does the Kubernetes controller Deployment do?

A

Supports rolling back to different versions
Supports upgrading Kubernetes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the Kubernetes controller DaemonSet do?

A

Monitoring and logging Daemon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What does the Kubernetes controller Job do?

A

Runs to complete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the work order of the Horizontal Pod Autoscaler?

A

HPA 1: Read metrics ->2: Threshold is reached -> 3: Change # of replicas -> 4: Scale in and Out pods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the Vertical Pod Autoscaler?

A

VPA 1: Read metrics ->2: Threshold is reached -> 3: Change CPU/MEM values -> 4: Adjust resource allocation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the Multidimensional Pod Autoscaler?

A

MPA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the Cluster Autoscaler?

A

CA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Prometheus’s data model is what?

A

A time series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is serverless computing?

A

A cloud execution model where the provider dynamically manages the infrastructure, allowing developers to focus on writing code that runs in response to events, with automatic scaling and pay-per-use pricing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Why did people make server less computing?

A

To reduce over provisioning
To change from big resource models to smaller ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Serverless computing is what?

A

Event based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is a warm start?

A

A function is already deployed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is the order of big resource models to small ones?

A

Server -> VM -> Containers -> Functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is a container timeout?

A

The amount of time a container can stay running without an application before its closed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

True or false: Serverless computing is a single language framework

A

False: it is typically polyglot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What occurs when you have a cold start?

A

You find a VM or create one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

You do not have to pay for what in serverless computing?

A

Idle time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is application timeout?

A

The amount of time an application can stay running

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is a cold start?

A

The first execution of the program

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the steps in a cold start?

A

Find host VM -> Load a container -> Function is loaded -> a Response is given

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the relationship between cold starts and wasted memory

A

The more cold starts the less wasted memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is serverless computing good at?

A

Avoiding over provisioning
No infrastructure management
Hiding underlying infrastructure
Scalability/Concurrency
True on demand cost
Never pay for idle resource
Near unlimited computing resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the negative of Serverless computing?

A

Very new
Limited resources and execution duration
Vendor lock in
Stateless

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How do you locally store files in serverless computing?

A

Use S3 or Azure Blob store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is Structured Data?

A

Data that can be represented in a table with schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is Unstructured Data?

A

Data that is not organized in a pre-defined manner

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is Semi-structured Data?

A

Cannot be stored in RDBMS, but has organizational properties

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is the BLOB Storage Model?

A

A Flat object model for storing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are the three APIs for BLOB storage?

A

Put, Get, Delete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What does BLOB store?

A

Unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is BLOB good at?

A

Highly scalable
Automatic Backup replica management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What are the five design assumptions used to design GFS?

A

System built from many inexpensive commodity machines (prone to failure)
System stores modest number of large files
Supporting three Google specific workloads
Concurrent, atomic append
Stable bandwidth is much more important than low latency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What are the Google specific workloads?

A

Large stream read
Small random read
Many large sequential append
No random write

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is a typical example for a large stream read?

A

Crawled data processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is a typical example for a small random read?

A

Read small pieces from large data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is a typical example for a large sequential append?

A

Append search index with new context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What is the reason for not supporting random write operations?

A

Simplicity in FS design
Simplicity in failover and data management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is the GFS architecture?

A

One master with many chunk servers and many clients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What does the clients do in the GFS?

A

Run programs that access data in chunk servers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What does the master contain in the GFS architecture?

A

Has a main controller and meta data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What does the chunk server do in the GFS architecture?

A

Store data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What resource in a computer system at GFS master stores a unique handler for a data chunk?

A

The master’s memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is a chunk?

A

A FS data block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What is the default chunk size in GFS?

A

64 mb

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What are the pros of a 64mb chunk?

A

Large chunk size == small number of chunks
Reduce size of metadata stored in meme space of GFS master
Reduce # of operations between clients and master
Many operations on a given chunk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What are the cons of a 64mb chunk?

A

Waste storage space due to internal fragmentation
High overhead when handing many small files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Why does the GFS client not have client side caching?

A

Data is too big to cache

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What are the two requests that the GFS client handles?

A

Control request to master
Data access request to Chunkservers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What is HDFS?

A

Hadoop distributed file system
Opensource implementation of GFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What is the master in HDFS?

A

The name node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What represents the chunk server in HDFS?

A

Data node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What is MapReduce?

A

Spliting a large dataset into smaller subsets to do computation over it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What are the two operations in MapReduce?

A

Map operation
Reduce operation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What is the Map operation procedure?

A

Takes a series of key/value pairs, generate intermediate key/value pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

What is the Reduce operation procedure?

A

Process key/value pairs from Map operations
Generate new output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

What is the MapReduce process?

A

Read data from GFS -> Mappers -> Intermediate local files -> Reduces -> Write Data to GFS

73
Q

What happens if a task fails in Hadoop?

A

Task tracker detects failure
Sends message to job tracker
Job tracker reschedules the task

74
Q

What happens if a data node fails in Hadoop?

A

Implemented based on GFS mechanism
Both name node and job tracker detect the failure
All tasks on the failed node are rescheduled
Name node replicates the data chunk to another one

75
Q

What happens if a Name Node or a Job Trackers fails?

A

The entire cluster fails if it is before v2.0 afterwards YARN handles the failure

76
Q

What are the things Hadoop is good at?

A

Highly scalable
Fault tolerant
Simple Programming model
Doesn’t require a distributed processing background

77
Q

Why does data locality become an issue for Hadoop?

A

When data needs to be moved from one cluster of nodes to another there can be a latency delay

78
Q

What are two more potential limitations of Hadoop/MapReduce?

A

Batch processing only
64mb block size

79
Q

What are some bad things of Hadoop?

A

64MB block size
Batch Processing Only
Data Locality

80
Q

Hadoop’s Ecosystem is what?

A

Sparse

81
Q

What is the difference between Hadoop’s computing model and Parallel DBMS computing model?

A

For Hadoop jobs are the unit of work while for Parallel DBMS transactions are the unit of work
Hadoop does not have concurrency control while parallel DBMS have concurrency controls

82
Q

What is the difference between Hadoop’s Data Model and Parallel DBMS Data Model?

A

Hadoop uses any data and its data is read only while Parallel DBMS uses structured data with schema and uses read/writes

83
Q

What is the difference between Hadoop’s cost model and Parallel DBMS cost model?

A

Hadoop uses cheap commodity machines while Parallel DBMS uses expensive servers

84
Q

What is the difference between Hadoop’s failure model and Parallel DBMS failure model?

A

Hadoop has alot of failures and simple recovery mechanism while Parallel DBMS has very few failures with more intricate recovery mechanisms

85
Q

What is the difference between Hadoop’s Key Characteristics and Parallel DBMS Key Characteristics?

A

Hadoop is scalable, flexible and fault tolerant while Parallel DBMS is efficient, optimized, and fine tuned

86
Q

What is the limitation of map reduce?

A

Most of its exec time is I/O

87
Q

What is its Disk IO operations?

A

Data saved in disks after each iteration, Creating 3 chunk replicas (3 by default), Fault Tolerant

88
Q

How much faster is RAM is compared to HDD?

A

x12

89
Q

What is the idea behind Spark?

A

Creating ROM style RAM disks

90
Q

What is the benefit behind Spark’s ROM style RAM disk?

A

Minimizes page update operations
Resilient Distributed Dataset

91
Q

What is Spark’s structure?

A

RDD + Programming Interface

92
Q

What is the RDD?

A

Restricted form of a distributed shared memory

93
Q

What is the attributes of RDD?

A

Immutable, partitioned collection of records
Read only
Distributed over a cluster of many nodes
Two data flows Disk to RDD and RDD to RDD

94
Q

How is RDD fault tolerant?

A

Lineage -> history of executions
Disk based check points
Re-execute steps from failures

95
Q

What is the FIFO scheduler?

A

Task are scheduled according to their arrival time

96
Q

What is the Shortest Task First Scheduler?

A

Tasks are scheduled according to their Duration

97
Q

What is the Round Robin Scheduler?

A

Each task is given a certain duration of time to run

98
Q

What is FIFO or shorted task first preferable for?

A

Batch applications

99
Q

What is Round Robin preferable for?

A

Interactive applications

100
Q

What are the four types of Hadoop scheduling?

A

FIFO Scheduler
Capacity Scheduler
Fair Scheduler
Delay Scheduler

101
Q

What are the pros of FIFO?

A

Simple
Predictable
Fair
Preserves order

102
Q

What are the cons of FIFO?

A

Lack of prioritzation
Stalling
Inflexible

103
Q

What is the Capacity Scheduler?

A

A Scheduler with multiple queues with each queue having a soft limit of minimum portion of cluster activity

104
Q

What is resource elasticity in capacity scheduler?

A

he ability to dynamically adjust the allocation of resources to different queues or applications

105
Q

What are the disadvantages of capacity scheduler?

A

Complex
Overhead is high
Potential for resource fragmentation

106
Q

What is Fair scheduler?

A

All jobs get an equal share of resources

107
Q

How does Fair scheduler handle the cluster?

A

It divides clusters into pools, and then divides the resources equally among the pools

108
Q

Each pool in a Fair scheduler has what?

A

Fair share scheduling
FIFO

109
Q

Is the Fair Scheduler preemptive or non preemptive?

A

Preemptive

110
Q

What happens when a pool does not have a minimum share of resources?

A

Take resources away from other pools
Currently running tasks will be killed and their tasks rescheduled
Select victim tasks on those who just started

111
Q

What is the limitations of Fair Schedulers?

A

Doesn’t support data locality

112
Q

What is Delaying Scheduling

A

Better Fair scheduler, has a relaxed queuing policy that makes jobs wait for a limited time to find idle machines with data locality

113
Q

What is the Benefit from delay scheduling?

A

Improve performance

114
Q

What is a centralized approach to scheduling?

A

Monolithic Scheduler

115
Q

What are the decentralized approach to system schedulers?

A

Statically Partitioned
Two Level
Shared State

116
Q

What is a Monolithic Scheduler?

A

A single centralized scheduler

117
Q

What are the characteristics a Monolithic scheduler scheduling algorithm?

A

Applies the same scheduling algorithm to all incoming jobs

118
Q

What are the pros of the Monolithic Scheduler?

A

Centralized control, scheduler knows everything, optimal scheduling decision

119
Q

What are the cons of the Monolithic Scheduler?

A

Single code base
Difficult to add new scheduling policies
Increase in code complexity
Scheduler becomes bottleneck
Not suitable for large cluster size

120
Q

What is the Statically Partitioned Scheduler?

A

Distributed scheduler used for cluster of multiple applications

121
Q

What are the Pros of the Statically Partitioned Scheduler?

A

Can handle multiple frameworks
Bottleneck from one application will not affect other applications scheduling

122
Q

What are the cons of the Statically Partitioned Scheduler?

A

Resource fragmentation
Sub-optimal resource utilization

123
Q

What is the Two level scheduler?

A

Application level scheduler
Resource coordinator

124
Q

Why is the Two level scheduler two level?

A

Because it has two levels of scheduling

125
Q

What does the resource coordinator of the Two Level scheduler do?

A

Does dynamic resource partitioning

126
Q

What does the application scheduler of the Two Level scheduler do?

A

Locks resources that are offered to it

127
Q

What are the Pros of the Two Level scheduler?

A

Dynamic resource partitioning
High resource utilization

128
Q

What are the cons of the Two Level scheduler?

A

Application schedulers are not omniscience
App schedule doesn’t who use which resource
Select offer or reject offer

129
Q

What is the Shared State Scheduler?

A

Application schedulers have a replica of cluster state

130
Q

What is the Cluster shared state of the shared state scheduler?

A

A replica

131
Q

What are the pros of the shared state scheduler?

A

Better performance

132
Q

What are the cons of the shared state scheduler?

A

App schedulers often have stale information

133
Q

What is Single Resource Fairness?

A

Each user gets 1/n of the shared resource

134
Q

What is the max-min Fairness?

A

There is a minimum and max of resources that each user gets

135
Q

What is weighted Max-Min Fairness?

A

Gives weights to users according to importance

136
Q

What is Dominant Resource Fairness?

A

Allocates resources in a cluster environment by providing fairness to tasks/jobs based on their dominant resource requirement

137
Q

What is NOSQL?

A

Not only using SQL for databases

138
Q

What is the focus of NoSQL?

A

Focused on Scalability
No ACID but BASE

139
Q

What is the motivation of NOSQL?

A

Make scalable DBMS for cloud apps

140
Q

What are the categories of NOSQL?

A

Document based
Key/Value pair
Column-Based
Graph-based

141
Q

What is ACID?

A

Atomicity
Consistency
Isolation
Durability

142
Q

What is NOSQL characteristics?

A

Scalability
Availability and eventual consistency
Replication models
Sharding of files
Does not require schema
No declarative query language

143
Q

What are the characteristics as distributed systems

A

Scalability
Availability and eventual consistency
Replication models
Sharing of files

144
Q

What are the characteristics related to data models and query language?

A

Does not require schema
No declarative query language

145
Q

Out of a Master and Slave who can accept writes?

A

The Master

146
Q

What is the pro of a Master/Slave split?

A

Consistency

147
Q

What is the con of a Master/Slave split?

A

Master can be a bottleneck or a SPOF

148
Q

What is the pro of a Master/Master split

A

Performance (fast), HA

149
Q

What is the con of a Master/Master split?

A

Inconsistency or need coordination

150
Q

What is Sharding?

A

Horizontal data distribution over nodes

151
Q

What are the partitioning strategies?

A

Hash-based and Range Based

152
Q

What is a challenge in Multi-shard operations?

A

Joining and aggregation

153
Q

What is Hash-based sharding?

A

The key determines the partition

154
Q

What is Ranged based sharding?

A

Assigns ranges defined over fields to partition

155
Q

What are the pro of Hash based sharding?

A

Even distribution

156
Q

What is the con of hash based sharding?

A

No data locality

157
Q

What is the pro of Range based sharding?

A

Enable range scan and sorting

158
Q

What is the con of Range based sharding?

A

Repartitioning and Rebalancing

159
Q

What is the CAP Theorem?

A

You can only have 2 of 3 things in a distributed system when sharing data
Consistency
Availability
Partition tolerance

160
Q

What is Consistency?

A

All replicas have the same copy

161
Q

What is Availability?

A

Reads and writes always succeed

162
Q

What is Partition tolerance?

A

The system continues to operate in the presence of network partition

163
Q

What will commonly happen to very large systems?

A

It will partition at some point

164
Q

Due to partitioning of very large datasets what is the result?

A

Relaxed consistency

165
Q

What are the two consistency models?

A

ACID and BASE

166
Q

Which consistency model results in strong consistency?

A

ACID

167
Q

Which consistency model results in weak consistency?

A

BASE

168
Q

What is required for a database or data processing system to become eventually consistent?

A

All replicas will gradually become consistent in the absence of updates

169
Q

What was the first system with eventual consistency?

A

Amazon Dynamo DB

170
Q

What is real time?

A

Computation with a deadline

171
Q

What is hard real time?

A

Missing a job deadline can result in system failure

172
Q

What is soft real time?

A

Missing deadlines can result in the degradation of the systems QOS

173
Q

What are some attributes of streaming data?

A

Unbounded
Push model
Concept of time

174
Q

What are the four key components of the Pub/Sub model?

A

Publishers
Subscribers
MSG Broker
Topics

175
Q

What are the Pros of the Pub/Sub model?

A

Simply/flexible, Scalable, Net efficiency

176
Q

What are the cons of a Pub/Sub model?

A

Simple/Flexible, inherently limited

177
Q

What is the data scope for Batch Processing?

A

All or most of the data in the data set

178
Q

What is the data size for Batch Processing?

A

Large

179
Q

What is the data scope for Stream Processing?

A

Within a rolling time window or most recent data record

179
Q

What is the performance for Batch Processing?

A

Latencies in minutes to hours

180
Q

What is the analysis for Batch Processing?

A

Complex analytics

181
Q

What is the data size for Stream Processing?

A

Very small, individual records or micro batches

181
Q

What is the performance for Stream Processing?

A

Latency in the order of seconds or milliseconds

182
Q

What is the analysis for Stream Processing?

A

Simple response functions, aggregates, and rolling metrics

183
Q

What is Apache Storm?

A

First production ready, well adopted stream processor
High compatibility
Low level
Super fast

183
Q

What is the stream processing pipeline?

A

Data Source -> Message Queue -> Stream Processor -> Batch -> Application

184
Q

What is Spout?

A

Sources of data for topology
Receives data from message queue
Emits tuples to bolts

185
Q

What is a Bolt?

A

Core unit of computation
Emits outgoing tuples

186
Q

What is a Tuple?

A

Stream message ie a collection of data