FInal Exam Flashcards

Question

What does the Kubernetes controller DaemonSet do?

Answer 1

Monitoring and logging Daemon

Answer 2

Runs to complete

Answer 3

HPA 1: Read metrics ->2: Threshold is reached -> 3: Change # of replicas -> 4: Scale in and Out pods

Answer 4

VPA 1: Read metrics ->2: Threshold is reached -> 3: Change CPU/MEM values -> 4: Adjust resource allocation

Answer 5

A time series

Answer 6

A cloud execution model where the provider dynamically manages the infrastructure, allowing developers to focus on writing code that runs in response to events, with automatic scaling and pay-per-use pricing.

Answer 7

To reduce over provisioning To change from big resource models to smaller ones

Answer 8

Event based

Answer 9

A function is already deployed

Answer 10

Server -> VM -> Containers -> Functions

Answer 11

The amount of time a container can stay running without an application before its closed

Answer 12

False: it is typically polyglot

Answer 13

You find a VM or create one

Answer 14

The amount of time an application can stay running

Answer 15

The first execution of the program

Answer 16

Find host VM -> Load a container -> Function is loaded -> a Response is given

Answer 17

The more cold starts the less wasted memory

Answer 18

Avoiding over provisioning No infrastructure management Hiding underlying infrastructure Scalability/Concurrency True on demand cost Never pay for idle resource Near unlimited computing resources

Answer 19

Very new Limited resources and execution duration Vendor lock in Stateless

Answer 20

Use S3 or Azure Blob store

Answer 21

Data that can be represented in a table with schema

Answer 22

Data that is not organized in a pre-defined manner

Answer 23

Cannot be stored in RDBMS, but has organizational properties

Answer 24

A Flat object model for storing data

Answer 25

Put, Get, Delete

Answer 26

Unstructured data

Answer 27

Highly scalable Automatic Backup replica management

Answer 28

System built from many inexpensive commodity machines (prone to failure) System stores modest number of large files Supporting three Google specific workloads Concurrent, atomic append Stable bandwidth is much more important than low latency

Answer 29

Large stream read Small random read Many large sequential append No random write

Answer 30

Crawled data processing

Answer 31

Read small pieces from large data

Answer 32

Append search index with new context

Answer 33

Simplicity in FS design Simplicity in failover and data management

Answer 34

One master with many chunk servers and many clients

Answer 35

Run programs that access data in chunk servers

Answer 36

Has a main controller and meta data

Answer 37

Store data

Answer 38

The master's memory

Answer 39

A FS data block

Answer 40

Large chunk size == small number of chunks Reduce size of metadata stored in meme space of GFS master Reduce # of operations between clients and master Many operations on a given chunk

Answer 41

Waste storage space due to internal fragmentation High overhead when handing many small files

Answer 42

Data is too big to cache

Answer 43

Control request to master Data access request to Chunkservers

Answer 44

Hadoop distributed file system Opensource implementation of GFS

Answer 45

The name node

Answer 46

Spliting a large dataset into smaller subsets to do computation over it

Answer 47

Map operation Reduce operation

Answer 48

Takes a series of key/value pairs, generate intermediate key/value pairs

Answer 49

Process key/value pairs from Map operations Generate new output

Answer 50

Read data from GFS -> Mappers -> Intermediate local files -> Reduces -> Write Data to GFS

Answer 51

Task tracker detects failure Sends message to job tracker Job tracker reschedules the task

Answer 52

Implemented based on GFS mechanism Both name node and job tracker detect the failure All tasks on the failed node are rescheduled Name node replicates the data chunk to another one

Answer 53

The entire cluster fails if it is before v2.0 afterwards YARN handles the failure

Answer 54

Highly scalable Fault tolerant Simple Programming model Doesn't require a distributed processing background

Answer 55

When data needs to be moved from one cluster of nodes to another there can be a latency delay

Answer 56

Batch processing only 64mb block size

Answer 57

64MB block size Batch Processing Only Data Locality

Answer 58

For Hadoop jobs are the unit of work while for Parallel DBMS transactions are the unit of work Hadoop does not have concurrency control while parallel DBMS have concurrency controls

Answer 59

Hadoop uses any data and its data is read only while Parallel DBMS uses structured data with schema and uses read/writes

Answer 60

Hadoop uses cheap commodity machines while Parallel DBMS uses expensive servers

Answer 61

Hadoop has alot of failures and simple recovery mechanism while Parallel DBMS has very few failures with more intricate recovery mechanisms

Answer 62

Hadoop is scalable, flexible and fault tolerant while Parallel DBMS is efficient, optimized, and fine tuned

Answer 63

Most of its exec time is I/O

Answer 64

Data saved in disks after each iteration, Creating 3 chunk replicas (3 by default), Fault Tolerant

Answer 65

Creating ROM style RAM disks

Answer 66

Minimizes page update operations Resilient Distributed Dataset

Answer 67

RDD + Programming Interface

Answer 68

Restricted form of a distributed shared memory

Answer 69

Immutable, partitioned collection of records Read only Distributed over a cluster of many nodes Two data flows Disk to RDD and RDD to RDD

Answer 70

Lineage -> history of executions Disk based check points Re-execute steps from failures

Answer 71

Task are scheduled according to their arrival time

Answer 72

Tasks are scheduled according to their Duration

Answer 73

Each task is given a certain duration of time to run

Answer 74

Batch applications

Answer 75

Interactive applications

Answer 76

FIFO Scheduler Capacity Scheduler Fair Scheduler Delay Scheduler

Answer 77

Simple Predictable Fair Preserves order

Answer 78

Lack of prioritzation Stalling Inflexible

Answer 79

A Scheduler with multiple queues with each queue having a soft limit of minimum portion of cluster activity

Answer 80

he ability to dynamically adjust the allocation of resources to different queues or applications

Answer 81

Complex Overhead is high Potential for resource fragmentation

Answer 82

All jobs get an equal share of resources

Answer 83

It divides clusters into pools, and then divides the resources equally among the pools

Answer 84

Fair share scheduling FIFO

Answer 85

Preemptive

Answer 86

Take resources away from other pools Currently running tasks will be killed and their tasks rescheduled Select victim tasks on those who just started

Answer 87

Doesn't support data locality

Answer 88

Better Fair scheduler, has a relaxed queuing policy that makes jobs wait for a limited time to find idle machines with data locality

Answer 89

Improve performance

Answer 90

Monolithic Scheduler

Answer 91

Statically Partitioned Two Level Shared State

Answer 92

A single centralized scheduler

Answer 93

Applies the same scheduling algorithm to all incoming jobs

Answer 94

Centralized control, scheduler knows everything, optimal scheduling decision

Answer 95

Single code base Difficult to add new scheduling policies Increase in code complexity Scheduler becomes bottleneck Not suitable for large cluster size

Answer 96

Distributed scheduler used for cluster of multiple applications

Answer 97

Can handle multiple frameworks Bottleneck from one application will not affect other applications scheduling

Answer 98

Resource fragmentation Sub-optimal resource utilization

Answer 99

Application level scheduler Resource coordinator

Answer 100

Because it has two levels of scheduling

Answer 101

Does dynamic resource partitioning

Answer 102

Locks resources that are offered to it

Answer 103

Dynamic resource partitioning High resource utilization

Answer 104

Application schedulers are not omniscience App schedule doesn't who use which resource Select offer or reject offer

Answer 105

Application schedulers have a replica of cluster state

Answer 106

Better performance

Answer 107

App schedulers often have stale information

Answer 108

Each user gets 1/n of the shared resource

Answer 109

There is a minimum and max of resources that each user gets

Answer 110

Gives weights to users according to importance

Answer 111

Allocates resources in a cluster environment by providing fairness to tasks/jobs based on their dominant resource requirement

Answer 112

Not only using SQL for databases

Answer 113

Focused on Scalability No ACID but BASE

Answer 114

Make scalable DBMS for cloud apps

Answer 115

Document based Key/Value pair Column-Based Graph-based

Answer 116

Atomicity Consistency Isolation Durability

Answer 117

Scalability Availability and eventual consistency Replication models Sharding of files Does not require schema No declarative query language

Answer 118

Scalability Availability and eventual consistency Replication models Sharing of files

Answer 119

Does not require schema No declarative query language

Answer 120

The Master

Answer 121

Consistency

Answer 122

Master can be a bottleneck or a SPOF

Answer 123

Performance (fast), HA

Answer 124

Inconsistency or need coordination

Answer 125

Horizontal data distribution over nodes

Answer 126

Hash-based and Range Based

Answer 127

Joining and aggregation

Answer 128

The key determines the partition

Answer 129

Assigns ranges defined over fields to partition

Answer 130

Even distribution

Answer 131

No data locality

Answer 132

Enable range scan and sorting

Answer 133

Repartitioning and Rebalancing

Answer 134

You can only have 2 of 3 things in a distributed system when sharing data Consistency Availability Partition tolerance

Answer 135

All replicas have the same copy

Answer 136

Reads and writes always succeed

Answer 137

The system continues to operate in the presence of network partition

Answer 138

It will partition at some point

Answer 139

Relaxed consistency

Answer 140

ACID and BASE

Answer 141

All replicas will gradually become consistent in the absence of updates

Answer 142

Amazon Dynamo DB

Answer 143

Computation with a deadline

Answer 144

Missing a job deadline can result in system failure

Answer 145

Missing deadlines can result in the degradation of the systems QOS

Answer 146

Unbounded Push model Concept of time

Answer 147

Publishers Subscribers MSG Broker Topics

Answer 148

Simply/flexible, Scalable, Net efficiency

Answer 149

Simple/Flexible, inherently limited

Answer 150

All or most of the data in the data set

Answer 151

Within a rolling time window or most recent data record

Answer 152

Latencies in minutes to hours

Answer 153

Complex analytics

Answer 154

Very small, individual records or micro batches

Answer 155

Latency in the order of seconds or milliseconds

Answer 156

Simple response functions, aggregates, and rolling metrics

Answer 157

First production ready, well adopted stream processor High compatibility Low level Super fast

Answer 158

Data Source -> Message Queue -> Stream Processor -> Batch -> Application

Answer 159

Sources of data for topology Receives data from message queue Emits tuples to bolts

Answer 160

Core unit of computation Emits outgoing tuples

Answer 161

Stream message ie a collection of data

FInal Exam Flashcards

This is it (196 cards)