ML Data Eng Flashcards

Question 1

Q

S3 standard storage class

Answer

A

Frequently accessed data
Low latency and high throughput
Can sustain 2 concurrent facility failures

Use cases:
Big data Analytics
Mobile gaming application
Content distribution

Question 2

Q

S3 Standard IA

Answer

A

Less frequently accessed, but requires rapid access when needed
Lower cost than standard
Minimum storage 30 days
99.9% availability

Use case:
Disaster recovery, backups

Question 3

Q

S3 one zone IA

Answer

A

Hugh durability within one AZ
Data lost if AZ is destroyed
Minimum storage 30 days
99.5% availability

Use cases:
Store backups of on premise data, or data you can re-create

Question 4

Q

S3 Glacier 3 types

Answer

A

Low cost: price for storage + object retrieval

3 types:

Instant retrieval:
Ms retrieval → great for data accessed once a quarter
Minimum storage duration of 90 days

Flexible retrieval :
Expedited (1 to 5 mins)
Standard (3 to 5 hours)
Bulk (5 to 12 hours)
Minimum storage of 90 days

Deep archive:
Standard - 12 hrs
Bulks 48 - 48 hrs
Min storage duration is 180 days

Use cases:
Archiving and backup

Question 5

Q

What is Intelligent tiering?

Answer

A

Small monitoring and auto-tiering charges

Moves objects automatically between access tiers based on usage

There are no retrieval charges in S3 intelligent tiering

Tiers:
Frequent access (automatic) : default
Infrequent Access (automatic) : objects not accessed for 90 days
Archive Instant Access (automatic): objects not accessed for 90 days
Archive archive tier (optional) : configurable form 90 days to 700+ days
Deep Archive Access tier (optional) : configurable from 180 days to 700+ days

Question 6

Q

Types of lifecycle rules?

Answer

A

Transition actions
E.g. move objects to IA 60 days after creation

Expiration actions
E.g. delete access logs after 365 days
E.g. Delete old versions of files
E.g. Delete incomplete multi-part uploads

Rules can be specified for a certain prefix or for certain tags

Question 7

Q

What is Amazon S3 Analytics?

Answer

A

Helps you decide when to transition objects to the right storage

Recommended for Standard and Standard IA
–> Does NOT work for One-Zone IA or Glacier

Report is updated daily as a CSV
24 to 48 hours to start seeing data analysis

Question 8

Q

S3 4 ways to encrypt?

Answer

A

SSE-S3 - S3 handles key

SSE-KMS - using KMS
Additional security (user must have access to KMS key → we can control access to key)
Audit trail for KMS key

SSE-C: when you manage your own key

Client Side encryption

Question 9

Q

S3 Security

Answer

A

User based → IAM

Resource based → bucket policy and ACLs (object and bucket level)

Bucket policy → to e.g. force encryption at upload
Default encryption

Can set this so AWS encrypts on upload → won’t encrypt existing objects

VPC endpoint Gateway
So traffic doesn’t go over internet

Logging and audit
Access logs can be stored in another S3 bucket

CloudTrail for API calls

Tags to control access
E.g. classification = PHI data

Question 10

Q

Kinesis overview

Answer

A

Managed alternative to Kafka

Application logs, metrics, IoT, clickstreams

“Real-time”

Great for stream processing frameworks (Spark, NiFi, etc)

Data is automatically replicated to 3 AZ

Question 11

Q

Kinesis Data Streams - what are shards and partitions?

Answer

A

Stream is made of shards and partitions
- Retention is 24 hrs by default, up to 365 days
- Ability to reprocess / replay data
- Multiple applications can consume data
- Once data is inserted into Kinesis, it can’t be deleted (immutability)
- Records can be up to 1MB

Question 12

Q

Kinesis Data Streams use cases?

Answer

A

Application logs
metrics
IoT
clickstreams

Question 13

Q

Kinesis Data Streams modes?

Answer

A

Provisioned:
You choose
Each shard gets 1MB/s or 1000 records/s in and 2 MB/s out (classic or enhanced fan-out mode)
You pay per shard per hour

On-demand:
No need to provision or manage
Default capacity provided (4MB/s in or 4000 records/s)
Scales automatically on observed throughput in last 30 days
Pay per stream, per hour and data in/out per GB

Question 14

Q

Kinesis Data Stream Producer?

Answer

A

Can send 1mb or 1000 record/s per shard
–> “ProvisionedThroughputException

Question 15

Q

Kinesis Data Stream Consumer?

Answer

A

2 MB/s at read per shard across all consumers
5 API calls per shard across all consumers

Question 16

Q

Kinesis Data Stream retention?

Answer

A

24 hrs by default
Max 1 year

Question 17

Q

Kinesis Analytics - what is it?

Answer

A

Perform real-time analytics on streams using SQL

Takes data from Kinesis Data Streams or from Firehose

You have:
An Input Stream
Reference table to join to stream
Output Stream
Error Stream
SQL statement

Question 18

Q

Kinesis Analytics output destinations

Answer

A

Data Streams
E.g. to S3 in JSON or CSV

Firehose

Lambda

Question 19

Q

Kinesis Analytics Use Cases?

Answer

A

Streaming ETL→ select columns, make simple transformations on streaming data (e.g. to reduce size)

Continuous metric generation: live leaderboard for a mobile game

Responsive analytics: look for certain criteria and build alerting (filtering)

Question 20

Q

Kinesis Analytics features?

Answer

A

Pay for resource consumed (not cheap)

Serverless, scales automatically

Use IAM permissions to access streaming source

SQL or Flink to write computation

Schema discovery

Lambda for preprocessing

Question 21

Q

What ML algos are available in Kinesis Analytics?

Answer

A

RANDOM_CUT_FOREST
SQl function used for anomaly detection on numeric columns in a stream
E.g. detect anomalous subway ridership in NYC marathon
Uses recent history to compute model

HOTSPOTS
Locate and return information about relatively dense regions in your data
E.g. a collections of overheated servers in a data centre

Question 22

Q

What editor is available in the console for Kinesis Analytics?

Answer

A

There is Kinesis analytics Studio where you can use Flink, and create a Streaming application

Also SQL applications allows you to write SQL to apply to Kinesis Data Stream directly or Firehose:
- It discovers the schema
- You can choose SQL from a template

Question 23

Q

Kinesis Firehose overview

Answer

A

Load streams into S3, Redshift, ElasticSearch, or Splunk

Stores data into target destinations

Records up to 1 MB

Lambda for transformation

Batch writes into target destination → near real time → 60 seconds latency minimum

Question 24

Q

Kinesis Firehose Producers

Answer

A

Applications
Client
Sdk, KPL
Kinesis Agent
Kinesis Data Streams
CloudWatch (logs and events)
IoT

Question 25

Q

Kinesis Firehose Destinations

Answer

A

S3
Redshift –> First writes into S3, and then issues a COPY command
ElasticSearch
Splunk

Datadog
New Relic
mongoDB

HTTP Endpoint

Question 26

Q

Kinesis Firehose other features?

Answer

A

Failure and/or all data into an S3 backup bucket
–> Transformation and delivery failures

Fully managed

Automatic scaling

can convert to parquet or orc format
–> also supports compression when writing to S3
–> GZIP, ZIP, Snappy, Hadoop-compatible Snappy
–> as well as add partitioning

1mb in

60 secs mins buffer → 1MB min buffer, max 128 MB

You pay for the amount of data going through firehose

Question 27

Q

Data Streams vs firehose?

Answer

A

Streams:
Custom code (producer and consumer)
Real time
Automatic scaling with On-demand mode
Data storage for 1 to 365 days, replay, multi consumers

Firehose:
Fully managed, send to S3, splunk, Redshift, ElasticSearch
Transformation with lambda
Near real time → lowest buffer time is 1 min
Automated scaling
No data storage

Question 28

Q

What is Kinesis Video Streams for?

Answer

A

Meant for streaming video in real time

Question 29

Q

Kinesis Video Streams Producers?

Answer

A

Security camera, body-worn camera, AWS DeepLens, smartphone camera, audio feeds, RADAR data, RTSP camera

Producers SDK

→ one producer per video stream e.g. if 100 cameras, then 100 video streams

Question 30

Q

Kinesis Video Streams features

Answer

A

Video playback capability

Data retention from 1hr up to 10 years

Question 31

Q

Kinesis Video Streams Consumers?

Answer

A

Rekognition Video
Saegmaker
Ec2 Consumer: TensorFlow, MXNet → i.e. build your own

Question 32

Q

Kinesis Video Streams use cases?

Answer

A

Kinesis video stream
Consumed in real time by docker container on Fargate
Which is checkpointing stream and processing status of stream consumption into DynamoDB
So if consumption is stopped, it can pick up where it left off
And it sends the decoded frames for ML-based inference to Sagemaker
And published inference results to Kinesis Data Streams
Which publish notifications to Lambda to get notifications in real time
E.g. use this to detect burglar in your house

Question 33

Q

Example set up

Answer

A

data → data streams → analytics → firehose → S3 or redshift

Question 34

Q

Kinesis summary

Answer

A

Data streams: create real-time ML application

Firehose: ingest massive amount of data near real time

Analytics: real-time ETL or ML algorithms on streams

Video Streams: real-time video stream to create ML applications

Question 35

Q

up to glue section

Brainscape's Knowledge GenomeTM

ML Data Eng Flashcards

Brainscape's Knowledge Genome^TM