ML Data Eng Flashcards

1
Q

S3 standard storage class

A

Frequently accessed data
Low latency and high throughput
Can sustain 2 concurrent facility failures

Use cases:
Big data Analytics
Mobile gaming application
Content distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

S3 Standard IA

A

Less frequently accessed, but requires rapid access when needed
Lower cost than standard
Minimum storage 30 days
99.9% availability

Use case:
Disaster recovery, backups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

S3 one zone IA

A

Hugh durability within one AZ
Data lost if AZ is destroyed
Minimum storage 30 days
99.5% availability

Use cases:
Store backups of on premise data, or data you can re-create

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

S3 Glacier 3 types

A

Low cost: price for storage + object retrieval

3 types:

Instant retrieval:
Ms retrieval → great for data accessed once a quarter
Minimum storage duration of 90 days

Flexible retrieval :
Expedited (1 to 5 mins)
Standard (3 to 5 hours)
Bulk (5 to 12 hours)
Minimum storage of 90 days

Deep archive:
Standard - 12 hrs
Bulks 48 - 48 hrs
Min storage duration is 180 days

Use cases:
Archiving and backup

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Intelligent tiering?

A

Small monitoring and auto-tiering charges

Moves objects automatically between access tiers based on usage

There are no retrieval charges in S3 intelligent tiering

Tiers:
Frequent access (automatic) : default
Infrequent Access (automatic) : objects not accessed for 90 days
Archive Instant Access (automatic): objects not accessed for 90 days
Archive archive tier (optional) : configurable form 90 days to 700+ days
Deep Archive Access tier (optional) : configurable from 180 days to 700+ days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Types of lifecycle rules?

A

Transition actions
E.g. move objects to IA 60 days after creation

Expiration actions
E.g. delete access logs after 365 days
E.g. Delete old versions of files
E.g. Delete incomplete multi-part uploads

Rules can be specified for a certain prefix or for certain tags

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Amazon S3 Analytics?

A

Helps you decide when to transition objects to the right storage

Recommended for Standard and Standard IA
–> Does NOT work for One-Zone IA or Glacier

Report is updated daily as a CSV
24 to 48 hours to start seeing data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

S3 4 ways to encrypt?

A

SSE-S3 - S3 handles key

SSE-KMS - using KMS
Additional security (user must have access to KMS key → we can control access to key)
Audit trail for KMS key

SSE-C: when you manage your own key

Client Side encryption

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

S3 Security

A

User based → IAM

Resource based → bucket policy and ACLs (object and bucket level)

Bucket policy → to e.g. force encryption at upload
Default encryption

Can set this so AWS encrypts on upload → won’t encrypt existing objects

VPC endpoint Gateway
So traffic doesn’t go over internet

Logging and audit
Access logs can be stored in another S3 bucket

CloudTrail for API calls

Tags to control access
E.g. classification = PHI data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Kinesis overview

A

Managed alternative to Kafka

Application logs, metrics, IoT, clickstreams

“Real-time”

Great for stream processing frameworks (Spark, NiFi, etc)

Data is automatically replicated to 3 AZ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Kinesis Data Streams - what are shards and partitions?

A

Stream is made of shards and partitions
- Retention is 24 hrs by default, up to 365 days
- Ability to reprocess / replay data
- Multiple applications can consume data
- Once data is inserted into Kinesis, it can’t be deleted (immutability)
- Records can be up to 1MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Kinesis Data Streams use cases?

A

Application logs
metrics
IoT
clickstreams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Kinesis Data Streams modes?

A

Provisioned:
You choose
Each shard gets 1MB/s or 1000 records/s in and 2 MB/s out (classic or enhanced fan-out mode)
You pay per shard per hour

On-demand:
No need to provision or manage
Default capacity provided (4MB/s in or 4000 records/s)
Scales automatically on observed throughput in last 30 days
Pay per stream, per hour and data in/out per GB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Kinesis Data Stream Producer?

A

Can send 1mb or 1000 record/s per shard
–> “ProvisionedThroughputException

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Kinesis Data Stream Consumer?

A

2 MB/s at read per shard across all consumers
5 API calls per shard across all consumers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Kinesis Data Stream retention?

A

24 hrs by default
Max 1 year

17
Q

Kinesis Analytics - what is it?

A

Perform real-time analytics on streams using SQL

Takes data from Kinesis Data Streams or from Firehose

You have:
An Input Stream
Reference table to join to stream
Output Stream
Error Stream
SQL statement

18
Q

Kinesis Analytics output destinations

A

Data Streams
E.g. to S3 in JSON or CSV

Firehose

Lambda

19
Q

Kinesis Analytics Use Cases?

A

Streaming ETL→ select columns, make simple transformations on streaming data (e.g. to reduce size)

Continuous metric generation: live leaderboard for a mobile game

Responsive analytics: look for certain criteria and build alerting (filtering)

20
Q

Kinesis Analytics features?

A

Pay for resource consumed (not cheap)

Serverless, scales automatically

Use IAM permissions to access streaming source

SQL or Flink to write computation

Schema discovery

Lambda for preprocessing

21
Q

What ML algos are available in Kinesis Analytics?

A

RANDOM_CUT_FOREST
SQl function used for anomaly detection on numeric columns in a stream
E.g. detect anomalous subway ridership in NYC marathon
Uses recent history to compute model

HOTSPOTS
Locate and return information about relatively dense regions in your data
E.g. a collections of overheated servers in a data centre

22
Q

What editor is available in the console for Kinesis Analytics?

A

There is Kinesis analytics Studio where you can use Flink, and create a Streaming application

Also SQL applications allows you to write SQL to apply to Kinesis Data Stream directly or Firehose:
- It discovers the schema
- You can choose SQL from a template

23
Q

Kinesis Firehose overview

A

Load streams into S3, Redshift, ElasticSearch, or Splunk

Stores data into target destinations

Records up to 1 MB

Lambda for transformation

Batch writes into target destination → near real time → 60 seconds latency minimum

24
Q

Kinesis Firehose Producers

A

Applications
Client
Sdk, KPL
Kinesis Agent
Kinesis Data Streams
CloudWatch (logs and events)
IoT

25
Q

Kinesis Firehose Destinations

A

S3
Redshift –> First writes into S3, and then issues a COPY command
ElasticSearch
Splunk

Datadog
New Relic
mongoDB

HTTP Endpoint

26
Q

Kinesis Firehose other features?

A

Failure and/or all data into an S3 backup bucket
–> Transformation and delivery failures

Fully managed

Automatic scaling

can convert to parquet or orc format
–> also supports compression when writing to S3
–> GZIP, ZIP, Snappy, Hadoop-compatible Snappy
–> as well as add partitioning

1mb in

60 secs mins buffer → 1MB min buffer, max 128 MB

You pay for the amount of data going through firehose

27
Q

Data Streams vs firehose?

A

Streams:
Custom code (producer and consumer)
Real time
Automatic scaling with On-demand mode
Data storage for 1 to 365 days, replay, multi consumers

Firehose:
Fully managed, send to S3, splunk, Redshift, ElasticSearch
Transformation with lambda
Near real time → lowest buffer time is 1 min
Automated scaling
No data storage

28
Q

What is Kinesis Video Streams for?

A

Meant for streaming video in real time

29
Q

Kinesis Video Streams Producers?

A

Security camera, body-worn camera, AWS DeepLens, smartphone camera, audio feeds, RADAR data, RTSP camera

Producers SDK

→ one producer per video stream e.g. if 100 cameras, then 100 video streams

30
Q

Kinesis Video Streams features

A

Video playback capability

Data retention from 1hr up to 10 years

31
Q

Kinesis Video Streams Consumers?

A

Rekognition Video
Saegmaker
Ec2 Consumer: TensorFlow, MXNet → i.e. build your own

32
Q

Kinesis Video Streams use cases?

A
  • Kinesis video stream
  • Consumed in real time by docker container on Fargate
  • Which is checkpointing stream and processing status of stream consumption into DynamoDB
  • So if consumption is stopped, it can pick up where it left off
  • And it sends the decoded frames for ML-based inference to Sagemaker
  • And published inference results to Kinesis Data Streams
  • Which publish notifications to Lambda to get notifications in real time
  • E.g. use this to detect burglar in your house
33
Q

Example set up

A

data → data streams → analytics → firehose → S3 or redshift

34
Q

Kinesis summary

A

Data streams: create real-time ML application

Firehose: ingest massive amount of data near real time

Analytics: real-time ETL or ML algorithms on streams

Video Streams: real-time video stream to create ML applications

35
Q

up to glue section

A