Data Engineering Flashcards

1
Q

S3 key ?

A

it’s the

bucket name all the way to file extention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

max S3 file size?

A

5TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

S3 object tag use cases?

A

It’s a key/value thing

Lifecycle
Classify data
Security

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Storage part is S3. Name some computing?

A
EC2
Amazon Athena
Amazon Redshift Spectrum
Rekognition
AWS Glue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data partitioning on S3. how and why?

A

S3:/bucket/partions(year)….

to speed up range queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

S3 Encryption options

A

SSE-S3
SSE-KMS
SSE-C
CSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

S3 Access

A

User based:
- IAM

Resource Based

  • Overall bucket policy
  • ACL
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What if we do not want to move the data in S3 over the internet?

A

use VPC Endpoint Gateway

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

S3 logs

A

S3 Access logs in another S3 bucket

API calls in CloudTrail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Can you do S3 policy based on the tags?

A

yes you do.
add tag classification=PHI
and impose the restriction on whatever file has this tag

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Apache alternative of Kinesis

A

Kafka

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Kinesis use cases

A

Logs
Metrics
IoT
ClickStream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Some streaming frameworks

A

Spark
NiFi
etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Kinesises

A

KDS: Low latency stream at scale

KDA: real-time analytics on stream using sql

KF: load stream into S3, Redshift, ES, Splunk

KVS: stream video in real-time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

KDS Facts:

  • provision
  • retention
  • replay data
  • consumer quantity
  • edit ingested data
  • record size
A
  • provision Shards in advance
  • retention 24h to 7 days
  • Ability to reprocess and replay data
  • multiple consumer take off the same stream
  • immutable
  • 1MB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

KDS Producer limits

A

1MB or 1000 messages /shard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Consumer Classic limits

A

2MB/s/Shard

5 API calls/s/Shard across all consumers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

KDF min latency

A

near real-time

60 seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

KDF targets

A

Redshift
Amazon S3
ElasticSearch
Splunk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

KDF scaling

A

Managed Auto-Scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

KDF Data conversion

A

CSV / JSON > Parquet / ORC

only for S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

KDF Data Transformation

A

using Lambda

CSV to JSON

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

KDF Compression

A

when target is S3

GZIP, ZIP, SNAPPY

24
Q

KDF Pricing

A

Pay as you go

25
Q

KDF Sources

A
SDK
KPL
Kinesis Agent
KDS
CloudWatch logs and events
IoT rule actions
26
Q

KDS Latency

A

70ms (enhanced fan-out)

200ms

27
Q

KDF vs KDS

A

Ingestion (delivery) vs Streaming

28
Q

Anomaly detection

A

RANDOM_CUT_FOREST

it uses recent history

29
Q

Detect Dense Areas

A

HOTSPOTS

locate and return information about relatively dense areas

30
Q

Runtime options on Kinesis Data Analytics

A

SQL

Apache Flink

31
Q

Every Kinesis Video Stream is capable of receiving how many video inputs ?

A

Just one

1000 cameras ? run 1000 KVS

32
Q

KVS Inputs?

A
Cameras
AWS DeepLense
Smartphone camera
Audio feed
Images
RTSP camera
Producer SDK
33
Q

KVS Targets?

A
SageMaker
Amazon Rekognition Video
EC2 Consumer
- Tensorflow
- MXNet
34
Q

KVS data retention

A

1 hour to 10 years

35
Q

Fargate ?

A

Runs containers and scaleeee

36
Q

KVS use cases

A

feed camera to KVS

run a container on Fargate
- use DynamoDB for checkpointing

Get decoded frames to SageMaker for ML inference

Publish the result to KDS

Fire off Lambda to e.g. send notifications

37
Q

Which AWS services can use Glue Data Catalogue?

A

Amazon Redshift
Amazon Athena > QuickSight
Amazon EMR

38
Q

Glue, does that transform as well somehow ?

A
yes
Glue does
Transformation
Cleaning
Enrich Data

using ETL code in Python or Scala
or Provide Spark or PySpark

39
Q

Glue Targets

A

S3
JDBC (RDS, Redshift)
Glue Data Catalogue

40
Q

Glue Cost modeling

A

Pay as you go for the resources consumed

41
Q

Where is Glue running?

A

a Serverless Spark Platform

42
Q

Glue Scheduler?

A

it does scheduling the jobs

43
Q

Glue Triggers?

A

it automate job runs based on events

44
Q

Glue Transformations, how?

A

Bundled:

  • DropFields, DropNullFields - remove null values
  • Filter to filter records
  • Join to enrich
  • Map to add fields, delete fields, perform external lookup

Machine Learning Transformations:
- Identify duplicates or matching records

Format Conversions:
CSV, JSON, Avro, ORC, Parquet, XML

Apache Spark Transformations:
K-Means

45
Q

Glue Job types

A
Spark
- Python 2
- Python 3
- Scala
Python Shell
46
Q

Name some AWS Storage services

A

Redshift

  • Columnar
  • SQL
  • OLAP
  • Load from S3
  • Redshift Spectrum runs on S3 without loading

RDS

  • OLTP
  • SQL

DynamoDB
- NoSQL

S3
- Object Storage

ElasticSearch

  • Search amongst data points
  • indexing
  • Clickstream analytics

ElastiCache:
- Caching and in-memory

47
Q

AWS Data Pipeline Source/ Destinations

A

include:

  • S3
  • RDS
  • Redshift
  • DynamoDB
  • EMR

Datasource may be on-premise

48
Q

Glue vs Data Pipeline

A

Glue

  • Managed
  • Run Apache Spark, Scala, Python
  • Focus on ETL and not configuration or managing resources

Data Pipeline

  • Orchestration service
  • Gives more control over environment, codes, ec2…
  • allow access to EC2 or EMR instances
49
Q

AWS Batch?

A

Run batch jobs as Docker images

Dynamic provisioning of the instances (EC2 & Spot)

Optimal quantity

No need to manage clusters, fully serverless

Pay for EC2 instances

50
Q

How to schedule Batch jobs?

A

using CloudWatch events

51
Q

How to orchestrate Batch jobs?

A

using AWS Step functions

52
Q

AWS Batch vs Glue

A

Resources are created in the account by Batch

Docker image must be provided

Batch is better for non-ETL related work
- e.g. Cleaning an S3 bucket

Glue better for ETL and Transformation

53
Q

AWS DMS, does the source remains available during the job?

A

yes

54
Q

DMS vs Glue?

A

Glue min every 5min
Data Migration Services (DMS) Real-time

DMS doesn’t do much transform

55
Q

How DMS is real-time?

A

it uses Continuous Data Replication (CDR)

56
Q

AWS Step functions?

A

Design Workflows
Easy Visualization
Error Handling and retry
Audit the history
Option to wait for an arbitrary amount of time
Max execution of a state machine is 1 year

57
Q

AWS Services for any sort of ETL?

A

Glue
Batch
Data Pipeline
Step functions (to orchestrate)