Data Engineering Flashcards

Question 1

Q

S3 key ?

Answer

A

it’s the

bucket name all the way to file extention

Question 2

Q

max S3 file size?

Question 3

Q

S3 object tag use cases?

Answer

A

It’s a key/value thing

Lifecycle
Classify data
Security

Question 4

Q

Storage part is S3. Name some computing?

Answer

A

EC2
Amazon Athena
Amazon Redshift Spectrum
Rekognition
AWS Glue

Question 5

Q

Data partitioning on S3. how and why?

Answer

A

S3:/bucket/partions(year)….

to speed up range queries

Question 6

Q

S3 Encryption options

Answer

A

SSE-S3
SSE-KMS
SSE-C
CSE

Question 7

Q

S3 Access

Answer

A

User based:
- IAM

Resource Based

Overall bucket policy
ACL

Question 8

Q

What if we do not want to move the data in S3 over the internet?

Answer

A

use VPC Endpoint Gateway

Question 9

Q

S3 logs

Answer

A

S3 Access logs in another S3 bucket

API calls in CloudTrail

Question 10

Q

Can you do S3 policy based on the tags?

Answer

A

yes you do.
add tag classification=PHI
and impose the restriction on whatever file has this tag

Question 11

Q

Apache alternative of Kinesis

Question 12

Q

Kinesis use cases

Answer

A

Logs
Metrics
IoT
ClickStream

Question 13

Q

Some streaming frameworks

Answer

A

Spark
NiFi
etc…

Question 14

Q

Kinesises

Answer

A

KDS: Low latency stream at scale

KDA: real-time analytics on stream using sql

KF: load stream into S3, Redshift, ES, Splunk

KVS: stream video in real-time

Question 15

Q

KDS Facts:

provision
retention
replay data
consumer quantity
edit ingested data
record size

Answer

A

provision Shards in advance
retention 24h to 7 days
Ability to reprocess and replay data
multiple consumer take off the same stream
immutable
1MB

Question 16

Q

KDS Producer limits

Answer

A

1MB or 1000 messages /shard

Question 17

Q

Consumer Classic limits

Answer

A

2MB/s/Shard

5 API calls/s/Shard across all consumers

Question 18

Q

KDF min latency

Answer

A

near real-time

60 seconds

Question 19

Q

KDF targets

Answer

A

Redshift
Amazon S3
ElasticSearch
Splunk

Question 20

Q

KDF scaling

Answer

A

Managed Auto-Scaling

Question 21

Q

KDF Data conversion

Answer

A

CSV / JSON > Parquet / ORC

only for S3

Question 22

Q

KDF Data Transformation

Answer

A

using Lambda

CSV to JSON

Question 23

Q

KDF Compression

Answer

A

when target is S3

GZIP, ZIP, SNAPPY

Question 24

Q

KDF Pricing

Answer

A

Pay as you go

Question 25

Q

KDF Sources

Answer

A

SDK
KPL
Kinesis Agent
KDS
CloudWatch logs and events
IoT rule actions

Question 26

Q

KDS Latency

Answer

A

70ms (enhanced fan-out)

200ms

Question 27

Q

KDF vs KDS

Answer

A

Ingestion (delivery) vs Streaming

Question 28

Q

Anomaly detection

Answer

A

RANDOM_CUT_FOREST

it uses recent history

Question 29

Q

Detect Dense Areas

Answer

A

HOTSPOTS

locate and return information about relatively dense areas

Question 30

Q

Runtime options on Kinesis Data Analytics

Answer

A

SQL

Apache Flink

Question 31

Q

Every Kinesis Video Stream is capable of receiving how many video inputs ?

Answer

A

Just one

1000 cameras ? run 1000 KVS

Question 32

Q

KVS Inputs?

Answer

A

Cameras
AWS DeepLense
Smartphone camera
Audio feed
Images
RTSP camera
Producer SDK

Question 33

Q

KVS Targets?

Answer

A

SageMaker
Amazon Rekognition Video
EC2 Consumer
- Tensorflow
- MXNet

Question 34

Q

KVS data retention

Answer

A

1 hour to 10 years

Question 35

Q

Fargate ?

Answer

A

Runs containers and scaleeee

Question 36

Q

KVS use cases

Answer

A

feed camera to KVS

run a container on Fargate
- use DynamoDB for checkpointing

Get decoded frames to SageMaker for ML inference

Publish the result to KDS

Fire off Lambda to e.g. send notifications

Question 37

Q

Which AWS services can use Glue Data Catalogue?

Answer

A

Amazon Redshift
Amazon Athena > QuickSight
Amazon EMR

Question 38

Q

Glue, does that transform as well somehow ?

Answer

A

yes
Glue does
Transformation
Cleaning
Enrich Data

using ETL code in Python or Scala
or Provide Spark or PySpark

Question 39

Q

Glue Targets

Answer

A

S3
JDBC (RDS, Redshift)
Glue Data Catalogue

Question 40

Q

Glue Cost modeling

Answer

A

Pay as you go for the resources consumed

Question 41

Q

Where is Glue running?

Answer

A

a Serverless Spark Platform

Question 42

Q

Glue Scheduler?

Answer

A

it does scheduling the jobs

Question 43

Q

Glue Triggers?

Answer

A

it automate job runs based on events

Question 44

Q

Glue Transformations, how?

Answer

A

Bundled:

DropFields, DropNullFields - remove null values
Filter to filter records
Join to enrich
Map to add fields, delete fields, perform external lookup

Machine Learning Transformations:
- Identify duplicates or matching records

Format Conversions:
CSV, JSON, Avro, ORC, Parquet, XML

Apache Spark Transformations:
K-Means

Question 45

Q

Glue Job types

Answer

A

Spark
- Python 2
- Python 3
- Scala
Python Shell

Question 46

Q

Name some AWS Storage services

Answer

A

Redshift

Columnar
SQL
OLAP
Load from S3
Redshift Spectrum runs on S3 without loading

RDS

OLTP
SQL

DynamoDB
- NoSQL

S3
- Object Storage

ElasticSearch

Search amongst data points
indexing
Clickstream analytics

ElastiCache:
- Caching and in-memory

Question 47

Q

AWS Data Pipeline Source/ Destinations

Answer

A

include:

S3
RDS
Redshift
DynamoDB
EMR

Datasource may be on-premise

Question 48

Q

Glue vs Data Pipeline

Answer

A

Glue

Managed
Run Apache Spark, Scala, Python
Focus on ETL and not configuration or managing resources

Data Pipeline

Orchestration service
Gives more control over environment, codes, ec2…
allow access to EC2 or EMR instances

Question 49

Q

AWS Batch?

Answer

A

Run batch jobs as Docker images

Dynamic provisioning of the instances (EC2 & Spot)

Optimal quantity

No need to manage clusters, fully serverless

Pay for EC2 instances

Question 50

Q

How to schedule Batch jobs?

Answer

A

using CloudWatch events

Question 51

Q

How to orchestrate Batch jobs?

Answer

A

using AWS Step functions

Question 52

Q

AWS Batch vs Glue

Answer

A

Resources are created in the account by Batch

Docker image must be provided

Batch is better for non-ETL related work
- e.g. Cleaning an S3 bucket

Glue better for ETL and Transformation

Question 53

Q

AWS DMS, does the source remains available during the job?

Question 54

Q

DMS vs Glue?

Answer

A

Glue min every 5min
Data Migration Services (DMS) Real-time

DMS doesn’t do much transform

Question 55

Q

How DMS is real-time?

Answer

A

it uses Continuous Data Replication (CDR)

Question 56

Q

AWS Step functions?

Answer

A

Design Workflows
Easy Visualization
Error Handling and retry
Audit the history
Option to wait for an arbitrary amount of time
Max execution of a state machine is 1 year

Question 57

Q

AWS Services for any sort of ETL?

Answer

A

Glue
Batch
Data Pipeline
Step functions (to orchestrate)