Data Engineering Flashcards

Question 1

Q

Amazon Kinesis Data Streams

Answer

A

Collect and store streaming data in real-time
1. Retention up to 365 days (can’t be deleted until it expires)
2. Data ordering guarantee

Question 2

Q

Kinesis Data Streams - Capacity Modes

Answer

A

Provisioned mode

Choose number of shards
Each shard gets 1000 records per second
Scale manually
Pay per shard provisioned per hour

On demand mode:

Default capacity provisiones (4000 records per second)
Scale based on observed throughput peaks
Pay per stream per hour & data in/out

Question 3

Q

Amazon Data Firehose - AWS Destinations

Answer

A

Amazon S3: Supports compression
Amazon Redshift (copy through S3)
Amazon OpenSearch

Question 4

Q

Amazon Data Firehose

Answer

A

Receive records (up to 1MB) from producers
Can make data transformation with Lambda functions
Batch writes to destinations based on buffer time and size, Near Real Time
All or failed data can be backup in S3

Question 5

Q

Firehose Buffer Sizing

Answer

A

Firehose accumulates records in a buffer and is flushed based on time and size rules

Question 6

Q

Amazon Managed Service for Apache Flink

Answer

A

Framework for processing data streams. Can’t read from Amazon Data Firehose.

Question 7

Q

Amazon Managed Streaming For Apache Kafka

Answer

A

Amazon MSK creates & manages Kafka brokers nodes.
* Deployed in your VPC, multi-AZ
* Data stores on EBS volumes

Question 8

Q

Kinesis Data Streams vs Amazon MSK

Answer

A

Kinesis Data Streams:
* 1 MB message size limit
* 12 months maximum retention
* Shard splitting and merging

Amazon MSK:
* Configure for bigger messages
* No retention limit
* Can only add partitions to a topic

Question 9

Q

AWS Batch

Answer

A

Run batch jobs as Docker images

Question 10

Q

AWS Batch - Multi node mode

Answer

A

Leverages multiple EC2 / ECS instances. One main node and multiple childs. Doesn’t work with Spot Instances.

Question 11

Q

Amazon Elastic MapReduce

Answer

A

EMR creates Hadoop clusters in a single AZ to analyze and process vast amount of data.
* Can access data in DynamoDB and S3

Question 12

Q

EMR File System

Answer

A

EMRFS stores persistent data in Amazon S3 while providing data encryption

Question 13

Q

EMR node types

Answer

A

Master node: Manage the cluster
Core node: Run tasks and store data
Task node: Justo to run tasks, usually uses spot

Question 14

Q

EMR instance configuration

Answer

A

Uniform instance groups: select a single instance type and purchasing option for each node. Has auto scaling.
Instance fleet: Select target capoacity, mix instance types and purchasing options. No auto scaling.

Question 15

Q

AWS Glue

Answer

A

Managed extract, transform and load service

Question 16

Q

AWS Glue Data Catalog

Answer

A

Data crawler writes metadata of databases and supportes storages to the data catalog

Question 17

Q

Redshift

Answer

A

Online analytical processing
* Columnar storage of data: easier to aggregate
* Massively Parallel Query Execution
* Not all clusters support multi AZ

Question 18

Q

Redshift nodes

Answer

A

Leader node: query planning and results aggregation
Compute node: performing queries and send results to leader

Question 19

Q

Redshift Enhanced VPC Routing

Answer

A

Copy and unload goes through VPC

Question 20

Q

Redshift snapshots

Answer

A

Point in time backups stored in S3
* Incremental
* Restore into a new cluster
* Automated or manual
* Copy automatically to another region

Question 21

Q

Redshift spectrum

Answer

A

Query data that is in S3 without loading it
* Must have a Redshift cluster to start the query

Question 22

Q

Redshift resource link

Answer

A

Data catalog object which is linked to a database, allows integrated services to run queries on the database. These services will not be able to access directly across cross accounts.

Question 23

Q

Redshift Workload Management

Answer

A

Flexibly manage queries priorities within workloads
* Multiple query queues
* Route queries to the appropriate queue

Question 24

Q

Redshift EVEN distribution

Answer

A

The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values. Is appropriate when a table doesn’t participate in joins

Question 25

Q

Redshift KEY distribution

Answer

A

The leader node places matching values in one column on the same node slice. Matching values from the common columns are physically stored together.

Question 26

Q

Redshift ALL distribution

Answer

A

A copy of the entire table is distributed to every node. Ensures that every row is collocated for every join. Appropriate only for relatively slow moving tables

Question 27

Q

DocumentDB

Answer

A

Fully managed MongoDB used to store, query and index JSON data
* Replication across 3 AZ
* Automatically grows in increments of 10GB

Question 28

Q

DocumentDB Pricing

Answer

A

On-demand instances per second
Database I/O
Database storage per GB
Backup storage per GB

Question 29

Q

Amazon Timestream

Answer

A

Fully managed time series database
* Faster and cost effective comparing with relational database
* SQL compatibility
* Recent stored in memory, historical cost optimized storage
* Built-in time series analytics functions
* You ccan configure schedule queries

Question 30

Q

Amazon Athena

Answer

A

Serverless query service to analyze data stored in S3
* Commonly used with Quicksight

Question 31

Q

Athena Performance Improvement

Answer

A

Columnar data for cost savings. Parquet or ORC is recommended
Compress data for small retrievals
Partition datasets in S3 (Partioning pruning)
Use larger files to minimize overhead
Predicate pushdown to filter data at source

Question 32

Q

Amazon QuickSight

Answer

A

Serverless business intelligence service to create interactive dashboards

Question 33

Q

Amazon Forecast

Answer

A

Fully managed service that uses statistical and machine learning algorithms to deliver highly accurate time-series forecasts. To import your data you must store it in an S3 bucket.