Data Engineering Flashcards

1
Q

Amazon Kinesis Data Streams

A

Collect and store streaming data in real-time
1. Retention up to 365 days (can’t be deleted until it expires)
2. Data ordering guarantee

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Kinesis Data Streams - Capacity Modes

A

Provisioned mode

  • Choose number of shards
  • Each shard gets 1000 records per second
  • Scale manually
  • Pay per shard provisioned per hour

On demand mode:

  • Default capacity provisiones (4000 records per second)
  • Scale based on observed throughput peaks
  • Pay per stream per hour & data in/out
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Amazon Data Firehose - AWS Destinations

A
  • Amazon S3: Supports compression
  • Amazon Redshift (copy through S3)
  • Amazon OpenSearch
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Amazon Data Firehose

A
  1. Receive records (up to 1MB) from producers
  2. Can make data transformation with Lambda functions
  3. Batch writes to destinations based on buffer time and size, Near Real Time
  4. All or failed data can be backup in S3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Firehose Buffer Sizing

A

Firehose accumulates records in a buffer and is flushed based on time and size rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Amazon Managed Service for Apache Flink

A

Framework for processing data streams. Can’t read from Amazon Data Firehose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Amazon Managed Streaming For Apache Kafka

A

Amazon MSK creates & manages Kafka brokers nodes.
* Deployed in your VPC, multi-AZ
* Data stores on EBS volumes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Kinesis Data Streams vs Amazon MSK

A

Kinesis Data Streams:
* 1 MB message size limit
* 12 months maximum retention
* Shard splitting and merging

Amazon MSK:
* Configure for bigger messages
* No retention limit
* Can only add partitions to a topic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

AWS Batch

A

Run batch jobs as Docker images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

AWS Batch - Multi node mode

A

Leverages multiple EC2 / ECS instances. One main node and multiple childs. Doesn’t work with Spot Instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Amazon Elastic MapReduce

A

EMR creates Hadoop clusters in a single AZ to analyze and process vast amount of data.
* Can access data in DynamoDB and S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

EMR File System

A

EMRFS stores persistent data in Amazon S3 while providing data encryption

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

EMR node types

A
  • Master node: Manage the cluster
  • Core node: Run tasks and store data
  • Task node: Justo to run tasks, usually uses spot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

EMR instance configuration

A
  • Uniform instance groups: select a single instance type and purchasing option for each node. Has auto scaling.
  • Instance fleet: Select target capoacity, mix instance types and purchasing options. No auto scaling.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

AWS Glue

A

Managed extract, transform and load service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

AWS Glue Data Catalog

A

Data crawler writes metadata of databases and supportes storages to the data catalog

17
Q

Redshift

A

Online analytical processing
* Columnar storage of data: easier to aggregate
* Massively Parallel Query Execution
* Not all clusters support multi AZ

18
Q

Redshift nodes

A

Leader node: query planning and results aggregation
Compute node: performing queries and send results to leader

19
Q

Redshift Enhanced VPC Routing

A

Copy and unload goes through VPC

20
Q

Redshift snapshots

A

Point in time backups stored in S3
* Incremental
* Restore into a new cluster
* Automated or manual
* Copy automatically to another region

21
Q

Redshift spectrum

A

Query data that is in S3 without loading it
* Must have a Redshift cluster to start the query

22
Q

Redshift resource link

A

Data catalog object which is linked to a database, allows integrated services to run queries on the database. These services will not be able to access directly across cross accounts.

23
Q

Redshift Workload Management

A

Flexibly manage queries priorities within workloads
* Multiple query queues
* Route queries to the appropriate queue

24
Q

Redshift EVEN distribution

A

The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values. Is appropriate when a table doesn’t participate in joins

25
Q

Redshift KEY distribution

A

The leader node places matching values in one column on the same node slice. Matching values from the common columns are physically stored together.

26
Q

Redshift ALL distribution

A

A copy of the entire table is distributed to every node. Ensures that every row is collocated for every join. Appropriate only for relatively slow moving tables

27
Q

DocumentDB

A

Fully managed MongoDB used to store, query and index JSON data
* Replication across 3 AZ
* Automatically grows in increments of 10GB

28
Q

DocumentDB Pricing

A
  • On-demand instances per second
  • Database I/O
  • Database storage per GB
  • Backup storage per GB
29
Q

Amazon Timestream

A

Fully managed time series database
* Faster and cost effective comparing with relational database
* SQL compatibility
* Recent stored in memory, historical cost optimized storage
* Built-in time series analytics functions
* You ccan configure schedule queries

30
Q

Amazon Athena

A

Serverless query service to analyze data stored in S3
* Commonly used with Quicksight

31
Q

Athena Performance Improvement

A
  • Columnar data for cost savings. Parquet or ORC is recommended
  • Compress data for small retrievals
  • Partition datasets in S3 (Partioning pruning)
  • Use larger files to minimize overhead
  • Predicate pushdown to filter data at source
32
Q

Amazon QuickSight

A

Serverless business intelligence service to create interactive dashboards

33
Q

Amazon Forecast

A

Fully managed service that uses statistical and machine learning algorithms to deliver highly accurate time-series forecasts. To import your data you must store it in an S3 bucket.