Data Engineering and Storage Flashcards by Luiz Martins

What is the maximum size of an S3 object?

5 TB

How well did you know this?

Not at all

Perfectly

What are the available S3 storage classes?

Standard
Infrequent Access (IA)
Intelligent Tiering
One-Zone IA
Glacier Instant Retrieval
Glacier Flexible Retrieval
Glacier Deep Archive

How well did you know this?

Not at all

Perfectly

For S3 Intelligent Tiering, what are the available tiers and when does an object enter each of those tiers?

-Standard Tier - Default tier
-IA tier - Not accessed for 30 days
-Archive Instant Access Tier - Not accessed for 90 days
-Archive Access Tier- configurable from 90 days to 700+ days
-Deep Archive Access Tier - configurable from 180 to 700+ days

How well did you know this?

Not at all

Perfectly

For S3 Intelligent Tiering, which tiers are automatically configured and which are optional?

Automatic: Standard, IA, Instant Archive Access
Optional: Archive Access, Deep Archive Access

How well did you know this?

Not at all

Perfectly

What is the minimum object billable size for S3 Glacier Flexible Retrieval and Glacier Deep Archive?

40KB

How well did you know this?

Not at all

Perfectly

What are the available retrieval times fo Glacier Flexible Retrieval?

-Expedited: 1-5 min
-Standard: 3-5 hours
-Bulk: 5-12 hours (Free)

How well did you know this?

Not at all

Perfectly

What is the minimum object billable size for S3 Infrequent Access?

128 KB

How well did you know this?

Not at all

Perfectly

What S3 feature can you use to automatically transition objects between S3 Storage Classes?

S3 lifecycle rules

How well did you know this?

Not at all

Perfectly

What is a default use case for S3 One-Zone IA?

Storing historical data that can easily be regenerated

How well did you know this?

Not at all

Perfectly

Aside from automatic storage class transition, what else can S3 Lifecycle rules be used for?

Delete old files or old file versions after a predetermined amount of time has passed

How well did you know this?

Not at all

Perfectly

What can you use to discriminate which objects should be affected by an S3 lifecycle rule or not?

Object prefix and tags

How well did you know this?

Not at all

Perfectly

What is S3 Analytics?

An S3 feature that analyses S3 objects and recommends storage classes for them.

How well did you know this?

Not at all

Perfectly

Which storage classes can S3 Analytics recommend?

Standard or IA.

How well did you know this?

Not at all

Perfectly

What are the User-Based security features available for S3?

IAM Policies

How well did you know this?

Not at all

Perfectly

What are the resource-based security features available for S3?

-Bucket policies (Allows cross account)
-Object Access Control List (More fine grained)
-Bucket Access Control List (less common)

How well did you know this?

Not at all

Perfectly

What are the main uses of Bucket Policies?

-Grant Public Access to files
-Grant Cross-Account access to files
-Enforce object encryption at upload

How well did you know this?

Not at all

Perfectly

What are the types of server side encryption available for S3 objects?

-SSE - S3 (Enabled by default)
-SSE - KMS
-SSE - C

How well did you know this?

Not at all

Perfectly

True or False: SSE-KMS has no impact on KMS usage quotas

False, SSE-KMS uses both GenerateDataKey and Decrypt APIs, and those may impact the quotas of the KMS service

How well did you know this?

Not at all

Perfectly

True or False: SSE-C accepts both HTTP and HTTPS

False, only HTTPS

How well did you know this?

Not at all

Perfectly

True or False: You can use bucket policies to force S3 to only accepts transfers through HTTPS

True

How well did you know this?

Not at all

Perfectly

What are the retrieval types for Glacier Deep Archive data?

Standard (12 hours)
Bulk (48 hours)

How well did you know this?

Not at all

Perfectly

What are the possible consumers for Kinesis Data Streams data?

Applications
Lambda
Amazon Kinesis Data Firehose
Managed Service for Apache Flink

How well did you know this?

Not at all

Perfectly

For how long can data stay stored on Kinesis Data Stream?

365 days

How well did you know this?

Not at all

Perfectly

What is the size of a data shard on Kinesis Data Streams?

1MB

How well did you know this?

Not at all

Perfectly

What are the Kinesis Data Streams provisioned modes and what are their I/O speed?

-Provisioned (1MBs or 1000 records in / 2MBs out) -On-demand(4 MBs or 4000 records in and out)

Where can Kinesis Data Firehose write to?

-S3 -Redshift -Opensearch -Elasticsearch -Cloudwatch Logs -Custom HTTP endpoints -3rd party services

True or False: Kinesis Firehose performs real-time data writing

False, it performs near real-time

What is Amazon Managed Service for Apache Flink?

It is an AWS service that allows you to run Apache Flink serverless on AWS

True or False: Amazon Managed Service for Apache Flink cannot read data from Kinesis Firehose

True

True or False: Amazon Managed Service for Apache Flink supports both Python and Scala

True

What are the main use cases for Kinesis Data Analytics?

-Streaming ETL (only simple transformations) -Continuous Metric Generation -Responsive analytics for certain metrics

What is Kinesis Data Analytics?

It's a service for real-time ETL / ML algorithms on streams

What are the possible data producers for Kinesis Video Streams?

-Cameras -Radar data -Images -Audio Feeds -AWS DeepLens -RTSP cameras

How many data producers can each video stream have?

What are the possible Kinesis Video Streams consumers?

-Custom consumers -AWS Sagemaker -Amazon Rekognition Video

How long can Kinesis Video Streams keep your data?

Between 1 day and 10 years

What is Glue Data Catalog?

It is a metadata repository for all S3 based tables' schemas.

What AWS services can integrate with Data Catalog schemas?

-EMR -Athena -Redshift Spectrum

What is a Glue Crawler?

A Glue Crawler is a functionality that scans your data to infer table schemas from it automatically

What input file formats are accepted by Glue Crawlers?

-JSON -Parquet -CSV -Relational Stores

A Glue Crawler can run on data stored on which AWS services?

-S3 -Redshift -RDS

What is Glue ETL?

It is a Glue functionality that allows you to perform transformations on your data using a serverless spark platform

True or False: Glue ETL accepts only Spark scripts

False, it accepts both Spark and PySpark

True or False: Glue ETL accepts only scrips written in Scala

False, it accepts both Scala and Python

What targets can Glue ETL accept?

S3, JDBC connector (Redshift, RDS) or Glue Data Catalog

What are the bundled transformations available on Glue ETL?

- DropFields, DropNullFields - Filter - Specify a function to filter records - Join - To enrich data - Map - Add fields, delete fields, perform external lookups

What are the machine learning transformations available on Glue ETL?

FindMatches ML: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly

What formats can Glue ETL convert files to?

-CSV -JSON -Parquet -Avro -ORC -XML

True or False: Glue ETL can perform Apache Spark default transformations (ex: K-means)

True

What is Glue Databrew?

It is a feature that allows you to perform transformations to your data without writing any code

True or False: Glue Data Catalog Schemas are versioned

True

What is a recommended data store on AWS for performing clickstream analytics?

OpenSearch

What is AWS Data Pipeline?

It is an AWS service that automates the movement and transformation of data

What are possible data targets for Data Pipeline?

S3, DynamoDB, RDS, Redshift and EMR

What is the main difference between Data Pipeline and Glue ETL?

Glue ETL gives you more control regarding the migration infrastructure and allows access to EC2 and EMR instances

True or False: AWS Batch is a serverless service, where you only pay for the usage of the underlying EC2 instances

True

What is AWS Batch?

Its is an AWS Service that allows you to run batch jobs based on Docker images

True or False: The only EC2 instance types accepted by AWS Batch is On-Demand

False, On-demand and Spot

What is the maximum execution time for a Step Functions state machine?

1 year

What is AWS DataSync?

Is is an AWS service that helps with the migration of outside files to AWS

What are the possible AWS DataSync targets?

-S3 -EFS -Fsx

What is MQTT?

It is a messaging protocol used on IoT