Data Engineering and Storage Flashcards
What is the maximum size of an S3 object?
5 TB
What are the available S3 storage classes?
- Standard
- Infrequent Access (IA)
- Intelligent Tiering
- One-Zone IA
- Glacier Instant Retrieval
- Glacier Flexible Retrieval
- Glacier Deep Archive
For S3 Intelligent Tiering, what are the available tiers and when does an object enter each of those tiers?
-Standard Tier - Default tier
-IA tier - Not accessed for 30 days
-Archive Instant Access Tier - Not accessed for 90 days
-Archive Access Tier- configurable from 90 days to 700+ days
-Deep Archive Access Tier - configurable from 180 to 700+ days
For S3 Intelligent Tiering, which tiers are automatically configured and which are optional?
Automatic: Standard, IA, Instant Archive Access
Optional: Archive Access, Deep Archive Access
What is the minimum object billable size for S3 Glacier Flexible Retrieval and Glacier Deep Archive?
40KB
What are the available retrieval times fo Glacier Flexible Retrieval?
-Expedited: 1-5 min
-Standard: 3-5 hours
-Bulk: 5-12 hours (Free)
What is the minimum object billable size for S3 Infrequent Access?
128 KB
What S3 feature can you use to automatically transition objects between S3 Storage Classes?
S3 lifecycle rules
What is a default use case for S3 One-Zone IA?
Storing historical data that can easily be regenerated
Aside from automatic storage class transition, what else can S3 Lifecycle rules be used for?
Delete old files or old file versions after a predetermined amount of time has passed
What can you use to discriminate which objects should be affected by an S3 lifecycle rule or not?
Object prefix and tags
What is S3 Analytics?
An S3 feature that analyses S3 objects and recommends storage classes for them.
Which storage classes can S3 Analytics recommend?
Standard or IA.
What are the User-Based security features available for S3?
IAM Policies
What are the resource-based security features available for S3?
-Bucket policies (Allows cross account)
-Object Access Control List (More fine grained)
-Bucket Access Control List (less common)
What are the main uses of Bucket Policies?
-Grant Public Access to files
-Grant Cross-Account access to files
-Enforce object encryption at upload
What are the types of server side encryption available for S3 objects?
-SSE - S3 (Enabled by default)
-SSE - KMS
-SSE - C
True or False: SSE-KMS has no impact on KMS usage quotas
False, SSE-KMS uses both GenerateDataKey and Decrypt APIs, and those may impact the quotas of the KMS service
True or False: SSE-C accepts both HTTP and HTTPS
False, only HTTPS
True or False: You can use bucket policies to force S3 to only accepts transfers through HTTPS
True
What are the retrieval types for Glacier Deep Archive data?
- Standard (12 hours)
- Bulk (48 hours)
What are the possible consumers for Kinesis Data Streams data?
- Applications
- Lambda
- Amazon Kinesis Data Firehose
- Managed Service for Apache Flink
For how long can data stay stored on Kinesis Data Stream?
365 days
What is the size of a data shard on Kinesis Data Streams?
1MB
What are the Kinesis Data Streams provisioned modes and what are their I/O speed?
-Provisioned (1MBs or 1000 records in / 2MBs out)
-On-demand(4 MBs or 4000 records in and out)
Where can Kinesis Data Firehose write to?
-S3
-Redshift
-Opensearch
-Custom HTTP endpoints
-3rd party services
True or False: Kinesis Firehose performs real-time data writing
False, it performs near real-time
What is Amazon Managed Service for Apache Flink?
It is an AWS service that allows you to run Apache Flink serverless on AWS
True or False: Amazon Managed Service for Apache Flink cannot read data from Kinesis Firehose
True
True or False: Amazon Managed Service for Apache Flink supports both Python and Scala
True
What are the main use cases for Kinesis Data Analytics?
-Streaming ETL (only simple transformations)
-Continuous Metric Generation
-Responsive analytics for certain metrics
What is Kinesis Data Analytics?
It’s a service for real-time ETL / ML algorithms on streams
What are the possible data producers for Kinesis Video Streams?
-Cameras
-Radar data
-Images
-Audio Feeds
-AWS DeepLens
-RTSP cameras
How many data producers can each video stream have?
1
What are the possible Kinesis Video Streams consumers?
-Custom consumers
-AWS Sagemaker
-Amazon Rekognition Video
How long can Kinesis Video Streams keep your data?
Between 1 day and 10 years
What is Glue Data Catalog?
It is a metadata repository for all S3 based tables’ schemas.
What AWS services can integrate with Data Catalog schemas?
-EMR
-Athena
-Redshift Spectrum
What is a Glue Crawler?
A Glue Crawler is a functionality that scans your data to infer table schemas from it automatically
What input file formats are accepted by Glue Crawlers?
-JSON
-Parquet
-CSV
-Relational Stores
A Glue Crawler can run on data stored on which AWS services?
-S3
-Redshift
-RDS
What is Glue ETL?
It is a Glue functionality that allows you to perform transformations on your data using a serverless spark platform
True or False: Glue ETL accepts only Spark scripts
False, it accepts both Spark and PySpark
True or False: Glue ETL accepts only scrips written in Scala
False, it accepts both Scala and Python
What targets can Glue ETL accept?
S3, JDBC connector (Redshift, RDS) or Glue Data Catalog
What are the bundled transformations available on Glue ETL?
- DropFields, DropNullFields
- Filter - Specify a function to filter records
- Join - To enrich data
- Map - Add fields, delete fields, perform external lookups
What are the machine learning transformations available on Glue ETL?
FindMatches ML: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly
What formats can Glue ETL convert files to?
-CSV
-JSON
-Parquet
-Avro
-ORC
-XML
True or False: Glue ETL can perform Apache Spark default transformations (ex: K-means)
True
What is Glue Databrew?
It is a feature that allows you to perform transformations to your data without writing any code
True or False: Glue Data Catalog Schemas are versioned
True
What is a recommended data store on AWS for performing clickstream analytics?
OpenSearch
What is AWS Data Pipeline?
It is an AWS service that automates the movement and transformation of data
What are possible data targets for Data Pipeline?
S3, DynamoDB, RDS, Redshift and EMR
What is the main difference between Data Pipeline and Glue ETL?
Glue ETL gives you more control regarding the migration infrastructure and allows access to EC2 and EMR instances
True or False: AWS Batch is a serverless service, where you only pay for the usage of the underlying EC2 instances
True
What is AWS Batch?
Its is an AWS Service that allows you to run batch jobs based on Docker images
True or False: The only EC2 instance types accepted by AWS Batch is On-Demand
False, On-demand and Spot
What is the maximum execution time for a Step Functions state machine?
1 year
What is AWS DataSync?
Is is an AWS service that helps with the migration of outside files to AWS
What are the possible AWS DataSync targets?
-S3
-EFS
-Fsx
What is MQTT?
It is a messaging protocol used on IoT