Data Engineering and Storage Flashcards

1
Q

What is the maximum size of an S3 object?

A

5 TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the available S3 storage classes?

A
  • Standard
  • Infrequent Access (IA)
  • Intelligent Tiering
  • One-Zone IA
  • Glacier Instant Retrieval
  • Glacier Flexible Retrieval
  • Glacier Deep Archive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

For S3 Intelligent Tiering, what are the available tiers and when does an object enter each of those tiers?

A

-Standard Tier - Default tier
-IA tier - Not accessed for 30 days
-Archive Instant Access Tier - Not accessed for 90 days
-Archive Access Tier- configurable from 90 days to 700+ days
-Deep Archive Access Tier - configurable from 180 to 700+ days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

For S3 Intelligent Tiering, which tiers are automatically configured and which are optional?

A

Automatic: Standard, IA, Instant Archive Access
Optional: Archive Access, Deep Archive Access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the minimum object billable size for S3 Glacier Flexible Retrieval and Glacier Deep Archive?

A

40KB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the available retrieval times fo Glacier Flexible Retrieval?

A

-Expedited: 1-5 min
-Standard: 3-5 hours
-Bulk: 5-12 hours (Free)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the minimum object billable size for S3 Infrequent Access?

A

128 KB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What S3 feature can you use to automatically transition objects between S3 Storage Classes?

A

S3 lifecycle rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a default use case for S3 One-Zone IA?

A

Storing historical data that can easily be regenerated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Aside from automatic storage class transition, what else can S3 Lifecycle rules be used for?

A

Delete old files or old file versions after a predetermined amount of time has passed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can you use to discriminate which objects should be affected by an S3 lifecycle rule or not?

A

Object prefix and tags

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is S3 Analytics?

A

An S3 feature that analyses S3 objects and recommends storage classes for them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which storage classes can S3 Analytics recommend?

A

Standard or IA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the User-Based security features available for S3?

A

IAM Policies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the resource-based security features available for S3?

A

-Bucket policies (Allows cross account)
-Object Access Control List (More fine grained)
-Bucket Access Control List (less common)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the main uses of Bucket Policies?

A

-Grant Public Access to files
-Grant Cross-Account access to files
-Enforce object encryption at upload

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the types of server side encryption available for S3 objects?

A

-SSE - S3 (Enabled by default)
-SSE - KMS
-SSE - C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

True or False: SSE-KMS has no impact on KMS usage quotas

A

False, SSE-KMS uses both GenerateDataKey and Decrypt APIs, and those may impact the quotas of the KMS service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

True or False: SSE-C accepts both HTTP and HTTPS

A

False, only HTTPS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

True or False: You can use bucket policies to force S3 to only accepts transfers through HTTPS

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the retrieval types for Glacier Deep Archive data?

A
  • Standard (12 hours)
  • Bulk (48 hours)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the possible consumers for Kinesis Data Streams data?

A
  • Applications
  • Lambda
  • Amazon Kinesis Data Firehose
  • Managed Service for Apache Flink
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

For how long can data stay stored on Kinesis Data Stream?

A

365 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the size of a data shard on Kinesis Data Streams?

A

1MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What are the Kinesis Data Streams provisioned modes and what are their I/O speed?
-Provisioned (1MBs or 1000 records in / 2MBs out) -On-demand(4 MBs or 4000 records in and out)
26
Where can Kinesis Data Firehose write to?
-S3 -Redshift -Opensearch -Elasticsearch -Cloudwatch Logs -Custom HTTP endpoints -3rd party services
27
True or False: Kinesis Firehose performs real-time data writing
False, it performs near real-time
28
What is Amazon Managed Service for Apache Flink?
It is an AWS service that allows you to run Apache Flink serverless on AWS
29
True or False: Amazon Managed Service for Apache Flink cannot read data from Kinesis Firehose
True
30
True or False: Amazon Managed Service for Apache Flink supports both Python and Scala
True
31
What are the main use cases for Kinesis Data Analytics?
-Streaming ETL (only simple transformations) -Continuous Metric Generation -Responsive analytics for certain metrics
32
What is Kinesis Data Analytics?
It's a service for real-time ETL / ML algorithms on streams
33
What are the possible data producers for Kinesis Video Streams?
-Cameras -Radar data -Images -Audio Feeds -AWS DeepLens -RTSP cameras
34
How many data producers can each video stream have?
1
35
What are the possible Kinesis Video Streams consumers?
-Custom consumers -AWS Sagemaker -Amazon Rekognition Video
36
How long can Kinesis Video Streams keep your data?
Between 1 day and 10 years
37
What is Glue Data Catalog?
It is a metadata repository for all S3 based tables' schemas.
38
What AWS services can integrate with Data Catalog schemas?
-EMR -Athena -Redshift Spectrum
39
What is a Glue Crawler?
A Glue Crawler is a functionality that scans your data to infer table schemas from it automatically
40
What input file formats are accepted by Glue Crawlers?
-JSON -Parquet -CSV -Relational Stores
41
A Glue Crawler can run on data stored on which AWS services?
-S3 -Redshift -RDS
42
What is Glue ETL?
It is a Glue functionality that allows you to perform transformations on your data using a serverless spark platform
43
True or False: Glue ETL accepts only Spark scripts
False, it accepts both Spark and PySpark
44
True or False: Glue ETL accepts only scrips written in Scala
False, it accepts both Scala and Python
45
What targets can Glue ETL accept?
S3, JDBC connector (Redshift, RDS) or Glue Data Catalog
46
What are the bundled transformations available on Glue ETL?
- DropFields, DropNullFields - Filter - Specify a function to filter records - Join - To enrich data - Map - Add fields, delete fields, perform external lookups
47
What are the machine learning transformations available on Glue ETL?
FindMatches ML: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly
48
What formats can Glue ETL convert files to?
-CSV -JSON -Parquet -Avro -ORC -XML
49
True or False: Glue ETL can perform Apache Spark default transformations (ex: K-means)
True
50
What is Glue Databrew?
It is a feature that allows you to perform transformations to your data without writing any code
51
True or False: Glue Data Catalog Schemas are versioned
True
52
What is a recommended data store on AWS for performing clickstream analytics?
OpenSearch
53
What is AWS Data Pipeline?
It is an AWS service that automates the movement and transformation of data
54
What are possible data targets for Data Pipeline?
S3, DynamoDB, RDS, Redshift and EMR
55
What is the main difference between Data Pipeline and Glue ETL?
Glue ETL gives you more control regarding the migration infrastructure and allows access to EC2 and EMR instances
56
True or False: AWS Batch is a serverless service, where you only pay for the usage of the underlying EC2 instances
True
57
What is AWS Batch?
Its is an AWS Service that allows you to run batch jobs based on Docker images
58
True or False: The only EC2 instance types accepted by AWS Batch is On-Demand
False, On-demand and Spot
59
What is the maximum execution time for a Step Functions state machine?
1 year
60
What is AWS DataSync?
Is is an AWS service that helps with the migration of outside files to AWS
61
What are the possible AWS DataSync targets?
-S3 -EFS -Fsx
62
What is MQTT?
It is a messaging protocol used on IoT