Data Engineering and Storage Flashcards

1
Q

What is the maximum size of an S3 object?

A

5 TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the available S3 storage classes?

A
  • Standard
  • Infrequent Access (IA)
  • Intelligent Tiering
  • One-Zone IA
  • Glacier Instant Retrieval
  • Glacier Flexible Retrieval
  • Glacier Deep Archive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

For S3 Intelligent Tiering, what are the available tiers and when does an object enter each of those tiers?

A

-Standard Tier - Default tier
-IA tier - Not accessed for 30 days
-Archive Instant Access Tier - Not accessed for 90 days
-Archive Access Tier- configurable from 90 days to 700+ days
-Deep Archive Access Tier - configurable from 180 to 700+ days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

For S3 Intelligent Tiering, which tiers are automatically configured and which are optional?

A

Automatic: Standard, IA, Instant Archive Access
Optional: Archive Access, Deep Archive Access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the minimum object billable size for S3 Glacier Flexible Retrieval and Glacier Deep Archive?

A

40KB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the available retrieval times fo Glacier Flexible Retrieval?

A

-Expedited: 1-5 min
-Standard: 3-5 hours
-Bulk: 5-12 hours (Free)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the minimum object billable size for S3 Infrequent Access?

A

128 KB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What S3 feature can you use to automatically transition objects between S3 Storage Classes?

A

S3 lifecycle rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a default use case for S3 One-Zone IA?

A

Storing historical data that can easily be regenerated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Aside from automatic storage class transition, what else can S3 Lifecycle rules be used for?

A

Delete old files or old file versions after a predetermined amount of time has passed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can you use to discriminate which objects should be affected by an S3 lifecycle rule or not?

A

Object prefix and tags

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is S3 Analytics?

A

An S3 feature that analyses S3 objects and recommends storage classes for them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which storage classes can S3 Analytics recommend?

A

Standard or IA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the User-Based security features available for S3?

A

IAM Policies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the resource-based security features available for S3?

A

-Bucket policies (Allows cross account)
-Object Access Control List (More fine grained)
-Bucket Access Control List (less common)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the main uses of Bucket Policies?

A

-Grant Public Access to files
-Grant Cross-Account access to files
-Enforce object encryption at upload

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the types of server side encryption available for S3 objects?

A

-SSE - S3 (Enabled by default)
-SSE - KMS
-SSE - C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

True or False: SSE-KMS has no impact on KMS usage quotas

A

False, SSE-KMS uses both GenerateDataKey and Decrypt APIs, and those may impact the quotas of the KMS service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

True or False: SSE-C accepts both HTTP and HTTPS

A

False, only HTTPS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

True or False: You can use bucket policies to force S3 to only accepts transfers through HTTPS

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the retrieval types for Glacier Deep Archive data?

A
  • Standard (12 hours)
  • Bulk (48 hours)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the possible consumers for Kinesis Data Streams data?

A
  • Applications
  • Lambda
  • Amazon Kinesis Data Firehose
  • Managed Service for Apache Flink
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

For how long can data stay stored on Kinesis Data Stream?

A

365 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the size of a data shard on Kinesis Data Streams?

A

1MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the Kinesis Data Streams provisioned modes and what are their I/O speed?

A

-Provisioned (1MBs or 1000 records in / 2MBs out)
-On-demand(4 MBs or 4000 records in and out)

26
Q

Where can Kinesis Data Firehose write to?

A

-S3
-Redshift
-Opensearch
-Custom HTTP endpoints
-3rd party services

27
Q

True or False: Kinesis Firehose performs real-time data writing

A

False, it performs near real-time

28
Q

What is Amazon Managed Service for Apache Flink?

A

It is an AWS service that allows you to run Apache Flink serverless on AWS

29
Q

True or False: Amazon Managed Service for Apache Flink cannot read data from Kinesis Firehose

30
Q

True or False: Amazon Managed Service for Apache Flink supports both Python and Scala

31
Q

What are the main use cases for Kinesis Data Analytics?

A

-Streaming ETL (only simple transformations)
-Continuous Metric Generation
-Responsive analytics for certain metrics

32
Q

What is Kinesis Data Analytics?

A

It’s a service for real-time ETL / ML algorithms on streams

33
Q

What are the possible data producers for Kinesis Video Streams?

A

-Cameras
-Radar data
-Images
-Audio Feeds
-AWS DeepLens
-RTSP cameras

34
Q

How many data producers can each video stream have?

35
Q

What are the possible Kinesis Video Streams consumers?

A

-Custom consumers
-AWS Sagemaker
-Amazon Rekognition Video

36
Q

How long can Kinesis Video Streams keep your data?

A

Between 1 day and 10 years

37
Q

What is Glue Data Catalog?

A

It is a metadata repository for all S3 based tables’ schemas.

38
Q

What AWS services can integrate with Data Catalog schemas?

A

-EMR
-Athena
-Redshift Spectrum

39
Q

What is a Glue Crawler?

A

A Glue Crawler is a functionality that scans your data to infer table schemas from it automatically

40
Q

What input file formats are accepted by Glue Crawlers?

A

-JSON
-Parquet
-CSV
-Relational Stores

41
Q

A Glue Crawler can run on data stored on which AWS services?

A

-S3
-Redshift
-RDS

42
Q

What is Glue ETL?

A

It is a Glue functionality that allows you to perform transformations on your data using a serverless spark platform

43
Q

True or False: Glue ETL accepts only Spark scripts

A

False, it accepts both Spark and PySpark

44
Q

True or False: Glue ETL accepts only scrips written in Scala

A

False, it accepts both Scala and Python

45
Q

What targets can Glue ETL accept?

A

S3, JDBC connector (Redshift, RDS) or Glue Data Catalog

46
Q

What are the bundled transformations available on Glue ETL?

A
  • DropFields, DropNullFields
  • Filter - Specify a function to filter records
  • Join - To enrich data
  • Map - Add fields, delete fields, perform external lookups
47
Q

What are the machine learning transformations available on Glue ETL?

A

FindMatches ML: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly

48
Q

What formats can Glue ETL convert files to?

A

-CSV
-JSON
-Parquet
-Avro
-ORC
-XML

49
Q

True or False: Glue ETL can perform Apache Spark default transformations (ex: K-means)

50
Q

What is Glue Databrew?

A

It is a feature that allows you to perform transformations to your data without writing any code

51
Q

True or False: Glue Data Catalog Schemas are versioned

52
Q

What is a recommended data store on AWS for performing clickstream analytics?

A

OpenSearch

53
Q

What is AWS Data Pipeline?

A

It is an AWS service that automates the movement and transformation of data

54
Q

What are possible data targets for Data Pipeline?

A

S3, DynamoDB, RDS, Redshift and EMR

55
Q

What is the main difference between Data Pipeline and Glue ETL?

A

Glue ETL gives you more control regarding the migration infrastructure and allows access to EC2 and EMR instances

56
Q

True or False: AWS Batch is a serverless service, where you only pay for the usage of the underlying EC2 instances

57
Q

What is AWS Batch?

A

Its is an AWS Service that allows you to run batch jobs based on Docker images

58
Q

True or False: The only EC2 instance types accepted by AWS Batch is On-Demand

A

False, On-demand and Spot

59
Q

What is the maximum execution time for a Step Functions state machine?

60
Q

What is AWS DataSync?

A

Is is an AWS service that helps with the migration of outside files to AWS

61
Q

What are the possible AWS DataSync targets?

A

-S3
-EFS
-Fsx

62
Q

What is MQTT?

A

It is a messaging protocol used on IoT