AWS Data Services Flashcards

Question 1

Q

S3

Answer

A

Simple Storage Service
Large amounts of data storage that we want to actively access
Structured or unstructured data
Can be used as a data lake:
- Collection of S3 buckets
- Structured (csv, json)
- Unstructured (text files, images)

Question 2

Q

Advantages and disadvantages of data lakes

Answer

A

Adv: many sources, defined schema, lower cost that data warehouse solutions, tolerant of low-quality data
Disadv: unsuitable for transactional systems, needs cataloguing before analysis

Question 3

Q

Simple ML workflow

Answer

A

Kinesis data firehose ingests data into S3
Glue “crawls” through the S3 bucket to make a catalogue of the data for Athena
Athena queries S3 via use of catalogue
Athena provides data to SageMaker for ML
* for large data set we could have used EMR/Spark instead or with Athena

Question 4

Q

Security in S3

Answer

A

IAM User and Roles which have policies attached which govern how these users use S3
Bucket policy (resource-level policies)
Encryption:
S3 SSE: server side encryption (good for enterprise who require this tick).
S3 KMS: create our own keys within the AWS console, or import our own keys

Question 5

Q

AWS Glue

Answer

A

Cataloguing and structuring the bucket
Stores, annotates and shares metadata
Creates catalogues of data (schemas)
“Crawler” which goes through the S3 bucket and makes the catalogue to store in Glue Data Catalogue
Also crawls Dynamo DB etc
Produces a single view/endpoint of the data

Question 6

Q

Glue ETL capability

Answer

A

Extract data out of somewhere, perform an operation, then reload in another location
We can “glue” together different data sources and perform some transformations
Can interact with a variety of data sources inside and outside of AWS
Using the metadata in Data Catalogue, Glue can autogenerate Scala or PySpark scripts with Glue extensions that can be used and modified for ETL operations

Question 7

Q

Glue jobs system

Answer

A

Provides managed infrastructure to orchestrate ETL workflows.
Can be created to automate ETL scripts and transfer data to different locations
Jobs can be scheduled and chained, or triggered by events such as the arrival of new data

Question 8

Q

Glue FindMatches

Answer

A

Enables you to identify duplicate or matching records in your data set, even when the records do not have a common unique identifier and no fields match exactly. This will not require writing any code or knowing how ML works

Question 9

Q

Database Migration Service (DMS)

Answer

A

Used to migrate relational databases, data warehouses, NoSQL databases and other types of data stores
You can migrate data to S3 using DMS from any of the supported database sources

Question 10

Q

Athena

Answer

A

Query S3 with SQL
Source data from multiple S3 locations
Save outputs to S3
Use for data pre-processing ahead of ML

Question 11

Q

Quicksight

Answer

A

AWS BI tool
Visualise data from many sources: dashboards, email reports, embedded reports
End-user application
Not inside the AWS console, but can be accessed from the console
Drag-and-drop application in the browser - can plug into many different data sources such as Dynamo, S3, Github, other SQL DBs etc

Question 12

Q

Kinesis

Answer

A

Large-scale data ingestion

- E.g. lots of video data from few sources, or small amounts of data from many sources (IoT)

Question 13

Q

Kinesis video streams

Answer

A

Securely stream video from connected devices to AWS for analytics, ML, playback and other processing

Question 14

Q

Kinesis data streams

Answer

A

General endpoint for ingesting large amounts of data for processing by:
- Kinesis data analysis
- Spark or EMR
- EC2
- Lambda
Can be used to collect and process large streams of data records in real time. You can create Kinesis data stream applications to process data from streams

Question 15

Q

Kinesis Data Firehose

Answer

A

Simple endpoint to stream data into:
- S3
- Redshift
- Elasticsearch
- Splunk (3rd party data analysis, reporting software)
Can also transform data before delivering
Is NOT designed for custom data stream processing and real-time metrics (use Kinesis Data Streams)

Question 16

Q

Kinesis Data Analytics

Answer

A

Process streaming data from Kinesis Streams (not a BI tool like Quicksight) or Firehose at scale using SQL or Java libraries

Question 17

Q

Glue vs Kinesis

Answer

A

If we have data sitting in buckets/other warehouses, then Glue may be a better option with its ETL capabilities
If we have lots of data streaming in fast, then Kinesis is better

Question 18

Q

Sample architecture from IoT device

Answer

A

Data streams from IoT device
Ingestion of stream is handled by Kinesis Data Stream
EMR/Spark handles processing
Passed onto S3 for storage

Question 19

Q

Sample architecture from video camera

Answer

A

Video camera data records and streams data through Kinesis Video Streams to Rekognition Video
Rekognition makes predictions such as what it identifies, face recognition,
These predictions flow through to Kinesis Data Streams
Lambda function takes predictions and triggers AWS SNS to send a message to a mobile device to notify user of a particular security alert in the footage

Question 20

Q

EMR

Answer

A

Elastic Map Reduce

Managed service for hosting massively parallel compute tasks (e.g. Google search)
Works well in the cloud
Integrates with S3
Petabyte scale processing
Uses big data tools: Spark, Hadoop, HBase
Task nodes are used to reduce compute costs by processing data but not holding persistent data in HDFS. Terminating a task node does not result in data loss or cause the application to terminate

Question 21

Q

Apache Spark

Answer

A

Fast analytics engine
Massive parallel compute
Deployed over clusters of resources
The aws-sagemaker-spark-sdk is installed when using EMR. This installs SageMaker Spark and associated dependencies
You can use SageMaker Spark to construct Spark machine learning pipelines using SageMaker stages

Question 22

Q

EC2 for Machine Learning

Answer

A

Compute instances sitting behind the model
Instance types targeted at ML tasks include ‘Compute Optimised’ and Accelerated Computing (GPU)
AWS have certain types of AMIs (Amazon Machine Images) which are aimed at ML, such as:
- conda-based deep learning AMIs for TF/keras, MXNet/gluon, pytorch etc; GPU acceleration
- Deep learning based AMIs (i.e. low-level, customised ML)
Limits: with a new AWS account you won’t be able to automatically spin up large ML models unless you request and increase to the compute limit (this can take a few days)