AWS Data Services Flashcards

1
Q

S3

A
  • Simple Storage Service
  • Large amounts of data storage that we want to actively access
  • Structured or unstructured data
  • Can be used as a data lake:
    • Collection of S3 buckets
    • Structured (csv, json)
    • Unstructured (text files, images)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Advantages and disadvantages of data lakes

A

Adv: many sources, defined schema, lower cost that data warehouse solutions, tolerant of low-quality data
Disadv: unsuitable for transactional systems, needs cataloguing before analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Simple ML workflow

A
  1. Kinesis data firehose ingests data into S3
  2. Glue “crawls” through the S3 bucket to make a catalogue of the data for Athena
  3. Athena queries S3 via use of catalogue
  4. Athena provides data to SageMaker for ML
    * for large data set we could have used EMR/Spark instead or with Athena
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Security in S3

A
  • IAM User and Roles which have policies attached which govern how these users use S3
  • Bucket policy (resource-level policies)
    Encryption:
  • S3 SSE: server side encryption (good for enterprise who require this tick).
  • S3 KMS: create our own keys within the AWS console, or import our own keys
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

AWS Glue

A
  • Cataloguing and structuring the bucket
  • Stores, annotates and shares metadata
  • Creates catalogues of data (schemas)
  • “Crawler” which goes through the S3 bucket and makes the catalogue to store in Glue Data Catalogue
  • Also crawls Dynamo DB etc
  • Produces a single view/endpoint of the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Glue ETL capability

A
  • Extract data out of somewhere, perform an operation, then reload in another location
  • We can “glue” together different data sources and perform some transformations
  • Can interact with a variety of data sources inside and outside of AWS
  • Using the metadata in Data Catalogue, Glue can autogenerate Scala or PySpark scripts with Glue extensions that can be used and modified for ETL operations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Glue jobs system

A
  • Provides managed infrastructure to orchestrate ETL workflows.
  • Can be created to automate ETL scripts and transfer data to different locations
  • Jobs can be scheduled and chained, or triggered by events such as the arrival of new data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Glue FindMatches

A
  • Enables you to identify duplicate or matching records in your data set, even when the records do not have a common unique identifier and no fields match exactly. This will not require writing any code or knowing how ML works
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Database Migration Service (DMS)

A
  • Used to migrate relational databases, data warehouses, NoSQL databases and other types of data stores
  • You can migrate data to S3 using DMS from any of the supported database sources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Athena

A
  • Query S3 with SQL
  • Source data from multiple S3 locations
  • Save outputs to S3
  • Use for data pre-processing ahead of ML
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Quicksight

A
  • AWS BI tool
  • Visualise data from many sources: dashboards, email reports, embedded reports
  • End-user application
  • Not inside the AWS console, but can be accessed from the console
  • Drag-and-drop application in the browser - can plug into many different data sources such as Dynamo, S3, Github, other SQL DBs etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Kinesis

A
  • Large-scale data ingestion

- E.g. lots of video data from few sources, or small amounts of data from many sources (IoT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Kinesis video streams

A
  • Securely stream video from connected devices to AWS for analytics, ML, playback and other processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Kinesis data streams

A
  • General endpoint for ingesting large amounts of data for processing by:
    • Kinesis data analysis
    • Spark or EMR
    • EC2
    • Lambda
  • Can be used to collect and process large streams of data records in real time. You can create Kinesis data stream applications to process data from streams
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Kinesis Data Firehose

A
  • Simple endpoint to stream data into:
    • S3
    • Redshift
    • Elasticsearch
    • Splunk (3rd party data analysis, reporting software)
  • Can also transform data before delivering
  • Is NOT designed for custom data stream processing and real-time metrics (use Kinesis Data Streams)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Kinesis Data Analytics

A
  • Process streaming data from Kinesis Streams (not a BI tool like Quicksight) or Firehose at scale using SQL or Java libraries
17
Q

Glue vs Kinesis

A
  • If we have data sitting in buckets/other warehouses, then Glue may be a better option with its ETL capabilities
  • If we have lots of data streaming in fast, then Kinesis is better
18
Q

Sample architecture from IoT device

A
  • Data streams from IoT device
  • Ingestion of stream is handled by Kinesis Data Stream
  • EMR/Spark handles processing
  • Passed onto S3 for storage
19
Q

Sample architecture from video camera

A
  • Video camera data records and streams data through Kinesis Video Streams to Rekognition Video
  • Rekognition makes predictions such as what it identifies, face recognition,
  • These predictions flow through to Kinesis Data Streams
  • Lambda function takes predictions and triggers AWS SNS to send a message to a mobile device to notify user of a particular security alert in the footage
20
Q

EMR

A

Elastic Map Reduce

  • Managed service for hosting massively parallel compute tasks (e.g. Google search)
  • Works well in the cloud
  • Integrates with S3
  • Petabyte scale processing
  • Uses big data tools: Spark, Hadoop, HBase
  • Task nodes are used to reduce compute costs by processing data but not holding persistent data in HDFS. Terminating a task node does not result in data loss or cause the application to terminate
21
Q

Apache Spark

A
  • Fast analytics engine
  • Massive parallel compute
  • Deployed over clusters of resources
  • The aws-sagemaker-spark-sdk is installed when using EMR. This installs SageMaker Spark and associated dependencies
  • You can use SageMaker Spark to construct Spark machine learning pipelines using SageMaker stages
22
Q

EC2 for Machine Learning

A
  • Compute instances sitting behind the model
  • Instance types targeted at ML tasks include ‘Compute Optimised’ and Accelerated Computing (GPU)
  • AWS have certain types of AMIs (Amazon Machine Images) which are aimed at ML, such as:
    • conda-based deep learning AMIs for TF/keras, MXNet/gluon, pytorch etc; GPU acceleration
    • Deep learning based AMIs (i.e. low-level, customised ML)
  • Limits: with a new AWS account you won’t be able to automatically spin up large ML models unless you request and increase to the compute limit (this can take a few days)