AWS Data Services Flashcards
S3
- Simple Storage Service
- Large amounts of data storage that we want to actively access
- Structured or unstructured data
- Can be used as a data lake:
- Collection of S3 buckets
- Structured (csv, json)
- Unstructured (text files, images)
Advantages and disadvantages of data lakes
Adv: many sources, defined schema, lower cost that data warehouse solutions, tolerant of low-quality data
Disadv: unsuitable for transactional systems, needs cataloguing before analysis
Simple ML workflow
- Kinesis data firehose ingests data into S3
- Glue “crawls” through the S3 bucket to make a catalogue of the data for Athena
- Athena queries S3 via use of catalogue
- Athena provides data to SageMaker for ML
* for large data set we could have used EMR/Spark instead or with Athena
Security in S3
- IAM User and Roles which have policies attached which govern how these users use S3
- Bucket policy (resource-level policies)
Encryption: - S3 SSE: server side encryption (good for enterprise who require this tick).
- S3 KMS: create our own keys within the AWS console, or import our own keys
AWS Glue
- Cataloguing and structuring the bucket
- Stores, annotates and shares metadata
- Creates catalogues of data (schemas)
- “Crawler” which goes through the S3 bucket and makes the catalogue to store in Glue Data Catalogue
- Also crawls Dynamo DB etc
- Produces a single view/endpoint of the data
Glue ETL capability
- Extract data out of somewhere, perform an operation, then reload in another location
- We can “glue” together different data sources and perform some transformations
- Can interact with a variety of data sources inside and outside of AWS
- Using the metadata in Data Catalogue, Glue can autogenerate Scala or PySpark scripts with Glue extensions that can be used and modified for ETL operations
Glue jobs system
- Provides managed infrastructure to orchestrate ETL workflows.
- Can be created to automate ETL scripts and transfer data to different locations
- Jobs can be scheduled and chained, or triggered by events such as the arrival of new data
Glue FindMatches
- Enables you to identify duplicate or matching records in your data set, even when the records do not have a common unique identifier and no fields match exactly. This will not require writing any code or knowing how ML works
Database Migration Service (DMS)
- Used to migrate relational databases, data warehouses, NoSQL databases and other types of data stores
- You can migrate data to S3 using DMS from any of the supported database sources
Athena
- Query S3 with SQL
- Source data from multiple S3 locations
- Save outputs to S3
- Use for data pre-processing ahead of ML
Quicksight
- AWS BI tool
- Visualise data from many sources: dashboards, email reports, embedded reports
- End-user application
- Not inside the AWS console, but can be accessed from the console
- Drag-and-drop application in the browser - can plug into many different data sources such as Dynamo, S3, Github, other SQL DBs etc
Kinesis
- Large-scale data ingestion
- E.g. lots of video data from few sources, or small amounts of data from many sources (IoT)
Kinesis video streams
- Securely stream video from connected devices to AWS for analytics, ML, playback and other processing
Kinesis data streams
- General endpoint for ingesting large amounts of data for processing by:
- Kinesis data analysis
- Spark or EMR
- EC2
- Lambda
- Can be used to collect and process large streams of data records in real time. You can create Kinesis data stream applications to process data from streams
Kinesis Data Firehose
- Simple endpoint to stream data into:
- S3
- Redshift
- Elasticsearch
- Splunk (3rd party data analysis, reporting software)
- Can also transform data before delivering
- Is NOT designed for custom data stream processing and real-time metrics (use Kinesis Data Streams)
Kinesis Data Analytics
- Process streaming data from Kinesis Streams (not a BI tool like Quicksight) or Firehose at scale using SQL or Java libraries
Glue vs Kinesis
- If we have data sitting in buckets/other warehouses, then Glue may be a better option with its ETL capabilities
- If we have lots of data streaming in fast, then Kinesis is better
Sample architecture from IoT device
- Data streams from IoT device
- Ingestion of stream is handled by Kinesis Data Stream
- EMR/Spark handles processing
- Passed onto S3 for storage
Sample architecture from video camera
- Video camera data records and streams data through Kinesis Video Streams to Rekognition Video
- Rekognition makes predictions such as what it identifies, face recognition,
- These predictions flow through to Kinesis Data Streams
- Lambda function takes predictions and triggers AWS SNS to send a message to a mobile device to notify user of a particular security alert in the footage
EMR
Elastic Map Reduce
- Managed service for hosting massively parallel compute tasks (e.g. Google search)
- Works well in the cloud
- Integrates with S3
- Petabyte scale processing
- Uses big data tools: Spark, Hadoop, HBase
- Task nodes are used to reduce compute costs by processing data but not holding persistent data in HDFS. Terminating a task node does not result in data loss or cause the application to terminate
Apache Spark
- Fast analytics engine
- Massive parallel compute
- Deployed over clusters of resources
- The aws-sagemaker-spark-sdk is installed when using EMR. This installs SageMaker Spark and associated dependencies
- You can use SageMaker Spark to construct Spark machine learning pipelines using SageMaker stages
EC2 for Machine Learning
- Compute instances sitting behind the model
- Instance types targeted at ML tasks include ‘Compute Optimised’ and Accelerated Computing (GPU)
- AWS have certain types of AMIs (Amazon Machine Images) which are aimed at ML, such as:
- conda-based deep learning AMIs for TF/keras, MXNet/gluon, pytorch etc; GPU acceleration
- Deep learning based AMIs (i.e. low-level, customised ML)
- Limits: with a new AWS account you won’t be able to automatically spin up large ML models unless you request and increase to the compute limit (this can take a few days)