[2] AWS Data Ecosystem Flashcards
How is security managed with S3?
- Encryption - S3-SSE and S3-KMS
- IAM
- Bucket policies
What is Data Pipeline?
A managed service to create highly-available data workflows that move data between services
What is EMR?
A managed service for hosting massively parallel compute tasks with Spark
How are EMR deployments structured?
There is a master node which is running all of the time,. core nodes which coordinate the data storage etc., and task nodes which do the actual computation (these can be spot instances)
How can machine learning on EC2 be streamlined?
With the Deep Learning AMIs which bundle key libraries and drivers etc.
What is AWS Batch?
A service for processing a large amount of data in parallel i.e. batch inference
What features does Glue have?
- crawlers to create catalogues of the data
- managed ETL using Python or Scala
- some ML capabilities such as deduplicating records
What import and export locations does Glue support?
It can read data from DynamoDB, S3 and services supporting JBDC
Results can be saved to a database or S3
What are the steps to using Glue?
(1) build a data catalogue using crawlers
(2) define transformations
(3) schedule and run transformations
What is Athena?
A managed service to perform SQL-like queries on data.
The data must be in S3 and the results are stored in S3
Data can be sourced from multiple S3 locations
What are the Kinesis services?
Kinesis Video Streams
Kinesis Data Streams
Kinesis Data Firehose
Kinesis Data Analytics
What is Kinesis Video Streams?
A service which streams video from devices to AWS for analytics, machine learning, playback and encoding etc.
What is Kinesis Data Streams?
Provides an endpoint for pressing data in real time for Kinesis Data Analytics, Spark on EMR, EC2 or Lambda etc.
Static reference data can be sourced from S3. Only one stream can be ingested at a time (same as Firehose)
What is Kinesis Data Firehose?
Data is streamed IN BATCHES in to S3, Redshift, Elasticsearch or Splunk etc. for offline processing
What is Kinesis Data Analytics?
Streaming data from Kinesis Streams or Firehose is processed in real time using SQL or Java libraries