AWS Data Services Flashcards
1
Q
S3
A
- Simple Storage Service
- Large amounts of data storage that we want to actively access
- Structured or unstructured data
- Can be used as a data lake:
- Collection of S3 buckets
- Structured (csv, json)
- Unstructured (text files, images)
2
Q
Advantages and disadvantages of data lakes
A
Adv: many sources, defined schema, lower cost that data warehouse solutions, tolerant of low-quality data
Disadv: unsuitable for transactional systems, needs cataloguing before analysis
3
Q
Simple ML workflow
A
- Kinesis data firehose ingests data into S3
- Glue “crawls” through the S3 bucket to make a catalogue of the data for Athena
- Athena queries S3 via use of catalogue
- Athena provides data to SageMaker for ML
* for large data set we could have used EMR/Spark instead or with Athena
4
Q
Security in S3
A
- IAM User and Roles which have policies attached which govern how these users use S3
- Bucket policy (resource-level policies)
Encryption: - S3 SSE: server side encryption (good for enterprise who require this tick).
- S3 KMS: create our own keys within the AWS console, or import our own keys
5
Q
AWS Glue
A
- Cataloguing and structuring the bucket
- Stores, annotates and shares metadata
- Creates catalogues of data (schemas)
- “Crawler” which goes through the S3 bucket and makes the catalogue to store in Glue Data Catalogue
- Also crawls Dynamo DB etc
- Produces a single view/endpoint of the data
6
Q
Glue ETL capability
A
- Extract data out of somewhere, perform an operation, then reload in another location
- We can “glue” together different data sources and perform some transformations
- Can interact with a variety of data sources inside and outside of AWS
- Using the metadata in Data Catalogue, Glue can autogenerate Scala or PySpark scripts with Glue extensions that can be used and modified for ETL operations
7
Q
Glue jobs system
A
- Provides managed infrastructure to orchestrate ETL workflows.
- Can be created to automate ETL scripts and transfer data to different locations
- Jobs can be scheduled and chained, or triggered by events such as the arrival of new data
8
Q
Glue FindMatches
A
- Enables you to identify duplicate or matching records in your data set, even when the records do not have a common unique identifier and no fields match exactly. This will not require writing any code or knowing how ML works
9
Q
Database Migration Service (DMS)
A
- Used to migrate relational databases, data warehouses, NoSQL databases and other types of data stores
- You can migrate data to S3 using DMS from any of the supported database sources
10
Q
Athena
A
- Query S3 with SQL
- Source data from multiple S3 locations
- Save outputs to S3
- Use for data pre-processing ahead of ML
11
Q
Quicksight
A
- AWS BI tool
- Visualise data from many sources: dashboards, email reports, embedded reports
- End-user application
- Not inside the AWS console, but can be accessed from the console
- Drag-and-drop application in the browser - can plug into many different data sources such as Dynamo, S3, Github, other SQL DBs etc
12
Q
Kinesis
A
- Large-scale data ingestion
- E.g. lots of video data from few sources, or small amounts of data from many sources (IoT)
13
Q
Kinesis video streams
A
- Securely stream video from connected devices to AWS for analytics, ML, playback and other processing
14
Q
Kinesis data streams
A
- General endpoint for ingesting large amounts of data for processing by:
- Kinesis data analysis
- Spark or EMR
- EC2
- Lambda
- Can be used to collect and process large streams of data records in real time. You can create Kinesis data stream applications to process data from streams
15
Q
Kinesis Data Firehose
A
- Simple endpoint to stream data into:
- S3
- Redshift
- Elasticsearch
- Splunk (3rd party data analysis, reporting software)
- Can also transform data before delivering
- Is NOT designed for custom data stream processing and real-time metrics (use Kinesis Data Streams)