Processing Flashcards
1
Q
Lambda
A
- Run code snippets in the cloud
- Serverless
- Continuous Scaling
- Support multiple programming languages
2
Q
Lambda Use Cases
A
- Real time file processing
- Real time stream processing
- ETL
- Cron replacement
3
Q
Lambda Supported Languages
A
- Node.js
- Python
- Java
- Go
- Ruby
4
Q
Lambda Triggers
A
- S3
- KDS
- SNS
- SQS
- AWS CloudWatch
- AWS CloudFormation
5
Q
Lambda Redshift
A
- Best practice for loading data into Redshift is the COPY command
- Use DynamoDB to keep track of what has been loaded
- Lambda can batch up new data and load them with COPY
6
Q
Lambda + Kinesis
A
- Lambda receives an event with a batch of stream records
- Specify a batch size (up to 10000 records)
- Batches may be split beyong Lambda’s payload limit (6MB)
- Lambda will retry the batch until it succeeds or the data expires
- This can stall the shard if you do not handle errors properly
- Use more shards to ensure processing isn’t totally held up by errors
- Lambda processes shard data synchronously
7
Q
Lambda Cost Model
A
- Generous free tier (1M/month, 400K GB-seconds compute time)
- $0.2 / million requests
- $0.00001667 per GB / second
8
Q
Lambda Promises
A
- High availability
- No scheduled downtime
- Retries failed code 3 times
- Unlimited scalability
- Safety throttle of 1000 concurrent execution per region
- High Performance
- New functions callable in seconds
- Code is cached automatically
- Max Timeout is 15 mins
9
Q
Lambda Auti-Pattern
A
- Long-running applications
- Use EC2 instead or chain functions
- Dynamic Websites
- Stateful applications
10
Q
AWS Glue
A
- Discovery of table schema
- S3
- RDS
- Redshift
- DynamoDB
- Fully managed
- Serverless
- Event-trigger or on a schedule
- Use Cases
- Data Discovery Schema
- Data Catalog
- Data Transformation (AWS Glue Studio)
- Data Replication (AWS Glue Elastic View)
- Data Preparation (AWS Glue DataBrew)
11
Q
Glue Crawler / Data Catalog
A
- Glue crawler scans data in S3, creates schema
- Run periodically
- Populates the Glue Data Catalog
- Stores only table definition
- Original data stays in S3
- Once cataloged, you can treat your unstructured data like it is structured
12
Q
Glue and S3 Partition
A
- Glue crawler will extract partitions based on how your S3 data is organized
- Organize
- yyyy/mm/dd/device_id
- device_id/yyyy/mm/dd
13
Q
Glue and Hive
A
- Hive lets your run SQL-like queries from EMR
- The Glue Data Catalog can serve as a Hive “metastore”
- We can import a Hive metastore into Glue
14
Q
Glue ETL
A
- Transform, Clean and Enrich data before data analysis
- Generate ETL code (we can modify it)
- We can provide our own Spark or PySpark scripts
- Target can be S3, JDBC or in Glue Data Catalog
- Fully managed
- Scala or Python
- Can be event-driven or scheduled
- Can provision addition DPU to increase performance of underlying Spark jobs
- Batch-oriented, and you can schedule your ETL jobs at a minimum of 5-minute intervals
15
Q
Glue DynamicFrame
A
- DynamicFrame is a collection of DynamicRecords
- DynamicRecords are self-describing and have schema
- Similar to Spark DataFrame
- Scala and Python APIs
16
Q
Glue ETL Transformations
A
- Bundled Transformations
- DropFields : DropNullFields (remove null fields)
- Filter, Join Map
- Machine Learning Transformation
- FindMatches ML : Identify duplicate or matching records in your dataset
- Format Conversions : CSV, JSON, Avro, Parquet, ORC, XML
- Apache Spark transformation (example : KMeans)
17
Q
Glue ETL : Modifying the Data Catalog
A
- ETL scripts can update your schema and partitions if necessary
- Updating table schema (when there is a new partition)
- Re-run the crawler or
- Use enableUpdateCatalog / updateBehavior from script
- Restrictions
- S3 only
- Json, CSV, avro, parquet only
- Parquet requires special code
- Nested schemas are not supported
18
Q
Glue Development Endpoints
A
- Develop ETL scripts using a Notebook
- Endpoint is in a VPC controlled by security groups, connect via
- Apache Zeppelin, Notebook, Sagemarker notebook, terminal window
19
Q
Running Glue Jobs
A
- Time-based schedules (cron styles)
- Job bookmarks
- Persists state from the job run
- Prevents reprocessing of old data
- Allows you to process new data only when re-running on a schedule
- Works with S3 sources in a variety of formats
- Works with relational databases via JDBC
- CloudWatch Events
- Fire off a Lambda function or SNS notification when ETL succeeds or fails
- Invoke EC2 run, send event to Kinesis, activate a Step Function
20
Q
Glue Cost Model
A
- Billed by the second for crawler and ETL jobs
- First million objects stored and accesses are free for the Glue Data Catalog
- Development endpoints for developing ETL code charged by the minute
21
Q
Glue Anti-Patterns
A
- non-spark engine
22
Q
AWS Glue Studio
A
- Visual interface for ETL workflows
- Visual Job Editior
- Create DAG’s for complex workflows
- Sources include S3, Kinesis, Kafka, JDBC
- Transform / sample / join data
- Target to S3 or Glue Data Catalog
- Support partitioning
- Visual Job Dashboard
23
Q
AWS Glue DataBrew
A
- Visual data preparation tool
- UI for preprocessing large data sets
- Input from S3, data warehouse, or database
- Output to S3
- Over 250 ready-made transformations
- Create “recipes” of transformations that can be saved as jobs within a larger project
- Define data quality rules
- Create datasets with custom SQL from Redshift and Snowflake
- Security
- Can integrate with KMS
- SSL in transit
- IAM
24
Q
AWS Glue Elastic Views
A
- Builds materialized views from Aurora, RDS, DynamoDB
- Those views can be used by Redshift, ElasticSearch, S3, DynamoDB, Aurora, RDS
- SQL Interface
- Handles any copying or combining or replicating data
- Monitors for changes and continuously updates
- Serverless
25
Q
AWS Lake Formation
A
- “Makes it easy to set up a secure data lake in days”
- Loading data and monitoring data flows
- Setting up partitions
- Encryption and managing keys
- Defining transformation jobs and monitoring them
- Built on top of Glue
- Auditing