Processing Flashcards
Lambda
- Run code snippets in the cloud
- Serverless
- Continuous Scaling
- Support multiple programming languages
Lambda Use Cases
- Real time file processing
- Real time stream processing
- ETL
- Cron replacement
Lambda Supported Languages
- Node.js
- Python
- Java
- Go
- Ruby
Lambda Triggers
- S3
- KDS
- SNS
- SQS
- AWS CloudWatch
- AWS CloudFormation
Lambda Redshift
- Best practice for loading data into Redshift is the COPY command
- Use DynamoDB to keep track of what has been loaded
- Lambda can batch up new data and load them with COPY
Lambda + Kinesis
- Lambda receives an event with a batch of stream records
- Specify a batch size (up to 10000 records)
- Batches may be split beyong Lambda’s payload limit (6MB)
- Lambda will retry the batch until it succeeds or the data expires
- This can stall the shard if you do not handle errors properly
- Use more shards to ensure processing isn’t totally held up by errors
- Lambda processes shard data synchronously
Lambda Cost Model
- Generous free tier (1M/month, 400K GB-seconds compute time)
- $0.2 / million requests
- $0.00001667 per GB / second
Lambda Promises
- High availability
- No scheduled downtime
- Retries failed code 3 times
- Unlimited scalability
- Safety throttle of 1000 concurrent execution per region
- High Performance
- New functions callable in seconds
- Code is cached automatically
- Max Timeout is 15 mins
Lambda Auti-Pattern
- Long-running applications
- Use EC2 instead or chain functions
- Dynamic Websites
- Stateful applications
AWS Glue
- Discovery of table schema
- S3
- RDS
- Redshift
- DynamoDB
- Fully managed
- Serverless
- Event-trigger or on a schedule
- Use Cases
- Data Discovery Schema
- Data Catalog
- Data Transformation (AWS Glue Studio)
- Data Replication (AWS Glue Elastic View)
- Data Preparation (AWS Glue DataBrew)
Glue Crawler / Data Catalog
- Glue crawler scans data in S3, creates schema
- Run periodically
- Populates the Glue Data Catalog
- Stores only table definition
- Original data stays in S3
- Once cataloged, you can treat your unstructured data like it is structured
Glue and S3 Partition
- Glue crawler will extract partitions based on how your S3 data is organized
- Organize
- yyyy/mm/dd/device_id
- device_id/yyyy/mm/dd
Glue and Hive
- Hive lets your run SQL-like queries from EMR
- The Glue Data Catalog can serve as a Hive “metastore”
- We can import a Hive metastore into Glue
Glue ETL
- Transform, Clean and Enrich data before data analysis
- Generate ETL code (we can modify it)
- We can provide our own Spark or PySpark scripts
- Target can be S3, JDBC or in Glue Data Catalog
- Fully managed
- Scala or Python
- Can be event-driven or scheduled
- Can provision addition DPU to increase performance of underlying Spark jobs
- Batch-oriented, and you can schedule your ETL jobs at a minimum of 5-minute intervals
Glue DynamicFrame
- DynamicFrame is a collection of DynamicRecords
- DynamicRecords are self-describing and have schema
- Similar to Spark DataFrame
- Scala and Python APIs
Glue ETL Transformations
- Bundled Transformations
- DropFields : DropNullFields (remove null fields)
- Filter, Join Map
- Machine Learning Transformation
- FindMatches ML : Identify duplicate or matching records in your dataset
- Format Conversions : CSV, JSON, Avro, Parquet, ORC, XML
- Apache Spark transformation (example : KMeans)
Glue ETL : Modifying the Data Catalog
- ETL scripts can update your schema and partitions if necessary
- Updating table schema (when there is a new partition)
- Re-run the crawler or
- Use enableUpdateCatalog / updateBehavior from script
- Restrictions
- S3 only
- Json, CSV, avro, parquet only
- Parquet requires special code
- Nested schemas are not supported
Glue Development Endpoints
- Develop ETL scripts using a Notebook
- Endpoint is in a VPC controlled by security groups, connect via
- Apache Zeppelin, Notebook, Sagemarker notebook, terminal window
Running Glue Jobs
- Time-based schedules (cron styles)
- Job bookmarks
- Persists state from the job run
- Prevents reprocessing of old data
- Allows you to process new data only when re-running on a schedule
- Works with S3 sources in a variety of formats
- Works with relational databases via JDBC
- CloudWatch Events
- Fire off a Lambda function or SNS notification when ETL succeeds or fails
- Invoke EC2 run, send event to Kinesis, activate a Step Function
Glue Cost Model
- Billed by the second for crawler and ETL jobs
- First million objects stored and accesses are free for the Glue Data Catalog
- Development endpoints for developing ETL code charged by the minute
Glue Anti-Patterns
- non-spark engine
AWS Glue Studio
- Visual interface for ETL workflows
- Visual Job Editior
- Create DAG’s for complex workflows
- Sources include S3, Kinesis, Kafka, JDBC
- Transform / sample / join data
- Target to S3 or Glue Data Catalog
- Support partitioning
- Visual Job Dashboard
AWS Glue DataBrew
- Visual data preparation tool
- UI for preprocessing large data sets
- Input from S3, data warehouse, or database
- Output to S3
- Over 250 ready-made transformations
- Create “recipes” of transformations that can be saved as jobs within a larger project
- Define data quality rules
- Create datasets with custom SQL from Redshift and Snowflake
- Security
- Can integrate with KMS
- SSL in transit
- IAM
AWS Glue Elastic Views
- Builds materialized views from Aurora, RDS, DynamoDB
- Those views can be used by Redshift, ElasticSearch, S3, DynamoDB, Aurora, RDS
- SQL Interface
- Handles any copying or combining or replicating data
- Monitors for changes and continuously updates
- Serverless
AWS Lake Formation
- “Makes it easy to set up a secure data lake in days”
- Loading data and monitoring data flows
- Setting up partitions
- Encryption and managing keys
- Defining transformation jobs and monitoring them
- Built on top of Glue
- Auditing
AWS Lake Formation Pricing
- No cost for Lake Formation itself
- But underlying services incur charges
- Glue
- S3
- EMR
- Athena
- Redshift
AWS Lake Formation : The Finer Points
- Cross-account Lake Formation permission
- lake Formation does not support manifests in Athena or Redshit queries
- IAM needed to create blueprints and workflows
AWS Lake Formation : Governed Tables and Security
- Now supports “Governed Tables” that support ACID transactions across multiple tables
- Storage Optimization with Automatic Compaction
- Granular Access Control with row and cell level security
Elastic MapReduce
- Managed Hadoop framework on EC2 instances
- Includes Spark, HBase, Presto, Flink, Hive and etc
- EMR Notebooks
- Several Integration points with AWS
EMR Cluster
- Master Node : Manages the cluster
- Tracks status of tasks, monitor cluster health
- Single EC2 instance
- Core Node : Host HDFS data and run tasks
- Can be scaled up and down
- Multi-node clusters have at least one
- Task Node : Run tasks and do not host data
- Optional and no risk of data loss when removing
- Good use of spot instances
EMR Usage
- Transient vs Long-Running Clusters
- Transient clusters terminate once all steps are completed
- Long running clusters must be manually terminated
- basically a datawarhouse with periodic processing on large datasets
- Can spin up task nodes using Spot instances for temporary capacity
- Can use reserved instance on long running clusters to save money
- Termination protection enabled by default
- Frameworks and applications are specified at cluster launch
- Connect directly to master to run jobs directly
- Submit ordered steps via the console
EMR and AWS Integration
- EC2 for the instances that comprise the nodes in the cluster
- Amazon VPC to configure the virutal network
- S3 to as input and output data
- CloudWatch to monitor cluster performance and configure alarms
- IAM for permissions
- CloudTrail to audit requests made to the service
- AWS Data Pipeline to schedule and start your cluster
EMR Stoarge
- HDFS
- Multiple copies stored across cluster instances for redundancy
- Files stored as blocks (128MB default size)
- Ephemeral : HDFS data is lost when cluster is terminated
- Useful for caching intermediate results or workloads
- EMRFS : access S3 as if it were HDFS
- Allows persistent storage after cluster termination
- EMRFS Consistent View
- Uses DynamoDB to track consistency
- May need to give read write capacity to DynamoDB
- Local File System
- Suitable only for temporary data (buffers, caches, etc)
- EBS for HDFS
- Allows use of EMR on EBS-only types (M4, C4)
- Deleted when cluster is terminated
- EBS volumes can only attached when launching a cluster
- If you manually detach an EBS volume, EMR treats that as a failure and replaces it
EMR Promises
- Charge by hour + EC2 charges
- Provision new nodes if a core node fails
- Can add and remove task nodes on the fly
- Can resize a running cluster’s core nodes
- Core nodes can also be added or removed
EMR Managed Scaling
- EMR Automatic Scaling
- Custom scaling rules based on CloudWatches metrics
- Supports instance groups only
- EMR Managed Scaling
- Support instance groups and instance fleets
- Scales spot, on-demand and instances in a Savings Plan within the same cluster
- Available for Spark, Hive, YARN workloads
- Scale-up Strategy
- First adds core nodes, then task nodes, up to max unit specified
- Scale-down Strategy
- First removes task nodes, then core nodes, no further than minimum constrainst
- Spot nodes always removed before on-demand instances
EMR Pre-Initialized Capacity
- Spark adds 10% overhead to memory requested for drivers and executors
- Be sure initial capacity is at least 10% more than requested by the job
EMR Serverless Security
- EMRFS
- S3 encryption at rest
- TLS EIT
- Local disk encryption
- Spark communication between drivers and executors is encrypted
Hadoop
- MapReduce
- Framework for distributed data processing
- Maps data to key / value pairs
- REduces intermediate results to final output
- Yet Another Resource Negotiator (YARN)
- Manages cluster resouces for multiple data processing frameworks
- HDFS
- Distributes data block across cluster in a redundant manner
- Ephemeral in EMR
- Data lost on termination
Apache Spark
- Distributed processing framework for big data
- In-memory caching, optimized query execution
- Supports Java, R, Python, etc
- Supports code reuse across
- Batch Processing
- Interactive Queries
- Real-time Analytics
- Machine Learning
- Graph Processing
- Spark Streaming
- Integrated with Kinesis, Kafka, on EMR
- Spark is NOT meant for OLTP
How Spark Works
- The SparkContext (driver program) coordinates Spark applications
- SparkContext works through a Cluster Manager
- Executors run computations and store data
- SparkContext sends application code and tasks to executors
How Spark Works
- The SparkContext (driver program) coordinates Spark applications
- SparkContext works through a Cluster Manager
- Executors run computations and store data
- SparkContext sends application code and tasks to executors
Spark Components
- Spark Streaming
- Real-time streaming analytics and structured streaming
- SparkSQL
- Up to 100x faster than MapReduce JDBC, ODBC, JSON, HDFS, ORC, Parquet, HiveQL
- MLLib
- Classification, Regression, Clustering, Collaborative filtering, pattern mining (Read from HDFS, HBase)
- GraphX
- Graph Processing, ETL, analysis, Iteractive graph computation
- No longer widely used
- Spark Core
- memory management, fault recovery, scheduling, distribute and monitor jobs, interact with storage
- Scala, Python, Java, R
Spark Structured Streaming
- Data stream as an unbounded input Table
Hive
- Allows users to run sql-like syntax query in EMR
- Scalable
- Easy OLAP queries
- Highly optimized
- Highly extensible
Hive Metastore
- Hive maintains a “metastore” that imparts a structure you define on the unstructured data that is stored on HDFS
External Hive Metastores
- Metastore is stored in MySQL on the master node by default
- External metastores offer better resiliency and integration
- AWS Glue Data Catalog
- Shares schema across EMR and other AWS services
- Tie Glue to EMR using the console, CLI, or API
- Amazon RDS / Aurora
- Need to override default Hive configuration values for external database location
- AWS Glue Data Catalog
Apache Pig
- Pig introduces a scripting language that lets you use SQL-like syntax to define your map and reduce steps
- Highly extensible with user-defined functions (UDFs)
Apache Pig AWS Integration
- Use multiple file systems
- HDFS, S3, etc
- Load JAR’s and scripts from S3
HBase
- Non-relational, petabyte-scale database
- Based on Goolge’s BigTable and on top of HDFS
- In-memory
- Hive Integration
HBase and AWS Integration
- Can store data (StoreFiles and metadata) on S3 via EMRFS
- Can back up to S3
Presto
- It can connect to many different “big data” databases and data stores at once and query across them
- Interactive queries at petabyte scale
- Familiar SQL syntax
- Optimized for OLAP
- Developed and partially maintained by Facebook
- This is what Amazon Athena uses under the hood
- Exposes JDBC, Command-Line and Tableau interfaces
Apache Zeppelin
- iPython Notebooks environments
- Can share notebooks with others on your cluster
- Spark, Python, JDBC, HBase, ElasticSearch
Apache Zeppelin + Spark
- Run Spark code interactively
- speed up your development cycle
- allows easy experimentation and exploration of the data
- Can execute SQL queries directly against SparkSQL
- Query results may be visualized in charts and graphs
- Makes Spark fee more like a data science tool
HBase Vs DynamoDB
- DynamoDB
- Fully managed
- More Integration with other AWS services
- Glue Integration
- HBase
- Efficient storage of sparse data
- Appropriate for high frequency counters (consistent reads and writes)
- High write and update throughput
- More integration with Hadoop
EMR Notebook
- Similar to Zeppelin with more AWS integrations
- Notebooks backed up to S3
- Provision clusters from the notebook
- Hosted inside a VPC
- Accessed only via AWS console
Hue
- Hadoop User Experience
- Graphical front-end for applications on your EMR cluster
- IAM Integration : Hue Supper-users inherit IAM roles
Splunk
- Splunk makes machine learning data accessible, usable and valuable to everyone
- Operational tool : can be used to visualize EMR and S3 data using your EMR Hadoop cluster
- Reserved instnaces on 64bit OS recommended
Flume
- Another way to stream data into your cluster
- Made from the start with Hadoop in mind
- Originally made to handle log aggregation
MXNet
- Like tensorflow, a library for building and accelerating neural networks
- Included on EMR
S3DistCP
- Tool for copying large amounts of data
- from S3 into HDFS or from HDFS to S3
- Uses MapReduce to copy in a distributed manner
- Suitable for parallel copying of large numbers of objects
- across buckets, across accounts
S3DistCP
- Tool for copying large amounts of data
- from S3 into HDFS or from HDFS to S3
- Uses MapReduce to copy in a distributed manner
- Suitable for parallel copying of large numbers of objects
- across buckets, across accounts
EMR Security
- IAM
- Kerberos
- Secure user authN
- SSH
- Secure connection to command line
- Tunneling for web interfaces
- Block Public Access
- Easy way to prevent public access to data stored on your EMR cluster
- Can set at the account level before creating the cluster
Kibana + ElasticSearch
- Kibana is an open source data visulazation and exploration tool used for log and time-series analytics, application monitoring, and operational intelligence use cases.