Processing Flashcards

1
Q

Lambda

A
  • Run code snippets in the cloud
  • Serverless
  • Continuous Scaling
  • Support multiple programming languages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Lambda Use Cases

A
  • Real time file processing
  • Real time stream processing
  • ETL
  • Cron replacement
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Lambda Supported Languages

A
  • Node.js
  • Python
  • Java
  • Go
  • Ruby
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Lambda Triggers

A
  • S3
  • KDS
  • SNS
  • SQS
  • AWS CloudWatch
  • AWS CloudFormation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Lambda Redshift

A
  • Best practice for loading data into Redshift is the COPY command
  • Use DynamoDB to keep track of what has been loaded
  • Lambda can batch up new data and load them with COPY
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Lambda + Kinesis

A
  • Lambda receives an event with a batch of stream records
    • Specify a batch size (up to 10000 records)
    • Batches may be split beyong Lambda’s payload limit (6MB)
  • Lambda will retry the batch until it succeeds or the data expires
    • This can stall the shard if you do not handle errors properly
    • Use more shards to ensure processing isn’t totally held up by errors
  • Lambda processes shard data synchronously
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Lambda Cost Model

A
  • Generous free tier (1M/month, 400K GB-seconds compute time)
  • $0.2 / million requests
  • $0.00001667 per GB / second
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Lambda Promises

A
  • High availability
    • No scheduled downtime
    • Retries failed code 3 times
  • Unlimited scalability
    • Safety throttle of 1000 concurrent execution per region
  • High Performance
    • New functions callable in seconds
    • Code is cached automatically
    • Max Timeout is 15 mins
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Lambda Auti-Pattern

A
  • Long-running applications
    • Use EC2 instead or chain functions
  • Dynamic Websites
  • Stateful applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

AWS Glue

A
  • Discovery of table schema
    • S3
    • RDS
    • Redshift
    • DynamoDB
  • Fully managed
  • Serverless
  • Event-trigger or on a schedule
  • Use Cases
    • Data Discovery Schema
    • Data Catalog
    • Data Transformation (AWS Glue Studio)
    • Data Replication (AWS Glue Elastic View)
    • Data Preparation (AWS Glue DataBrew)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Glue Crawler / Data Catalog

A
  • Glue crawler scans data in S3, creates schema
  • Run periodically
  • Populates the Glue Data Catalog
    • Stores only table definition
    • Original data stays in S3
  • Once cataloged, you can treat your unstructured data like it is structured
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Glue and S3 Partition

A
  • Glue crawler will extract partitions based on how your S3 data is organized
  • Organize
    • yyyy/mm/dd/device_id
    • device_id/yyyy/mm/dd
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Glue and Hive

A
  • Hive lets your run SQL-like queries from EMR
  • The Glue Data Catalog can serve as a Hive “metastore”
  • We can import a Hive metastore into Glue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Glue ETL

A
  • Transform, Clean and Enrich data before data analysis
    • Generate ETL code (we can modify it)
    • We can provide our own Spark or PySpark scripts
    • Target can be S3, JDBC or in Glue Data Catalog
  • Fully managed
  • Scala or Python
  • Can be event-driven or scheduled
  • Can provision addition DPU to increase performance of underlying Spark jobs
  • Batch-oriented, and you can schedule your ETL jobs at a minimum of 5-minute intervals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Glue DynamicFrame

A
  • DynamicFrame is a collection of DynamicRecords
    • DynamicRecords are self-describing and have schema
    • Similar to Spark DataFrame
    • Scala and Python APIs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Glue ETL Transformations

A
  • Bundled Transformations
    • DropFields : DropNullFields (remove null fields)
    • Filter, Join Map
  • Machine Learning Transformation
    • FindMatches ML : Identify duplicate or matching records in your dataset
  • Format Conversions : CSV, JSON, Avro, Parquet, ORC, XML
  • Apache Spark transformation (example : KMeans)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Glue ETL : Modifying the Data Catalog

A
  • ETL scripts can update your schema and partitions if necessary
  • Updating table schema (when there is a new partition)
    • Re-run the crawler or
    • Use enableUpdateCatalog / updateBehavior from script
  • Restrictions
    • S3 only
    • Json, CSV, avro, parquet only
    • Parquet requires special code
    • Nested schemas are not supported
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Glue Development Endpoints

A
  • Develop ETL scripts using a Notebook
  • Endpoint is in a VPC controlled by security groups, connect via
    • Apache Zeppelin, Notebook, Sagemarker notebook, terminal window
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Running Glue Jobs

A
  • Time-based schedules (cron styles)
  • Job bookmarks
    • Persists state from the job run
    • Prevents reprocessing of old data
    • Allows you to process new data only when re-running on a schedule
    • Works with S3 sources in a variety of formats
    • Works with relational databases via JDBC
  • CloudWatch Events
    • Fire off a Lambda function or SNS notification when ETL succeeds or fails
    • Invoke EC2 run, send event to Kinesis, activate a Step Function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Glue Cost Model

A
  • Billed by the second for crawler and ETL jobs
  • First million objects stored and accesses are free for the Glue Data Catalog
  • Development endpoints for developing ETL code charged by the minute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Glue Anti-Patterns

A
  • non-spark engine
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

AWS Glue Studio

A
  • Visual interface for ETL workflows
  • Visual Job Editior
    • Create DAG’s for complex workflows
    • Sources include S3, Kinesis, Kafka, JDBC
    • Transform / sample / join data
    • Target to S3 or Glue Data Catalog
    • Support partitioning
  • Visual Job Dashboard
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

AWS Glue DataBrew

A
  • Visual data preparation tool
    • UI for preprocessing large data sets
    • Input from S3, data warehouse, or database
    • Output to S3
  • Over 250 ready-made transformations
  • Create “recipes” of transformations that can be saved as jobs within a larger project
  • Define data quality rules
  • Create datasets with custom SQL from Redshift and Snowflake
  • Security
    • Can integrate with KMS
    • SSL in transit
    • IAM
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

AWS Glue Elastic Views

A
  • Builds materialized views from Aurora, RDS, DynamoDB
  • Those views can be used by Redshift, ElasticSearch, S3, DynamoDB, Aurora, RDS
  • SQL Interface
  • Handles any copying or combining or replicating data
  • Monitors for changes and continuously updates
  • Serverless
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

AWS Lake Formation

A
  • “Makes it easy to set up a secure data lake in days”
  • Loading data and monitoring data flows
  • Setting up partitions
  • Encryption and managing keys
  • Defining transformation jobs and monitoring them
  • Built on top of Glue
  • Auditing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

AWS Lake Formation Pricing

A
  • No cost for Lake Formation itself
  • But underlying services incur charges
    • Glue
    • S3
    • EMR
    • Athena
    • Redshift
27
Q

AWS Lake Formation : The Finer Points

A
  • Cross-account Lake Formation permission
  • lake Formation does not support manifests in Athena or Redshit queries
  • IAM needed to create blueprints and workflows
28
Q

AWS Lake Formation : Governed Tables and Security

A
  • Now supports “Governed Tables” that support ACID transactions across multiple tables
  • Storage Optimization with Automatic Compaction
  • Granular Access Control with row and cell level security
29
Q

Elastic MapReduce

A
  • Managed Hadoop framework on EC2 instances
  • Includes Spark, HBase, Presto, Flink, Hive and etc
  • EMR Notebooks
  • Several Integration points with AWS
30
Q

EMR Cluster

A
  • Master Node : Manages the cluster
    • Tracks status of tasks, monitor cluster health
    • Single EC2 instance
  • Core Node : Host HDFS data and run tasks
    • Can be scaled up and down
    • Multi-node clusters have at least one
  • Task Node : Run tasks and do not host data
    • Optional and no risk of data loss when removing
    • Good use of spot instances
31
Q

EMR Usage

A
  • Transient vs Long-Running Clusters
    • Transient clusters terminate once all steps are completed
    • Long running clusters must be manually terminated
      • basically a datawarhouse with periodic processing on large datasets
      • Can spin up task nodes using Spot instances for temporary capacity
      • Can use reserved instance on long running clusters to save money
      • Termination protection enabled by default
  • Frameworks and applications are specified at cluster launch
  • Connect directly to master to run jobs directly
  • Submit ordered steps via the console
32
Q

EMR and AWS Integration

A
  • EC2 for the instances that comprise the nodes in the cluster
  • Amazon VPC to configure the virutal network
  • S3 to as input and output data
  • CloudWatch to monitor cluster performance and configure alarms
  • IAM for permissions
  • CloudTrail to audit requests made to the service
  • AWS Data Pipeline to schedule and start your cluster
33
Q

EMR Stoarge

A
  • HDFS
    • Multiple copies stored across cluster instances for redundancy
    • Files stored as blocks (128MB default size)
    • Ephemeral : HDFS data is lost when cluster is terminated
    • Useful for caching intermediate results or workloads
  • EMRFS : access S3 as if it were HDFS
    • Allows persistent storage after cluster termination
    • EMRFS Consistent View
      • Uses DynamoDB to track consistency
      • May need to give read write capacity to DynamoDB
  • Local File System
    • Suitable only for temporary data (buffers, caches, etc)
  • EBS for HDFS
    • Allows use of EMR on EBS-only types (M4, C4)
    • Deleted when cluster is terminated
    • EBS volumes can only attached when launching a cluster
    • If you manually detach an EBS volume, EMR treats that as a failure and replaces it
34
Q

EMR Promises

A
  • Charge by hour + EC2 charges
  • Provision new nodes if a core node fails
  • Can add and remove task nodes on the fly
  • Can resize a running cluster’s core nodes
  • Core nodes can also be added or removed
35
Q

EMR Managed Scaling

A
  • EMR Automatic Scaling
    • Custom scaling rules based on CloudWatches metrics
    • Supports instance groups only
  • EMR Managed Scaling
    • Support instance groups and instance fleets
    • Scales spot, on-demand and instances in a Savings Plan within the same cluster
    • Available for Spark, Hive, YARN workloads
  • Scale-up Strategy
    • First adds core nodes, then task nodes, up to max unit specified
  • Scale-down Strategy
    • First removes task nodes, then core nodes, no further than minimum constrainst
    • Spot nodes always removed before on-demand instances
36
Q

EMR Pre-Initialized Capacity

A
  • Spark adds 10% overhead to memory requested for drivers and executors
  • Be sure initial capacity is at least 10% more than requested by the job
37
Q

EMR Serverless Security

A
  • EMRFS
    • S3 encryption at rest
    • TLS EIT
  • Local disk encryption
  • Spark communication between drivers and executors is encrypted
38
Q

Hadoop

A
  • MapReduce
    • Framework for distributed data processing
    • Maps data to key / value pairs
    • REduces intermediate results to final output
  • Yet Another Resource Negotiator (YARN)
    • Manages cluster resouces for multiple data processing frameworks
  • HDFS
    • Distributes data block across cluster in a redundant manner
    • Ephemeral in EMR
    • Data lost on termination
39
Q

Apache Spark

A
  • Distributed processing framework for big data
  • In-memory caching, optimized query execution
  • Supports Java, R, Python, etc
  • Supports code reuse across
    • Batch Processing
    • Interactive Queries
    • Real-time Analytics
    • Machine Learning
    • Graph Processing
  • Spark Streaming
    • Integrated with Kinesis, Kafka, on EMR
  • Spark is NOT meant for OLTP
40
Q

How Spark Works

A
  • The SparkContext (driver program) coordinates Spark applications
  • SparkContext works through a Cluster Manager
  • Executors run computations and store data
  • SparkContext sends application code and tasks to executors
40
Q

How Spark Works

A
  • The SparkContext (driver program) coordinates Spark applications
  • SparkContext works through a Cluster Manager
  • Executors run computations and store data
  • SparkContext sends application code and tasks to executors
41
Q

Spark Components

A
  • Spark Streaming
    • Real-time streaming analytics and structured streaming
  • SparkSQL
    • Up to 100x faster than MapReduce JDBC, ODBC, JSON, HDFS, ORC, Parquet, HiveQL
  • MLLib
    • Classification, Regression, Clustering, Collaborative filtering, pattern mining (Read from HDFS, HBase)
  • GraphX
    • Graph Processing, ETL, analysis, Iteractive graph computation
    • No longer widely used
  • Spark Core
    • memory management, fault recovery, scheduling, distribute and monitor jobs, interact with storage
    • Scala, Python, Java, R
42
Q

Spark Structured Streaming

A
  • Data stream as an unbounded input Table
43
Q

Hive

A
  • Allows users to run sql-like syntax query in EMR
  • Scalable
  • Easy OLAP queries
  • Highly optimized
  • Highly extensible
44
Q

Hive Metastore

A
  • Hive maintains a “metastore” that imparts a structure you define on the unstructured data that is stored on HDFS
45
Q

External Hive Metastores

A
  • Metastore is stored in MySQL on the master node by default
  • External metastores offer better resiliency and integration
    • AWS Glue Data Catalog
      • Shares schema across EMR and other AWS services
      • Tie Glue to EMR using the console, CLI, or API
      • Amazon RDS / Aurora
        • Need to override default Hive configuration values for external database location
46
Q

Apache Pig

A
  • Pig introduces a scripting language that lets you use SQL-like syntax to define your map and reduce steps
  • Highly extensible with user-defined functions (UDFs)
47
Q

Apache Pig AWS Integration

A
  • Use multiple file systems
    • HDFS, S3, etc
  • Load JAR’s and scripts from S3
48
Q

HBase

A
  • Non-relational, petabyte-scale database
  • Based on Goolge’s BigTable and on top of HDFS
  • In-memory
  • Hive Integration
49
Q

HBase and AWS Integration

A
  • Can store data (StoreFiles and metadata) on S3 via EMRFS
  • Can back up to S3
50
Q

Presto

A
  • It can connect to many different “big data” databases and data stores at once and query across them
  • Interactive queries at petabyte scale
  • Familiar SQL syntax
  • Optimized for OLAP
  • Developed and partially maintained by Facebook
  • This is what Amazon Athena uses under the hood
  • Exposes JDBC, Command-Line and Tableau interfaces
51
Q

Apache Zeppelin

A
  • iPython Notebooks environments
    • Can share notebooks with others on your cluster
  • Spark, Python, JDBC, HBase, ElasticSearch
52
Q

Apache Zeppelin + Spark

A
  • Run Spark code interactively
    • speed up your development cycle
    • allows easy experimentation and exploration of the data
  • Can execute SQL queries directly against SparkSQL
  • Query results may be visualized in charts and graphs
  • Makes Spark fee more like a data science tool
53
Q

HBase Vs DynamoDB

A
  • DynamoDB
    • Fully managed
    • More Integration with other AWS services
    • Glue Integration
  • HBase
    • Efficient storage of sparse data
    • Appropriate for high frequency counters (consistent reads and writes)
    • High write and update throughput
    • More integration with Hadoop
54
Q

EMR Notebook

A
  • Similar to Zeppelin with more AWS integrations
  • Notebooks backed up to S3
  • Provision clusters from the notebook
  • Hosted inside a VPC
  • Accessed only via AWS console
55
Q

Hue

A
  • Hadoop User Experience
  • Graphical front-end for applications on your EMR cluster
  • IAM Integration : Hue Supper-users inherit IAM roles
56
Q

Splunk

A
  • Splunk makes machine learning data accessible, usable and valuable to everyone
  • Operational tool : can be used to visualize EMR and S3 data using your EMR Hadoop cluster
  • Reserved instnaces on 64bit OS recommended
57
Q

Flume

A
  • Another way to stream data into your cluster
  • Made from the start with Hadoop in mind
  • Originally made to handle log aggregation
58
Q

MXNet

A
  • Like tensorflow, a library for building and accelerating neural networks
  • Included on EMR
59
Q

S3DistCP

A
  • Tool for copying large amounts of data
    • from S3 into HDFS or from HDFS to S3
  • Uses MapReduce to copy in a distributed manner
  • Suitable for parallel copying of large numbers of objects
    • across buckets, across accounts
60
Q

S3DistCP

A
  • Tool for copying large amounts of data
    • from S3 into HDFS or from HDFS to S3
  • Uses MapReduce to copy in a distributed manner
  • Suitable for parallel copying of large numbers of objects
    • across buckets, across accounts
61
Q

EMR Security

A
  • IAM
  • Kerberos
    • Secure user authN
  • SSH
    • Secure connection to command line
    • Tunneling for web interfaces
  • Block Public Access
    • Easy way to prevent public access to data stored on your EMR cluster
    • Can set at the account level before creating the cluster
62
Q

Kibana + ElasticSearch

A
  • Kibana is an open source data visulazation and exploration tool used for log and time-series analytics, application monitoring, and operational intelligence use cases.