Analytic Services Flashcards
#Glue What are Glue Worker Types?
AWS Glue comes with 3 worker types
Standard - 4 vCPU, 16GB RAM, 50GB, 2 Spark executors ⇒ 1 DPU
G.1X - 4 vCPU, 16GB RAM, 64GB, 1 Spark executors ⇒ 1 DPU
G.2X - 8 vCPU, 32GB RAM, 128GB, 1 Spark executors ⇒ 2 DPU
1 DPU can run 8 Spark executors
G1.X for jobs that are memory intensive
G2.X for jobs that uses AWS Glue ML workloads such as ML Transforms
#EMR What are three different node types in a EMR cluster?
- Master or Leader nodes
- manage the cluster
- Single EC2 instance
- Core nodes: host HDFS data and run tasks
- Task nodes: run tasks not hosting data
- No risk of data loss when removing them
- Good use of spot instance
#EMR What is EMRFS Consistent View?
EMRFS Consistent View is an optional feature that allows EMR cluster to check for list and read-after-write consistency for S3 objects written or synced with EMRFS.
Why? S3’s eventual consistency
#EMR How does EMRFS Consistent View works?
EMRFS Consistent View uses an Amazon DynamoDB to store object metadata and track consistency with S3 (EMRFS Metadata Store)
- DynamoDB by default has 400 read capacity and 100 write capacity - you need to configure DDB according to # of objects being tracked
- You can configure # of retries and retry period (# seconds) - consistent view will retry
#EMR What are the storage options for a EMR cluster?
Can you add or detach a EBS volumes to a running EMR cluster?
- HDFS - distributed local storage
- Ephemeral
- Default block size of 128 MB
- EMRFS: access S3 as if it were HDFS
- EMRFS Consistent View - optional for S3 consistency
- Use DynamoDB to track consistency - pay attention to capacity limits
- Why? S3 eventual consistency
- Local file system
- EBS for HDFS
- Will be deleted if the cluster gets terminated
- Can not add or detach EBS volumes to a running cluster, only at creation time
#EMR What does the Spark stack consist of?
- Spark SQL (structure); Spark Streaming (real-time); MLLib; GraphX (graph processing)
- Spark Core
- Standalone Scheduler; YARN; Mesos
#EMR What is Hive?
Hive - data warehouse and analytic infrastructure built on top of Hadoop
- Hive uses SQL like language called Hive SQL
#EMR What is Tez?
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop;
Both Tez and MapReduce are execution engine in Hive
#EMR What is Presto?
Pesto is an open source in-memory distributed fast SQL query engine designed for interactive queries against PB of data from different sources
- Run interactive analytics queries against a variety of data sources with size ranging from GB to PB at once and query across them
- Interactive queries at PB scale - faster than Hive
- Optimized for OLAP - analytics queries, data warehousing
- This is what Athena use under the hood
Expose JDBC, CLI, and Tableau interfaces
#EMR What are EMR Notebooks?
EMR Notebooks are similar Zeppelin but with more AWS Integration
- Notebooks backup to S3
- Provision clusters from the notebook
- Host inside a VPC
- Access ONLY via AWS Console
#EMR What is HUE?
HUE stands for Hadoop User Experience. It is a Open source web interface for Apache Hadoop and other non-Hadoop applications running on EMR
- You can browse S3 / HDFS / Hbase / Zookeeper
- You can move data between S3 and HDFS
#EMR What is Flume?
Flume is another way to streaming data (e.g. log data) into your EMR cluster.
- Build-in sinks for HDFS and HBase
- Source ==> Channels ==> Sink ==> Target
#EMR What is MXNet?
MXNet is an alternative to Tensorflow for building neural networks. MXNet is included in EMR.
#EMR What is S3DistCP?
S3DistCP is a tool for copying large amount of data between S3 and HDFS
- Uses MapReduce to copy in a distributed manner
- Suitable for parallel copying of large # of objects across buckets and accounts
#EMR What are the Hive integration with AWS?
- S3: load data / script / partition information
- DynamoDB as an external table. Hive can process and join data stored in DynamoDB
#Quicksight What is a KPI Chart?
KPI Charts use a key performance indicator (KPI) to visualize a comparison between a key value and its target value.
#DynamoDB What is WCU and RCU for DynamoDB?
1 WCU = 1KB/s WRITE
1 RCU = 2 eventual consistent READ of 4KB/s; 1 consistent READ of 4KB/s
#S3 What is Glacier Select?
Glacier Select allow you query Glacier data with simple SQL queries and get results in minutes, without need to restore to S3.
#IoT What are types of identity principals for device or client authentication supported by AWS IoT?
- X.509 cert
- IAM users, groups, and roles
- Amazon Cognito identities
I think you can also implement Federated Identity via Cognito since it can use
#Redshift What is Redshift's Elastic Resize?
Redshift’s Elastic Resize allow you add / remove nodes and also change node types.
However, Elastic resize only holds connections open if you only change the number of nodes, not the node type.
If you want to minimizes the downtime involved, you might still use the snapshot / restore / resize approach with classic resize
https://aws.amazon.com/blogs/big-data/scale-your-amazon-redshift-clusters-up-and-down-in-minutes-to-get-the-performance-you-need-when-you-need-it/
#EMR Which compression algorithm are splittable?
BZIP and LZO are splittable, great for parallel processing
GZIP and SNAPPY are NOT splittable
#EMR What is HBase Read-replica in S3?
Amazon EMR version 5.7.0+ allows you to maintain read-only copies of data in Amazon S3.
You can access the data from the read-replica cluster to perform read operations simultaneously, and in the event that the primary cluster becomes unavailable.
#EMR What are the 3 ways that HBase can integrate with S3
- HBase on S3 (S3 storage mode) - Storage of HBase StoreFiles and metadata in S3
- Snapshot of HBase to S3 - this is for the HBase that do not use S3
- HBase Read-replcas in S3
#EMR What is Ganglia?
Ganglia is a a scalable, distributed system designed to monitor clusters and grids while minimizing the impact on their performance.
Ganglia is installed on the Master Node and Ganglia is the operational dashboard provided with EMR.
#EMR What is Apache Sqoop?
Sqoop is an open-source system for transferring data between Hadoop and relational databases (in parallel).
You can use Sqoop to copy data from RDBMS to EMR for analysis
#QuickSight What does HeatMap chart used for?
Heatmap is used to highlight outliers and trends using color.
#QuickSight What does Scatter Plot used for? What does Tree Map used for?
Scatter Plot is used for visualizing two or three measures for a dimension.
Tree Map is used for visualizing one or two measures for a dimension, use tree maps.
#EMR What are the advantages of HBase on S3?
- The HBase root directory is stored in Amazon S3, including HBase store files and table metadata.
The data is persist outside cluster and available across AZs, no snapshot / recover is needed - Separate compute and storage, so you can scale EMR cluster for your compute requirements
- You can have a read-replica cluster and perform read operation concurrently, even when the primary cluster is not available
#EMR What is HCatalog?
HCatalog is a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.
#Redshift Does Redshift support multi-AZ deployment?
No. Currently, Amazon Redshift only supports Single-AZ deployments.
- You can create Redshift clusters in each AZ and stream / COPY same data into them; OR
- You can use Redshift Spectrum to query the same S3 data with Redshift clusters in different AZ
#QuickSight What is Pivotal Table used for?
You can use Pivotal table to interactively explore multi-dimensional data, applying statistical functions to different rows and columns and sorting them in different ways.
#EMR Why use external metastore for Hive? What are the choices for Hive external metastore?
External metastore is for persisting metastore beyond the cluster’s life span.
By default, Hive metastore is in a MySQL on master node. But you can choose use external metastore
- AWS Glue Data Catalog
- Amazon RDS or Aurora
#EMR #S3 What is S3 Select? How does S3 Select work with EMR and Hive?
S3 Select allows applications to retrieve only a subset of data from an S3 object.
For EMR, S3 Select can improve the performance of data processing by reduced the data movement between S3 and EMR
This is also called S3 Select Push Down
Presto on EMR can use S3 Select Push Diwn
#EMR What is Hudi? Why we need it?
Hudi open-source data management framework used to simplify incremental data processing and data pipeline development by providing record-level insert, update, upsert, and delete capabilities.
- Hudi is integrated with Spark, Hive, Presto, and EMR.
- With Hudi, you can apply changes to a dataset over time.
#EMR What are the two Hudi dataset types?
- Copy on Write (CoW) - default.
- Data is stored in a columnar format (Parquet), and each update creates a new version of files during a write;
- Better suited for read-heavy workloads on data that changes less frequently. - Merge on Read (MoR) - Data is stored using a combination of columnar (Parquet) and row-based (Avro) formats.
- Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files
- Better suited for write- or change-heavy workloads with fewer reads.
#EMR What are the 3 logical view of Hudi datasets?
- Read Optimized View - default. latest for CoW and latest compacted data for MoR. Data may not be the latest for MOR
- Incremental View - changes stream between actions out of CoW dataset to feed downstream jobs and ETL workflows
- Real-time View - latest data from MoR
#EMR What are the 2 options you have using Jupyter notebooks on EMR?
- EMR Notebook - have to use it in the EMR console; stored in S3; can share among clusters; a cluster can have multiple notebooks
- JupyterHub - it is a container running on EMR master node that can be used by multiple users; you can access via UI or CLI; stored on file system on Master node, BUT you can configure it to persist to S3
#EMR What is Livy?
Livy which is included in EMR is used to talk to the Spark on EMR. You can submit job or script and manage Spark Context.
JupyterHub container uses Livy.
#EMR What is Mahout?
Mahout is a machine learning framework / libraries / tools for Hadoop and Spark.
Mahout uses Hadoop to distribute compute across the cluster for clustering, classification, recommendation etc.
#EMR What is MXNet?
Apache MXNet is an acceleration library designed for building neural networks and other deep learning applications.
MXNet is included in EMR
#EMR What is Oozie?
Oozie is a Hadoop job schduler and included in EMR.
With EMR, you need to access Oozie via Hiue Oozie application since native interface is not supported in EMR
Oozie by default persist user information and query history to local MySQL on master node, but you can configure it to presist to S3 and RDS
#EMR What is Phoenix?
Apache Phoenix is used for OLTP and operational analysis with SQL against HBase store.
You can connect to Phoenix using JDBC client or think client to the Phoenix Query Server running on Master node
#EMR What is Pig?
Apache Pig is an open-source library that runs on top of Hadoop for data processing
- Pig has SQL-like commands written in a language called Pig Latin;
- Pig converts those commands into Tez jobs based on DAGs or MapReduce programs.
- you can run Pig commands interactively or in batch as EMR cluster steps (from S3)
- You can configure to let Pig write output to HCatalog
#EMR When to use S3 Select Push Down?
- if your query filter out 50%+ orginal data
- if your query predicates are supported by S3 Select, note timestamp is NOT supported
- if your S3 object is in CSV format, could be optionally compressed with gzip and bzip2
- S3 SSE-C and Client side encryption is NOT supported
S3 Select is not replacing using columnar (ORC or Parquet) or compressed files
#EMR What is Presto Graceful Auto Scaling
EMR allows you to set a grace period for Presto tasks to keep running before the node terminates because of a scale-in resize action or an automatic scaling policy request.
#EMR Can Spark on EMR work with Amazon SageMaker?
Yes. SageMaker Spark SDK is included with Spark on EMR and can be used to create ML pipelines with SageMaker Steps.
#EMR What is Sqoop?
Apache Sqoop is a is a tool for transferring data between Amazon S3, Hadoop, HDFS, and RDBMS databases.
Sqoop supports HCatalog and JDBC connection, and is included in EMR
#EMR #DynamoDB What operations you can do on DynamoDB from a EMR cluster?
You can use EMR with Hive to load (DDB –> HDFS), query (DDB), join (DDB), import (S3 –> DDB), export (DDB –> S3)