Analytic Services Flashcards
#Glue What are Glue Worker Types?
AWS Glue comes with 3 worker types
Standard - 4 vCPU, 16GB RAM, 50GB, 2 Spark executors ⇒ 1 DPU
G.1X - 4 vCPU, 16GB RAM, 64GB, 1 Spark executors ⇒ 1 DPU
G.2X - 8 vCPU, 32GB RAM, 128GB, 1 Spark executors ⇒ 2 DPU
1 DPU can run 8 Spark executors
G1.X for jobs that are memory intensive
G2.X for jobs that uses AWS Glue ML workloads such as ML Transforms
#EMR What are three different node types in a EMR cluster?
- Master or Leader nodes
- manage the cluster
- Single EC2 instance
- Core nodes: host HDFS data and run tasks
- Task nodes: run tasks not hosting data
- No risk of data loss when removing them
- Good use of spot instance
#EMR What is EMRFS Consistent View?
EMRFS Consistent View is an optional feature that allows EMR cluster to check for list and read-after-write consistency for S3 objects written or synced with EMRFS.
Why? S3’s eventual consistency
#EMR How does EMRFS Consistent View works?
EMRFS Consistent View uses an Amazon DynamoDB to store object metadata and track consistency with S3 (EMRFS Metadata Store)
- DynamoDB by default has 400 read capacity and 100 write capacity - you need to configure DDB according to # of objects being tracked
- You can configure # of retries and retry period (# seconds) - consistent view will retry
#EMR What are the storage options for a EMR cluster?
Can you add or detach a EBS volumes to a running EMR cluster?
- HDFS - distributed local storage
- Ephemeral
- Default block size of 128 MB
- EMRFS: access S3 as if it were HDFS
- EMRFS Consistent View - optional for S3 consistency
- Use DynamoDB to track consistency - pay attention to capacity limits
- Why? S3 eventual consistency
- Local file system
- EBS for HDFS
- Will be deleted if the cluster gets terminated
- Can not add or detach EBS volumes to a running cluster, only at creation time
#EMR What does the Spark stack consist of?
- Spark SQL (structure); Spark Streaming (real-time); MLLib; GraphX (graph processing)
- Spark Core
- Standalone Scheduler; YARN; Mesos
#EMR What is Hive?
Hive - data warehouse and analytic infrastructure built on top of Hadoop
- Hive uses SQL like language called Hive SQL
#EMR What is Tez?
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop;
Both Tez and MapReduce are execution engine in Hive
#EMR What is Presto?
Pesto is an open source in-memory distributed fast SQL query engine designed for interactive queries against PB of data from different sources
- Run interactive analytics queries against a variety of data sources with size ranging from GB to PB at once and query across them
- Interactive queries at PB scale - faster than Hive
- Optimized for OLAP - analytics queries, data warehousing
- This is what Athena use under the hood
Expose JDBC, CLI, and Tableau interfaces
#EMR What are EMR Notebooks?
EMR Notebooks are similar Zeppelin but with more AWS Integration
- Notebooks backup to S3
- Provision clusters from the notebook
- Host inside a VPC
- Access ONLY via AWS Console
#EMR What is HUE?
HUE stands for Hadoop User Experience. It is a Open source web interface for Apache Hadoop and other non-Hadoop applications running on EMR
- You can browse S3 / HDFS / Hbase / Zookeeper
- You can move data between S3 and HDFS
#EMR What is Flume?
Flume is another way to streaming data (e.g. log data) into your EMR cluster.
- Build-in sinks for HDFS and HBase
- Source ==> Channels ==> Sink ==> Target
#EMR What is MXNet?
MXNet is an alternative to Tensorflow for building neural networks. MXNet is included in EMR.
#EMR What is S3DistCP?
S3DistCP is a tool for copying large amount of data between S3 and HDFS
- Uses MapReduce to copy in a distributed manner
- Suitable for parallel copying of large # of objects across buckets and accounts
#EMR What are the Hive integration with AWS?
- S3: load data / script / partition information
- DynamoDB as an external table. Hive can process and join data stored in DynamoDB
#Quicksight What is a KPI Chart?
KPI Charts use a key performance indicator (KPI) to visualize a comparison between a key value and its target value.
#DynamoDB What is WCU and RCU for DynamoDB?
1 WCU = 1KB/s WRITE
1 RCU = 2 eventual consistent READ of 4KB/s; 1 consistent READ of 4KB/s
#S3 What is Glacier Select?
Glacier Select allow you query Glacier data with simple SQL queries and get results in minutes, without need to restore to S3.
#IoT What are types of identity principals for device or client authentication supported by AWS IoT?
- X.509 cert
- IAM users, groups, and roles
- Amazon Cognito identities
I think you can also implement Federated Identity via Cognito since it can use
#Redshift What is Redshift's Elastic Resize?
Redshift’s Elastic Resize allow you add / remove nodes and also change node types.
However, Elastic resize only holds connections open if you only change the number of nodes, not the node type.
If you want to minimizes the downtime involved, you might still use the snapshot / restore / resize approach with classic resize
https://aws.amazon.com/blogs/big-data/scale-your-amazon-redshift-clusters-up-and-down-in-minutes-to-get-the-performance-you-need-when-you-need-it/
#EMR Which compression algorithm are splittable?
BZIP and LZO are splittable, great for parallel processing
GZIP and SNAPPY are NOT splittable
#EMR What is HBase Read-replica in S3?
Amazon EMR version 5.7.0+ allows you to maintain read-only copies of data in Amazon S3.
You can access the data from the read-replica cluster to perform read operations simultaneously, and in the event that the primary cluster becomes unavailable.
#EMR What are the 3 ways that HBase can integrate with S3
- HBase on S3 (S3 storage mode) - Storage of HBase StoreFiles and metadata in S3
- Snapshot of HBase to S3 - this is for the HBase that do not use S3
- HBase Read-replcas in S3
#EMR What is Ganglia?
Ganglia is a a scalable, distributed system designed to monitor clusters and grids while minimizing the impact on their performance.
Ganglia is installed on the Master Node and Ganglia is the operational dashboard provided with EMR.