Analytic Services Flashcards
#Glue What are Glue Worker Types?
AWS Glue comes with 3 worker types
Standard - 4 vCPU, 16GB RAM, 50GB, 2 Spark executors ⇒ 1 DPU
G.1X - 4 vCPU, 16GB RAM, 64GB, 1 Spark executors ⇒ 1 DPU
G.2X - 8 vCPU, 32GB RAM, 128GB, 1 Spark executors ⇒ 2 DPU
1 DPU can run 8 Spark executors
G1.X for jobs that are memory intensive
G2.X for jobs that uses AWS Glue ML workloads such as ML Transforms
#EMR What are three different node types in a EMR cluster?
- Master or Leader nodes
- manage the cluster
- Single EC2 instance
- Core nodes: host HDFS data and run tasks
- Task nodes: run tasks not hosting data
- No risk of data loss when removing them
- Good use of spot instance
#EMR What is EMRFS Consistent View?
EMRFS Consistent View is an optional feature that allows EMR cluster to check for list and read-after-write consistency for S3 objects written or synced with EMRFS.
Why? S3’s eventual consistency
#EMR How does EMRFS Consistent View works?
EMRFS Consistent View uses an Amazon DynamoDB to store object metadata and track consistency with S3 (EMRFS Metadata Store)
- DynamoDB by default has 400 read capacity and 100 write capacity - you need to configure DDB according to # of objects being tracked
- You can configure # of retries and retry period (# seconds) - consistent view will retry
#EMR What are the storage options for a EMR cluster?
Can you add or detach a EBS volumes to a running EMR cluster?
- HDFS - distributed local storage
- Ephemeral
- Default block size of 128 MB
- EMRFS: access S3 as if it were HDFS
- EMRFS Consistent View - optional for S3 consistency
- Use DynamoDB to track consistency - pay attention to capacity limits
- Why? S3 eventual consistency
- Local file system
- EBS for HDFS
- Will be deleted if the cluster gets terminated
- Can not add or detach EBS volumes to a running cluster, only at creation time
#EMR What does the Spark stack consist of?
- Spark SQL (structure); Spark Streaming (real-time); MLLib; GraphX (graph processing)
- Spark Core
- Standalone Scheduler; YARN; Mesos
#EMR What is Hive?
Hive - data warehouse and analytic infrastructure built on top of Hadoop
- Hive uses SQL like language called Hive SQL
#EMR What is Tez?
Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop;
Both Tez and MapReduce are execution engine in Hive
#EMR What is Presto?
Pesto is an open source in-memory distributed fast SQL query engine designed for interactive queries against PB of data from different sources
- Run interactive analytics queries against a variety of data sources with size ranging from GB to PB at once and query across them
- Interactive queries at PB scale - faster than Hive
- Optimized for OLAP - analytics queries, data warehousing
- This is what Athena use under the hood
Expose JDBC, CLI, and Tableau interfaces
#EMR What are EMR Notebooks?
EMR Notebooks are similar Zeppelin but with more AWS Integration
- Notebooks backup to S3
- Provision clusters from the notebook
- Host inside a VPC
- Access ONLY via AWS Console
#EMR What is HUE?
HUE stands for Hadoop User Experience. It is a Open source web interface for Apache Hadoop and other non-Hadoop applications running on EMR
- You can browse S3 / HDFS / Hbase / Zookeeper
- You can move data between S3 and HDFS
#EMR What is Flume?
Flume is another way to streaming data (e.g. log data) into your EMR cluster.
- Build-in sinks for HDFS and HBase
- Source ==> Channels ==> Sink ==> Target
#EMR What is MXNet?
MXNet is an alternative to Tensorflow for building neural networks. MXNet is included in EMR.
#EMR What is S3DistCP?
S3DistCP is a tool for copying large amount of data between S3 and HDFS
- Uses MapReduce to copy in a distributed manner
- Suitable for parallel copying of large # of objects across buckets and accounts
#EMR What are the Hive integration with AWS?
- S3: load data / script / partition information
- DynamoDB as an external table. Hive can process and join data stored in DynamoDB
#Quicksight What is a KPI Chart?
KPI Charts use a key performance indicator (KPI) to visualize a comparison between a key value and its target value.
#DynamoDB What is WCU and RCU for DynamoDB?
1 WCU = 1KB/s WRITE
1 RCU = 2 eventual consistent READ of 4KB/s; 1 consistent READ of 4KB/s
#S3 What is Glacier Select?
Glacier Select allow you query Glacier data with simple SQL queries and get results in minutes, without need to restore to S3.
#IoT What are types of identity principals for device or client authentication supported by AWS IoT?
- X.509 cert
- IAM users, groups, and roles
- Amazon Cognito identities
I think you can also implement Federated Identity via Cognito since it can use