Analytic Services Flashcards

1
Q
#Glue
What are Glue Worker Types?
A

AWS Glue comes with 3 worker types

Standard - 4 vCPU, 16GB RAM, 50GB, 2 Spark executors ⇒ 1 DPU

G.1X - 4 vCPU, 16GB RAM, 64GB, 1 Spark executors ⇒ 1 DPU

G.2X - 8 vCPU, 32GB RAM, 128GB, 1 Spark executors ⇒ 2 DPU

1 DPU can run 8 Spark executors

G1.X for jobs that are memory intensive
G2.X for jobs that uses AWS Glue ML workloads such as ML Transforms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
#EMR
What are three different node types in a EMR cluster?
A
  • Master or Leader nodes
    • manage the cluster
    • Single EC2 instance
  • Core nodes: host HDFS data and run tasks
  • Task nodes: run tasks not hosting data
    • No risk of data loss when removing them
    • Good use of spot instance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
#EMR
What is EMRFS Consistent View?
A

EMRFS Consistent View is an optional feature that allows EMR cluster to check for list and read-after-write consistency for S3 objects written or synced with EMRFS.

Why? S3’s eventual consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
#EMR 
How does EMRFS Consistent View works?
A

EMRFS Consistent View uses an Amazon DynamoDB to store object metadata and track consistency with S3 (EMRFS Metadata Store)

  • DynamoDB by default has 400 read capacity and 100 write capacity - you need to configure DDB according to # of objects being tracked
  • You can configure # of retries and retry period (# seconds) - consistent view will retry
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
#EMR
What are the storage options for a EMR cluster?

Can you add or detach a EBS volumes to a running EMR cluster?

A
  • HDFS - distributed local storage
  • Ephemeral
  • Default block size of 128 MB
  • EMRFS: access S3 as if it were HDFS
  • EMRFS Consistent View - optional for S3 consistency
  • Use DynamoDB to track consistency - pay attention to capacity limits
  • Why? S3 eventual consistency
  • Local file system
  • EBS for HDFS
  • Will be deleted if the cluster gets terminated
  • Can not add or detach EBS volumes to a running cluster, only at creation time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
#EMR
What does the Spark stack consist of?
A
  • Spark SQL (structure); Spark Streaming (real-time); MLLib; GraphX (graph processing)
  • Spark Core
  • Standalone Scheduler; YARN; Mesos
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
#EMR
What is Hive?
A

Hive - data warehouse and analytic infrastructure built on top of Hadoop

  • Hive uses SQL like language called Hive SQL
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
#EMR
What is Tez?
A

Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop;

Both Tez and MapReduce are execution engine in Hive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
#EMR
What is Presto?
A

Pesto is an open source in-memory distributed fast SQL query engine designed for interactive queries against PB of data from different sources

  • Run interactive analytics queries against a variety of data sources with size ranging from GB to PB at once and query across them
  • Interactive queries at PB scale - faster than Hive
  • Optimized for OLAP - analytics queries, data warehousing
  • This is what Athena use under the hood
    Expose JDBC, CLI, and Tableau interfaces
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
#EMR
What are EMR Notebooks?
A

EMR Notebooks are similar Zeppelin but with more AWS Integration

  • Notebooks backup to S3
  • Provision clusters from the notebook
  • Host inside a VPC
  • Access ONLY via AWS Console
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
#EMR
What is HUE?
A

HUE stands for Hadoop User Experience. It is a Open source web interface for Apache Hadoop and other non-Hadoop applications running on EMR

  • You can browse S3 / HDFS / Hbase / Zookeeper
  • You can move data between S3 and HDFS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
#EMR
What is Flume?
A

Flume is another way to streaming data (e.g. log data) into your EMR cluster.

  • Build-in sinks for HDFS and HBase
  • Source ==> Channels ==> Sink ==> Target
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
#EMR
What is MXNet?
A

MXNet is an alternative to Tensorflow for building neural networks. MXNet is included in EMR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
#EMR
What is S3DistCP?
A

S3DistCP is a tool for copying large amount of data between S3 and HDFS

  • Uses MapReduce to copy in a distributed manner
  • Suitable for parallel copying of large # of objects across buckets and accounts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
#EMR
What are the Hive integration with AWS?
A
  • S3: load data / script / partition information

- DynamoDB as an external table. Hive can process and join data stored in DynamoDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
#Quicksight
What is a KPI Chart?
A

KPI Charts use a key performance indicator (KPI) to visualize a comparison between a key value and its target value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
#DynamoDB
What is WCU and RCU for DynamoDB?
A

1 WCU = 1KB/s WRITE

1 RCU = 2 eventual consistent READ of 4KB/s; 1 consistent READ of 4KB/s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
#S3
What is Glacier Select?
A

Glacier Select allow you query Glacier data with simple SQL queries and get results in minutes, without need to restore to S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q
#IoT
What are types of identity principals for device or client authentication supported by AWS IoT?
A
  1. X.509 cert
  2. IAM users, groups, and roles
  3. Amazon Cognito identities

I think you can also implement Federated Identity via Cognito since it can use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q
#Redshift
What is Redshift's Elastic Resize?
A

Redshift’s Elastic Resize allow you add / remove nodes and also change node types.

However, Elastic resize only holds connections open if you only change the number of nodes, not the node type.

If you want to minimizes the downtime involved, you might still use the snapshot / restore / resize approach with classic resize

https://aws.amazon.com/blogs/big-data/scale-your-amazon-redshift-clusters-up-and-down-in-minutes-to-get-the-performance-you-need-when-you-need-it/

21
Q
#EMR
Which compression algorithm are splittable?
A

BZIP and LZO are splittable, great for parallel processing

GZIP and SNAPPY are NOT splittable

22
Q
#EMR
What is HBase Read-replica in S3?
A

Amazon EMR version 5.7.0+ allows you to maintain read-only copies of data in Amazon S3.

You can access the data from the read-replica cluster to perform read operations simultaneously, and in the event that the primary cluster becomes unavailable.

23
Q
#EMR
What are the 3 ways that HBase can integrate with S3
A
  1. HBase on S3 (S3 storage mode) - Storage of HBase StoreFiles and metadata in S3
  2. Snapshot of HBase to S3 - this is for the HBase that do not use S3
  3. HBase Read-replcas in S3
24
Q
#EMR
What is Ganglia?
A

Ganglia is a a scalable, distributed system designed to monitor clusters and grids while minimizing the impact on their performance.

Ganglia is installed on the Master Node and Ganglia is the operational dashboard provided with EMR.

25
Q
#EMR
What is Apache Sqoop?
A

Sqoop is an open-source system for transferring data between Hadoop and relational databases (in parallel).

You can use Sqoop to copy data from RDBMS to EMR for analysis

26
Q
#QuickSight
What does HeatMap chart used for?
A

Heatmap is used to highlight outliers and trends using color.

27
Q
#QuickSight
What does Scatter Plot used for?
What does Tree Map used for?
A

Scatter Plot is used for visualizing two or three measures for a dimension.

Tree Map is used for visualizing one or two measures for a dimension, use tree maps.

28
Q
#EMR
What are the advantages of HBase on S3?
A
  1. The HBase root directory is stored in Amazon S3, including HBase store files and table metadata.
    The data is persist outside cluster and available across AZs, no snapshot / recover is needed
  2. Separate compute and storage, so you can scale EMR cluster for your compute requirements
  3. You can have a read-replica cluster and perform read operation concurrently, even when the primary cluster is not available
29
Q
#EMR
What is HCatalog?
A

HCatalog is a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.

30
Q
#Redshift
Does Redshift support multi-AZ deployment?
A

No. Currently, Amazon Redshift only supports Single-AZ deployments.

  • You can create Redshift clusters in each AZ and stream / COPY same data into them; OR
  • You can use Redshift Spectrum to query the same S3 data with Redshift clusters in different AZ
31
Q
#QuickSight
What is Pivotal Table used for?
A

You can use Pivotal table to interactively explore multi-dimensional data, applying statistical functions to different rows and columns and sorting them in different ways.

32
Q
#EMR
Why use external metastore for Hive? What are the choices for Hive external metastore?
A

External metastore is for persisting metastore beyond the cluster’s life span.

By default, Hive metastore is in a MySQL on master node. But you can choose use external metastore

  1. AWS Glue Data Catalog
  2. Amazon RDS or Aurora
33
Q
#EMR #S3
What is S3 Select? How does S3 Select work with EMR and Hive?
A

S3 Select allows applications to retrieve only a subset of data from an S3 object.

For EMR, S3 Select can improve the performance of data processing by reduced the data movement between S3 and EMR

This is also called S3 Select Push Down

Presto on EMR can use S3 Select Push Diwn

34
Q
#EMR
What is Hudi? Why we need it?
A

Hudi open-source data management framework used to simplify incremental data processing and data pipeline development by providing record-level insert, update, upsert, and delete capabilities.

  • Hudi is integrated with Spark, Hive, Presto, and EMR.
  • With Hudi, you can apply changes to a dataset over time.
35
Q
#EMR
What are the two Hudi dataset types?
A
  1. Copy on Write (CoW) - default.
    - Data is stored in a columnar format (Parquet), and each update creates a new version of files during a write;
    - Better suited for read-heavy workloads on data that changes less frequently.
  2. Merge on Read (MoR) - Data is stored using a combination of columnar (Parquet) and row-based (Avro) formats.
    - Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files
    - Better suited for write- or change-heavy workloads with fewer reads.
36
Q
#EMR
What are the 3 logical view of Hudi datasets?
A
  1. Read Optimized View - default. latest for CoW and latest compacted data for MoR. Data may not be the latest for MOR
  2. Incremental View - changes stream between actions out of CoW dataset to feed downstream jobs and ETL workflows
  3. Real-time View - latest data from MoR
37
Q
#EMR
What are the 2 options you have using Jupyter notebooks on EMR?
A
  1. EMR Notebook - have to use it in the EMR console; stored in S3; can share among clusters; a cluster can have multiple notebooks
  2. JupyterHub - it is a container running on EMR master node that can be used by multiple users; you can access via UI or CLI; stored on file system on Master node, BUT you can configure it to persist to S3
38
Q
#EMR
What is Livy?
A

Livy which is included in EMR is used to talk to the Spark on EMR. You can submit job or script and manage Spark Context.

JupyterHub container uses Livy.

39
Q
#EMR
What is Mahout?
A

Mahout is a machine learning framework / libraries / tools for Hadoop and Spark.

Mahout uses Hadoop to distribute compute across the cluster for clustering, classification, recommendation etc.

40
Q
#EMR
What is MXNet?
A

Apache MXNet is an acceleration library designed for building neural networks and other deep learning applications.

MXNet is included in EMR

41
Q
#EMR
What is Oozie?
A

Oozie is a Hadoop job schduler and included in EMR.

With EMR, you need to access Oozie via Hiue Oozie application since native interface is not supported in EMR

Oozie by default persist user information and query history to local MySQL on master node, but you can configure it to presist to S3 and RDS

42
Q
#EMR
What is Phoenix?
A

Apache Phoenix is used for OLTP and operational analysis with SQL against HBase store.

You can connect to Phoenix using JDBC client or think client to the Phoenix Query Server running on Master node

43
Q
#EMR
What is Pig?
A

Apache Pig is an open-source library that runs on top of Hadoop for data processing

  • Pig has SQL-like commands written in a language called Pig Latin;
  • Pig converts those commands into Tez jobs based on DAGs or MapReduce programs.
  • you can run Pig commands interactively or in batch as EMR cluster steps (from S3)
  • You can configure to let Pig write output to HCatalog
44
Q
#EMR
When to use S3 Select Push Down?
A
  • if your query filter out 50%+ orginal data
  • if your query predicates are supported by S3 Select, note timestamp is NOT supported
  • if your S3 object is in CSV format, could be optionally compressed with gzip and bzip2
  • S3 SSE-C and Client side encryption is NOT supported

S3 Select is not replacing using columnar (ORC or Parquet) or compressed files

45
Q
#EMR
What is Presto Graceful Auto Scaling
A

EMR allows you to set a grace period for Presto tasks to keep running before the node terminates because of a scale-in resize action or an automatic scaling policy request.

46
Q
#EMR
Can Spark on EMR work with Amazon SageMaker?
A

Yes. SageMaker Spark SDK is included with Spark on EMR and can be used to create ML pipelines with SageMaker Steps.

47
Q
#EMR
What is Sqoop?
A

Apache Sqoop is a is a tool for transferring data between Amazon S3, Hadoop, HDFS, and RDBMS databases.

Sqoop supports HCatalog and JDBC connection, and is included in EMR

48
Q
#EMR #DynamoDB
What operations you can do on DynamoDB from a EMR cluster?
A

You can use EMR with Hive to load (DDB –> HDFS), query (DDB), join (DDB), import (S3 –> DDB), export (DDB –> S3)