Analytic Services Flashcards

Question

``` #EMR What is Apache Sqoop? ```

Answer 1

Sqoop is an open-source system for transferring data between Hadoop and relational databases (in parallel). You can use Sqoop to copy data from RDBMS to EMR for analysis

Answer 2

Heatmap is used to highlight outliers and trends using color.

Answer 3

Scatter Plot is used for visualizing two or three measures for a dimension. Tree Map is used for visualizing one or two measures for a dimension, use tree maps.

Answer 4

1. The HBase root directory is stored in Amazon S3, including HBase store files and table metadata. The data is persist outside cluster and available across AZs, no snapshot / recover is needed 2. Separate compute and storage, so you can scale EMR cluster for your compute requirements 3. You can have a read-replica cluster and perform read operation concurrently, even when the primary cluster is not available

Answer 5

HCatalog is a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.

Answer 6

No. Currently, Amazon Redshift only supports Single-AZ deployments. - You can create Redshift clusters in each AZ and stream / COPY same data into them; OR - You can use Redshift Spectrum to query the same S3 data with Redshift clusters in different AZ

Answer 7

You can use Pivotal table to interactively explore multi-dimensional data, applying statistical functions to different rows and columns and sorting them in different ways.

Answer 8

External metastore is for persisting metastore beyond the cluster’s life span. By default, Hive metastore is in a MySQL on master node. But you can choose use external metastore 1. AWS Glue Data Catalog 2. Amazon RDS or Aurora

Answer 9

S3 Select allows applications to retrieve only a subset of data from an S3 object. For EMR, S3 Select can improve the performance of data processing by reduced the data movement between S3 and EMR This is also called S3 Select Push Down Presto on EMR can use S3 Select Push Diwn

Answer 10

Hudi open-source data management framework used to simplify incremental data processing and data pipeline development by providing record-level insert, update, upsert, and delete capabilities. - Hudi is integrated with Spark, Hive, Presto, and EMR. - With Hudi, you can apply changes to a dataset over time.

Answer 11

1. Copy on Write (CoW) - default. - Data is stored in a columnar format (Parquet), and each update creates a new version of files during a write; - Better suited for read-heavy workloads on data that changes less frequently. 2. Merge on Read (MoR) - Data is stored using a combination of columnar (Parquet) and row-based (Avro) formats. - Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files - Better suited for write- or change-heavy workloads with fewer reads.

Answer 12

1. Read Optimized View - default. latest for CoW and latest compacted data for MoR. Data may not be the latest for MOR 2. Incremental View - changes stream between actions out of CoW dataset to feed downstream jobs and ETL workflows 3. Real-time View - latest data from MoR

Answer 13

1. EMR Notebook - have to use it in the EMR console; stored in S3; can share among clusters; a cluster can have multiple notebooks 2. JupyterHub - it is a container running on EMR master node that can be used by multiple users; you can access via UI or CLI; stored on file system on Master node, BUT you can configure it to persist to S3

Answer 14

Livy which is included in EMR is used to talk to the Spark on EMR. You can submit job or script and manage Spark Context. JupyterHub container uses Livy.

Answer 15

Mahout is a machine learning framework / libraries / tools for Hadoop and Spark. Mahout uses Hadoop to distribute compute across the cluster for clustering, classification, recommendation etc.

Answer 16

Apache MXNet is an acceleration library designed for building neural networks and other deep learning applications. MXNet is included in EMR

Answer 17

Oozie is a Hadoop job schduler and included in EMR. With EMR, you need to access Oozie via Hiue Oozie application since native interface is not supported in EMR Oozie by default persist user information and query history to local MySQL on master node, but you can configure it to presist to S3 and RDS

Answer 18

Apache Phoenix is used for OLTP and operational analysis with SQL against HBase store. You can connect to Phoenix using JDBC client or think client to the Phoenix Query Server running on Master node

Answer 19

Apache Pig is an open-source library that runs on top of Hadoop for data processing - Pig has SQL-like commands written in a language called Pig Latin; - Pig converts those commands into Tez jobs based on DAGs or MapReduce programs. - you can run Pig commands interactively or in batch as EMR cluster steps (from S3) - You can configure to let Pig write output to HCatalog

Answer 20

- if your query filter out 50%+ orginal data - if your query predicates are supported by S3 Select, note timestamp is NOT supported - if your S3 object is in CSV format, could be optionally compressed with gzip and bzip2 - S3 SSE-C and Client side encryption is NOT supported S3 Select is not replacing using columnar (ORC or Parquet) or compressed files

Answer 21

EMR allows you to set a grace period for Presto tasks to keep running before the node terminates because of a scale-in resize action or an automatic scaling policy request.

Answer 22

Yes. SageMaker Spark SDK is included with Spark on EMR and can be used to create ML pipelines with SageMaker Steps.

Answer 23

Apache Sqoop is a is a tool for transferring data between Amazon S3, Hadoop, HDFS, and RDBMS databases. Sqoop supports HCatalog and JDBC connection, and is included in EMR

Answer 24

You can use EMR with Hive to load (DDB --> HDFS), query (DDB), join (DDB), import (S3 --> DDB), export (DDB --> S3)

Analytic Services Flashcards

(48 cards)