Amazon EMR | Using Pig Flashcards by Parri Pandian

Does Impala support ODBC and JDBC drivers?

Using Pig

Amazon EMR | Analytics

While you can use ODBC drivers, Impala is also a great engine for third-party tools connected through JDBC. You can download and install the Impala client JDBC driver from http://elasticmapreduce.s3.amazonaws.com/libs/impala/1.2.1/impala-jdbc-1.2.1.zip. From the client computer where you have your business intelligence tool installed, connect the JDBC driver to the master node of an Impala cluster using SSH or a VPN on port 21050. For more information, see Open an SSH Tunnel to the Master Node.

How well did you know this?

Not at all

Perfectly

What is Apache Pig?

Using Pig

Amazon EMR | Analytics

Pig is an open source analytics package that runs on top of Hadoop. Pig is operated by a SQL-like language called Pig Latin, which allows users to structure, summarize, and query data sources stored in Amazon S3. As well as SQL-like operations, Pig Latin also adds first-class support for map/reduce functions and complex extensible user defined data types. This capability allows processing of complex and even unstructured data sources such as text documents and log files. Pig allows user extensions via user-defined functions written in Java and deployed via storage in Amazon S3.

How well did you know this?

Not at all

Perfectly

What can I do with Pig running on Amazon EMR?

Using Pig

Amazon EMR | Analytics

Using Pig with Amazon EMR, you can implement sophisticated data-processing applications with a familiar SQL-like language and easy to use tools available with Amazon EMR. With Amazon EMR, you can turn your Pig applications into a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence tasks.

How well did you know this?

Not at all

Perfectly

How can I get started with Pig running on Amazon EMR?

Using Pig

Amazon EMR | Analytics

The best place to start is to review our written documentation located here.

How well did you know this?

Not at all

Perfectly

Are there new features in Pig specific to Amazon EMR?

Using Pig

Amazon EMR | Analytics

Yes. There are three new features which make Pig even more powerful when used with Amazon EMR, including:

a/ Accessing multiple filesystems. By default a Pig job can only access one remote file system, be it an HDFS store or S3 bucket, for input, output and temporary data. EMR has extended Pig so that any job can access as many file systems as it wishes. An advantage of this is that temporary intra-job data is always stored on the local HDFS, leading to improved perfomance.

b/ Loading resources from S3. EMR has extended Pig so that custom JARs and scripts can come from the S3 file system, for example “REGISTER s3:///my-bucket/piggybank.jar”

c/ Additional Piggybank function for String and DateTime processing.

How well did you know this?

Not at all

Perfectly

What types of Pig clusters are supported?

Using Pig

Amazon EMR | Analytics

There are two types of clusters supported with Pig: interactive and batch. In an interactive mode a customer can start a cluster and run Pig scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development. In batch mode, the Pig script is stored in Amazon S3 and is referenced at the start of the cluster. Typically, batch mode is used for repeatable runs such as report generation.

How well did you know this?

Not at all

Perfectly

How can I launch a Pig cluster?

Using Pig

Amazon EMR | Analytics

Both batch and interactive clusters can be started from AWS Management Console, EMR command line client, or APIs.

How well did you know this?

Not at all

Perfectly

What version of Pig does Amazon EMR support?

Using Pig

Amazon EMR | Analytics

Amazon EMR supports multiple versions of Pig.

How well did you know this?

Not at all

Perfectly

Can I write to a S3 bucket from two clusters concurrently

Using Pig

Amazon EMR | Analytics

Yes, you are able to write to the same bucket from two concurrent clusters.

How well did you know this?

Not at all

Perfectly

Can I share input data in S3 between clusters?

Using Pig

Amazon EMR | Analytics

Yes, you are able to read the same data in S3 from two concurrent clusters.

How well did you know this?

Not at all

Perfectly

Can data be shared between multiple AWS users?

Using Pig

Amazon EMR | Analytics

Yes. Data can be shared using standard Amazon S3 sharing mechanism described here http://docs.amazonwebservices.com/AmazonS3/latest/index.html?S3_ACLs.html

How well did you know this?

Not at all

Perfectly

Should I run one large cluster, and share it amongst many users or many smaller clusters?

Using Pig

Amazon EMR | Analytics

Amazon EMR provides a unique capability for you to use both methods. On the one hand one large cluster may be more efficient for processing regular batch workloads. On the other hand, if you require ad-hoc querying or workloads that vary with time, you may choose to create several separate cluster tuned to the specific task sharing data sources stored in Amazon S3.

How well did you know this?

Not at all

Perfectly

Can I access a script or jar resource which is on my local file system?

Using Pig

Amazon EMR | Analytics

No. You must upload the script or jar to Amazon S3 or to the cluster’s master node before it can be referenced. For uploading to Amazon S3 you can use tools including s3cmd, jets3t or S3Organizer.

How well did you know this?

Not at all

Perfectly

Can I run a persistent cluster executing multiple Pig queries?

Using Pig

Amazon EMR | Analytics

Yes. You run a cluster in a manual termination mode so it will not terminate between Pig steps. To reduce the risk of data loss we recommend periodically persisting all important data in Amazon S3. It is good practice to regularly transfer your work to a new cluster to test you process for recovering from master node failure.

How well did you know this?

Not at all

Perfectly