Amazon EMR | Using Pig Flashcards
Does Impala support ODBC and JDBC drivers?
Using Pig
Amazon EMR | Analytics
While you can use ODBC drivers, Impala is also a great engine for third-party tools connected through JDBC. You can download and install the Impala client JDBC driver from http://elasticmapreduce.s3.amazonaws.com/libs/impala/1.2.1/impala-jdbc-1.2.1.zip. From the client computer where you have your business intelligence tool installed, connect the JDBC driver to the master node of an Impala cluster using SSH or a VPN on port 21050. For more information, see Open an SSH Tunnel to the Master Node.
What is Apache Pig?
Using Pig
Amazon EMR | Analytics
Pig is an open source analytics package that runs on top of Hadoop. Pig is operated by a SQL-like language called Pig Latin, which allows users to structure, summarize, and query data sources stored in Amazon S3. As well as SQL-like operations, Pig Latin also adds first-class support for map/reduce functions and complex extensible user defined data types. This capability allows processing of complex and even unstructured data sources such as text documents and log files. Pig allows user extensions via user-defined functions written in Java and deployed via storage in Amazon S3.
What can I do with Pig running on Amazon EMR?
Using Pig
Amazon EMR | Analytics
Using Pig with Amazon EMR, you can implement sophisticated data-processing applications with a familiar SQL-like language and easy to use tools available with Amazon EMR. With Amazon EMR, you can turn your Pig applications into a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence tasks.
How can I get started with Pig running on Amazon EMR?
Using Pig
Amazon EMR | Analytics
The best place to start is to review our written documentation located here.
Are there new features in Pig specific to Amazon EMR?
Using Pig
Amazon EMR | Analytics
Yes. There are three new features which make Pig even more powerful when used with Amazon EMR, including:
a/ Accessing multiple filesystems. By default a Pig job can only access one remote file system, be it an HDFS store or S3 bucket, for input, output and temporary data. EMR has extended Pig so that any job can access as many file systems as it wishes. An advantage of this is that temporary intra-job data is always stored on the local HDFS, leading to improved perfomance.
b/ Loading resources from S3. EMR has extended Pig so that custom JARs and scripts can come from the S3 file system, for example “REGISTER s3:///my-bucket/piggybank.jar”
c/ Additional Piggybank function for String and DateTime processing.
What types of Pig clusters are supported?
Using Pig
Amazon EMR | Analytics
There are two types of clusters supported with Pig: interactive and batch. In an interactive mode a customer can start a cluster and run Pig scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development. In batch mode, the Pig script is stored in Amazon S3 and is referenced at the start of the cluster. Typically, batch mode is used for repeatable runs such as report generation.
How can I launch a Pig cluster?
Using Pig
Amazon EMR | Analytics
Both batch and interactive clusters can be started from AWS Management Console, EMR command line client, or APIs.
What version of Pig does Amazon EMR support?
Using Pig
Amazon EMR | Analytics
Amazon EMR supports multiple versions of Pig.
Can I write to a S3 bucket from two clusters concurrently
Using Pig
Amazon EMR | Analytics
Yes, you are able to write to the same bucket from two concurrent clusters.
Can I share input data in S3 between clusters?
Using Pig
Amazon EMR | Analytics
Yes, you are able to read the same data in S3 from two concurrent clusters.
Can data be shared between multiple AWS users?
Using Pig
Amazon EMR | Analytics
Yes. Data can be shared using standard Amazon S3 sharing mechanism described here http://docs.amazonwebservices.com/AmazonS3/latest/index.html?S3_ACLs.html
Should I run one large cluster, and share it amongst many users or many smaller clusters?
Using Pig
Amazon EMR | Analytics
Amazon EMR provides a unique capability for you to use both methods. On the one hand one large cluster may be more efficient for processing regular batch workloads. On the other hand, if you require ad-hoc querying or workloads that vary with time, you may choose to create several separate cluster tuned to the specific task sharing data sources stored in Amazon S3.
Can I access a script or jar resource which is on my local file system?
Using Pig
Amazon EMR | Analytics
No. You must upload the script or jar to Amazon S3 or to the cluster’s master node before it can be referenced. For uploading to Amazon S3 you can use tools including s3cmd, jets3t or S3Organizer.
Can I run a persistent cluster executing multiple Pig queries?
Using Pig
Amazon EMR | Analytics
Yes. You run a cluster in a manual termination mode so it will not terminate between Pig steps. To reduce the risk of data loss we recommend periodically persisting all important data in Amazon S3. It is good practice to regularly transfer your work to a new cluster to test you process for recovering from master node failure.