Cloud Dataproc Flashcards

Question 1

Q

What is Cloud Dataproc?

Answer

A

A fully managed cluster data processing service (Apache Spark and Apache Hadoop service)

Question 2

Q

What are key points of Cloud Dataproc?

Answer

A

Compatible with Apache Hadoop, Spark and Hive
Runs in clusters
Allows existing projects to be moved without redevelopment
Fast cluster creation - can create workflow templates
Can scale clusters without stopping Job
Can switch to different versions
Can handle streaming and batch data

Question 3

Q

When do you choose Cloud Dataproc over Cloud Dataflow?

Answer

A

If you have dependencies on Hadoop or Spark, or if you want more hands on management and control.

Question 4

Q

How do you create a Cloud Dataproc cluster from the command line?

Answer

A

gcloud dataproc clusters create [CLUSTER NAME] –zone [ZONE]

Question 5

Q

How do you submit a job to Cloud Dataproc via the shell?

Answer

A

gcloud dataproc jobs submit [TYPE] –cluster [CLUSTER NAME] –jar [JAR FILE]

Question 6

Q

What cluster modes can you choose when setting up Cloud Dataproc?

Answer

A

Single - for development
Standard - one master node
High Availability - uses 3 master nodes

Question 7

Q

What job types are available for Cloud Dataproc?

Answer

A

Spark
PySparck
SparkR
Hive
Spark SQL
Pig
Hadoop

Question 8

Q

How do you import or export data to Cloud Dataproc?

Answer

A

You don’t. It’s a data analysis platform, not a database.
You can import and export to save/restore the cluster configuration data.
gcloud beta dataproc clusters export [CLUSTER NAME] – destination=[PATH TO EXPORT FILE]
gcloud beta dataproc clusters import [SOURCE FILE]