Cloud Dataproc Flashcards
What is Cloud Dataproc?
A fully managed cluster data processing service (Apache Spark and Apache Hadoop service)
What are key points of Cloud Dataproc?
- Compatible with Apache Hadoop, Spark and Hive
- Runs in clusters
- Allows existing projects to be moved without redevelopment
- Fast cluster creation - can create workflow templates
- Can scale clusters without stopping Job
- Can switch to different versions
- Can handle streaming and batch data
When do you choose Cloud Dataproc over Cloud Dataflow?
If you have dependencies on Hadoop or Spark, or if you want more hands on management and control.
How do you create a Cloud Dataproc cluster from the command line?
gcloud dataproc clusters create [CLUSTER NAME] –zone [ZONE]
How do you submit a job to Cloud Dataproc via the shell?
gcloud dataproc jobs submit [TYPE] –cluster [CLUSTER NAME] –jar [JAR FILE]
What cluster modes can you choose when setting up Cloud Dataproc?
- Single - for development
- Standard - one master node
- High Availability - uses 3 master nodes
What job types are available for Cloud Dataproc?
- Spark
- PySparck
- SparkR
- Hive
- Spark SQL
- Pig
- Hadoop
How do you import or export data to Cloud Dataproc?
You don’t. It’s a data analysis platform, not a database.
You can import and export to save/restore the cluster configuration data.
gcloud beta dataproc clusters export [CLUSTER NAME] – destination=[PATH TO EXPORT FILE]
gcloud beta dataproc clusters import [SOURCE FILE]