5_Dataproc Flashcards by Julien Heck

Cluster Types

Single Node Cluster
- Master, Worker, HDFS on same node
Standard Cluster
- 1 Master Nodes, several worker nodes
High Availability Cluster
- Several Master Nodes

How well did you know this?

Not at all

Perfectly

Preemptible VM

Excellent low-cost worker nodes
Can be reclaimed and removed from the cluster if they are required by Google Cloud for other tasks.
Dataproc manages the entire leave/join process:
- No need to configure startup/shutdown scripts
- Just add PVMs and thats it
No assigned disks for HDFS (only disk for caching)
Ideally you want a mix of standard + PVM worker nodes
- Number of preemptible workers in your cluster should be less than 50% of the total number of all workers

How well did you know this?

Not at all

Perfectly

Updating Clusters

Can only change # workers/preemptibles VMs/labels/toggle graceful decomission
Automatically reshards data for you
gcloud dataproc clusters update [cluster_name] –num-workers [#] –num-preemptible-workers [#]

How well did you know this?

Not at all

Perfectly

Custom Clusters

Dataproc clusters can be provisioned with a custom image that includes a user’s pre-installed packages.
You can also customise your cluster that is using the default image using
- Custom Cluster properties: allow you to modify properties in common configuration files like core-site.xml. Remove the need to manually change property files by hand or initialization action.
- Custom Java/Scala dependencies (typically for Spark jobs)
- When creating a Dataproc cluster, you can specify initialization actions in executables or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.
- If you create a Dataproc cluster with internal IP addresses only, attempts to access the internet in an initialization action will fail unless you have configured routes to direct the traffic through Cloud NAT or a Cloud VPN. Without access to the internet, you can enable Private Google Access and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.

How well did you know this?

Not at all

Perfectly

Storage

HDFS storage on Dataproc is built on top of persistent disks (PDs) attached to worker nodes. This means data stored on HDFS is transient (unless it is copied to GCS or other persistent storage) with relatively higher storage costs. Hence it is recommended to minimize the use of HDFS storage. However there might be valid scenarios where you need to maintain a small HDFS footprint, specifically for performance reasons. In such cases, you can provision Dataproc clusters with limited HDFS storage, offloading all persistent storage needs to GCS.

REF: https://cloud.google.com/blog/topics/developers-practitioners/dataproc-best-practices-guide

https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips

How well did you know this?

Not at all

Perfectly

Autoscaling

Do not use autoscaling with:

High Availability Clusters
HDFS
Spark Structured Streaming
Idle Clusters

How well did you know this?

Not at all

Perfectly

Graceful Decommissioning

When you downscale a cluster, work in progress may stop before completion. You can use Graceful Decommissioning to finish work in progress on a worker before it is removed from the Cloud Dataproc cluster.

The preemptible (secondary) worker group continues to provision or delete workers to reach its expected size even after a cluster scaling operation is marked complete.

If you attempt to gracefully decommission a secondary worker and receive an error message similar to the following:

“Secondary worker group cannot be modified outside of Dataproc. If you recently created or updated this cluster, wait a few minutes before gracefully decommissioning to allow all secondary instances to join or leave the cluster. Expected secondary worker group size: x, actual size: y”,

wait a few minutes then repeat the graceful decommissioning request.

Graceful decommission should ideally be set to be longer than the longest running job on the cluster.

Also note:

You can forcefully decommission preemptible workers at any time.
You gracefully decommission primary workers at any time

How well did you know this?

Not at all

Perfectly

5_Dataproc Flashcards

(7 cards)