5_Dataproc Flashcards
Cluster Types
-
Single Node Cluster
- Master, Worker, HDFS on same node
-
Standard Cluster
- 1 Master Nodes, several worker nodes
-
High Availability Cluster
- Several Master Nodes
Preemptible VM
- Excellent low-cost worker nodes
- Can be reclaimed and removed from the cluster if they are required by Google Cloud for other tasks.
- Dataproc manages the entire leave/join process:
- No need to configure startup/shutdown scripts
- Just add PVMs and thats it
- No assigned disks for HDFS (only disk for caching)
- Ideally you want a mix of standard + PVM worker nodes
- Number of preemptible workers in your cluster should be less than 50% of the total number of all workers
Updating Clusters
- Can only change # workers/preemptibles VMs/labels/toggle graceful decomission
- Automatically reshards data for you
- gcloud dataproc clusters update [cluster_name] –num-workers [#] –num-preemptible-workers [#]
Custom Clusters
- Dataproc clusters can be provisioned with a custom image that includes a user’s pre-installed packages.
- You can also customise your cluster that is using the default image using
- Custom Cluster properties: allow you to modify properties in common configuration files like core-site.xml. Remove the need to manually change property files by hand or initialization action.
- Custom Java/Scala dependencies (typically for Spark jobs)
- When creating a Dataproc cluster, you can specify initialization actions in executables or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.
- If you create a Dataproc cluster with internal IP addresses only, attempts to access the internet in an initialization action will fail unless you have configured routes to direct the traffic through Cloud NAT or a Cloud VPN. Without access to the internet, you can enable Private Google Access and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.
Storage
HDFS storage on Dataproc is built on top of persistent disks (PDs) attached to worker nodes. This means data stored on HDFS is transient (unless it is copied to GCS or other persistent storage) with relatively higher storage costs. Hence it is recommended to minimize the use of HDFS storage. However there might be valid scenarios where you need to maintain a small HDFS footprint, specifically for performance reasons. In such cases, you can provision Dataproc clusters with limited HDFS storage, offloading all persistent storage needs to GCS.
REF: https://cloud.google.com/blog/topics/developers-practitioners/dataproc-best-practices-guide
Autoscaling
Do not use autoscaling with:
- High Availability Clusters
- HDFS
- Spark Structured Streaming
- Idle Clusters
Graceful Decommissioning
When you downscale a cluster, work in progress may stop before completion. You can use Graceful Decommissioning to finish work in progress on a worker before it is removed from the Cloud Dataproc cluster.
The preemptible (secondary) worker group continues to provision or delete workers to reach its expected size even after a cluster scaling operation is marked complete.
If you attempt to gracefully decommission a secondary worker and receive an error message similar to the following:
“Secondary worker group cannot be modified outside of Dataproc. If you recently created or updated this cluster, wait a few minutes before gracefully decommissioning to allow all secondary instances to join or leave the cluster. Expected secondary worker group size: x, actual size: y”,
wait a few minutes then repeat the graceful decommissioning request.
Graceful decommission should ideally be set to be longer than the longest running job on the cluster.
Also note:
- You can forcefully decommission preemptible workers at any time.
- You gracefully decommission primary workers at any time