5_Dataproc Flashcards by Julien Heck

What is Cloud Dataproc?

On-demand, managed cluster service for Hadoop and Apache Spark
Managed but not NoOps
- Must configure cluster, not auto-scaling
- Greatly recudes administrative overhead
Integrates with other Google Cloud services:
- Separate data from the cluster - save costs
Familiar Hadoop/Spark ecosystem environment
- Easy to move existing projects -> Lift and shift to GCP
Base on Apache Bigtop distribution:
- Hadoop, Spark, Hive, Pig
HDFS available (but maybe not optimal)
Other ecosystem tools can be installed as well via initialization actions

Lift and Shift is simply migrating existing processes or infrastructure to the cloud without redesigning anything. Some companies might take this approach because they don’t want to spend time and money modifying their current infrastructure but still want to use the benefits of the cloud.

How well did you know this?

Not at all

Perfectly

Lift & Shift vs. Lift & Leverage

Lift and Leverage means to move your existing processes over to the cloud and make them better using some of the services the cloud has to offer. For example, you can use Dataproc to run your Hadoop and Spark workloads but store the data in Cloud Storage as opposed to storing it in HDFS. This is more cost-efficient since you only pay for the time the job is running and then you can shut down the cluster when you aren’t using it without losing all its data since it is now stored in Cloud Storage.

How well did you know this?

Not at all

Perfectly

Using Dataproc

Submit Hadoop/Spark Jobs
- Dataproc is interoperable and compatible with these open source tools
Enable Autoscaling
- If necessary to cope with the load of the job
Output to GCP services
- E.g. Google Cloud Storage, BigQuery and Bigtable
Monitor with Stackdriver

How well did you know this?

Not at all

Perfectly

Cluster Types

Single Node Cluster
- Master, Worker, HDFS on same node
Standard Cluster
- 1 Master Nodes, several worker nodes
High Availability Cluster
- Several Master Nodes

How well did you know this?

Not at all

Perfectly

Preemptible VM

Excellent low-cost worker nodes
Can be reclaimed and removed from the cluster if they are required by Google Cloud for other tasks.
Dataproc manages the entire leave/join process:
- No need to configure startup/shutdown scripts
- Just add PVMs and thats it
No assigned disks for HDFS (only disk for caching)
Ideally you want a mix of standard + PVM worker nodes
- Number of preemptible workers in your cluster should be less than 50% of the total number of all workers

How well did you know this?

Not at all

Perfectly

Submitting Jobs

gcloud Command line
GCP Console
Dataproc API
SSH to Master Node

How well did you know this?

Not at all

Perfectly

IAM

Project level only (primitive and predefined roles)
Cloud Dataproc Admin, Editor, Viewer, Worker
- Admin:
- Editor: Full access to create/delete/edit clusters/jobs/workflow
- Viewer: View access only
- Worker: Assigned to service accounts only:
  - Read/write GCS, write to Cloud logging

https://cloud.google.com/dataproc/docs/concepts/iam/iam#roles

How well did you know this?

Not at all

Perfectly

Updating Clusters

Can only change # workers/preemptibles VMs/labels/toggle graceful decomission
Automatically reshards data for you
gcloud dataproc clusters update [cluster_name] –num-workers [#] –num-preemptible-workers [#]

How well did you know this?

Not at all

Perfectly

Custom Clusters

Dataproc clusters can be provisioned with a custom image that includes a user’s pre-installed packages.
You can also customise your cluster that is using the default image using
- Custom Cluster properties: allow you to modify properties in common configuration files like core-site.xml. Remove the need to manually change property files by hand or initialization action.
- Initialization actions: optional executable scripts that will run during your cluster setup. Allow you to install additional components, stage files, or change the node
- Custom Java/Scala dependencies (typically for Spark jobs)

How well did you know this?

Not at all

Perfectly

Autoscaling

Do not use autoscaling with:

High Availability Clusters
HDFS
Spark Structured Streaming
Idle Clusters

How well did you know this?

Not at all

Perfectly

Cloud Storage Connector

Use GCS instead of HDFS
Cheaper than persistent disk
High availability and durability
Decouple storage from cluster lifecycle

How well did you know this?

Not at all

Perfectly

Migrating and Optimizing for Google Cloud

What are we moving/optimizing?

Data (from HDFS)
Jobs (pointing to Google Cloud locations)
Treating clusters as ephemeral (temporary) rather than permanent entities

Migration best practices:

Move data first, generally to Cloud Storage buckets
- Possible exception:
  - Apache HBase data to Bigtable
  - Apache Impala to BigQuery
  - Can still choose to move to GCS if Bigtable/BigQuery features not needed
Small-scale experimentation (proof of concept)
- Use a subset of data to test
Think of it in terms of ephemeral clusters
Separate storage and compute
Optimize for the Cloud: Lift and Leverage

How well did you know this?

Not at all

Perfectly

Performance Optimization

Keep data close to your cluster
- Place Dataproc cluster in the same region as storage bucket
Larget persistent disk = better performance
- Consider SSD over HDD - slightly higher cost
Move/keep data on HDFS if workloads involve heavy I/O
Allocate more VMs
- Use preemptible VMs to save on costs
- More VMs will come at a higher cost than larger disks if more disk throughput is needed

How well did you know this?

Not at all

Perfectly

5_Dataproc Flashcards

(13 cards)