# Processing Hadoop Jobs with Dataproc on Google Cloud - Sheet1 Flashcards by michael stroud

What is Dataproc in Google Cloud?

Dataproc on Google Cloud enables the processing of Hadoop jobs using open-source data tools for batch processing, querying, streaming, and machine learning.

How well did you know this?

Not at all

Perfectly

What are the benefits of using Dataproc for cluster management?

Dataproc provides automation for creating and managing clusters, allowing for easy cluster management and cost savings by turning off clusters when not in use.

How well did you know this?

Not at all

Perfectly

How does Dataproc compare to traditional on-premises solutions?

Compared to traditional on-premises solutions and other cloud services, Dataproc offers unique advantages for clusters of various sizes.

How well did you know this?

Not at all

Perfectly

Is there a need to learn new tools when switching to Dataproc?

No need to learn new tools or APIs when using Dataproc. Existing projects can be moved to Dataproc without redevelopment.

How well did you know this?

Not at all

Perfectly

What popular tools are updated and supported on Dataproc?

Popular tools like Spark, Hadoop, Pig, and Hive are frequently updated and supported.

How well did you know this?

Not at all

Perfectly

How is Dataproc priced?

Dataproc is priced at $0.01 per virtual CPU per cluster per hour, on top of other Google Cloud resources used.

How well did you know this?

Not at all

Perfectly

What are the cost benefits of Dataproc?

Clusters can include preemptible instances with lower compute prices, and billing is based on second-by-second usage with a one-minute minimum billing period.

How well did you know this?

Not at all

Perfectly

How fast can Dataproc clusters start, scale, and shut down?

Dataproc clusters start, scale, and shut down quickly, with each operation taking an average of 90 seconds or less.

How well did you know this?

Not at all

Perfectly

How flexible are the Dataproc clusters?

Clusters can be created and scaled rapidly, offering various virtual machine types, sizes, numbers of nodes, and networking options.

How well did you know this?

Not at all

Perfectly

Does Dataproc allow the use of open-source tools?

Dataproc allows the use of Spark and Hadoop tools, libraries, and documentation. Native versions of Spark, Hadoop, Pig, and Hive are frequently updated.

How well did you know this?

Not at all

Perfectly

How does Dataproc integrate with other Google Cloud Services?

Dataproc provides built-in integration with Cloud Storage, BigQuery, and Cloud Bigtable to ensure data integrity. Cloud logging and monitoring are available to create a complete data platform.

How well did you know this?

Not at all

Perfectly

How does Dataproc handle ETL processes?

Dataproc can effortlessly perform ETL processes by loading row-logged data directly into BigQuery for business reporting.

How well did you know this?

Not at all

Perfectly

How does cluster management happen in Dataproc?

Users can easily interact with clusters and Spark or Hadoop jobs without the need for an administrator or special software. Cluster management can be done through the Cloud Console, Cloud SDK, or Dataproc REST API.

How well did you know this?

Not at all

Perfectly

What is the significance of image versioning in Dataproc?

Dataproc supports image versioning, allowing switching between different versions of Apache Spark, Apache Hadoop, and other tools.

How well did you know this?

Not at all

Perfectly

What measures are there to ensure high availability in Dataproc?

Clusters can run with multiple primary nodes, and jobs can be set to restart on failure, ensuring high availability.

How well did you know this?

Not at all

Perfectly

What developer tools does Dataproc offer?

Dataproc offers multiple ways to manage a cluster, including the Cloud Console, Cloud SDK, RESTful APIs, and SSH access.

How well did you know this?

Not at all

Perfectly

What are initialization actions in Dataproc?

Initialization actions enable the installation or customization of settings and libraries when creating a cluster. They allow for cluster customization by specifying executables or scripts to run on all nodes in the cluster immediately after setup.

How well did you know this?

Not at all

Perfectly

What are optional components in Dataproc?

Optional components can be selected when deploying a cluster, including Anaconda, Hive, Jupyter notebook, Zeppelin notebook, Druid, Presto, and ZooKeeper. These components enhance the capabilities of the cluster.

How well did you know this?

Not at all

Perfectly

Can a Dataproc cluster contain both preemptible secondary workers and non-preemptible secondary workers?

No, a Dataproc cluster can contain either preemptible secondary workers or non-preemptible secondary workers, but not both.

How well did you know this?

Not at all

Perfectly

How should a Dataproc cluster be considered in terms of longevity?

It is recommended to consider a Dataproc cluster as short-lived rather than long-lived. Spin up clusters when compute processing is required for a job and then turn them down.

How well did you know this?

Not at all

Perfectly

How should data storage be managed in Dataproc?

Persistent data storage should be connected to other Google Cloud products rather than relying solely on native HDFS on the cluster.

How can Cloud Storage be used in place of HDFS in Dataproc?

Cloud Storage with the HDFS connector can be used as an alternative to native HDFS for storage. Existing Hadoop code can be adapted to use Cloud Storage by changing the prefix from hdfs:// to gs://.

What are some alternatives for off-cluster storage and large analytical workloads in Dataproc?

Consider using Cloud Bigtable for HBase off-cluster storage, and BigQuery for large analytical workloads instead of relying on Hadoop directly.

What are the key stages in a Dataproc workflow?

Dataproc involves a sequence of events: setup, configuration, optimization, utilization, and monitoring.

25. How can a cluster be created in Dataproc?

Setup includes creating a cluster, which can be done through the Cloud Console, command line (GCloud command), YAML files, Terraform configurations, or the REST API.

26. What types of clusters can be created in Dataproc?

Clusters can be single VMs, standard with a single primary node, or high availability with three primary nodes.

27. How can regions and zones be specified in Dataproc?

Users can specify the region and zone or choose a global region and allow the service to select the zone.

28. What optional components can be included in a Dataproc cluster?

Optional components from the Hadoop ecosystem, such as Anaconda, Hive Web Cat, Jupyter notebook, and Zeppelin notebook, can be included.

29. How can a Dataproc cluster be customized?

Cluster properties, user labels, and metadata can be defined to customize the cluster further.

30. What VM options are available for worker nodes in Dataproc?

Worker nodes, including preemptible nodes, have separate VM options for CPU, memory, and storage.

31. How can resource utilization and cluster startup be optimized in Dataproc?

Custom machine types and custom images can be used for optimized resource utilization and faster cluster startup.

32. How can jobs be submitted in Dataproc?

Jobs can be submitted through the Console, GCloud command, REST API, or orchestration services like Dataproc workflow and Cloud Composer.

33. How can a Dataproc cluster be monitored?

Monitoring can be done using Cloud Monitoring, allowing the creation of custom dashboards with graphs and setting up alert policies for notifications.

34. What metrics can be monitored using Cloud Monitoring in Dataproc?

Metrics related to HDFS, Yarn, and cluster performance can be monitored using Cloud Monitoring.

35. What are the key features of Dataproc on Google Cloud?

Key features include seamless transition, low-cost, super-fast operations, resizable clusters, open source ecosystem, integration with data platforms, managed environment, image versioning, developer tools, and cluster customization options.

36. How does Dataproc integrate with Google's BigQuery service?

Dataproc provides built-in integration with BigQuery and can load row-logged data directly into BigQuery for business reporting, performing ETL processes effortlessly.

37. How can you interact with Spark or Hadoop jobs on Dataproc?

Users can interact with Spark or Hadoop jobs through the Cloud Console, Cloud SDK, or Dataproc REST API, without the need for an administrator or special software.

38. How can one manage cost-effectiveness with Dataproc?

Dataproc allows the turning off of idle clusters easily, saving costs. Also, it uses second-by-second billing with a one-minute minimum billing period.

39. What does it mean that Dataproc supports image versioning?

This means that Dataproc allows switching between different versions of Apache Spark, Apache Hadoop, and other tools for maximum flexibility and compatibility.

40. What is the significance of Dataproc's managed environment feature?

This means users can interact with clusters and Spark or Hadoop jobs without needing an administrator or special software. It simplifies cluster management.

41. What are the ways to submit jobs to Dataproc?

Jobs can be submitted through the Console, GCloud command, REST API, or orchestration services like Dataproc workflow and Cloud Composer.

42. What are the primary node options in Dataproc?

Clusters can be single VMs, standard with a single primary node, or high availability with three primary nodes.

43. How does Dataproc support customizability with initialization actions?

Initialization actions enable the installation or customization of settings and libraries when creating a cluster, allowing executables or scripts to run on all nodes in the cluster immediately after setup.

44. What are some optional components that can be included when deploying a cluster in Dataproc?

Optional components include Anaconda, Hive, Jupyter notebook, Zeppelin notebook, Druid, Presto, and ZooKeeper.

45. What is the recommended practice for storing data in a Dataproc cluster?

It is recommended to connect persistent data storage to other Google Cloud products rather than relying solely on native HDFS on the cluster.

46. How can Cloud Storage be used in a Dataproc workflow?

Cloud Storage with the HDFS connector can be used as an alternative to native HDFS for storage. The prefix in existing Hadoop code can be changed from hdfs:// to gs:// to use Cloud Storage.

47. How does Dataproc handle large analytical workloads?

Consider using BigQuery for large analytical workloads instead of relying on Hadoop directly in Dataproc.

48. How can alerts and monitoring be set up in Dataproc?

Monitoring can be done using Cloud Monitoring, which allows for the creation of custom dashboards with graphs and setting up alert policies for notifications.

49. What options are available for worker nodes in Dataproc?

Worker nodes, including preemptible nodes, have separate VM options for CPU, memory, and storage.

50. How can the performance of a Dataproc cluster be optimized?

By using custom machine types and custom images for optimized resource utilization and faster cluster startup.