# Processing Hadoop Jobs with Dataproc on Google Cloud - Sheet1 Flashcards
- What is Dataproc in Google Cloud?
Dataproc on Google Cloud enables the processing of Hadoop jobs using open-source data tools for batch processing, querying, streaming, and machine learning.
- What are the benefits of using Dataproc for cluster management?
Dataproc provides automation for creating and managing clusters, allowing for easy cluster management and cost savings by turning off clusters when not in use.
- How does Dataproc compare to traditional on-premises solutions?
Compared to traditional on-premises solutions and other cloud services, Dataproc offers unique advantages for clusters of various sizes.
- Is there a need to learn new tools when switching to Dataproc?
No need to learn new tools or APIs when using Dataproc. Existing projects can be moved to Dataproc without redevelopment.
- What popular tools are updated and supported on Dataproc?
Popular tools like Spark, Hadoop, Pig, and Hive are frequently updated and supported.
- How is Dataproc priced?
Dataproc is priced at $0.01 per virtual CPU per cluster per hour, on top of other Google Cloud resources used.
- What are the cost benefits of Dataproc?
Clusters can include preemptible instances with lower compute prices, and billing is based on second-by-second usage with a one-minute minimum billing period.
- How fast can Dataproc clusters start, scale, and shut down?
Dataproc clusters start, scale, and shut down quickly, with each operation taking an average of 90 seconds or less.
- How flexible are the Dataproc clusters?
Clusters can be created and scaled rapidly, offering various virtual machine types, sizes, numbers of nodes, and networking options.
- Does Dataproc allow the use of open-source tools?
Dataproc allows the use of Spark and Hadoop tools, libraries, and documentation. Native versions of Spark, Hadoop, Pig, and Hive are frequently updated.
- How does Dataproc integrate with other Google Cloud Services?
Dataproc provides built-in integration with Cloud Storage, BigQuery, and Cloud Bigtable to ensure data integrity. Cloud logging and monitoring are available to create a complete data platform.
- How does Dataproc handle ETL processes?
Dataproc can effortlessly perform ETL processes by loading row-logged data directly into BigQuery for business reporting.
- How does cluster management happen in Dataproc?
Users can easily interact with clusters and Spark or Hadoop jobs without the need for an administrator or special software. Cluster management can be done through the Cloud Console, Cloud SDK, or Dataproc REST API.
- What is the significance of image versioning in Dataproc?
Dataproc supports image versioning, allowing switching between different versions of Apache Spark, Apache Hadoop, and other tools.
- What measures are there to ensure high availability in Dataproc?
Clusters can run with multiple primary nodes, and jobs can be set to restart on failure, ensuring high availability.
- What developer tools does Dataproc offer?
Dataproc offers multiple ways to manage a cluster, including the Cloud Console, Cloud SDK, RESTful APIs, and SSH access.
- What are initialization actions in Dataproc?
Initialization actions enable the installation or customization of settings and libraries when creating a cluster. They allow for cluster customization by specifying executables or scripts to run on all nodes in the cluster immediately after setup.
- What are optional components in Dataproc?
Optional components can be selected when deploying a cluster, including Anaconda, Hive, Jupyter notebook, Zeppelin notebook, Druid, Presto, and ZooKeeper. These components enhance the capabilities of the cluster.
- Can a Dataproc cluster contain both preemptible secondary workers and non-preemptible secondary workers?
No, a Dataproc cluster can contain either preemptible secondary workers or non-preemptible secondary workers, but not both.
- How should a Dataproc cluster be considered in terms of longevity?
It is recommended to consider a Dataproc cluster as short-lived rather than long-lived. Spin up clusters when compute processing is required for a job and then turn them down.