Exam Topics 5 Flashcards
Question #: 16
Topic #: 2
Why do you need to split a machine learning dataset into training data and test data?
A. So you can try two different sets of features
B. To make sure your model is generalized for more than just the training data
C. To allow you to create unit tests in your code
D. So you can use one dataset for a wide model and one for a deep model
B. To make sure your model is generalized for more than just the training data
Reason: Other options are totally wrong
Question #: 23
Topic #: 2
If you want to create a machine learning model that predicts the price of a particular stock based on its recent price history, what type of estimator should you use? A. Unsupervised learning B. Regressor C. Classifier D. Clustering estimator
B. Regressor
Reason: Continuous value
Question #: 34
Topic #: 2
What Dataflow concept determines when a Window's contents should be output based on certain criteria being met? A. Sessions B. OutputCriteria C. Windows D. Triggers
D. Triggers
Triggers control when the elements for a specific key and window are output. As elements arrive, they are put into one or more windows by a Window transform and its associated WindowFn, and then passed to the associated Trigger to determine if the Windows contents should be output.
Question #: 35
Topic #: 2
Which of the following is NOT one of the three main types of triggers that Dataflow supports?
A. Trigger based on element size in bytes
B. Trigger that is a combination of other triggers
C. Trigger based on element count
D. Trigger based on time
A. Trigger based on element size in bytes
Reason: There are three major kinds of triggers that Dataflow supports: 1. Time-based triggers 2. Data-driven triggers. You can set a trigger to emit results from a window when that window has received a certain number of data elements. 3. Composite triggers. These triggers combine multiple time-based or data-driven triggers in some logical way
Reference: https://cloud.google.com/dataflow/model/triggers
Question #: 40
Topic #: 2
You are planning to use Google’s Dataflow SDK to analyze customer data such as displayed below. Your project requirement is to extract only the customer name from the data source and then write to an output PCollection.
Tom,555 X street -
Tim,553 Y street -
Sam, 111 Z street - Which operation is best suited for the above data processing requirement? A. ParDo B. Sink API C. Source API D. Data extraction
A. ParDo
Question #: 46
Topic #: 2
By default, which of the following windowing behavior does Dataflow apply to unbounded data sets? A. Windows at every 100 MB of data B. Single, Global Window C. Windows at every 1 minute D. Windows at every 10 minutes
B. Single, Global Window
Question #: 52
Topic #: 2
Which of these rules apply when you add preemptible workers to a Dataproc cluster (select 2 answers)?
A. Preemptible workers cannot use persistent disk.
B. Preemptible workers cannot store data.
C. If a preemptible worker is reclaimed, then a replacement worker must be added manually.
D. A Dataproc cluster cannot have only preemptible workers.
B. Preemptible workers cannot store data.
D. A Dataproc cluster cannot have only preemptible workers.
ReasonZ: The following rules will apply when you use preemptible workers with a Cloud Dataproc cluster:
. Processing onlySince preemptibles can be reclaimed at any time, preemptible workers do not store data. Preemptibles added to a Cloud Dataproc cluster only function as processing nodes.
. No preemptible-only clustersTo ensure clusters do not lose all workers, Cloud Dataproc cannot create preemptible-only clusters.
. Persistent disk sizeAs a default, all preemptible workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS.
The managed group automatically re-adds workers lost due to reclamation as capacity permits.
Reference: https://cloud.google.com/dataproc/docs/concepts/preemptible-vms
Question #: 57
Topic #: 2
Scaling a Cloud Dataproc cluster typically involves ____.
A. increasing or decreasing the number of worker nodes
B. increasing or decreasing the number of master nodes
C. moving memory to run more applications on a single node
D. deleting applications from unused nodes periodically
A. increasing or decreasing the number of worker nodes
Reason: After creating a Cloud Dataproc cluster, you can scale the cluster by increasing or decreasing the number of worker nodes in the cluster at any time, even when jobs are running on the cluster. Cloud Dataproc clusters are typically scaled to:
1) increase the number of workers to make a job run faster
2) decrease the number of workers to save money
3) increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS) storage
Reference: https://cloud.google.com/dataproc/docs/concepts/scaling-clusters
Question #: 59
Topic #: 2
The YARN ResourceManager and the HDFS NameNode interfaces are available on a Cloud Dataproc cluster \_\_\_\_. A. application node B. conditional node C. master node D. worker node
C. master node
Reason: The YARN ResourceManager and the HDFS NameNode interfaces are available on a Cloud Dataproc cluster master node. The cluster master-host-name is the name of your Cloud Dataproc cluster followed by an -m suffixfor example, if your cluster is named “my-cluster”, the master-host-name would be “my-cluster-m”.
Question #: 60
Topic #: 2
Which of these is NOT a way to customize the software on Dataproc cluster instances?
A. Set initialization actions
B. Modify configuration files using cluster properties
C. Configure the cluster using Cloud Deployment Manager
D. Log into the master node and make changes from there
C. Configure the cluster using Cloud Deployment Manager
You can access the master node of the cluster by clicking the SSH button next to it in the Cloud Console.
You can easily use the –properties option of the dataproc command in the Google Cloud SDK to modify many common configuration files when creating a cluster.
When creating a Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud
Dataproc cluster immediately after the cluster is set up.
Question #: 61
Topic #: 2
In order to securely transfer web traffic data from your computer's web browser to the Cloud Dataproc cluster you should use a(n) \_\_\_\_\_. A. VPN connection B. Special browser C. SSH tunnel D. FTP connection
C. SSH tunnel
To connect to the web interfaces, it is recommended to use an SSH tunnel to create a secure connection to the master node.
Question #: 72
Topic #: 2
Cloud Bigtable is Google's \_\_\_\_\_\_ Big Data database service. A. Relational B. mySQL C. NoSQL D. SQL Server
C. NoSQL
Question #: 75
Topic #: 2
Cloud Bigtable is a recommended option for storing very large amounts of ____________________________?
A. multi-keyed data with very high latency
B. multi-keyed data with very low latency
C. single-keyed data with very low latency
D. single-keyed data with very high latency
C. single-keyed data with very low latency
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data.
A single value in each row is indexed; this value is known as the row key. Cloud Bigtable is ideal for storing very large amounts of single-keyed data with very low latency. It supports high read and write throughput at low latency, and it is an ideal data source for MapReduce operations.
Question #: 76
Topic #: 2
Google Cloud Bigtable indexes a single value in each row. This value is called the \_\_\_\_\_\_\_. A. primary key B. unique key C. row key D. master key
C. row key
Question #: 77
Topic #: 2
What is the HBase Shell for Cloud Bigtable?
A. The HBase shell is a GUI based interface that performs administrative tasks, such as creating and deleting tables.
B. The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting tables.
C. The HBase shell is a hypervisor based shell that performs administrative tasks, such as creating and deleting new virtualized instances.
D. The HBase shell is a command-line tool that performs only user account management functions to grant access to Cloud Bigtable instances.
B. The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting tables.
The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting tables. The Cloud Bigtable HBase client for Java makes it possible to use the HBase shell to connect to Cloud Bigtable.
Question #: 39
Topic #: 2
Does Dataflow process batch data pipelines or streaming data pipelines?
A. Only Batch Data Pipelines
B. Both Batch and Streaming Data Pipelines
C. Only Streaming Data Pipelines
D. None of the above
B. Both Batch and Streaming Data Pipelines
Question #: 4
Topic #: 2
What are all of the BigQuery operations that Google charges for?
A. Storage, queries, and streaming inserts
B. Storage, queries, and loading data from a file
C. Storage, queries, and exporting data
D. Queries and streaming inserts
A. Storage, queries, and streaming inserts
Google charges for storage, queries, and streaming inserts. Loading data from a file and exporting data are free operations.
Question #: 53
Topic #: 2
When using Cloud Dataproc clusters, you can access the YARN web interface by configuring a browser to connect through a \_\_\_\_ proxy. A. HTTPS B. VPN C. SOCKS D. HTTP
C. SOCKS
When using Cloud Dataproc clusters, configure your browser to use the SOCKS proxy. The SOCKS proxy routes data intended for the Cloud Dataproc cluster through an SSH tunnel.
Question #: 12
Topic #: 2
What are two methods that can be used to denormalize tables in BigQuery?
A. 1) Split table into multiple tables; 2) Use a partitioned table
B. 1) Join tables into one table; 2) Use nested repeated fields
C. 1) Use a partitioned table; 2) Join tables into one table
D. 1) Use nested repeated fields; 2) Use a partitioned table
B. 1) Join tables into one table; 2) Use nested repeated fields
Question #: 56
Topic #: 2
Dataproc clusters contain many configuration files. To update these files, you will need to use the --properties option. The format for the option is: file_prefix:property=\_\_\_\_\_. A. details B. value C. null D. id
B. value
To make updating files and properties easy, the –properties command uses a special format to specify the configuration file and the property and value within the file that should be updated. The formatting is as follows: file_prefix:property=value.
Question #: 49
Topic #: 2
Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs? A. Dataproc Worker B. Dataproc Viewer C. Dataproc Runner D. Dataproc Editor
A. Dataproc Worker
Service accounts used with Cloud Dataproc must have Dataproc/Dataproc Worker role (or have all the permissions granted by Dataproc Worker role).
Question #: 45
Topic #: 2
Which of the following is not true about Dataflow pipelines?
A. Pipelines are a set of operations
B. Pipelines represent a data processing job
C. Pipelines represent a directed graph of steps
D. Pipelines can share data between instances
D. Pipelines can share data between instances
The data and transforms in a pipeline are unique to, and owned by, that pipeline. While your program can create multiple pipelines, pipelines cannot share data or transforms
Question #: 42
Topic #: 2
Which of the following is NOT true about Dataflow pipelines?
A. Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner
B. Dataflow pipelines can consume data from other Google Cloud services
C. Dataflow pipelines can be programmed in Java
D. Dataflow pipelines use a unified programming model, so can work both with streaming and batch data sources
A. Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner
Dataflow pipelines can also run on alternate runtimes like Spark and Flink, as they are built using the Apache Beam SDKs
Question #: 41
Topic #: 2
Which Cloud Dataflow / Beam feature should you use to aggregate data in an unbounded data source every hour based on the time when the data entered the pipeline? A. An hourly watermark B. An event time trigger C. The with Allowed Lateness method D. A processing time trigger
D. A processing time trigger
Reason: “when the data entered the pipeline”
When collecting and grouping data into windows, Beam uses triggers to determine when to emit the aggregated results of each window.
Processing time triggers. These triggers operate on the processing time the time when the data element is processed at any given stage in the pipeline.
Event time triggers. These triggers operate on the event time, as indicated by the timestamp on each data element. Beams default trigger is event time-based.