Exam Topics 5 Flashcards

1
Q

Question #: 16
Topic #: 2

Why do you need to split a machine learning dataset into training data and test data?
A. So you can try two different sets of features
B. To make sure your model is generalized for more than just the training data
C. To allow you to create unit tests in your code
D. So you can use one dataset for a wide model and one for a deep model

A

B. To make sure your model is generalized for more than just the training data
Reason: Other options are totally wrong

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Question #: 23
Topic #: 2

If you want to create a machine learning model that predicts the price of a particular stock based on its recent price history, what type of estimator should you use?
A. Unsupervised learning
B. Regressor
C. Classifier
D. Clustering estimator
A

B. Regressor

Reason: Continuous value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Question #: 34
Topic #: 2

What Dataflow concept determines when a Window's contents should be output based on certain criteria being met?
A. Sessions
B. OutputCriteria
C. Windows
D. Triggers
A

D. Triggers
Triggers control when the elements for a specific key and window are output. As elements arrive, they are put into one or more windows by a Window transform and its associated WindowFn, and then passed to the associated Trigger to determine if the Windows contents should be output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Question #: 35
Topic #: 2

Which of the following is NOT one of the three main types of triggers that Dataflow supports?
A. Trigger based on element size in bytes
B. Trigger that is a combination of other triggers
C. Trigger based on element count
D. Trigger based on time

A

A. Trigger based on element size in bytes
Reason: There are three major kinds of triggers that Dataflow supports: 1. Time-based triggers 2. Data-driven triggers. You can set a trigger to emit results from a window when that window has received a certain number of data elements. 3. Composite triggers. These triggers combine multiple time-based or data-driven triggers in some logical way
Reference: https://cloud.google.com/dataflow/model/triggers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Question #: 40
Topic #: 2

You are planning to use Google’s Dataflow SDK to analyze customer data such as displayed below. Your project requirement is to extract only the customer name from the data source and then write to an output PCollection.

Tom,555 X street -

Tim,553 Y street -

Sam, 111 Z street -
Which operation is best suited for the above data processing requirement?
A. ParDo
B. Sink API
C. Source API
D. Data extraction
A

A. ParDo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Question #: 46
Topic #: 2

By default, which of the following windowing behavior does Dataflow apply to unbounded data sets?
A. Windows at every 100 MB of data
B. Single, Global Window
C. Windows at every 1 minute
D. Windows at every 10 minutes
A

B. Single, Global Window

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Question #: 52
Topic #: 2

Which of these rules apply when you add preemptible workers to a Dataproc cluster (select 2 answers)?
A. Preemptible workers cannot use persistent disk.
B. Preemptible workers cannot store data.
C. If a preemptible worker is reclaimed, then a replacement worker must be added manually.
D. A Dataproc cluster cannot have only preemptible workers.

A

B. Preemptible workers cannot store data.
D. A Dataproc cluster cannot have only preemptible workers.
ReasonZ: The following rules will apply when you use preemptible workers with a Cloud Dataproc cluster:
. Processing onlySince preemptibles can be reclaimed at any time, preemptible workers do not store data. Preemptibles added to a Cloud Dataproc cluster only function as processing nodes.
. No preemptible-only clustersTo ensure clusters do not lose all workers, Cloud Dataproc cannot create preemptible-only clusters.
. Persistent disk sizeAs a default, all preemptible workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS.
The managed group automatically re-adds workers lost due to reclamation as capacity permits.
Reference: https://cloud.google.com/dataproc/docs/concepts/preemptible-vms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Question #: 57
Topic #: 2

Scaling a Cloud Dataproc cluster typically involves ____.
A. increasing or decreasing the number of worker nodes
B. increasing or decreasing the number of master nodes
C. moving memory to run more applications on a single node
D. deleting applications from unused nodes periodically

A

A. increasing or decreasing the number of worker nodes
Reason: After creating a Cloud Dataproc cluster, you can scale the cluster by increasing or decreasing the number of worker nodes in the cluster at any time, even when jobs are running on the cluster. Cloud Dataproc clusters are typically scaled to:
1) increase the number of workers to make a job run faster
2) decrease the number of workers to save money
3) increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS) storage
Reference: https://cloud.google.com/dataproc/docs/concepts/scaling-clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Question #: 59
Topic #: 2

The YARN ResourceManager and the HDFS NameNode interfaces are available on a Cloud Dataproc cluster \_\_\_\_.
A. application node
B. conditional node
C. master node
D. worker node
A

C. master node
Reason: The YARN ResourceManager and the HDFS NameNode interfaces are available on a Cloud Dataproc cluster master node. The cluster master-host-name is the name of your Cloud Dataproc cluster followed by an -m suffixfor example, if your cluster is named “my-cluster”, the master-host-name would be “my-cluster-m”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Question #: 60
Topic #: 2

Which of these is NOT a way to customize the software on Dataproc cluster instances?
A. Set initialization actions
B. Modify configuration files using cluster properties
C. Configure the cluster using Cloud Deployment Manager
D. Log into the master node and make changes from there

A

C. Configure the cluster using Cloud Deployment Manager
You can access the master node of the cluster by clicking the SSH button next to it in the Cloud Console.
You can easily use the –properties option of the dataproc command in the Google Cloud SDK to modify many common configuration files when creating a cluster.
When creating a Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud
Dataproc cluster immediately after the cluster is set up.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Question #: 61
Topic #: 2

In order to securely transfer web traffic data from your computer's web browser to the Cloud Dataproc cluster you should use a(n) \_\_\_\_\_.
A. VPN connection
B. Special browser
C. SSH tunnel
D. FTP connection
A

C. SSH tunnel
To connect to the web interfaces, it is recommended to use an SSH tunnel to create a secure connection to the master node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Question #: 72
Topic #: 2

Cloud Bigtable is Google's \_\_\_\_\_\_ Big Data database service.
A. Relational
B. mySQL
C. NoSQL
D. SQL Server
A

C. NoSQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Question #: 75
Topic #: 2

Cloud Bigtable is a recommended option for storing very large amounts of ____________________________?
A. multi-keyed data with very high latency
B. multi-keyed data with very low latency
C. single-keyed data with very low latency
D. single-keyed data with very high latency

A

C. single-keyed data with very low latency
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data.
A single value in each row is indexed; this value is known as the row key. Cloud Bigtable is ideal for storing very large amounts of single-keyed data with very low latency. It supports high read and write throughput at low latency, and it is an ideal data source for MapReduce operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Question #: 76
Topic #: 2

Google Cloud Bigtable indexes a single value in each row. This value is called the \_\_\_\_\_\_\_.
A. primary key
B. unique key
C. row key
D. master key
A

C. row key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Question #: 77
Topic #: 2

What is the HBase Shell for Cloud Bigtable?
A. The HBase shell is a GUI based interface that performs administrative tasks, such as creating and deleting tables.
B. The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting tables.
C. The HBase shell is a hypervisor based shell that performs administrative tasks, such as creating and deleting new virtualized instances.
D. The HBase shell is a command-line tool that performs only user account management functions to grant access to Cloud Bigtable instances.

A

B. The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting tables.
The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting tables. The Cloud Bigtable HBase client for Java makes it possible to use the HBase shell to connect to Cloud Bigtable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Question #: 39
Topic #: 2

Does Dataflow process batch data pipelines or streaming data pipelines?
A. Only Batch Data Pipelines
B. Both Batch and Streaming Data Pipelines
C. Only Streaming Data Pipelines
D. None of the above

A

B. Both Batch and Streaming Data Pipelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Question #: 4
Topic #: 2

What are all of the BigQuery operations that Google charges for?
A. Storage, queries, and streaming inserts
B. Storage, queries, and loading data from a file
C. Storage, queries, and exporting data
D. Queries and streaming inserts

A

A. Storage, queries, and streaming inserts
Google charges for storage, queries, and streaming inserts. Loading data from a file and exporting data are free operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Question #: 53
Topic #: 2

When using Cloud Dataproc clusters, you can access the YARN web interface by configuring a browser to connect through a \_\_\_\_ proxy.
A. HTTPS
B. VPN
C. SOCKS
D. HTTP
A

C. SOCKS
When using Cloud Dataproc clusters, configure your browser to use the SOCKS proxy. The SOCKS proxy routes data intended for the Cloud Dataproc cluster through an SSH tunnel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Question #: 12
Topic #: 2

What are two methods that can be used to denormalize tables in BigQuery?
A. 1) Split table into multiple tables; 2) Use a partitioned table
B. 1) Join tables into one table; 2) Use nested repeated fields
C. 1) Use a partitioned table; 2) Join tables into one table
D. 1) Use nested repeated fields; 2) Use a partitioned table

A

B. 1) Join tables into one table; 2) Use nested repeated fields

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Question #: 56
Topic #: 2

Dataproc clusters contain many configuration files. To update these files, you will need to use the --properties option. The format for the option is: file_prefix:property=\_\_\_\_\_.
A. details
B. value
C. null
D. id
A

B. value
To make updating files and properties easy, the –properties command uses a special format to specify the configuration file and the property and value within the file that should be updated. The formatting is as follows: file_prefix:property=value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Question #: 49
Topic #: 2

Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs?
A. Dataproc Worker
B. Dataproc Viewer
C. Dataproc Runner
D. Dataproc Editor
A

A. Dataproc Worker
Service accounts used with Cloud Dataproc must have Dataproc/Dataproc Worker role (or have all the permissions granted by Dataproc Worker role).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Question #: 45
Topic #: 2

Which of the following is not true about Dataflow pipelines?
A. Pipelines are a set of operations
B. Pipelines represent a data processing job
C. Pipelines represent a directed graph of steps
D. Pipelines can share data between instances

A

D. Pipelines can share data between instances
The data and transforms in a pipeline are unique to, and owned by, that pipeline. While your program can create multiple pipelines, pipelines cannot share data or transforms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Question #: 42
Topic #: 2

Which of the following is NOT true about Dataflow pipelines?
A. Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner
B. Dataflow pipelines can consume data from other Google Cloud services
C. Dataflow pipelines can be programmed in Java
D. Dataflow pipelines use a unified programming model, so can work both with streaming and batch data sources

A

A. Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner

Dataflow pipelines can also run on alternate runtimes like Spark and Flink, as they are built using the Apache Beam SDKs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Question #: 41
Topic #: 2

Which Cloud Dataflow / Beam feature should you use to aggregate data in an unbounded data source every hour based on the time when the data entered the pipeline?
A. An hourly watermark
B. An event time trigger
C. The with Allowed Lateness method
D. A processing time trigger
A

D. A processing time trigger
Reason: “when the data entered the pipeline”
When collecting and grouping data into windows, Beam uses triggers to determine when to emit the aggregated results of each window.
Processing time triggers. These triggers operate on the processing time the time when the data element is processed at any given stage in the pipeline.
Event time triggers. These triggers operate on the event time, as indicated by the timestamp on each data element. Beams default trigger is event time-based.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Question #: 32
Topic #: 2

You have a job that you want to cancel. It is a streaming pipeline, and you want to ensure that any data that is in-flight is processed and written to the output.
Which of the following commands can you use on the Dataflow monitoring console to stop the pipeline job?
A. Cancel
B. Drain
C. Stop
D. Finish

A

B. Drain
Using the Drain option to stop your job tells the Dataflow service to finish your job in its current state. Your job will immediately stop ingesting new data from input sources, but the Dataflow service will preserve any existing resources (such as worker instances) to finish processing and writing any buffered data in your pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Question #: 31
Topic #: 2

You want to use a BigQuery table as a data sink. In which writing mode(s) can you use BigQuery as a sink?
A. Both batch and streaming
B. BigQuery cannot be used as a sink
C. Only batch
D. Only streaming
A

A. Both batch and streaming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Question #: 30
Topic #: 2

Which of the following are feature engineering techniques? (Select 2 answers)
A. Hidden feature layers
B. Feature prioritization
C. Crossed feature columns
D. Bucketization of a continuous feature
A

C. Crossed feature columns
D. Bucketization of a continuous feature

Selecting and crafting the right set of feature columns is key to learning an effective model.
Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive bins/buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into.
Using each base feature column separately may not be enough to explain the data. To learn the differences between different feature combinations, we can add crossed feature columns to the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Question #: 29
Topic #: 2

Which of the following are examples of hyperparameters? (Select 2 answers.)
A. Number of hidden layers
B. Number of nodes in each hidden layer
C. Biases
D. Weights
A

A. Number of hidden layers
B. Number of nodes in each hidden layer

If model parameters are variables that get adjusted by training with existing data, your hyperparameters are the variables about the training process itself. For example, part of setting up a deep neural network is deciding how many “hidden” layers of nodes to use between the input layer and the output layer, as well as how many nodes each layer should use. These variables are not directly related to the training data at all. They are configuration variables. Another difference is that parameters change during a training job, while the hyperparameters are usually constant during a job.
Weights and biases are variables that get adjusted during the training process, so they are not hyperparameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Question #: 27
Topic #: 2

How can you get a neural network to learn about relationships between categories in a categorical feature?
A. Create a multi-hot column
B. Create a one-hot column
C. Create a hash bucket
D. Create an embedding column
A

D. Create an embedding column

There are two problems with one-hot encoding. First, it has high dimensionality, meaning that instead of having just one value, like a continuous feature, it has many values, or dimensions. This makes computation more time-consuming, especially if a feature has a very large number of categories. The second problem is that it doesnt encode any relationships between the categories. They are completely independent from each other, so the network has no way of knowing which ones are similar to each other.
Both of these problems can be solved by representing a categorical feature with an embedding column. The idea is that each category has a smaller vector with, lets say, 5 values in it. But unlike a one-hot vector, the values are not usually 0. The values are weights, similar to the weights that are used for basic features in a neural network. The difference is that each category has a set of weights (5 of them in this case).
You can think of each value in the embedding vector as a feature of the category. So, if two categories are very similar to each other, then their embedding vectors should be very similar too.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Question #: 2
Topic #: 2

What are two of the benefits of using denormalized data structures in BigQuery?
A. Reduces the amount of data processed, reduces the amount of storage required
B. Increases query speed, makes queries simpler
C. Reduces the amount of storage required, increases query speed
D. Reduces the amount of data processed, increases query speed

A

B. Increases query speed, makes queries simpler

Denormalization increases query speed for tables with billions of rows because BigQuery’s performance degrades when doing JOINs on large tables, but with a denormalized data structure, you don’t have to use JOINs, since all of the data has been combined into one table. Denormalization also makes queries simpler because you do not have to use JOIN clauses.
Denormalization increases the amount of data processed and the amount of storage required because it creates redundant data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Question #: 25
Topic #: 2

What are two of the characteristics of using online prediction rather than batch prediction?
A. It is optimized to handle a high volume of data instances in a job and to run more complex models.
B. Predictions are returned in the response message.
C. Predictions are written to output files in a Cloud Storage location that you specify.
D. It is optimized to minimize the latency of serving predictions.

A

B. Predictions are returned in the response message.
D. It is optimized to minimize the latency of serving predictions.

Online prediction -
.Optimized to minimize the latency of serving predictions.
.Predictions returned in the response message.

Batch prediction -
.Optimized to handle a high volume of instances in a job and to run more complex models.
.Predictions written to output files in a Cloud Storage location that you specify.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Question #: 22
Topic #: 2

To run a TensorFlow training job on your own computer using Cloud Machine Learning Engine, what would your command start with?
A. gcloud ml-engine local train
B. gcloud ml-engine jobs submit training
C. gcloud ml-engine jobs submit training local
D. You can’t run a TensorFlow program on your own computer using Cloud ML Engine .

A

A. gcloud ml-engine local train

gcloud ml-engine local train - run a Cloud ML Engine training job locally
This command runs the specified module in an environment similar to that of a live Cloud ML Engine Training Job.
This is especially useful in the case of testing distributed models, as it allows you to validate that you are properly interacting with the Cloud ML Engine cluster configuration.
33
Q

Question #: 20
Topic #: 2

Which TensorFlow function can you use to configure a categorical column if you don’t know all of the possible values for that column?
A. categorical_column_with_vocabulary_list
B. categorical_column_with_hash_bucket
C. categorical_column_with_unknown_values
D. sparse_column_with_keys

A

B. categorical_column_with_hash_bucket

If you know the set of all possible feature values of a column and there are only a few of them, you can use categorical_column_with_vocabulary_list. Each key in the list will get assigned an auto-incremental ID starting from 0.
What if we don’t know the set of possible values in advance? Not a problem. We can use categorical_column_with_hash_bucket instead. What will happen is that each possible value in the feature column occupation will be hashed to an integer ID as we encounter them in training.

34
Q

Question #: 19
Topic #: 2

Which software libraries are supported by Cloud Machine Learning Engine?
A. Theano and TensorFlow
B. Theano and Torch
C. TensorFlow
D. TensorFlow and Torch
A

C. TensorFlow

it supports Tensflow, scikit=learn & XGBoost

35
Q

Question #: 18
Topic #: 2

The CUSTOM tier for Cloud Machine Learning Engine allows you to specify the number of which types of cluster nodes?
A. Workers
B. Masters, workers, and parameter servers
C. Workers and parameter servers
D. Parameter servers

A

C. Workers and parameter servers

The CUSTOM tier is not a set tier, but rather enables you to use your own cluster specification. When you use this tier, set values to configure your processing cluster according to these guidelines:
You must set TrainingInput.masterType to specify the type of machine to use for your master node.
You may set TrainingInput.workerCount to specify the number of workers to use.
You may set TrainingInput.parameterServerCount to specify the number of parameter servers to use.
You can specify the type of machine for the master node, but you can’t specify more than one master node.

36
Q

Question #: 62
Topic #: 2

All Google Cloud Bigtable client requests go through a front-end server \_\_\_\_\_\_ they are sent to a Cloud Bigtable node.
A. before
B. after
C. only if
D. once
A

A. before
In a Cloud Bigtable architecture all client requests go through a front-end server before they are sent to a Cloud Bigtable node.
The nodes are organized into a Cloud Bigtable cluster, which belongs to a Cloud Bigtable instance, which is a container for the cluster. Each node in the cluster handles a subset of the requests to the cluster.
When additional nodes are added to a cluster, you can increase the number of simultaneous requests that the cluster can handle, as well as the maximum throughput for the entire cluster.

37
Q

Question #: 73
Topic #: 2

When you store data in Cloud Bigtable, what is the recommended minimum amount of stored data?
A. 500 TB
B. 1 GB
C. 1 TB
D. 500 GB
A

C. 1 TB

1TB minimum, and 2TB per node

38
Q

Question #: 68
Topic #: 2

Which is not a valid reason for poor Cloud Bigtable performance?
A. The workload isn’t appropriate for Cloud Bigtable.
B. The table’s schema is not designed correctly.
C. The Cloud Bigtable cluster has too many nodes.
D. There are issues with the network connection.

A

C. The Cloud Bigtable cluster has too many nodes.

39
Q

Question #: 55
Topic #: 2

Which action can a Cloud Dataproc Viewer perform?
A. Submit a job.
B. Create a cluster.
C. Delete a cluster.
D. List the jobs.
A

D. List the jobs.
A Cloud Dataproc Viewer is limited in its actions based on its role. A viewer can only list clusters, get cluster details, list jobs, get job details, list operations, and get operation details.

40
Q

Question #: 10
Topic #: 2

Which SQL keyword can be used to reduce the number of columns processed by BigQuery?
A. BETWEEN
B. WHERE
C. SELECT
D. LIMIT
A

C. SELECT
SELECT allows you to query specific columns rather than the whole table.
LIMIT, BETWEEN, and WHERE clauses will not reduce the number of columns processed by
BigQuery.

41
Q

Question #: 38
Topic #: 2

The \_\_\_\_\_\_\_\_\_ for Cloud Bigtable makes it possible to use Cloud Bigtable in a Cloud Dataflow pipeline.
A. Cloud Dataflow connector
B. DataFlow SDK
C. BiqQuery API
D. BigQuery Data Transfer Service
A

A. Cloud Dataflow connector

42
Q

Question #: 70
Topic #: 2

When you design a Google Cloud Bigtable schema it is recommended that you _________.
A. Avoid schema designs that are based on NoSQL concepts
B. Create schema designs that are based on a relational database design
C. Avoid schema designs that require atomicity across rows
D. Create schema designs that require atomicity across rows

A

C. Avoid schema designs that require atomicity across rows
All operations are atomic at the row level. For example, if you update two rows in a table, it’s possible that one row will be updated successfully and the other update will fail. Avoid schema designs that require atomicity across rows.

43
Q

Question #: 33
Topic #: 2

When running a pipeline that has a BigQuery source, on your local machine, you continue to get permission denied errors. What could be the reason for that?
A. Your gcloud does not have access to the BigQuery resources
B. BigQuery cannot be accessed from local machines
C. You are missing gcloud on your machine
D. Pipelines cannot be run locally

A

A. Your gcloud does not have access to the BigQuery resources
When reading from a Dataflow source or writing to a Dataflow sink using DirectPipelineRunner, the Cloud Platform account that you configured with the gcloud executable will need access to the corresponding source/sink

44
Q

Question #: 21
Topic #: 2

Which of the following statements about the Wide & Deep Learning model are true? (Select 2 answers.)
A. The wide model is used for memorization, while the deep model is used for generalization.
B. A good use for the wide and deep model is a recommender system.
C. The wide model is used for generalization, while the deep model is used for memorization.
D. A good use for the wide and deep model is a small-scale linear regression problem.

A

A. The wide model is used for memorization, while the deep model is used for generalization.
B. A good use for the wide and deep model is a recommender system.

Can we teach computers to learn like humans do, by combining the power of memorization and generalization? It’s not an easy question to answer, but by jointly training a wide linear model (for memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to bring us one step closer. At Google, we call it Wide & Deep Learning. It’s useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values), such as recommender systems, search, and ranking problems.

45
Q

Question #: 9
Topic #: 2

How would you query specific partitions in a BigQuery table?
A. Use the DAY column in the WHERE clause
B. Use the EXTRACT(DAY) clause
C. Use the __PARTITIONTIME pseudo-column in the WHERE clause
D. Use DATE BETWEEN in the WHERE clause

A

C. Use the __PARTITIONTIME pseudo-column in the WHERE clause

Partitioned tables include a pseudo column named _PARTITIONTIME that contains a date-based timestamp for data loaded into the table. To limit a query to particular partitions (such as Jan 1st and 2nd of 2017), use a clause similar to this:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP(‘2017-01-01’) AND TIMESTAMP(‘2017-01-02’)

46
Q

Question #: 48
Topic #: 2

What are the minimum permissions needed for a service account used with Google Dataproc?
A. Execute to Google Cloud Storage; write to Google Cloud Logging
B. Write to Google Cloud Storage; read to Google Cloud Logging
C. Execute to Google Cloud Storage; execute to Google Cloud Logging
D. Read and write to Google Cloud Storage; write to Google Cloud Logging

A

D. Read and write to Google Cloud Storage; write to Google Cloud Logging

Service accounts authenticate applications running on your virtual machine instances to other Google Cloud Platform services. For example, if you write an application that reads and writes files on Google Cloud Storage, it must first authenticate to the Google Cloud Storage API. At a minimum, service accounts used with Cloud Dataproc need permissions to read and write to Google Cloud Storage, and to write to Google Cloud Logging.

47
Q

Question #: 67
Topic #: 2

When a Cloud Bigtable node fails, \_\_\_\_ is lost.
A. all data
B. no data
C. the last transaction
D. the time dimension
A

B. no data
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google’s file system, in SSTable format. Each tablet is associated with a specific Cloud Bigtable node.
Data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. As a result:
Rebalancing tablets from one node to another is very fast, because the actual data is not copied. Cloud Bigtable simply updates the pointers for each node.
Recovery from the failure of a Cloud Bigtable node is very fast, because only metadata needs to be migrated to the replacement node.
When a Cloud Bigtable node fails, no data is lost

48
Q

Question #: 64
Topic #: 2

Which of the following statements is NOT true regarding Bigtable access roles?
A. Using IAM roles, you cannot give a user access to only one table in a project, rather than all tables in a project.
B. To give a user access to only one table in a project, grant the user the Bigtable Editor role for that table.
C. You can configure access control only at the project level.
D. To give a user access to only one table in a project, you must configure access through your application.

A

B. To give a user access to only one table in a project, grant the user the Bigtable Editor role for that table.
Reason: there is no Editor role

49
Q

Question #: 66
Topic #: 2

Which row keys are likely to cause a disproportionate number of reads and/or writes on a particular node in a Bigtable cluster (select 2 answers)?
A. A sequential numeric ID
B. A timestamp followed by a stock symbol
C. A non-sequential numeric ID
D. A stock symbol followed by a timestamp

A

A. A sequential numeric ID
B. A timestamp followed by a stock symbol
Reason: Similar row keys at the same time

50
Q

Question #: 69
Topic #: 2

Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?
A. Field promotion
B. Randomization
C. Salting
D. Hashing
A

A. Field promotion

Reason: Include a field such as user id into the row key to reduce hotspotting. B works but not as good for query.

51
Q

Question #: 8
Topic #: 2

Which of the following statements about Legacy SQL and Standard SQL is not true?
A. Standard SQL is the preferred query language for BigQuery.
B. If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
C. One difference between the two query languages is how you specify fully-qualified table names (i.e. table names that include their associated project name).
D. You need to set a query language for each dataset and the default is Standard SQL.

A

D. You need to set a query language for each dataset and the default is Standard SQL.

You do not set a query language for each dataset. It is set each time you run a query and the default query language is Legacy SQL.
Standard SQL has been the preferred query language since BigQuery 2.0 was released.
In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. In standard SQL, you use a period, ., instead.
Due to the differences in syntax between the two query languages (such as with project-qualified table names), if you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.

52
Q

Question #: 71
Topic #: 2

Which of the following is NOT a valid use case to select HDD (hard disk drives) as the storage for Google Cloud Bigtable?
A. You expect to store at least 10 TB of data.
B. You will mostly run batch workloads with scans and writes, rather than frequently executing random reads of a small number of rows.
C. You need to integrate with Google BigQuery.
D. You will not use the data to back a user-facing or latency-sensitive application.

A

C. You need to integrate with Google BigQuery.

For example, if you plan to store extensive historical data for a large number of remote-sensing devices and then use the data to generate daily reports, the cost savings for HDD storage may justify the performance tradeoff. On the other hand, if you plan to use the data to display a real-time dashboard, it probably would not make sense to use HDD storagereads would be much more frequent in this case, and reads are much slower with HDD storage.

53
Q

Question #: 43
Topic #: 2

You are developing a software application using Google's Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline. Which component will be used for the data processing operation?
A. PCollection
B. Transform
C. Pipeline
D. Sink API
A

B. Transform
In Google Cloud, the Dataflow SDK provides a transform component. It is responsible for the data processing operation. You can use conditional, for loops, and other complex programming structure to create a branching pipeline.

54
Q

Question #: 36
Topic #: 2

Which Java SDK class can you use to run your Dataflow programs locally?
A. LocalRunner
B. DirectPipelineRunner
C. MachineRunner
D. LocalPipelineRunner
A

B. DirectPipelineRunner
DirectPipelineRunner allows you to execute operations in the pipeline directly, without any optimization. Useful for small local execution and tests

55
Q

Question #: 6
Topic #: 2

Which of these statements about BigQuery caching is true?
A. By default, a query’s results are not cached.
B. BigQuery caches query results for 48 hours.
C. Query results are cached even if you specify a destination table.
D. There is no charge for a query that retrieves its results from cache.

A

D. There is no charge for a query that retrieves its results from cache.

When query results are retrieved from a cached results table, you are not charged for the query.
BigQuery caches query results for 24 hours, not 48 hours.
Query results are not cached if you specify a destination table.
A query’s results are always cached except under certain conditions, such as if you specify a destination table.

56
Q

Question #: 14
Topic #: 2

Which of these operations can you perform from the BigQuery Web UI?
A. Upload a file in SQL format.
B. Load data with nested and repeated fields.
C. Upload a 20 MB file.
D. Upload multiple files using a wildcard.

A

B. Load data with nested and repeated fields.

You can load data with nested and repeated fields using the Web UI.
You cannot use the Web UI to:
- Upload a file greater than 10 MB in size
- Upload multiple files at the same time
- Upload a file in SQL format
All three of the above operations can be performed using the “bq” command.

57
Q

Question #: 63
Topic #: 2

What is the general recommendation when designing your row keys for a Cloud Bigtable schema?
A. Include multiple time series values within the row key
B. Keep the row keep as an 8 bit integer
C. Keep your row key reasonably short
D. Keep your row key as long as the field permits

A

C. Keep your row key reasonably short

A general guide is to, keep your row keys reasonably short. Long row keys take up additional memory and storage and increase the time it takes to get responses from the Cloud Bigtable server.

58
Q

Question #: 78
Topic #: 2

What is the recommended action to do in order to switch between SSD and HDD storage for your Google Cloud Bigtable instance?
A. create a third instance and sync the data from the two storage types via batch jobs
B. export the data from the existing instance and import the data into a new instance
C. run parallel instances where one is HDD and the other is SDD
D. the selection is final and you must resume using the same storage type

A

B. export the data from the existing instance and import the data into a new instance

When you create a Cloud Bigtable instance and cluster, your choice of SSD or HDD storage for the cluster is permanent. You cannot use the Google Cloud
Platform Console to change the type of storage that is used for the cluster.
If you need to convert an existing HDD cluster to SSD, or vice-versa, you can export the data from the existing instance and import the data into a new instance.

Alternatively, you can write -
a Cloud Dataflow or Hadoop MapReduce job that copies the data from one instance to another.

59
Q

Question #: 58
Topic #: 2

Cloud Dataproc charges you only for what you really use with \_\_\_\_\_ billing.
A. month-by-month
B. minute-by-minute
C. week-by-week
D. hour-by-hour
A

B. minute-by-minute
One of the advantages of Cloud Dataproc is its low cost. Dataproc charges for what you really use with minute-by-minute billing and a low, ten-minute-minimum billing period.

60
Q

Question #: 47
Topic #: 2

Which of the following job types are supported by Cloud Dataproc (select 3 answers)?
A. Hive
B. Pig
C. YARN
D. Spark
A

ABD

Cloud Dataproc provides out-of-the box and end-to-end support for many of the most popular job types, including Spark, Spark SQL, PySpark, MapReduce, Hive, and Pig jobs.

61
Q

Question #: 44
Topic #: 2

Which of the following IAM roles does your Compute Engine account require to be able to run pipeline jobs?
A. dataflow.worker
B. dataflow.compute
C. dataflow.developer
D. dataflow.viewer
A

A. dataflow.worker

62
Q

Question #: 3
Topic #: 2

Which of these statements about exporting data from BigQuery is false?
A. To export more than 1 GB of data, you need to put a wildcard in the destination filename.
B. The only supported export destination is Google Cloud Storage.
C. Data can only be exported in JSON or Avro format.
D. The only compression option available is GZIP.

A

C. Data can only be exported in JSON or Avro format.

You cannot export table data to a local file, to Google Sheets, or to Google Drive. The only supported export location is Cloud Storage. For information on saving query results, see Downloading and saving query results.
You can export up to 1 GB of table data to a single file. If you are exporting more than 1 GB of data, use a wildcard to export the data into multiple files. When you export data to multiple files, the size of the files will vary.
You cannot export nested and repeated data in CSV format. Nested and repeated data is supported for Avro and JSON exports.
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
You cannot export data from multiple tables in a single export job.
You cannot choose a compression type other than GZIP when you export data using the Cloud Console or the classic BigQuery web UI.

63
Q

Question #: 17
Topic #: 2

Which of these numbers are adjusted by a neural network as it learns from a training dataset (select 2 answers)?
A. Weights
B. Biases
C. Continuous features
D. Input values
A

A. Weights
B. Biases
A neural network is a simple mechanism thats implemented with basic math. The only difference between the traditional programming model and a neural network is that you let the computer determine the parameters (weights and bias) by learning from training datasets.

64
Q

Question #: 13
Topic #: 2

Which of these is not a supported method of putting data into a partitioned table?
A. If you have existing data in a separate file for each day, then create a partitioned table and upload each file into the appropriate partition.
B. Run a query to get the records for a specific day from an existing table and for the destination table, specify a partitioned table ending with the day in the format “$YYYYMMDD”.
C. Create a partitioned table and stream new records to it every day.
D. Use ORDER BY to put a table’s rows into chronological order and then change the table’s type to “Partitioned”.

A

D. Use ORDER BY to put a table’s rows into chronological order and then change the table’s type to “Partitioned”.

You cannot change an existing table into a partitioned table. You must create a partitioned table from scratch. Then you can either stream data into it every day and the data will automatically be put in the right partition, or you can load data into a specific partition by using “$YYYYMMDD” at the end of the table name.

65
Q

Question #: 51
Topic #: 2

Which Google Cloud Platform service is an alternative to Hadoop with Hive?
A. Cloud Dataflow
B. Cloud Bigtable
C. BigQuery
D. Cloud Datastore
A

C. BigQuery
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis.
Google BigQuery is an enterprise data warehouse.

66
Q

Question #: 65
Topic #: 2

For the best possible performance, what is the recommended zone for your Compute Engine instance and Cloud Bigtable instance?
A. Have the Compute Engine instance in the furthest zone from the Cloud Bigtable instance.
B. Have both the Compute Engine instance and the Cloud Bigtable instance to be in different zones.
C. Have both the Compute Engine instance and the Cloud Bigtable instance to be in the same zone.
D. Have the Cloud Bigtable instance to be in the same zone as all of the consumers of your data.

A

C. Have both the Compute Engine instance and the Cloud Bigtable instance to be in the same zone.

It is recommended to create your Compute Engine instance in the same zone as your Cloud Bigtable instance for the best possible performance,
If it’s not possible to create a instance in the same zone, you should create your instance in another zone within the same region. For example, if your Cloud
Bigtable instance is located in us-central1-b, you could create your instance in us-central1-f. This change may result in several milliseconds of additional latency for each Cloud Bigtable request.
It is recommended to avoid creating your Compute Engine instance in a different region from your Cloud Bigtable instance, which can add hundreds of milliseconds of latency to each Cloud Bigtable request.

67
Q

Question #: 15
Topic #: 2

Which methods can be used to reduce the number of rows processed by BigQuery?
A. Splitting tables into multiple tables; putting data in partitions
B. Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause
C. Putting data in partitions; using the LIMIT clause
D. Splitting tables into multiple tables; using the LIMIT clause

A

A. Splitting tables into multiple tables; putting data in partitions

68
Q

Question #: 37
Topic #: 2

The Dataflow SDKs have been recently transitioned into which Apache service?
A. Apache Spark
B. Apache Hadoop
C. Apache Kafka
D. Apache Beam
A

D. Apache Beam

69
Q

Question #: 54
Topic #: 2

Cloud Dataproc is a managed Apache Hadoop and Apache \_\_\_\_\_ service.
A. Blaze
B. Spark
C. Fire
D. Ignite
A

B. Spark
Cloud Dataproc is a managed Apache Spark and Apache Hadoop service that lets you use open source data tools for batch processing, querying, streaming, and machine learning.

70
Q

Question #: 50
Topic #: 2

When creating a new Cloud Dataproc cluster with the projects.regions.clusters.create operation, these four values are required: project, region, name, and \_\_\_\_.
A. zone
B. node
C. label
D. type
A

A. zone
At a minimum, you must specify four values when creating a new cluster with the projects.regions.clusters.create operation:
The project in which the cluster will be created

The region to use -

The name of the cluster -
The zone in which the cluster will be created
You can specify many more details beyond these minimum requirements. For example, you can also specify the number of workers, whether preemptible compute should be used, and the network settings.

71
Q

Question #: 5
Topic #: 2

Which of the following is not possible using primitive roles?
A. Give a user viewer access to BigQuery and owner access to Google Compute Engine instances.
B. Give UserA owner access and UserB editor access for all datasets in a project.
C. Give a user access to view all datasets in a project, but not run queries on them.
D. Give GroupA owner access and GroupB editor access for all datasets in a project.

A

C. Give a user access to view all datasets in a project, but not run queries on them.

Primitive roles can be used to give owner, editor, or viewer access to a user or group, but they can’t be used to separate data access permissions from job-running permissions.

72
Q

Question #: 74
Topic #: 2

If you’re running a performance test that depends upon Cloud Bigtable, all the choices except one below are recommended steps. Which is NOT a recommended step to follow?
A. Do not use a production instance.
B. Run your test for at least 10 minutes.
C. Before you test, run a heavy pre-test for several minutes.
D. Use at least 300 GB of data.

A

A. Do not use a production instance.
If you’re running a performance test that depends upon Cloud Bigtable, be sure to follow these steps as you plan and execute your test:
Use a production instance. A development instance will not give you an accurate sense of how a production instance performs under load.
Use at least 300 GB of data. Cloud Bigtable performs best with 1 TB or more of data. However, 300 GB of data is enough to provide reasonable results in a performance test on a 3-node cluster. On larger clusters, use 100 GB of data per node.
Before you test, run a heavy pre-test for several minutes. This step gives Cloud Bigtable a chance to balance data across your nodes based on the access patterns it observes.
Run your test for at least 10 minutes. This step lets Cloud Bigtable further optimize your data, and it helps ensure that you will test reads from disk as well as cached reads from memory.

73
Q

Question #: 1
Topic #: 2

Suppose you have a table that includes a nested column called “city” inside a column called “person”, but when you try to submit the following query in BigQuery, it gives you an error.
SELECT person FROM project1.example.table1 WHERE city = “London”
How would you correct the error?
A. Add “, UNNEST(person)” before the WHERE clause.
B. Change “person” to “person.city”.
C. Change “person” to “city.person”.
D. Add “, UNNEST(city)” before the WHERE clause.

A

A. Add “, UNNEST(person)” before the WHERE clause.

To access the person.city column, you need to “UNNEST(person)” and JOIN it to table1 using a comma.

74
Q

Question #: 26
Topic #: 2

Which of these are examples of a value in a sparse vector? (Select 2 answers.)
A. [0, 5, 0, 0, 0, 0]
B. [0, 0, 0, 1, 0, 0, 1]
C. [0, 1]
D. [1, 0, 0, 0, 0, 0, 0]
A

C. [0, 1]
D. [1, 0, 0, 0, 0, 0, 0]

[0, 0, 0, 1, 0, 0, 1] is not a sparse vector because it has two 1s in it. A sparse vector contains only a single 1.
[0, 5, 0, 0, 0, 0] is not a sparse vector because it has a 5 in it. Sparse vectors only contain 0s and 1s.

75
Q

Question #: 11
Topic #: 2

To give a user read permission for only the first three columns of a table, which access control method would you use?
A. Primitive role
B. Predefined role
C. Authorized view
D. It’s not possible to give access to only the first three columns of a table.

A

C. Authorized view
An authorized view allows you to share query results with particular users and groups without giving them read access to the underlying tables. Authorized views can only be created in a dataset that does not contain the tables queried by the view.
When you create an authorized view, you use the view’s SQL query to restrict access to only the rows and columns you want the users to see.

76
Q

Question #: 24
Topic #: 2

Suppose you have a dataset of images that are each labeled as to whether or not they contain a human face. To create a neural network that recognizes human faces in images using this labeled dataset, what approach would likely be the most effective?
A. Use K-means Clustering to detect faces in the pixels.
B. Use feature engineering to add features for eyes, noses, and mouths to the input data.
C. Use deep learning by creating a neural network with multiple hidden layers to automatically detect features of faces.
D. Build a neural network with an input layer of pixels, a hidden layer, and an output layer with two categories.

A

C. Use deep learning by creating a neural network with multiple hidden layers to automatically detect features of faces.

77
Q

Question #: 28
Topic #: 2

If a dataset contains rows with individual people and columns for year of birth, country, and income, how many of the columns are continuous and how many are categorical?
A. 1 continuous and 2 categorical
B. 3 categorical
C. 3 continuous
D. 2 continuous and 1 categorical
A

D. 2 continuous and 1 categorical

78
Q

Question #: 7
Topic #: 2

Which of these sources can you not load data into BigQuery from?
A. File upload
B. Google Drive
C. Google Cloud Storage
D. Google Cloud SQL
A

D. Google Cloud SQL
You can load data into BigQuery from a file upload, Google Cloud Storage, Google Drive, or Google Cloud Bigtable. It is not possible to load data into BigQuery directly from Google Cloud SQL. One way to get data from Cloud SQL to BigQuery would be to export data from Cloud SQL to Cloud Storage and then load it from there.