UDEMY Flashcards

1
Q

The data mining project manager meets with the production line supervisor to discuss the implementation of changes and improvements.Which stage in CRISP-DM does this scenario refer to?

Deployment Phase

​Data Preparation
​
Modeling Phase
​
Data Understanding
A

Deployment Phase

Correct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which model within Cloud Speech-to-Text API is best for audio that originated from video or includes multiple speakers(typically recorded at a 16khz or higher sampling rate)?

video

​phone_call
​
command_and_search
​
default
A

video
(Correct)

Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
Which GCP ML Service can help identify entities and labels by types such as a person, organization, location, events, products, and media from within a text?
​
Cloud Translation
​
Cloud Vision
​
Cloud Natural Language

​Cloud Video Intelligence

A

Cloud Natural Language
(Correct)

Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
\_\_\_\_\_\_\_\_ is a scalable, fully-managed NoSQL document database for your web and mobile applications.
​
Google Cloud Datastore
​
Google Cloud Bigtable
​
Google Cloud Storage
​
Cloud Storage for Firebase
A

Google Cloud Datastore
(Correct)

Explanation
Please refer https://cloud.google.com/storage-options/
https://cloud.google.com/datastore

Highly scalable NoSQL database
Firestore is the next generation of Datastore. Learn more about upgrading to Firestore.

Datastore is a highly scalable NoSQL database for your applications. Datastore automatically handles sharding and replication, providing you with a highly available and durable database that scales automatically to handle your applications’ load. Datastore provides a myriad of capabilities such as ACID transactions, SQL-like queries, indexes, and much more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In order to use categorical variables in a regression problem, which data pre-processing step is needed?

Use the categorical variable as is

Convert the categorical variable into binary dummy variables

Use a classification algorithm instead

Assign numeric values to different levels within a categorical variable and use the numeric values instead

A

Convert the categorical variable into binary dummy variables
(Correct)

Explanation
Binary Dummy Variable helps convert categorical variable into multiple numerical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
Which GCP ML Service can help to extract text and identify the language from within an image?
​
Cloud Translation
​
Cloud Natural Language
​
Cloud Vision
​
Cloud Video Intelligence
A
Cloud Vision
(Correct)
​
Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
A startup wishes to use a data processing platform that supports both batch and streaming applications and they would prefer to have a hands-off/serverless data processing platform. Which GCP service is suited for this?
​
Shell Scripts on Compute Instances
​
Dataproc
​
Cloud SQL
​
Dataflow
A

Dataflow
(Correct)

Explanation
Please refer https://cloud.google.com/dataflow/

Dataflow
Unified stream and batch data processing that’s serverless, fast, and cost-effective.

Flexible scheduling and pricing for batch processing
For processing with flexibility in job scheduling time, such as overnight jobs, flexible resource scheduling (FlexRS) offers a lower price for batch processing. These flexible jobs are placed into a queue with a guarantee that they will be retrieved for execution within a six-hour window.

Ready-to-use real-time AI patterns
Enabled through ready-to-use patterns, Dataflow’s real-time AI capabilities allow for real-time reactions with near-human intelligence to large torrents of events. Customers can build intelligent solutions ranging from predictive analytics and anomaly detection to real-time personalization and other advanced analytics use cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
Which among below is not a category of Data Access audit logs for Cloud Spanner?
​
Data Access (DATA_WRITE)
​
Data Access (DATA_READ)
​
Data Access (ADMIN_READ)
​
User Access(USER_ACCESS)
A
User Access(USER_ACCESS)
(Correct)

Explanation
Please refer https://cloud.google.com/spanner/docs/audit-logging

Cloud Spanner audit logging information
This page describes the audit logs created by Cloud Spanner as part of Cloud Audit Logs.

Overview
Google Cloud services write audit logs to help you answer the questions, “Who did what, where, and when?” Your Cloud projects contain only the audit logs for resources that are directly within the project. Other entities, such as folders, organizations, and Cloud Billing accounts, contain the audit logs for the entity itself.

For a general overview of Cloud Audit Logs, see Cloud Audit Logs. For a deeper understanding of Cloud Audit Logs, review Understanding audit logs.

Cloud Audit Logs maintains three audit logs for each Cloud project, folder, and organization:

Admin Activity audit logs
Data Access audit logs
System Event audit logs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
A user wished to store images, videos, objects and blob data in a scalable, fully managed, highly reliable and cost-efficient object/blob store. Which GCP storage option is appropriate for this use case?
​
Cloud Storage for Firebase
​
Google Cloud Storage
​
Google Cloud Bigtable
​
Google Cloud Datastore
A

Google Cloud Storage
(Correct)

Explanation
Please refer https://cloud.google.com/storage-options/

Object Lifecycle Management Define conditions that trigger data deletion or transition to a cheaper storage class.
Object Versioning Continue to store old copies of objects when they are deleted or overwritten.
Retention policies Define minimum retention periods that objects must be stored for before they’re deletable.
Object holds Place a hold on an object to prevent its deletion.
Customer-managed encryption keys Encrypt object data with encryption keys stored by the Cloud Key Management Service and managed by you.
Customer-supplied encryption keys Encrypt object data with encryption keys created and managed by you.
Uniform bucket-level access Uniformly control access to your Cloud Storage resources by disabling object ACLs.
Requester Pays Require accessors of your data to include a project ID to bill for network charges, operation charges, and retrieval fees.
Bucket Lock Bucket Lock allows you to configure a data retention policy for a Cloud Storage bucket that governs how long objects in the bucket must be retained.
Pub/Sub notifications for Cloud Storage Send notifications to Pub/Sub when objects are created, updated, or deleted.
Cloud Audit Logs with Cloud Storage Maintain admin activity logs and data access logs for your Cloud Storage resources.
Object- and bucket-level permissions Cloud Identity and Access Management (IAM) allows you to control who has access to your buckets and objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
With Regards to instance start-up time , App Engine Standard environment is faster than App engine Flexible environment(True/False)
​
TRUE
​
FALSE
A

TRUE
(Correct)

Explanation
Please refer https://cloud.google.com/appengine/docs/the-appengine-environments

Comparing high-level features
The following table summarizes the differences between the two environments:

Feature Standard environment Flexible environment
Instance startup time Seconds Minutes
Maximum request timeout Depends on the runtime and type of scaling. 60 minutes
Background threads Yes, with restrictions Yes
Background processes No Yes
SSH debugging No Yes
Scaling Manual, Basic, Automatic Manual, Automatic
Scale to zero Yes No, minimum 1 instance
Writing to local disk
Java 8, Java 11, Node.js, Python 3, PHP 7, Ruby, Go 1.11, and Go 1.12+ have read and write access to the /tmp directory.
Python 2.7 and PHP 5.5 don’t have write access to the disk.
Yes, ephemeral (disk initialized on each VM startup)
Modifying the runtime No Yes (through Dockerfile)
Deployment time Seconds Minutes
Automatic in-place security patches Yes Yes (excludes container image runtime)
Access to Google Cloud APIs & Services such as Cloud Storage, Cloud SQL, Memorystore, Tasks and others. Yes Yes
WebSockets No
Java 8, Python 2, and PHP 5 provide a proprietary Sockets API (beta), but the API is not available in newer standard runtimes. Yes
Supports installing third-party binaries
Yes for Java 8, Java 11, Node.js, Python 3, PHP 7, Ruby 2.5 (beta), Go 1.11, and Go 1.12+.
No for Python 2.7 and PHP 5.5.
Yes
Location North America, Asia Pacific, or Europe North America, Asia Pacific, or Europe
Pricing Based on instance hours Based on usage of vCPU, memory, and persistent disks
For an in-depth comparison of the environments, see the guide for your language: Python, Java, Go, or PHP.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
Which among below is not a machine learning type?
​
Dimensionality Reduction Techniques
(Correct)
​
Supervised Machine Learning
​
Reinforcement Learning
​
Unsupervised Machine Learning
A

Dimensionality Reduction Techniques
(Correct)

Explanation
Dimensionality Reduction Techniques is a method to reduce the number of features and is not a Machine Learning Type

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.[1]

Methods are commonly divided into linear and non-linear approaches.[1] Approaches can also be divided into feature selection and feature extraction.[2] Dimensionality reduction can be used for noise reduction, data visualization, cluster analysis, or as an intermediate step to facilitate other analyses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
Cloud Datastore supports ACID transactions,SQL-like queries and indexes(True/False)
​
FALSE
​
TRUE
A

TRUE
(Correct)

Explanation
Please refer https://cloud.google.com/datastore/

Firestore is the next generation of Datastore. Learn more about upgrading to Firestore.

Datastore is a highly scalable NoSQL database for your applications. Datastore automatically handles sharding and replication, providing you with a highly available and durable database that scales automatically to handle your applications’ load. Datastore provides a myriad of capabilities such as ACID transactions, SQL-like queries, indexes, and much more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
Which GCP ML Service can help extract tokens and sentences, identify parts of speech (PoS), and create dependency parse trees for each sentence within a text?
​
Cloud Vision
​
Cloud Video Intelligence
​
Cloud Translation
​
Cloud Natural Language
A

Cloud Natural Language
(Correct)

Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/

AutoML Natural Language	Natural Language API Integrated REST API

Natural Language is accessible via our REST API. Text can be uploaded in the request or integrated with Cloud Storage.

Syntax analysis

Extract tokens and sentences, identify parts of speech and create dependency parse trees for each sentence.

Entity analysis

Identify entities within documents — including receipts, invoices, and contracts — and label them by types such as date, person, contact information, organization, location, events, products, and media.

Custom entity extraction

Identify entities within documents and label them based on your own domain-specific keywords or phrases.

Sentiment analysis

Understand the overall opinion, feeling, or attitude sentiment expressed in a block of text.

Custom sentiment analysis

Understand the overall opinion, feeling, or attitude expressed in a block of text tuned to your own domain-specific sentiment scores.

Content classification

Classify documents in 700+ predefined categories.

Custom content classification

Create labels to customize models for unique use cases, using your own training data.

Multi-language

Enables you to easily analyze text in multiple languages including English, Spanish, Japanese, Chinese (simplified and traditional), French, German, Italian, Korean, Portuguese, and Russian.

Custom models

Train custom machine learning models with minimum effort and machine learning expertise.

Powered by Google’s AutoML models

Leverages Google state-of-the-art AutoML technology to produce high-quality models.

Spatial structure understanding

Use the structure and layout information in PDFs to improve custom entity extraction performance.

Large dataset support

Unlock complex use cases with support for 5,000 classification labels, 1 million documents, and 10 MB document size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

To convert a continuous variable into a categorical variable, which of the following technique can we use?

Bin the numerical variable into different categories

use min-max normalization

treat the numerical value directly as a categorical variable

use mean as a representative value

A

Bin the numerical variable into different categories
(Correct)

Dividing a Continuous Variable into Categories
This is also known by other names such as “discretizing,” “chopping data,” or “binning”.1 Specific methods sometimes used include “median split” or “extreme third tails”.

Whatever it is called, it is usually2 a bad idea. Instead, use a technique (such as regression) that can work with the continuous variable.The basic reason is intuitive: You are tossing away information. This can occur in various ways with various consequences. Here are some:

  1. When doing hypothesis tests, the loss of information when dividing continuous variables into categories typically translates into losing power. 3
  2. The loss of information involved in choosing bins to make a histogram can result in a misleading histogram.
  3. Collecting continuous data by categories can also cause headaches later on. Good and Hardin5 give an example of a long-term study in which incomes were relevant. The data were collected in categories of ten thousand dollars. Because of inflation, purchasing power decreased noticeably from the beginning to the end of the study. The categorization of income made it virtually impossible to correct for inflation.
  4. Wainer, Gessaroli, and Verdi6 argue that if a large enough sample is drawn from two uncorrelated variables, it is possible to group the variables one way so that the binned means show an increasing trend, and another way so that they show a decreasing trend. They conclude that if the original data are available, one should look at the scatterplot rather than at binned data. Moral: If there is a good justification for binning data in an analysis, it should be “before the fact” – you could otherwise be accused of manipulating the data to get the results you want!
  5. There are times when continuous data must be dichotomized, for example in deciding a cut-off for diagnostic criteria. When this is the case, it is important to choose the cut-off carefully, and to consider the sensitivity, specificity, and positive predictive value. 7

Notes:

  1. “Binning” is also used to refer to processes used in data mining and analytics. In those fields, which usually deal with large data sets and aim to discover patterns, carefully developed algorithms and validating with holdout subsamples can create a more rigorous process than the types of discretizing discussed on this web page.
  2. One situation in which it may be necessary is when comparing new data with existing data where only the categories are know, not the values of the continuous variable. Categorizing may also sometimes be appropriate for explaining an idea to an audience that lacks the sophistication for the full analysis. However, this should only be done when the full analysis has been done and justifies the result that is illustrated by the simpler technique using categorizing. For an example, see Gelman and Park (2008), Splitting a predictor at the upper quarter or third, American Statistician 62, No. 4, pp. 1-8. See also footnote 1 above.
  3. See http://psych.colorado.edu/~mcclella/MedianSplit/ for a demo illustrating this in the case when a continuous predictor in regression is dichotomized using a median split. Also see Van Belle (2008) Statistical Rules of Thumb, pp. 139 - 140 for more discussion and references.
  4. Some software has a “kernel density” feature that can give an estimate of the distribution of data. This is usually better than a histogram. The problem with bins in a histogram is the reason why histograms are not good for checking model assumptions.
  5. Good and Hardin (2006) Common Errors in Statistics, pp. 28 - 29.
  6. Wainer, Gessaroli, and Verdi (2006). Finding What Is Not There through the Unfortunate Binning of Results: The Mendel Effect, Chance Magazine, Vol 19, No.1, pp. 49 -52. Essentially the same article appears as Chapter14 in Wainer (2009) Picturing the Uncertain World, Princeton University Press.
  7. In addition to the references listed at the end of the linked page, see also Susan Ott’s Bone Density page for a graphical discussion of the cut-offs for osteoporosis and osteopenia.

https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
BigQuery ML supports which of the following ML models?
​
Decision Trees
​
Naive Bayes
​
Binary logistic regression
​
Random Forest
A

Binary logistic regression
(Correct)

Explanation
Please refer https://cloud.google.com/bigquery/docs/bigqueryml-intro

Supported models in BigQuery ML
A model in BigQuery ML represents what an ML system has learned from the training data.

BigQuery ML supports the following types of models:

Linear regression for forecasting; for example, the sales of an item on a given day. Labels are real-valued (they cannot be +/- infinity or NaN).
Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values.
Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross-entropy loss function.
K-means clustering for data segmentation; for example, identifying customer segments. K-means is an unsupervised learning technique, so model training does not require labels nor split data for training or evaluation.
Matrix Factorization for creating product recommendation systems. You can create product recommendations using historical customer behavior, transactions, and product ratings and then use those recommendations for personalized customer experiences.
Time series for performing time-series forecasts. You can use this feature to create millions of time series models and use them for forecasting. The model automatically handles anomalies, seasonality, and holidays.
Boosted Tree for creating XGBoost based classification and regression models.
Deep Neural Network (DNN) for creating TensorFlow based Deep Neural Networks for classification and regression models.
AutoML Tables to create best-in-class models without feature engineering or model selection. AutoML Tables searches through a variety of model architectures to decide the best model.
TensorFlow model importing. This feature lets you create BigQuery ML models from previously trained TensorFlow models, then perform prediction in BigQuery ML.
In BigQuery ML, you can use a model with data from multiple BigQuery datasets for training and for prediction.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
BigQuery ML supports which of the following ML models?
​
Random Forest
​
Multiclass logistic regression for classification
​
Principal Component Analysis
​
K means Algorithm
A
Multiclass logistic regression for classification
(Correct)

Explanation
Please refer https://cloud.google.com/bigquery/docs/bigqueryml-intro

Supported models in BigQuery ML
A model in BigQuery ML represents what an ML system has learned from the training data.

BigQuery ML supports the following types of models:

Linear regression for forecasting; for example, the sales of an item on a given day. Labels are real-valued (they cannot be +/- infinity or NaN).
Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values.
Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross-entropy loss function.
K-means clustering for data segmentation; for example, identifying customer segments. K-means is an unsupervised learning technique, so model training does not require labels nor split data for training or evaluation.
Matrix Factorization for creating product recommendation systems. You can create product recommendations using historical customer behavior, transactions, and product ratings and then use those recommendations for personalized customer experiences.
Time series for performing time-series forecasts. You can use this feature to create millions of time series models and use them for forecasting. The model automatically handles anomalies, seasonality, and holidays.
Boosted Tree for creating XGBoost based classification and regression models.
Deep Neural Network (DNN) for creating TensorFlow based Deep Neural Networks for classification and regression models.
AutoML Tables to create best-in-class models without feature engineering or model selection. AutoML Tables searches through a variety of model architectures to decide the best model.
TensorFlow model importing. This feature lets you create BigQuery ML models from previously trained TensorFlow models, then perform prediction in BigQuery ML.
In BigQuery ML, you can use a model with data from multiple BigQuery datasets for training and for prediction.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
\_\_\_\_\_\_\_\_ Google Cloud Storage is optimized for fast, highly durable storage for data accessed less than once a month
​
Regional
​
Nearline
​
Multi Regional
​
Coldline
A

Nearline
(Correct)

Explanation
Please refer https://cloud.google.com/storage/

Storage classes for any workload
Storage classes determine the availability and pricing model that apply to the data you store in Cloud Storage.

Standard - Optimized for performance and high frequency access.

Nearline - Fast, highly durable, for data accessed less than once a month.

Coldline - Fast, highly durable, for data accessed less than once a quarter.

Archive - Most cost-effective, for data accessed less than once a year.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
Which GCP ML Service can help detect and translate a document’s language?
​
Cloud Translation
​
Cloud Video Intelligence
​
Cloud Vision
​
Cloud Natural Language
A

Cloud Translation
(Correct)

Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/

Translation
Fast, dynamic translation tailored to your content

With Translation, you can quickly shift between languages using the best model for your content needs. Translation API delivers fast and dynamic results with our pre-trained models—instantly porting texts directly to your website and apps. AutoML Translation empowers developers and localization experts with limited machine learning expertise to quickly create high-quality, production-ready custom models. And if you want to translate directly from your audio data, Media Translation API allows you to add low-latency and real-time audio translations to your content and systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q
Cross-Validation is a technique for ensuring that the results uncovered in analysis are generalizable to an independent, unseen, data set(True/False)
​
TRUE
​
FALSE
A

TRUE
(Correct)

Cross-validation[1][2][3], sometimes called rotation estimation[4][5][6] or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set).[7][8] The goal of cross-validation is to test the model’s ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias[9] and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q
A professor is conducting research to examine the proportion of children whose parents read to them and who are themselves good readers. Which Machine learning algorithm can he/she apply?
​
Principal Component Analysis
​
Classification Algorithms
​
Association Rules
​
Clustering Algorithms
A

Association Rules
(Correct)

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.[1]

Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {\displaystyle {\mathrm {onions,potatoes} }\Rightarrow {\mathrm {burger} }}{{\mathrm {onions,potatoes}}}\Rightarrow {{\mathrm {burger}}} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements.

In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q
Interquartile range (IQR) can be used to identify outliers in numerical data(True/False)
​
TRUE
​
FALSE
A

TRUE
(Correct)

In descriptive statistics, the interquartile range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles,[1][2] IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used robust measure of scale.

The IQR is a measure of variability, based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal parts. The values that separate parts are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.

Use
Unlike total range, the interquartile range has a breakdown point of 25%,[3] and is thus often preferred to the total range.

The IQR is used to build box plots, simple graphical representations of a probability distribution.

The IQR is used in businesses as a marker for their income rates.

For a symmetric distribution (where the median equals the midhinge, the average of the first and third quartiles), half the IQR equals the median absolute deviation (MAD).

The median is the corresponding measure of central tendency.

The IQR can be used to identify outliers (see below).

The quartile deviation or semi-interquartile range is defined as half the IQR.[4][5]

22
Q
Which of the below measures is not used to evaluate regression models
​
Coefficient of Determination (R Square)
​
Root Mean Squared Error (RMSE)
​
Mean Absolute Error (MAE)
​
Confusion Matrix
A

Confusion Matrix
(Correct)

Explanation
Confusion Matrix is used to evaluate classification models

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.

I wanted to create a “quick reference guide” for confusion matrix terminology because I couldn’t find an existing resource that suited my requirements: compact in presentation, using numbers instead of arbitrary variables, and explained both in terms of formulas and sentences.

Let’s start with an example confusion matrix for a binary classifier (though it can easily be extended to the case of more than two classes):

Example confusion matrix for a binary classifier

What can we learn from this matrix?

There are two possible predicted classes: “yes” and “no”. If we were predicting the presence of a disease, for example, “yes” would mean they have the disease, and “no” would mean they don’t have the disease.
The classifier made a total of 165 predictions (e.g., 165 patients were being tested for the presence of that disease).
Out of those 165 cases, the classifier predicted “yes” 110 times, and “no” 55 times.
In reality, 105 patients in the sample have the disease, and 60 patients do not.
Let’s now define the most basic terms, which are whole numbers (not rates):

true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
true negatives (TN): We predicted no, and they don't have the disease.
false positives (FP): We predicted yes, but they don't actually have the disease. (Also known as a "Type I error.")
false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a "Type II error.")

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix,[8] is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).[9] The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).

It is a special kind of contingency table, with two dimensions (“actual” and “predicted”), and identical sets of “classes” in both dimensions (each combination of dimension and class is a variable in the contingency table).

23
Q
Which GCP ML Service can help detect adult content within a video?
​
Cloud Translation
​
Cloud Video Intelligence
​
Cloud Vision
​
Cloud Natural Language
A

Cloud Video Intelligence
(Correct)

Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/

Precise video analysis
Video Intelligence API automatically recognizes more than 20,000 objects, places, and actions in stored and streaming video. It also distinguishes scene changes and extracts rich metadata at the video, shot, or frame level. Use in combination with AutoML Video Intelligence to create your own custom entity labels to categorize content.

24
Q
An organization wishes to automate data movement from Software as a Service (SaaS) applications such as Google Ads and Google Ad Manager on a scheduled, managed basis.This data is further needed to generate reports
​
Google Big Data Service
​
Google Transfer Appliance
​
BigQuery Data Transfer Service
​
Storage Transfer Service
A
BigQuery Data Transfer Service
(Correct)
​
Explanation
Please refer https://cloud.google.com/bigquery/docs/transfer-service-overview

Simplified data imports from SaaS applications to BigQuery
The BigQuery Data Transfer Service automates data movement from SaaS applications to Google BigQuery on a scheduled, managed basis. Your analytics team can lay the foundation for a data warehouse without writing a single line of code. BigQuery Data Transfer Service initially supports Google application sources like Google Ads, Campaign Manager, Google Ad Manager and YouTube. Through BigQuery Data Transfer Service, users also gain access to data connectors that allow you to easily transfer data from Teradata and Amazon S3 to BigQuery.

25
Q
Which service in combination with Cloud Pub/Sub can be used to implement exactly-once processing of incoming stream messages?
​
Google BigTable
​
Dataflow
​
Cloud SQL
​
Google BigQuery
A

Dataflow
(Correct)

Explanation
Please refer https://cloud.google.com/pubsub/

Stream analytics and connectors
Native Dataflow integration enables reliable, expressive, exactly-once processing and integration of event streams in Java, Python, and SQL.

26
Q
A financial organization wishes to develop a global application to store transactions happening from different parts of the world. The storage system must provide low latency transaction support and horizontal scaling. Which GCP service is appropriate for this use case?
​
Google Cloud Spanner
(Correct)
​
Google Cloud Bigtable
(Incorrect)
​
Google Cloud Storage
​
Google Cloud Datastore
A

Google Cloud Spanner
(Correct)

Explanation
Please refer https://cloud.google.com/storage-options/

All features
Relational database, built for scale Everything you would expect from a relational database—schemas, SQL queries, and ACID transactions—battled-tested and ready to scale globally.
99.999% availability For global businesses, reliability is expected but maintaining that reliability while also rapidly scaling can be a challenge. Cloud Spanner delivers industry-leading 99.999% availability for multi-regional instances and provides transparent, synchronous replication across region and multi-region configurations
Automatic sharding Cloud Spanner optimizes performance by automatically sharding the data based on request load and size of the data. As a result, you can spend less time worrying about how to scale your database, and instead focus on scaling your business.
Fully managed Easy deployment at every stage and for any size database. Synchronous replication and maintenance are automatic and built-in.
Strong transactional consistency Purpose-built for external, strong, global transactional consistency.
Regional and multi-regional configurations No matter where your users may be, apps backed by Cloud Spanner can read and write up-to-date strongly consistent data globally Additionally, when running a multi-region instance, your database is able to survive a regional failure, and offers industry-leading 99.999% availability.
Online schema changes with no downtime Cloud Spanner users can make a schema change, whether it’s adding a column or adding an index, while serving traffic with zero downtime. Hence you now have the flexibility to adapt your database to your business needs without compromising on the availability of your application.
Built on Google Cloud Network Cloud Spanner is built on Google’s dedicated network that provides low-latency, security, and reliability for serving users across the globe.
Enterprise-grade security Data-layer encryption, IAM integration for access and controls, and comprehensive audit logging.
Backup and restore On-demand backup and restore for data protection.
Multi-language support Client libraries in C#, C++, Go, Java, Node.js, PHP, Python, and Ruby. JDBC drivers for connectivity with popular third-party tools.

27
Q
Iterative processing and notebooks workloads are recommended to be implemented using which google service?
​
Cloud SQL
​
Dataflow
​
Dataproc
​
Shell Scripts on Compute Instances
A

Dataproc
(Correct)

Explanation
Please refer https://cloud.google.com/dataflow/

Jupyter logo
BLOG POST

New GA Dataproc features extend data science and ML capabilities
Learn more

28
Q
App Engine Standard environment supports SSH debugging(True/False)
​
TRUE
​
FALSE
A

FALSE
(Correct)
Explanation
Please refer https://cloud.google.com/appengine/docs/the-appengine-environments

Comparing high-level features
The following table summarizes the differences between the two environments:

Feature Standard environment Flexible environment
Instance startup time Seconds Minutes
Maximum request timeout Depends on the runtime and type of scaling. 60 minutes
Background threads Yes, with restrictions Yes
Background processes No Yes
SSH debugging No Yes
Scaling Manual, Basic, Automatic Manual, Automatic
Scale to zero Yes No, minimum 1 instance
Writing to local disk
Java 8, Java 11, Node.js, Python 3, PHP 7, Ruby, Go 1.11, and Go 1.12+ have read and write access to the /tmp directory.
Python 2.7 and PHP 5.5 don’t have write access to the disk.
Yes, ephemeral (disk initialized on each VM startup)
Modifying the runtime No Yes (through Dockerfile)
Deployment time Seconds Minutes
Automatic in-place security patches Yes Yes (excludes container image runtime)
Access to Google Cloud APIs & Services such as Cloud Storage, Cloud SQL, Memorystore, Tasks and others. Yes Yes
WebSockets No
Java 8, Python 2, and PHP 5 provide a proprietary Sockets API (beta), but the API is not available in newer standard runtimes. Yes
Supports installing third-party binaries
Yes for Java 8, Java 11, Node.js, Python 3, PHP 7, Ruby 2.5 (beta), Go 1.11, and Go 1.12+.
No for Python 2.7 and PHP 5.5.
Yes
Location North America, Asia Pacific, or Europe North America, Asia Pacific, or Europe
Pricing Based on instance hours Based on usage of vCPU, memory, and persistent disks
For an in-depth comparison of the environments, see the guide for your language: Python, Java, Go, or PHP.

29
Q
Which of the following is a Dimensionality Reduction Method?
​
Neural Networks
​
Factor Analysis
​
Linear Regression
​
Association Rules
A

Factor Analysis
(Correct)

The most commonly used techniques for data-dimensionality reduction, including:

Missing Value Ratio
Low Variance Filter
High Correlation Filter
Principal component analysis (PCA)
Random Forest
Backward Feature Elimination
Forward Feature Selection
​Linear discriminant analysis (LDA)
Neural auto-encoder
t-distributed stochastic neighbour embedding (t-SNE)
Factor Analysis
Independent Component Analysis
Methods Based on Projections
UMAP
Multidimensional Scaling (MDS)

https://d1wqtxts1xzle7.cloudfront.net/30756732/smdr_ssl05.pdf?1362334822=&response-content-disposition=inline%3B+filename%3DSpectral_methods_for_dimensionality_redu.pdf&Expires=1598916230&Signature=WUUosJvSSX4FHVAvmQJ3NmsC66-X7uJH3haGZdxTFbXYCmIKRb36xlWj2XeP6Fe0cLiRXdwKCDjcXVzhNdFenIwwQ2MBDTXNY~89W3BGf3tIIoAsOwFP~0tChgkpdt6bPWm2lZkvijjwpu~gzgeNb-pl30owjydUgoC4PVKmorMhtF06t422LOOk9LybkYTSgiESRBsI3BtyT-bp6NnOMgJhtBR-cJNvZbKEMYxu5JXaXiEY6rII3fV4aj5Iqzxi4huctkyKgA5GZFctbmheDdwmD4Cz7XDvVf8dPBEMu2vnnm~kIvpexVqo-KGQlWddVZ00GsCUy3wSFbjoqLI~qg__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA

30
Q
Google Cloud Storage supports strongly consistent listing(True/False)
​
TRUE
​
FALSE
A
TRUE
(Correct)
​
Explanation
Please refer https://cloud.google.com/storage/

How Google Cloud Storage offers strongly consistent object listing thanks to Spanner

Here at Google Cloud, we’re proud of the fact that, all of our listing operations are consistent across Google Cloud Storage. They’re consistent across all Cloud Storage bucket locations, including regional and multi-regional buckets. They’re consistent whether you’re listing buckets in a project or listing objects within a bucket. If you create a Cloud Storage bucket or object, and then you follow that up with a request to fetch a list of resources, your resource will be in that response.

Why is this important? Strong list consistency is a big deal when you run data and analytics workloads. Here’s an explanation from Twan Wolthof, Engineer at Spotify on why strongly consistent listing operations are so important to his business:

When you do not have consistent listings, there is a possibility of missing files. You cannot rely on the consistency of the data being read as you develop your products. Even worse, inconsistent listings lead to unforeseen issues. For example, our processing tooling will succeed reading partial data and may potentially produce seemingly valid outputs. Problems like these have a tendency to quickly propagate throughout the dependency tree. When that happens, in the best-case we notice the failure and recompute all datasets produced within the dependency tree. In the worst-case, the failure goes unnoticed and we create invalid reports and statistics. Considering the large amount of data pipelines we run, even with a low probability of that happening, a lack of list-consistency in cloud storage offerings was a major blocker for data-processing at Spotify.
Not all cloud storage services provide list-after-write consistency, which can cause challenges for some common use cases. Typically, when a user uploads an object to a cloud storage bucket, an unpredictable and unbounded amount of time passes before that object shows up in that bucket’s list of contents. This is a very weak consistency model called “eventual consistency.” In practice, if a user uploads a new object and then tries to find it from a browser on another computer, they might not see the object that they just uploaded. Similar issues impact workloads distributed across multiple compute nodes. By offering strong list consistency across all Google Cloud Storage objects, you avoid having to wrangle with these sorts of problems. Again, here’s Spotify’s Twan Wolthof:

We considered multiple workarounds, such as using a global consistency cache based on NFS, porting Netflix’s s3mper as well as persisting listings in a manifest file stored alongside the data. All of the considered solutions were suboptimal as they either introduced a single point of failure or required us to put significant resources into developing our own solution and adjusting our tooling. Strong list consistency in Cloud Storage means we can continue using our existing data-processing stack without modifications and without worrying that data may be corrupted.

31
Q
App Engine Flexible environment supports SSH debugging(True/False)
​
FALSE
​
TRUE
A

TRUE
(Correct)
Explanation
Please refer https://cloud.google.com/appengine/docs/the-appengine-environments

Features
Customizable infrastructure - App Engine flexible environment instances are Compute Engine virtual machines, which means that you can take advantage of custom libraries, use SSH for debugging, and deploy your own Docker containers.

Performance options - Take advantage of a wide array of CPU and memory configurations. You can specify how much CPU and memory each instance of your application needs, and the flexible environment will provision the necessary infrastructure for you.

Native feature support - Features such as microservices, authorization, SQL and NoSQL databases, traffic splitting, logging, versioning, security scanning, and content delivery networks are natively supported.

Managed virtual machines - App Engine manages your virtual machines, ensuring that:

Instances are health-checked, healed as necessary, and co-located with other services within the project.
Critical, backwards compatible updates are automatically applied to the underlying operating system.
VM instances are automatically located by geographical region according to the settings in your project. Google’s management services ensure that all of a project’s VM instances are co-located for optimal performance.
VM instances are restarted on a weekly basis. During restarts Google’s management services will apply any necessary operating system and security updates.
You always have root access to Compute Engine VM instances. SSH access to VM instances in the flexible environment is disabled by default. If you choose, you can enable root access to your app’s VM instances.

32
Q
\_\_\_\_\_\_\_\_ gives you the ability to connect your pipeline through a single orchestration tool whether your workflow lives on-premises, in multiple clouds, or fully within GCP
​
Apache Crunch
​
Apache Beam
​
Cloud Composer
​
Apache Nifi
A
Cloud Composer
(Correct)
​
Explanation
Please refer https://cloud.google.com/composer/

Key features
Hybrid and multi-cloud
Ease your transition to the cloud or maintain a hybrid data environment by orchestrating workflows that cross between on-premises and the public cloud. Create workflows that connect data, processing, and services across clouds to give you a unified data environment.

Open source
Cloud Composer is built upon Apache Airflow, giving users freedom from lock-in and portability. This open source project, which Google is contributing back into, provides freedom from lock-in for customers as well as integration with a broad number of platforms, which will only expand as the Airflow community grows.

Easy orchestration
Cloud Composer pipelines are configured as directed acyclic graphs (DAGs) using Python, making it easy for any user. One-click deployment yields instant access to a rich library of connectors and multiple graphical representations of your workflow in action, making troubleshooting easy. Automatic synchronization of your directed acyclic graphs ensures your jobs stay on schedule.

33
Q
Z-SCORE Standardization is one of the techniques to normalize numeric variables before applying a machine learning model(True/False)
​
TRUE
(Correct)
​
FALSE
A

TRUE
(Correct)

https://en.wikipedia.org/wiki/Feature_scaling

Standardization (Z-score Normalization)
See also: Standard score
In machine learning, we can handle various types of data, e.g. audio signals and pixel values for image data, and this data can include multiple dimensions. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. This method is widely used for normalization in many machine learning algorithms (e.g., support vector machines, logistic regression, and artificial neural networks).[2][citation needed] The general method of calculation is to determine the distribution mean and standard deviation for each feature. Next we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation.

{\displaystyle x’={\frac {x-{\bar {x}}}{\sigma }}}x’ = \frac{x - \bar{x}}{\sigma}

Where {\displaystyle x}x is the original feature vector, {\displaystyle {\bar {x}}={\text{average}}(x)}{\displaystyle {\bar {x}}={\text{average}}(x)} is the mean of that feature vector, and {\displaystyle \sigma }\sigma is its standard deviation.

34
Q

Which of the following is not a goal of Dimensionality Reduction techniques?

To help ensure that the predictor items are independent

To provide a framework for interpretability of the results

To reduce the number of predictor items

To predict a value of a target variable

A

To predict a value of a target variable
(Correct)

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.[1]

Methods are commonly divided into linear and non-linear approaches.[1] Approaches can also be divided into feature selection and feature extraction.[2] Dimensionality reduction can be used for noise reduction, data visualization, cluster analysis, or as an intermediate step to facilitate other analyses.
Dimension reduction
For high-dimensional datasets (i.e. with number of dimensions more than 10), dimension reduction is usually performed prior to applying a K-nearest neighbours algorithm (k-NN) in order to avoid the effects of the curse of dimensionality.[19]

Feature extraction and dimension reduction can be combined in one step using principal component analysis (PCA), linear discriminant analysis (LDA), canonical correlation analysis (CCA), or non-negative matrix factorization (NMF) techniques as a pre-processing step followed by clustering by K-NN on feature vectors in reduced-dimension space. In machine learning this process is also called low-dimensional embedding.[20]

For very-high-dimensional datasets (e.g. when performing similarity search on live video streams, DNA data or high-dimensional time series) running a fast approximate K-NN search using locality sensitive hashing, random projection,[21] “sketches” [22] or other high-dimensional similarity search techniques from the VLDB toolbox might be the only feasible option.

Applications
A dimensionality reduction technique that is sometimes used in neuroscience is maximally informative dimensions,[citation needed] which finds a lower-dimensional representation of a dataset such that as much information as possible about the original data is preserved.

35
Q
A user wishes to generate reports on petabyte-scale data using a Business Intelligence (BI) tools. Which storage option provides integration with BI tools and supports OLAP workloads up to the petabyte-scale?
​
Google Cloud Datastore
​
Google Cloud Storage
​
Google BigQuery
​
Google Cloud Bigtable
A

Google BigQuery
(Correct)

Explanation
Please refer https://cloud.google.com/storage-options/

BigQuery BI Engine
BigQuery BI Engine is a blazing-fast in-memory analysis service for BigQuery that allows users to analyze large and complex datasets interactively with sub-second query response time and high concurrency. BigQuery BI Engine seamlessly integrates with familiar tools like Data Studio and will help accelerate data exploration and analysis for Looker, Sheets, and our BI partners in the coming months.

36
Q
BigQuery ML supports which of the following ML models?
​
Decision Trees
​
Neural Networks
​
Linear Regression


Random Forest

A

Linear Regression
(Correct)

Explanation
Please refer https://cloud.google.com/bigquery/docs/bigqueryml-intro

Supported models in BigQuery ML
A model in BigQuery ML represents what an ML system has learned from the training data.

BigQuery ML supports the following types of models:

Linear regression for forecasting; for example, the sales of an item on a given day. Labels are real-valued (they cannot be +/- infinity or NaN).
Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values.
Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross-entropy loss function.
K-means clustering for data segmentation; for example, identifying customer segments. K-means is an unsupervised learning technique, so model training does not require labels nor split data for training or evaluation.
Matrix Factorization for creating product recommendation systems. You can create product recommendations using historical customer behavior, transactions, and product ratings and then use those recommendations for personalized customer experiences.
Time series for performing time-series forecasts. You can use this feature to create millions of time series models and use them for forecasting. The model automatically handles anomalies, seasonality, and holidays.
Boosted Tree for creating XGBoost based classification and regression models.
Deep Neural Network (DNN) for creating TensorFlow based Deep Neural Networks for classification and regression models.
AutoML Tables to create best-in-class models without feature engineering or model selection. AutoML Tables searches through a variety of model architectures to decide the best model.
TensorFlow model importing. This feature lets you create BigQuery ML models from previously trained TensorFlow models, then perform prediction in BigQuery ML.
In BigQuery ML, you can use a model with data from multiple BigQuery datasets for training and for prediction.
37
Q
Cloud Dataflow uses \_\_\_\_\_\_\_\_\_\_ which is a unified programming model that enables one to develop both batch and streaming pipelines
​
Apache Nifi
​
Apache Beam


Apache Airflow

Apache Crunch

A

Apache Beam
(Correct)

Explanation
Please refer https://cloud.google.com/dataflow/docs/

Dataflow documentation
Dataflow is a managed service for executing a wide variety of data processing patterns. The documentation on this site shows you how to deploy your batch and streaming data processing pipelines using Dataflow, including directions for using service features.

The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service. The Apache Beam documentation provides in-depth conceptual information and reference material for the Apache Beam programming model, SDKs, and other runners.

38
Q

Which among below is not the best practice to control cost in BigQuery?

Query only the columns that you need.

Don’t run queries to explore or preview table data.Sample data using preview options

Use a LIMIT clause as a method of cost control

Before running queries, preview them to estimate costs.

A

Use a LIMIT clause as a method of cost control
(Correct)

Explanation
Please refer https://cloud.google.com/bigquery/docs/best-practices-costs

LIMIT doesn’t affect cost
Best practice: Do not use a LIMIT clause as a method of cost control.

Applying a LIMIT clause to a query does not affect the amount of data that is read. It merely limits the results set output. You are billed for reading all bytes in the entire table as indicated by the query.

The amount of data read by the query counts against your free tier quota despite the presence of a LIMIT clause.

39
Q
Which GCP ML Service can help understand the overall sentiment expressed in a block of text?
​
Cloud Translation
​
Cloud Video Intelligence
​
Cloud Vision
​
Cloud Natural Language
A

Cloud Natural Language
(Correct)
Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/

Analyzing Sentiment
Sentiment Analysis inspects the given text and identifies the prevailing emotional opinion within the text, especially to determine a writer’s attitude as positive, negative, or neutral. Sentiment analysis is performed through the analyzeSentiment method. For information on which languages are supported by the Natural Language, see Language Support. For information on how to interpret the score and magnitude sentiment values included in the analysis, see Interpreting sentiment analysis values.

40
Q
Cloud Pub/Sub is a HIPAA-compliant service, offering fine-grained access controls and end-to-end encryption(True/False)
​
TRUE
​
FALSE
A

TRUE
(Correct)

Explanation
Please refer https://cloud.google.com/pubsub/

All features
At-least-once delivery Synchronous, cross-zone message replication and per-message receipt tracking ensures at-least-once delivery at any scale.
Open Open APIs and client libraries in seven languages support cross-cloud and hybrid deployments.
Exactly-once processing Dataflow supports reliable, expressive, exactly-once processing of Pub/Sub streams.
No provisioning, auto-everything Pub/Sub does not have shards or partitions. Just set your quota, publish, and consume.
Compliance and security Pub/Sub is a HIPAA-compliant service, offering fine-grained access controls and end-to-end encryption.
Google Cloud–native integrations Take advantage of integrations with multiple services, such as Cloud Storage and Gmail update events and Cloud Functions for serverless event-driven computing.
Third-party and OSS integrations Pub/Sub provides third-party integrations with Splunk and Datadog for logs along with Striim and Informatica for data integration. Additionally, OSS integrations are available through Confluent Cloud for Apache Kafka and Knative Eventing for Kubernetes-based serverless workloads.
Seek and replay Rewind your backlog to any point in time or a snapshot, giving the ability to reprocess the messages. Fast forward to discard outdated data.
Dead letter topics Dead letter topics allow for messages unable to be processed by subscriber applications to be put aside for offline examination and debugging so that other messages can be processed without delay.
Filtering Pub/Sub can filter messages based upon attributes in order to reduce delivery volumes to subscribers.
Pricing

41
Q
Can we attach a persistent disk to more than one GCP compute instance?
​
Yes if the disk is in read-write mode
​
Yes if the disk is in read-only mode
​
Both 1 and 2
​
No , we cannot attach to more than 1 instance
A

Yes if the disk is in read-only mode
(Correct)

Explanation
Please refer https://cloud.google.com/compute/docs/faq
https://cloud.google.com/compute/docs/disks/add-persistent-disk

Share a zonal persistent disk between multiple instances
You can attach a non-boot persistent disk to more than one virtual machine instance in read-only mode, which lets you share static data between multiple instances. Sharing static data between multiple instances from one persistent disk is cheaper than replicating your data to unique disks for individual instances.

If you attach a persistent disk to multiple instances, all of those instances must attach the persistent disk in read-only mode. It is not possible to attach the persistent disk to multiple instances in read-write mode. If you need to share dynamic storage space between multiple instances, you can use one of the following options:

Connect your instances to Cloud Storage
Connect your instances to Filestore
Create a network file server on Compute Engine
If you have a persistent disk with data that you want to share between multiple instances, detach it from any read-write instances and attach it to one or more instances in read-only mode.

42
Q
Machine learning with Spark ML workloads is recommended to be implemented using which google service?
​
Shell Scripts on Compute Instances
​
Dataflow
​
Cloud SQL
​
Dataproc
A

Dataproc
(Correct)
Explanation
Please refer https://cloud.google.com/dataflow/

Use Dataproc, BigQuery, and Apache Spark ML for Machine Learning
The BigQuery Connector for Apache Spark allows Data Scientists to blend the power of BigQuery’s seamlessly scalable SQL engine with Apache Spark’s Machine Learning capabilities. In this tutorial, we show how to use Dataproc, BigQuery and Apache Spark ML to perform machine learning on a dataset.

Objectives
Use linear regression to build a model of birth weight as a function of five factors:
gestation weeks
mother’s age
father’s age
mother’s weight gain during pregnancy
Apgar score
BigQuery is used to prepare the linear regression input table, which is written to your Google Cloud Platform project. Python is used to query and manage data in BigQuery. The resulting linear regression table is accessed in Apache Spark, and Spark ML is used to build and evaluate the model. A Dataproc PySpark job is used to invoke Spark ML functions.
43
Q

A retailer wishes to identify the products which are bought together. Given a dataset containing a customer id, receipt id and the products bought. Which type of Machine Learning algorithm is suited to achieve this?

Association Rules

​
Random Forests
​
Principal Component Analysis
​
Logistic Regression
A
Association Rules
(Correct)
​
Explanation
Association rules can help create rules on which items are brought together

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.[1]

Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {\displaystyle {\mathrm {onions,potatoes} }\Rightarrow {\mathrm {burger} }}{{\mathrm {onions,potatoes}}}\Rightarrow {{\mathrm {burger}}} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements.

In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.

44
Q
Selecting suitable principal components from Principal Component Analysis can help avoid multicollinearity in regression problems (True / False)
​
FALSE
​
TRUE
A

TRUE
(Correct)
Explanation
Principal components are orthogonal and independent

45
Q

Which option among below represents all the stages in CRISP-DM Methodology?

Data Understanding,Data Preparation ,Modeling Phase,Evaluation Phase,Deployment Phase

Modeling Phase,Evaluation Phase,Deployment Phase

Business Understanding,Data Understanding,Data Preparation ,Modeling Phase,Evaluation Phase

Business Understanding,Data Understanding,Data Preparation ,Modeling Phase,Evaluation Phase,Deployment Phase

A

Business Understanding,Data Understanding,Data Preparation ,Modeling Phase,Evaluation Phase,Deployment Phase
(Correct)

Cross-industry standard process for data mining, known as CRISP-DM,[1] is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.[2]

Major phases

Process diagram showing the relationship between the different phases of CRISP-DM
CRISP-DM breaks the process of data mining into six major phases[14]:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
The sequence of the phases is not strict and moving back and forth between different phases as it is always required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data mining processes will benefit from the experiences of previous ones.
46
Q
Which GCP ML Service can help extract text in a video, including when in the video the text is detected (timestamp) and the location of the text within the frame (bounding box)?
​
Cloud Vision
​
Cloud Translation
​
Cloud Video Intelligence


Cloud Natural Language

A
Cloud Video Intelligence
(Correct)
​
Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/

Recognizing text
Text Detection performs Optical Character Recognition (OCR), which detects and extracts text within an input video.

Text detection is available for all the languages supported by the Cloud Vision API.

Request Text Detection for a Video on Google Cloud Storage
The following samples demonstrate text detection on a file located in Cloud Storage.

47
Q
Which GCP ML Service can help track objects within a video?
​
Cloud Translation
​
Cloud Video Intelligence
​
Cloud Vision
​
Cloud Natural Language
A
Cloud Video Intelligence
(Correct)
​
Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/

Tracking objects
Object tracking tracks multiple objects detected in an input video. To make an object tracking request, call the annotate method and specify OBJECT_TRACKING in the features field.

An object tracking request annotates a video with labels for entities and spatial locations for entities that are detected in the video or video segments provided. For example, a video of vehicles crossing a traffic signal might produce labels such as “car”, “truck”, “bike,” “tires”, “lights”, “window” and so on. Each label can include a series of bounding boxes, with each bounding box having an associated time segment containing a time offset that indicates the duration offset from the beginning of the video. The annotation also contains additional entity information including an entity id that you can use to find more information about the entity in the Google Knowledge Graph Search API.

48
Q
App Engine Flexible environment supports running background processes(True/False)
​
FALSE
​
TRUE
A

TRUE
(Correct)

Explanation
Please refer https://cloud.google.com/appengine/docs/the-appengine-environments

Comparing high-level features
The following table summarizes the differences between the two environments:

Feature Standard environment Flexible environment
Instance startup time Seconds Minutes
Maximum request timeout Depends on the runtime and type of scaling. 60 minutes
Background threads Yes, with restrictions Yes
Background processes No Yes
SSH debugging No Yes
Scaling Manual, Basic, Automatic Manual, Automatic
Scale to zero Yes No, minimum 1 instance
Writing to local disk
Java 8, Java 11, Node.js, Python 3, PHP 7, Ruby, Go 1.11, and Go 1.12+ have read and write access to the /tmp directory.
Python 2.7 and PHP 5.5 don’t have write access to the disk.
Yes, ephemeral (disk initialized on each VM startup)
Modifying the runtime No Yes (through Dockerfile)
Deployment time Seconds Minutes
Automatic in-place security patches Yes Yes (excludes container image runtime)
Access to Google Cloud APIs & Services such as Cloud Storage, Cloud SQL, Memorystore, Tasks and others. Yes Yes
WebSockets No
Java 8, Python 2, and PHP 5 provide a proprietary Sockets API (beta), but the API is not available in newer standard runtimes. Yes
Supports installing third-party binaries
Yes for Java 8, Java 11, Node.js, Python 3, PHP 7, Ruby 2.5 (beta), Go 1.11, and Go 1.12+.
No for Python 2.7 and PHP 5.5.
Yes
Location North America, Asia Pacific, or Europe North America, Asia Pacific, or Europe
Pricing Based on instance hours Based on usage of vCPU, memory, and persistent disks
For an in-depth comparison of the environments, see the guide for your language: Python, Java, Go, or PHP.

49
Q
\_\_\_\_\_\_\_ is a fully-managed in-memory data store service for Redis
​
Google Cloud Datastore
​
Cloud Memorystore


Cloud Storage for Firebase

Google Cloud Storage

A

Cloud Memorystore
(Correct)

Explanation
Please refer https://cloud.google.com/memorystore/

Memorystore
Reduce latency with scalable, secure, and highly available in-memory service for Redis and Memcached.

Try Memorystore free
Build application caches that provide sub-millisecond data access
100% compatible with open source Redis and Memcached
Migrate your caching layer to cloud with zero code change

50
Q
Which of the following is not a supported data source for BigQuery Data Transfer Service?
​
Campaign Manager
​
Google Ads
​
On Premise Oracle Database
​
Google Ad Manager
A

On Premise Oracle Database
(Correct)

Explanation
Please refer https://cloud.google.com/bigquery/docs/transfer-service-overview

Next steps
After enabling the BigQuery Data Transfer Service, create a transfer for your data source.

Google Software as a Service (SaaS) apps
Campaign Manager
Cloud Storage
Google Ad Manager
Google Ads
Google Merchant Center (beta)
Google Play
Search Ads 360 (beta)
YouTube Channel reports
YouTube Content Owner reports
External cloud storage providers
Amazon S3
Data warehouses
Teradata
Amazon Redshift
In addition, several third-party transfers are available in the Google Cloud Marketplace.