UDEMY Flashcards
The data mining project manager meets with the production line supervisor to discuss the implementation of changes and improvements.Which stage in CRISP-DM does this scenario refer to?
Deployment Phase
Data Preparation Modeling Phase Data Understanding
Deployment Phase
Correct
Which model within Cloud Speech-to-Text API is best for audio that originated from video or includes multiple speakers(typically recorded at a 16khz or higher sampling rate)?
video
phone_call command_and_search default
video
(Correct)
Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/
Which GCP ML Service can help identify entities and labels by types such as a person, organization, location, events, products, and media from within a text? Cloud Translation Cloud Vision Cloud Natural Language
Cloud Video Intelligence
Cloud Natural Language
(Correct)
Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/
\_\_\_\_\_\_\_\_ is a scalable, fully-managed NoSQL document database for your web and mobile applications. Google Cloud Datastore Google Cloud Bigtable Google Cloud Storage Cloud Storage for Firebase
Google Cloud Datastore
(Correct)
Explanation
Please refer https://cloud.google.com/storage-options/
https://cloud.google.com/datastore
Highly scalable NoSQL database
Firestore is the next generation of Datastore. Learn more about upgrading to Firestore.
Datastore is a highly scalable NoSQL database for your applications. Datastore automatically handles sharding and replication, providing you with a highly available and durable database that scales automatically to handle your applications’ load. Datastore provides a myriad of capabilities such as ACID transactions, SQL-like queries, indexes, and much more.
In order to use categorical variables in a regression problem, which data pre-processing step is needed?
Use the categorical variable as is
Convert the categorical variable into binary dummy variables
Use a classification algorithm instead
Assign numeric values to different levels within a categorical variable and use the numeric values instead
Convert the categorical variable into binary dummy variables
(Correct)
Explanation
Binary Dummy Variable helps convert categorical variable into multiple numerical variables
Which GCP ML Service can help to extract text and identify the language from within an image? Cloud Translation Cloud Natural Language Cloud Vision Cloud Video Intelligence
Cloud Vision (Correct) Explanation Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/
A startup wishes to use a data processing platform that supports both batch and streaming applications and they would prefer to have a hands-off/serverless data processing platform. Which GCP service is suited for this? Shell Scripts on Compute Instances Dataproc Cloud SQL Dataflow
Dataflow
(Correct)
Explanation
Please refer https://cloud.google.com/dataflow/
Dataflow
Unified stream and batch data processing that’s serverless, fast, and cost-effective.
Flexible scheduling and pricing for batch processing
For processing with flexibility in job scheduling time, such as overnight jobs, flexible resource scheduling (FlexRS) offers a lower price for batch processing. These flexible jobs are placed into a queue with a guarantee that they will be retrieved for execution within a six-hour window.
Ready-to-use real-time AI patterns
Enabled through ready-to-use patterns, Dataflow’s real-time AI capabilities allow for real-time reactions with near-human intelligence to large torrents of events. Customers can build intelligent solutions ranging from predictive analytics and anomaly detection to real-time personalization and other advanced analytics use cases.
Which among below is not a category of Data Access audit logs for Cloud Spanner? Data Access (DATA_WRITE) Data Access (DATA_READ) Data Access (ADMIN_READ) User Access(USER_ACCESS)
User Access(USER_ACCESS) (Correct)
Explanation
Please refer https://cloud.google.com/spanner/docs/audit-logging
Cloud Spanner audit logging information
This page describes the audit logs created by Cloud Spanner as part of Cloud Audit Logs.
Overview
Google Cloud services write audit logs to help you answer the questions, “Who did what, where, and when?” Your Cloud projects contain only the audit logs for resources that are directly within the project. Other entities, such as folders, organizations, and Cloud Billing accounts, contain the audit logs for the entity itself.
For a general overview of Cloud Audit Logs, see Cloud Audit Logs. For a deeper understanding of Cloud Audit Logs, review Understanding audit logs.
Cloud Audit Logs maintains three audit logs for each Cloud project, folder, and organization:
Admin Activity audit logs
Data Access audit logs
System Event audit logs
A user wished to store images, videos, objects and blob data in a scalable, fully managed, highly reliable and cost-efficient object/blob store. Which GCP storage option is appropriate for this use case? Cloud Storage for Firebase Google Cloud Storage Google Cloud Bigtable Google Cloud Datastore
Google Cloud Storage
(Correct)
Explanation
Please refer https://cloud.google.com/storage-options/
Object Lifecycle Management Define conditions that trigger data deletion or transition to a cheaper storage class.
Object Versioning Continue to store old copies of objects when they are deleted or overwritten.
Retention policies Define minimum retention periods that objects must be stored for before they’re deletable.
Object holds Place a hold on an object to prevent its deletion.
Customer-managed encryption keys Encrypt object data with encryption keys stored by the Cloud Key Management Service and managed by you.
Customer-supplied encryption keys Encrypt object data with encryption keys created and managed by you.
Uniform bucket-level access Uniformly control access to your Cloud Storage resources by disabling object ACLs.
Requester Pays Require accessors of your data to include a project ID to bill for network charges, operation charges, and retrieval fees.
Bucket Lock Bucket Lock allows you to configure a data retention policy for a Cloud Storage bucket that governs how long objects in the bucket must be retained.
Pub/Sub notifications for Cloud Storage Send notifications to Pub/Sub when objects are created, updated, or deleted.
Cloud Audit Logs with Cloud Storage Maintain admin activity logs and data access logs for your Cloud Storage resources.
Object- and bucket-level permissions Cloud Identity and Access Management (IAM) allows you to control who has access to your buckets and objects.
With Regards to instance start-up time , App Engine Standard environment is faster than App engine Flexible environment(True/False) TRUE FALSE
TRUE
(Correct)
Explanation
Please refer https://cloud.google.com/appengine/docs/the-appengine-environments
Comparing high-level features
The following table summarizes the differences between the two environments:
Feature Standard environment Flexible environment
Instance startup time Seconds Minutes
Maximum request timeout Depends on the runtime and type of scaling. 60 minutes
Background threads Yes, with restrictions Yes
Background processes No Yes
SSH debugging No Yes
Scaling Manual, Basic, Automatic Manual, Automatic
Scale to zero Yes No, minimum 1 instance
Writing to local disk
Java 8, Java 11, Node.js, Python 3, PHP 7, Ruby, Go 1.11, and Go 1.12+ have read and write access to the /tmp directory.
Python 2.7 and PHP 5.5 don’t have write access to the disk.
Yes, ephemeral (disk initialized on each VM startup)
Modifying the runtime No Yes (through Dockerfile)
Deployment time Seconds Minutes
Automatic in-place security patches Yes Yes (excludes container image runtime)
Access to Google Cloud APIs & Services such as Cloud Storage, Cloud SQL, Memorystore, Tasks and others. Yes Yes
WebSockets No
Java 8, Python 2, and PHP 5 provide a proprietary Sockets API (beta), but the API is not available in newer standard runtimes. Yes
Supports installing third-party binaries
Yes for Java 8, Java 11, Node.js, Python 3, PHP 7, Ruby 2.5 (beta), Go 1.11, and Go 1.12+.
No for Python 2.7 and PHP 5.5.
Yes
Location North America, Asia Pacific, or Europe North America, Asia Pacific, or Europe
Pricing Based on instance hours Based on usage of vCPU, memory, and persistent disks
For an in-depth comparison of the environments, see the guide for your language: Python, Java, Go, or PHP.
Which among below is not a machine learning type? Dimensionality Reduction Techniques (Correct) Supervised Machine Learning Reinforcement Learning Unsupervised Machine Learning
Dimensionality Reduction Techniques
(Correct)
Explanation
Dimensionality Reduction Techniques is a method to reduce the number of features and is not a Machine Learning Type
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.[1]
Methods are commonly divided into linear and non-linear approaches.[1] Approaches can also be divided into feature selection and feature extraction.[2] Dimensionality reduction can be used for noise reduction, data visualization, cluster analysis, or as an intermediate step to facilitate other analyses.
Cloud Datastore supports ACID transactions,SQL-like queries and indexes(True/False) FALSE TRUE
TRUE
(Correct)
Explanation
Please refer https://cloud.google.com/datastore/
Firestore is the next generation of Datastore. Learn more about upgrading to Firestore.
Datastore is a highly scalable NoSQL database for your applications. Datastore automatically handles sharding and replication, providing you with a highly available and durable database that scales automatically to handle your applications’ load. Datastore provides a myriad of capabilities such as ACID transactions, SQL-like queries, indexes, and much more.
Which GCP ML Service can help extract tokens and sentences, identify parts of speech (PoS), and create dependency parse trees for each sentence within a text? Cloud Vision Cloud Video Intelligence Cloud Translation Cloud Natural Language
Cloud Natural Language
(Correct)
Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/
AutoML Natural Language Natural Language API Integrated REST API
Natural Language is accessible via our REST API. Text can be uploaded in the request or integrated with Cloud Storage.
Syntax analysis
Extract tokens and sentences, identify parts of speech and create dependency parse trees for each sentence.
Entity analysis
Identify entities within documents — including receipts, invoices, and contracts — and label them by types such as date, person, contact information, organization, location, events, products, and media.
Custom entity extraction
Identify entities within documents and label them based on your own domain-specific keywords or phrases.
Sentiment analysis
Understand the overall opinion, feeling, or attitude sentiment expressed in a block of text.
Custom sentiment analysis
Understand the overall opinion, feeling, or attitude expressed in a block of text tuned to your own domain-specific sentiment scores.
Content classification
Classify documents in 700+ predefined categories.
Custom content classification
Create labels to customize models for unique use cases, using your own training data.
Multi-language
Enables you to easily analyze text in multiple languages including English, Spanish, Japanese, Chinese (simplified and traditional), French, German, Italian, Korean, Portuguese, and Russian.
Custom models
Train custom machine learning models with minimum effort and machine learning expertise.
Powered by Google’s AutoML models
Leverages Google state-of-the-art AutoML technology to produce high-quality models.
Spatial structure understanding
Use the structure and layout information in PDFs to improve custom entity extraction performance.
Large dataset support
Unlock complex use cases with support for 5,000 classification labels, 1 million documents, and 10 MB document size.
To convert a continuous variable into a categorical variable, which of the following technique can we use?
Bin the numerical variable into different categories
use min-max normalization
treat the numerical value directly as a categorical variable
use mean as a representative value
Bin the numerical variable into different categories
(Correct)
Dividing a Continuous Variable into Categories
This is also known by other names such as “discretizing,” “chopping data,” or “binning”.1 Specific methods sometimes used include “median split” or “extreme third tails”.
Whatever it is called, it is usually2 a bad idea. Instead, use a technique (such as regression) that can work with the continuous variable.The basic reason is intuitive: You are tossing away information. This can occur in various ways with various consequences. Here are some:
- When doing hypothesis tests, the loss of information when dividing continuous variables into categories typically translates into losing power. 3
- The loss of information involved in choosing bins to make a histogram can result in a misleading histogram.
- Collecting continuous data by categories can also cause headaches later on. Good and Hardin5 give an example of a long-term study in which incomes were relevant. The data were collected in categories of ten thousand dollars. Because of inflation, purchasing power decreased noticeably from the beginning to the end of the study. The categorization of income made it virtually impossible to correct for inflation.
- Wainer, Gessaroli, and Verdi6 argue that if a large enough sample is drawn from two uncorrelated variables, it is possible to group the variables one way so that the binned means show an increasing trend, and another way so that they show a decreasing trend. They conclude that if the original data are available, one should look at the scatterplot rather than at binned data. Moral: If there is a good justification for binning data in an analysis, it should be “before the fact” – you could otherwise be accused of manipulating the data to get the results you want!
- There are times when continuous data must be dichotomized, for example in deciding a cut-off for diagnostic criteria. When this is the case, it is important to choose the cut-off carefully, and to consider the sensitivity, specificity, and positive predictive value. 7
Notes:
- “Binning” is also used to refer to processes used in data mining and analytics. In those fields, which usually deal with large data sets and aim to discover patterns, carefully developed algorithms and validating with holdout subsamples can create a more rigorous process than the types of discretizing discussed on this web page.
- One situation in which it may be necessary is when comparing new data with existing data where only the categories are know, not the values of the continuous variable. Categorizing may also sometimes be appropriate for explaining an idea to an audience that lacks the sophistication for the full analysis. However, this should only be done when the full analysis has been done and justifies the result that is illustrated by the simpler technique using categorizing. For an example, see Gelman and Park (2008), Splitting a predictor at the upper quarter or third, American Statistician 62, No. 4, pp. 1-8. See also footnote 1 above.
- See http://psych.colorado.edu/~mcclella/MedianSplit/ for a demo illustrating this in the case when a continuous predictor in regression is dichotomized using a median split. Also see Van Belle (2008) Statistical Rules of Thumb, pp. 139 - 140 for more discussion and references.
- Some software has a “kernel density” feature that can give an estimate of the distribution of data. This is usually better than a histogram. The problem with bins in a histogram is the reason why histograms are not good for checking model assumptions.
- Good and Hardin (2006) Common Errors in Statistics, pp. 28 - 29.
- Wainer, Gessaroli, and Verdi (2006). Finding What Is Not There through the Unfortunate Binning of Results: The Mendel Effect, Chance Magazine, Vol 19, No.1, pp. 49 -52. Essentially the same article appears as Chapter14 in Wainer (2009) Picturing the Uncertain World, Princeton University Press.
- In addition to the references listed at the end of the linked page, see also Susan Ott’s Bone Density page for a graphical discussion of the cut-offs for osteoporosis and osteopenia.
https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html
BigQuery ML supports which of the following ML models? Decision Trees Naive Bayes Binary logistic regression Random Forest
Binary logistic regression
(Correct)
Explanation
Please refer https://cloud.google.com/bigquery/docs/bigqueryml-intro
Supported models in BigQuery ML
A model in BigQuery ML represents what an ML system has learned from the training data.
BigQuery ML supports the following types of models:
Linear regression for forecasting; for example, the sales of an item on a given day. Labels are real-valued (they cannot be +/- infinity or NaN). Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values. Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross-entropy loss function. K-means clustering for data segmentation; for example, identifying customer segments. K-means is an unsupervised learning technique, so model training does not require labels nor split data for training or evaluation. Matrix Factorization for creating product recommendation systems. You can create product recommendations using historical customer behavior, transactions, and product ratings and then use those recommendations for personalized customer experiences. Time series for performing time-series forecasts. You can use this feature to create millions of time series models and use them for forecasting. The model automatically handles anomalies, seasonality, and holidays. Boosted Tree for creating XGBoost based classification and regression models. Deep Neural Network (DNN) for creating TensorFlow based Deep Neural Networks for classification and regression models. AutoML Tables to create best-in-class models without feature engineering or model selection. AutoML Tables searches through a variety of model architectures to decide the best model. TensorFlow model importing. This feature lets you create BigQuery ML models from previously trained TensorFlow models, then perform prediction in BigQuery ML. In BigQuery ML, you can use a model with data from multiple BigQuery datasets for training and for prediction.
BigQuery ML supports which of the following ML models? Random Forest Multiclass logistic regression for classification Principal Component Analysis K means Algorithm
Multiclass logistic regression for classification (Correct)
Explanation
Please refer https://cloud.google.com/bigquery/docs/bigqueryml-intro
Supported models in BigQuery ML
A model in BigQuery ML represents what an ML system has learned from the training data.
BigQuery ML supports the following types of models:
Linear regression for forecasting; for example, the sales of an item on a given day. Labels are real-valued (they cannot be +/- infinity or NaN). Binary logistic regression for classification; for example, determining whether a customer will make a purchase. Labels must only have two possible values. Multiclass logistic regression for classification. These models can be used to predict multiple possible values such as whether an input is "low-value," "medium-value," or "high-value." Labels can have up to 50 unique values. In BigQuery ML, multiclass logistic regression training uses a multinomial classifier with a cross-entropy loss function. K-means clustering for data segmentation; for example, identifying customer segments. K-means is an unsupervised learning technique, so model training does not require labels nor split data for training or evaluation. Matrix Factorization for creating product recommendation systems. You can create product recommendations using historical customer behavior, transactions, and product ratings and then use those recommendations for personalized customer experiences. Time series for performing time-series forecasts. You can use this feature to create millions of time series models and use them for forecasting. The model automatically handles anomalies, seasonality, and holidays. Boosted Tree for creating XGBoost based classification and regression models. Deep Neural Network (DNN) for creating TensorFlow based Deep Neural Networks for classification and regression models. AutoML Tables to create best-in-class models without feature engineering or model selection. AutoML Tables searches through a variety of model architectures to decide the best model. TensorFlow model importing. This feature lets you create BigQuery ML models from previously trained TensorFlow models, then perform prediction in BigQuery ML. In BigQuery ML, you can use a model with data from multiple BigQuery datasets for training and for prediction.
\_\_\_\_\_\_\_\_ Google Cloud Storage is optimized for fast, highly durable storage for data accessed less than once a month Regional Nearline Multi Regional Coldline
Nearline
(Correct)
Explanation
Please refer https://cloud.google.com/storage/
Storage classes for any workload
Storage classes determine the availability and pricing model that apply to the data you store in Cloud Storage.
Standard - Optimized for performance and high frequency access.
Nearline - Fast, highly durable, for data accessed less than once a month.
Coldline - Fast, highly durable, for data accessed less than once a quarter.
Archive - Most cost-effective, for data accessed less than once a year.
Which GCP ML Service can help detect and translate a document’s language? Cloud Translation Cloud Video Intelligence Cloud Vision Cloud Natural Language
Cloud Translation
(Correct)
Explanation
Please refer GCP ML Services Documentation https://cloud.google.com/products/ai/
Translation
Fast, dynamic translation tailored to your content
With Translation, you can quickly shift between languages using the best model for your content needs. Translation API delivers fast and dynamic results with our pre-trained models—instantly porting texts directly to your website and apps. AutoML Translation empowers developers and localization experts with limited machine learning expertise to quickly create high-quality, production-ready custom models. And if you want to translate directly from your audio data, Media Translation API allows you to add low-latency and real-time audio translations to your content and systems.
Cross-Validation is a technique for ensuring that the results uncovered in analysis are generalizable to an independent, unseen, data set(True/False) TRUE FALSE
TRUE
(Correct)
Cross-validation[1][2][3], sometimes called rotation estimation[4][5][6] or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set).[7][8] The goal of cross-validation is to test the model’s ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias[9] and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).
A professor is conducting research to examine the proportion of children whose parents read to them and who are themselves good readers. Which Machine learning algorithm can he/she apply? Principal Component Analysis Classification Algorithms Association Rules Clustering Algorithms
Association Rules
(Correct)
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.[1]
Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {\displaystyle {\mathrm {onions,potatoes} }\Rightarrow {\mathrm {burger} }}{{\mathrm {onions,potatoes}}}\Rightarrow {{\mathrm {burger}}} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements.
In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.