Google ML Engineer Flashcards
What kinds of problem benefit from ML
- Identification (Classification)
- Prediction
- Grouping (Clustering)
What problem does ML Solve
- Identification (Classification)
- Prediction
- Grouping (Clustering)
What are technical success metrics
- Accuracy (Positive match/total)
- Precision (True Positive match / (False positive + True positive)
- Recall (True Positive match / (True positive + False negative)
What are different types of model
- Decision Tree
- Deep Learning
- Regression
What are the steps in ML ops pipeline?
- Data extraction
- Data validation
- Data preparation
- Model training
- Model evaluation
- Model validation
Which of the following comes first in a machine-learning pipeline?
- Model evalution
- Data extraction
- Data preparation
- Model training
Data extraction
Which of the following is not a kind of data preparation task?
- Addressing missing data
- Removing unwanted data
- Integrating data
- Evaluating model performance
Evaluating model performance
Which one isn’t the target feature of the label dataset
- Attributes of instances
- Structured tables
- Images
- Random data
Random data
What are target values of label dataset
- Class or category
- Value to predict
What are the classification algorithms
- Logistic Regression
- Decision Trees (End of decision trees are classifications), e.g. Random trees
- Naive Bayes (It uses statistic and probability, it use to collate and coappear for classification)
- Neural Networks (Deep learning network)
- Nearest Neighbor (How near or far points are in space for classification)
- Ensemble methods (These are different methods)
Is logistic regression a classification algorithm or regression algorithm
Classification algorithm
Give a example of decision tree algorithm
Random forest
Which algorithm is based on How near or far points are in space and what type of algorithm it is?
- Nearest Neighbor
- Classification algorithm
Which are the three regression algorithm
- Linear Regression (Learn formula from series of values which represent the strait line and predict based on formula)
- Decision Tree Regression (Using the structure of decision trees)
- Polynomial Regression (Like to learn a function, but instead of the straight line, it could be a curve)
You have data on the price of cars sold over the past two years. You have data on the sale price, age of the car, mileage, interior features, gas mileage, and several other features. You want to use this data to predict the sale prices of other cars. What kind of ML problem is this?
- Classification
- Regression
- Reinforcement learning
- Unsupervised learning
This is a regression problem because we are trying to predict a continuous value.
Which of the following is a process risk to successfully deploying a machine learning model? (One choice)
- Insufficiently agreed upon objectives
- Insufficient data
- Biased data
- High F1 score
Insufficiently agreed upon objectives
Which are the three Unsupervised Learning algorithm
- Do not use labeled data sets
- Clustering
- K means Clustering
- Association rules
- A priori algorithm
- Dimensionality reduction
- Principle component analysis
- Autoencoders (compression of data)
Which of the following isn’t the use case for Unsupervised learning? (One choice)
- Grouping and segmentation
- Data compression
- Work in game like scenario
- Anomaly detection
Work in game like scenario. This is part of Reinforcement Learning instead of Unsupervised learning
Where are the characteristic of Reinforcement Learning? (multiple choice)
- Agent makes a series of choices in an environment
- Environment provides positive or negative feedback
- Trial and error, learn from feedback
- Use in Dimensionality reduction
- Agent makes a series of choices in an environment
- Environment provides positive or negative feedback
- Trial and error, learn from feedback
Use in Dimensionality reduction is unsupervise learning problem
Which one is security risk in ML Model Development? (One choice)
- Insufficient data
- Data quality issues (Data exploration, mis categorizing data, missing data)
- Biased data
- Data poisioning
Data poisoning (security risk) where the actor deliberately feeds data which is biased and has a missing data set to tweak the output of ML algorithm
What Google Cloud service would you use to store 3 TB of raw data files in Parquet format that will be processed and then used for machine learning training?
- Cloud Storage
- Cloud SQL
- Bigtable
- Cloud Dataproc
Cloud Storage (storing objects such as raw data files.)
What Vertex AI feature supports managed and user managed Jupyter Notebooks?
- Vertex AI Training
- Vertex AI FeatureStore
- Vertex AI Workbench
- Vertex AI Labeling
Vertex AI Workbench (support for Jupyter Notebooks.)
You are building a deep learning network and need to preform large volumes of low precision calculations. What accelerator would you choose?
- TPU
- GPU
- Edge devices
- Kubernetes Pod
TPU
Which of the following is not an example of sensitive data?
- Faces in an image
- Government issued ID number
- Notes in electronic patient records
- Tracking number for an e-commerce shipment
Tracking number for an e-commerce shipment
What is a statistical object that describes central tendency and spread of values called?
- Mean
- Distribution
- Variance
- Mode
Distribution
What are valid sources of data for use with Vertex AI Datasets?
- CSV files and BigQuery tables/views only
- CSV files and Parquet files only
- BigQuery tables/views only
- CSV files, BigQuery, and Bigtable
CSV files and BigQuery tables/views only
When is deleting rows with missing data is not a reasonable option for handling missing data?
- When there are many rows with no missing data
- When a significant portion of datasets has missing values for some feature
- When the data is stored in Parquet format
- When the data is not in a relational database
When a significant portion of datasets has missing values for some feature
What is the role of feature attributions in explaining predictions?
- Attribution prevents overfitting
- Attributions are a way of handling missing data
- Attributions are a measure of how much a feature contributes to a prediction
- Attributions are a type of data augmentation
Attributions are a type of data augmentation
Which of the following are types of data augmentation used with images?
- Crop and flip
- Crop and synonym substitution
- Feature attribution
- Imputing data and Crop
Crop and flip
What type of hyperparameter tuning algorithms use sequential processing and use the results from prior evaluations to inform evaluations of new hyperparameter values?
- Bayesin Search
- Random Search
- Grid Search
- Data Augmentation
Bayesian search updates priors or information known as the start of the evaluation
When using distributed training in Vertex AI, what kind of node is responsible for communicating gradients between nodes?
- Worker Nodes
- Primary Replicas
- Reduction Server
- Backup Nodes
The purpose of the reduction server is to increase throughput by communicating gradients among worker nodes.
You want to run TensorFlow models most efficiently in Google Cloud. What serving option would you choose?
- KubeFlow
- TensorFlow open source
- Optimized TensorFlow Runtime
- XGBoost
Optimized TensorFlow Runtime contains optimizations used internally at Google Cloud.
You want to use Vertex AI monitoring to detect when newly arriving data is significantly different than recent production data. What metric would you use?
- Skew
- Drift
- Precision
- FI Score
Drift measures difference in distribution with recent production data.