Exam Preparation Flashcards

Question

What is BigQuery?

Answer 1

BigQuery is an analytics database that uses SQL as a query language

Answer 2

Datasets are the basic unit of organization for sharing data in BigQuery. A dataset can have multiple tables.

Answer 3

A dataset can have multiple tables. BigQuery supports two dialects of SQL: legacy and standard

Answer 4

Streaming inserts allow adding one row at a time.

Answer 5

Stackdriver is used for monitoring and logging in BigQuery. Stackdriver

Answer 6

BigQuery costs are based on the amount of data stored, the amount of data streamed, and the workload required to execute queries.

Answer 7

Cloud Memorystore is a managed Redis service. Redis instances can be created using the Cloud Console or gcloud commands

Answer 8

When the memory used by Redis exceeds 80 percent of system memory, the instance is considered under memory pressure.

Answer 9

It's an object storage, like S3.

Answer 10

A bucket, that share access controls at the bucket level.

Answer 11

The four storage tiers are Regional, Multi-regional, Nearline, and Coldline.

Answer 12

Data pipelines are modeled as directed acyclic graphs (DAGs)

Answer 13

Ingestion - bringing data into the GCP environment. Transformation - mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline. Cloud Storage can be used as both the staging area for storing data immediately after ingestion and also as a long-term store for transformed data. BigQuery and Cloud Storage treat data as external tables and query them. Cloud Dataproc can use Cloud Storage as HDFS-compatible storage. Analysis can take on several forms, from simple SQL querying and report generation to machine learning model training and data science analysis.

Answer 14

ETL ELT CDC EL

Answer 15

Difference between event time and processing time, sliding and tumbling windows, late- arriving data and watermarks, and missing data.

Answer 16

This includes data ingestion, data preprocessing, feature engineering, model training and evaluation, and deployment.

Answer 17

Cloud Pub/Sub is a managed message queue service.

Answer 18

Cloud Pub/Sub will automatically scale as needed.

Answer 19

Messaging queues are used in distributed systems to decouple services in a pipeline. This allows one service to produce more output than the consum- ing service can process without adversely affecting the consuming service. This is especially helpful when one process is subject to spikes. I.E Lots of messages = no worries, add to queue

Answer 20

Cloud Dataflow is a managed stream and batch processing service.

Answer 21

In the past, developers would typically create a stream processing pipeline (hot path) and a separate batch processing pipeline (cold path). Cloud Dataflow combines the two.

Answer 22

Cloud Dataproc is a managed Hadoop and Spark service.

Answer 23

You can move your on-prem Hadoop to Dataproc.

Answer 24

Cloud Composer is a managed service implementing Apache Airflow.

Answer 25

Cloud Dataproc clusters consist of two types of nodes: master nodes and worker nodes.

Answer 26

Cloud Composer automates the scheduling and moni- toring of workflows.

Answer 27

Do it incrementally. Migrate HBase to Bigtable. Manage the syncronization between the on-prem and cloud.

Answer 28

It's like EC2, you have complete control over it.

Answer 29

Kubernetes is a container orchestration system, and Kubernetes Engine is a managed Kubernetes service. With Kubernetes Engine, Google maintains the cluster and assumes responsibility for installing and configuring the Kubernetes platform on the cluster. Kubernetes Engine deploys Kubernetes on managed instance groups.

Answer 30

App Engine is GCP’s original platform-as-a-service (PaaS) offering. It's elastic beanstalk.

Answer 31

Cloud Functions is a serverless, managed compute service for running code in response to events that occur in the cloud. Events are supported for Cloud Pub/ Sub, Cloud Storage, HTTP events, Firebase, and Stackdriver Logging.

Answer 32

Availability is defined as the ability of a user to access a resource at a specific time. SLA = percentage of time a system is operational. Reliability is defined as the probability that a system will meet service-level objectives for some duration of time. Reliability is often measured as the mean time between failures. Scalability is the ability of a system to meet the demands of workloads as they vary over time.

Answer 33

The analytics hybrid cloud is used when transaction processing systems continue to run on premises and data is extracted and transferred to the cloud for analytic processing.

Answer 34

A variation of hybrid clouds is an edge cloud, which uses local computation resources in addition to cloud platforms. This architecture pattern is used when a network may not be reliable or have sufficient bandwidth to transfer data to the cloud. It is also used when low-latency processing is required.

Answer 35

Message brokers are services that provide three kinds of functionality: message validation, message transformation, and routing. Message validation is the process of ensuring that messages received are correctly formatted. Message transformation is the process of mapping data to structures that can be used by other services. Message brokers can receive a message and use data in the message to determine where the message should be sent. Routing is used when hub-and-spoke message brokers are used.

Answer 36

At a high level, the process of migrating a data warehouse involves four stages: Assessing the current state of the data warehouse Designing the future state Migrating data, jobs, and access controls to the cloud Validating the cloud data warehouse

Answer 37

SOA is a distributed architecture that is driven by business operations and delivering business value. Typically, an SOA system serves a discrete business activity. SOAs are self-contained sets of services. Microservices are a variation on SOA architecture. Like other SOA systems, microservice architectures use multiple, independent components and common communication protocols to provide higher-level business services. Serverless functions extend the principles of microservices by removing concerns for containers and managing runtime environments.

Answer 38

Yes, Compute Engine supports provisioning single instances or groups of instances, known as instance groups.

Answer 39

Managed instance groups (MIGs) consist of identically configured VMs

Answer 40

Autohealing Support for multizone groups that provide for availability in spite of zone-level failures Load balancing to distribute workload across all instances in the group Autoscaling, which adds or removes instances in the group to accommodate increases and decreases in workloads Automatic, incremental updates to reduce disruptions to workload processing

Answer 41

Containers are increasingly used to process workloads because they have less overhead than VMs and allow for finer-grained allocation of resources than VMs.

Answer 42

Kubernetes Engine is a managed Kubernetes service that provides container orchestration.

Answer 43

Bigtable instances can be provisioned using the cloud console, the command-line SDK, and the REST API.

Answer 44

Require high-volume, low-latency writes.

Answer 45

Cloud IAM provides fine-grained identity and access management for resources within GCP. Cloud IAM uses the concept of roles, which are collections of permissions that can be assigned to identities.

Answer 46

Cloud IAM and include Owner, Editor, and Viewer roles

Answer 47

the user should have the minimal privilege.

Answer 48

Service accounts are able to make API calls authorised by roles assigned to the service account.

Answer 49

A service account is identified by a unique email address. These accounts are authenticated by two sets of public/private keys.

Answer 50

Encryption is the process of encoding data in a way that yields a coded version of data that cannot be practically converted back to the original form without additional information.

Answer 51

Data is encrypted at multiple levels, including the application, infrastructure, and device levels.

Answer 52

All traffic to Google Cloud services is encrypted by default.

Answer 53

Cloud KMS is a hosted key management service in the Google Cloud. It enables customers to generate and store keys in GCP.

Answer 54

The three dimensions are rows, columns, and cells.

Answer 55

When a using a multitenant Cloud Bigtable database, it is a good practice to use a tenant prefix in the row-key.

Answer 56

String identifiers, such as a customer ID or a sensor ID, are good candidates for a row-key.

Answer 57

Timestamps may be used as part of a row-key, but they should not be the entire row-key or the start of the row-key.

Answer 58

Putting another field at the front of a row-key, ahead of a timestamp.

Answer 59

Keep names short. This reduces the size of metadata since names are stored along with data values. Design row-keys for looking up a single value or a range of values.

Answer 60

Use interleaved tables with a parent-child relationship in which parent data is stored with child data. This makes simultaneously reads more efficient.

Answer 61

Using the hash of a natural key; Swapping the order of columns in keys to promote higher-cardinality attributes; using a universally unique identifier (UUID)

Answer 62

Primary indexes are created automatically on the primary key. Secondary indexes are useful when filtering in a query using a WHERE clause.

Answer 63

Projects are the high- level structure used to organize the use of GCP services and resources. Datasets exist within a project and are containers for tables and views

Answer 64

Denormalizing in BigQuery can be done with nested and repeated columns. A column that contains nested and repeated data is defined as a RECORD datatype and is accessed as a STRUCT in SQL. BigQuery supports up to 15 levels of nested STRUCTs.

Answer 65

Partitioning is the process of dividing tables into segments called partitions. BigQuery has three partition types: ingestion time partitioned tables, timestamp partitioned tables, and integer range partitioned tables

Answer 66

BigQuery supports two types of queries: interactive and batch queries.

Answer 67

Interactive queries are executed immediately, whereas batch queries are queued and run when resources are available

Answer 68

True. BigQuery can access data in external sources, known as federated sources. External sources can be Cloud Bigtable, Cloud Storage, and Google Drive.

Answer 69

True. BigQuery extends standard SQL with the addition of machine learning functionality.

Answer 70

Data Catalog is a metadata service for data management. Its primary function is to provide a single, consolidated view of enterprise data.

Answer 71

Cloud Storage, Cloud Bigtable, Google Sheets, BigQuery, and Cloud Pub/Sub.

Answer 72

Cloud Dataprep is an interactive tool for preparing data for analysis and machine learning

Answer 73

Cloud Dataprep is used to cleanse, enrich, import, export, discover, structure, and validate data. The main cleansing operations in Cloud Dataprep center around altering column names, reformatting strings, and working with numeric values.

Answer 74

The Data Studio tool is organized around reports, and it reads data from data sources and formats the data into tables and charts.

Answer 75

Cloud Datalab is an interactive tool for exploring and transforming data.

Answer 76

Cloud Composer is a fully managed workflow orchestration service based on Apache Airflow

Answer 77

Data ingestion, data preparation, data segregation, model training, model evaluation, model deployment, and model monitoring are the stages of ML pipelines.

Answer 78

Batch data ingestion should use a dedicated process for ingesting each distinct data source. Batch ingestion often occurs on a relatively fixed schedule, much like many data warehouse ETL processes. Cloud Pub/Sub is a good option for ingesting streaming data.

Answer 79

Data preparation are data exploration, data transformation, and feature engineering.

Answer 80

Data segregation is the process splitting a dataset into three segments: training, validation, and test data

Answer 81

Training data is used to build machine learning models. | Validation data is used during hyperparameter tuning. Test data is used to evaluate model performance.

Answer 82

Know that feature selection is the process | of evaluating how a particular attribute or feature contributes to the predictiveness of a model.

Answer 83

Underfitting - it doesn't fit the model at all. Overfitting - It matches the training data too much Regularization - Punishing data points for overfitting, making the model more complicated

Answer 84

Methods for evaluation a model include individual evaluation metrics, such as accuracy, precision, recall, and the F measure; k-fold cross-validation; confusion matrices; and bias and variance.

Answer 85

Bias is the difference between the average prediction of a model and the correct prediction of a model. Variance is the model variability.

Answer 86

Cloud AutoML, BigQuery ML, Kubeflow, and Spark MLib

Answer 87

Yes, you can do Jupyter Notebooks.

Answer 88

Distributing model training over a group of servers provides for scalability and improved availability

Answer 89

When serving models, you need to consider latency, scalability, and version management.

Answer 90

Edge computing is the practice of moving compute and storage resources closer to the location at which they are needed.

Answer 91

Edge devices provide three kinds of data: metadata about the device, state information about the device, and telemetry data.

Answer 92

Both used for Deep learning training, but TPU doesn't have Von Neumann bottleneck.

Answer 93

Supervised algorithms learn from labeled examples.

Answer 94

Unsupervised learning starts with unlabeled data and identifies salient features, such as groups or clus- ters, and anomalies in a data stream.

Answer 95

Reinforcement learning is a third type of machine learning algorithm that is distinct from supervised and unsupervised learning. It trains a model by interacting with its environment and receiving feedback on the decisions that it makes.

Answer 96

Classification models assign discrete values to instances.

Answer 97

Regression models map continuous variables to other continuous variables.

Answer 98

Unsupervised learning algorithms find patterns in data without using predefined labels. Three types of unsupervised learning are clustering, anomaly detection, and collaborative filtering

Answer 99

The network is composed of artificial neurons that are linked together into a network. The links between artificial neurons are called connections. A single neuron is limited in what it can learn. A multilayer network, however, is able to learn more functions. A multilayer neural network consists of a set of input nodes, hidden nodes, and an output layer.

Answer 100

Poor models

Answer 101

Some common data-quality problems are missing data, invalid values, inconsistent use of codes and categories, and data that is not representative of the population at large.

Answer 102

Gets text from images.

Answer 103

Video Intelligence API can extract metadata; identify key persons, places, and things; and annotate video content.

Answer 104

Dialogflow manages chatbots.

Answer 105

Takes text to speech.

Answer 106

A translating API.

Answer 107

Suggesting products to customers based on their behavior on the user’s website and the product catalog of that website.

Answer 108

Cloud Inference API. The Cloud Inference API provides real-time analysis of time-series data.

Exam Preparation Flashcards

(133 cards)