Data Engineering Foundations Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

The data lifecycle consists of four stages:

A

Ingest
Store
Process and analyze
Explore and visualize
Ingestion is the first stage in the data lifecycle, and it entails acquiring data and bringing data into the Google Cloud Platform (GCP).
The storage stage is about persisting data to a storage system from which it can be accessed for later stages of the data lifecycle.

The process and analyze stage begins with transforming data into a usable format for analysis applications.

Explore and visualize is the final stage, in which insights are derived from analysis and presented in tables, charts, and other visualizations for use by others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The three broad ingestion modes with which data engineers typically work are as follows:

A

Ingest

Application data
Streaming data
Batch data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Application Data what are examples and where does it come from?

A

Whatj d Examples of application data include the following:

Transactions from an online retail application
Clickstream data from users reading articles on a news site
Log data from a server running computer-aided design software
User registration data from an online service
Application data can be ingested by services running in Compute Engine, Kubernetes Engine, or App Engine, for example. Application data can also be written to Stackdriver Logging or one of the managed databases, such as Cloud SQL or Cloud Datastore.

Where from?
Application data is generated by applications, including mobile apps, and pushed to backend services.

This data includes user-generated data, like a name and shipping address collected as part of a sales transaction.

It also includes data generated by the application, such as log data.

Event data, like clickstream data, is also a type of application-generated data.
The volume of this kind of data depends on the number of users of the application, the types of data the application generates, and the duration of time the application is in use. This size of application data that is sent in a single operation can vary widely. A clickstream event may have less than 1KB of data, whereas an image upload could be multiple megabytes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is streaming data and what are examples?

A

Streaming data is a set of data that is typically sent in small messages that are transmitted continuously from the data source. Streaming data may be sensor data, which is data generated at regular intervals, and event data, which is data generated in response to a particular event. Examples of streaming data include the following:

Virtual machine monitoring data, such as CPU utilization rates and memory consumption data
An IoT device that sends temperature, humidity, and pressure data every minute
A customer adding an item to an online shopping cart, which then generates an event with data about the customer and the item

Time-series data may require some additional processing early in the ingestion process. If a stream of data needs to be in time order for processing, then late arriving data will need to be inserted in the correct position in the stream. This can require buffering of data for a short period of time in case the data arrives out of order. Of course, there is a maximum amount of time to wait before processing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the event time?

A

Streaming data often includes a timestamp indicating the time that the data was generated. This is often .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is known as the process time?

A

Some applications will also track the time that data arrives at the beginning of the ingestion pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Streaming Data

A

Streaming data is well suited for Cloud Pub/Sub ingestion, which can buffer data while applications process the data. During spikes in data ingestion in which application instances cannot keep up with the rate data is arriving, the data can be preserved in a Cloud Pub/Sub topic and processed later after application instances have a chance to catch up. Cloud Pub/Sub has global endpoints and uses GCP’s global frontend load balancer to support ingestion. The messaging service scales automatically to meet the demands of the current workload.d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Batch Data?

A

Batch data is ingested in bulk, typically in files. Examples of batch data ingestion include uploading files of data exported from one application to be processed by another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Examples of batch data include the following:

A

Transaction data that is collected from applications may be stored in a relational database and later exported for use by a machine learning pipeline
Archiving data in long-term storage to comply with data retention regulations
Migrating an application from on premises to the cloud by uploading files of exported data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The focus of the storage stage of the data lifecycle is to make data available for transformation and analysis?

A

Several factors influence the choice of storage system, including

How the data is accessed—by individual record (row) or by an aggregation of columns across many records (rows)
The way access controls need to be implemented, at the schema or database level or finer-grained level
How long the data will be stored
These three characteristics are the minimum that should be considered when choosing a storage system; there may be additional criteria for some use cases. (Structure is another factor and is discussed later in this chapter.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are Data Access Patterns?

A

Data is accessed in different ways. Online transaction processing systems often query for specific records using a set of filtering parameters. For example, an e-commerce application may need to look up a customer shipping address from a data store table that holds tens of thousands of addresses. Databases, like Cloud SQL and Cloud Datastore, provide that kind of query functionality.

In another example, a machine learning pipeline might begin by accessing files with thousands of rows of data that is used for training the model. Since machine learning models are often trained in batch mode, all of the training data is needed. Cloud Storage is a good option for storing data that is accessed in bulk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Time to Store is required for?

A

Consider how long data will be stored when choosing a data store. Some data is transient. For example, data that is needed only temporarily by an application running on a Compute Engine instance could be stored on a local solid-state drive (SSD) on the instance. As long as the data can be lost when the instance shuts down, this could be a reasonable option.

Data is often needed longer than the lifetime of a virtual machine instance, so other options are better fits for those cases. Cloud Storage is a good option for long-term storage, especially if you can make use of storage lifecycle policies to migrate older data to Nearline or Coldline storage. For long-lived analytics data, Cloud Storage or BigQuery are good options, since the costs are similar.

 Nearline storage is used for data that is accessed less than once per 30days. Coldline storage is used to store data accesses less than once peryear.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Statistical techniques are often used with numeric data to do the following:

A

Describe characteristics of a dataset, such as a mean and standard deviation of the dataset.
Generate histograms to understand the distribution of values of an attribute.
Find correlations between variables, such as customer type and average revenue per sales order.
Make predictions using regression models, which allow you to estimate one attribute based on the value of another. In statistical terms, regression models generate predictions of a dependent variable based on the value of an independent variable.
Cluster subsets of a dataset into groups of similar entities. For example, a retail sales dataset may yield groups of customers who purchase similar types of products and spend similar amounts over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Some storage services are designed to store large volumes of data, including petabyte scales, whereas others are limited to smaller volumes.

A

Cloud Storage is an example of the former. An individual item in Cloud Storage can be up to 5 TB, and there is no limit to the number of read or write operations. Cloud Bigtable, which is used for telemetry data and large-volume analytic applications, can store up to 8 TB per node when using hard disk drives, and it can store up to 2.5 TB per node when using SSDs. Each Bigtable instance can have up to 1,000 tables. BigQuery, the managed data warehouse and analytics database, has no limit on the number of tables in a dataset, and it may have up to 4,000 partitions per table. Persistent disks, which can be attached to Compute Engine instances, can store up to 64 TB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Velocity

A

Velocity of data is the rate at which it is sent to and processed by an application. Web applications and mobile apps that collect and store human-entered data are typically low velocity, at least when measured by individual user.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

There are three widely recognized categories:

Structured
Semi-structured
Unstructured
What do they mean?

A

These categories are particularly helpful when choosing a database.

Structured Data  Structured data has a fixed set of attributes that can be modeled in a table of rows and columns.

Semi-Structured Data  Semi-structured data has attributes like structured data, but the set of attributes can vary from one instance to another. For example, a product description of an appliance might include length, width, height, weight, and power consumption. A chair in the same catalog might have length, width, height, color, and style as attributes. Semi-structured data may be organized using arrays or sets of key-value pairs.

Unstructured Data  Unstructured data does not fit into a tabular structure. Images and audio files are good examples of unstructured data. In between these two extremes lies semi-structured data, which has characteristics of both structured and unstructured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Row Key Access?

A

Wide-column databases usually take a different approach to querying. Rather than using indexes to allow efficient lookup of rows with needed data, wide-column databases organize data so that rows with similar row keys are close together. Queries use a row key, which is analogous to a primary key in relational databases, to retrieve data. This has two implications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Google has developed a decision tree for choosing a storage system that starts with distinguishing structured, semi-structured, and unstructured data.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Schema Design Considerations

A

Structured and semi-structured data has a schema associated with it. Structured data is usually stored in relational databases whereas semi-structured data is often stored in NoSQL databases. The schema influences how data is stored and accessed, so once you have determined which kind of storage technology to use, you may then need to design a schema that will support optimal storage and retrieval.

 The distinction between relational and NoSQL databases is becoming less pronounced as each type adopts features of the other. Some relational databases support storing and querying JavaScript Object Notation (JSON) structures, similar to the way that document databases do. Similarly, some NoSQL databases now support ACID (atomicity, consistency, isolation, durability) transactions, which are a staple feature of relational databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Relational Database Design is centered around?

A

Data modeling for relational databases begins with determining which type of relational database you are developing: an online transaction processing (OLTP) database or an online analytical processing (OLAP) database.
Online transaction processing (OLTP) databases are designed for transaction processing and typically follow data normalization rules.
Denormalization—that is, intentionally violating one of the rules of normalization—is often used to improve query performance. For example, repeating customer names in both the customer table and an order table could avoid having to join the two tables when printing invoices.

Online analytical processing (OLAP) data models are often used for data warehouse and data mart applications. OLAP models are also called dimensional models because data is organized around several dimensions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

NoSQL Database Design is centered around on?

A

NoSQL databases are less structured than relational databases, and there is no formal model, like relational algebra and forms of normalization, that apply to all NoSQL databases. The four types of NoSQL databases available in GCP are

Key-value
Document
Wide column
Graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Key-value data stores

A

Key-value data stores are databases that use associative arrays or dictionaries as the basic datatype. Keys are data used to look up values.
Key-value data stores are simple, but it is possible to have more complex data structures as values. For example, a JSON object could be stored as a value. This would be reasonable use of a key-value data store if the JSON object was only looked up by the key, and there was no need to search on items within the JSON structure. In situations where items in the JSON structure should be searchable, a document database would be a better option.
Cloud Memorystore is a fully managed key-value data store based on Redis, a popular open source key-value datastore. As of this writing, Cloud Memorystore does not support persistence, so it should not be used for applications that do not need to save data to persistent storage. Open source Redis does support persistence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Document Databases

A

Document stores allow complex data structures, called documents, to be used as values and accessed in more ways than simple key lookup. When designing a data model for document databases, documents should be designed to group data that is read together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Wide-column databases are used for use cases with the following?

A

High volumes of data
Need for low-latency writes
More write operations than read operations
Limited range of queries—in other words, no ad hoc queries
Lookup by a single key
Wide-column databases have a data model similar to the tabular structure of relational tables, but there are significant differences. Wide-column databases are often sparse, with the exception of IoT and other time-series databases that have few columns that are almost always used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Graph Databases

A

Another type of NoSQL database are graph databases, which are based on modeling entities and relationships as nodes and links in a graph or network. Social networks are a good example of a use case for graph databases. People could be modeled as nodes in the graph, and relationships between people are links, also called edges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How does Cloud SQL support data engineers?

A

Cloud SQL supports MySQL, PostgreSQL, and SQL Server (beta).  Cloud SQL instances are created in a single zone by default, but they can be created for high availability and use instances in multiple zones. Use read replicas to improve read performance. Importing and exporting are implemented via the RDBMS-specific tool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How does Cloud Spanner support Data Engineers?
Replica Types?

A

Cloud Spanner is configured as regional or multi-regional instances.  Cloud Spanner is a horizontally scalable relational database that automatically replicates data. Three types of replicas are read-write replicas, read-only replicas, and witness replicas. Avoid hotspots by not using consecutive values for primary keys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How does Cloud Big Table support data engineers?

A

Cloud Bigtable is a wide-column NoSQL database used for high-volume databases that require sub-10 ms latency.  Cloud Bigtable is used for IoT, time-series, finance, and similar applications. For multi-regional high availability, you can create a replicated cluster in another region. All data is replicated between clusters. Designing tables for Bigtable is fundamentally different from designing them for relational databases. Bigtable tables are denormalized, and they can have thousands of columns. There is no support for joins in Bigtable or for secondary indexes. Data is stored in Bigtable lexicographically by row-key, which is the one indexed column in a Bigtable table. Keeping related data in adjacent rows can help make reads more efficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

The Cloud Firestore data model consists of?

A

The Cloud Firestore data model consists of entities, entity groups, properties, and keys.

Entities have properties that can be atomic values, arrays, or entities.

Keys can be used to lookup entities and their properties.

Alternatively, entities can be retrieved using queries that specify properties and values, much like using a WHERE clause in SQL.

However, to query using property values, properties need to be indexed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

BigQuery has what major components?

A

BigQuery is an analytics database that uses SQL as a query language.  
Datasets are the basic unit of organization for sharing data in BigQuery.
A dataset can have multiple tables.

BigQuery supports two dialects of SQL: legacy and standard. Standard SQL supports advanced SQL features such as correlated subqueries, ARRAY and STRUCT data types, and complex join expressions.
BigQuery uses the concepts of slots for allocating computing resources to execute queries.
BigQuery also supports streaming inserts, which load one row at a time.

Data is generally available for analysis within a few seconds, but it may be up to 90 minutes before data is available for copy and export operations.
Streaming inserts provide for best effort de-duplication. Stackdriver is used for monitoring and logging in BigQuery.
Stackdriver Monitoring provides performance metrics, such query counts and time, to run queries. Stackdriver Logging is used to track events, such as running jobs or creating tables. BigQuery costs are based on the amount of data stored, the amount of data streamed, and the workload required to execute queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How does cloud storage data engineers?

A

Google Cloud Storage is an object storage system.  It is designed for persisting unstructured data, such as data files, images, videos, backup files, and any other data. It is unstructured in the sense that objects—that is, files stored in Cloud Storage—use buckets to group objects. A bucket is a group of objects that share access controls at the bucket level. The four storage tiers are Regional, Multi-regional, Nearline, and Coldline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

When you manage your own databases, you will be responsible for an array of database and system administration tasks.
How do you manage monitoring?

A

  The two Stackdriver components that are used with unmanaged databases are Stackdriver Monitoring and Stackdriver Logging. Instances have built-in monitoring and logging. Monitoring includes CPU, memory, and I/O metrics. Audit logs, which have information about who created an instance, are also available by default. Once the Stackdriver Logging agent is installed, it can collect application logs, including database logs. Stackdriver Logging is configured with Fluentd, an open source data collector for logs. Once the Stackdriver Monitoring agent is installed, it can collect application performance metrics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Understand the definitions of availability, reliability, and scalability. 

A

 Availability is defined as the ability of a user to access a resource at a specific time. Availability is usually measured as the percentage of time a system is operational. Reliability is defined as the probability that a system will meet service-level objectives for some duration of time. Reliability is often measured as the mean time between failures. Scalability is the ability of a system to meet the demands of workloads as they vary over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Know when to use hybrid clouds and edge computing.  

A

The analytics hybrid cloud is used when transaction processing systems continue to run on premises and data is extracted and transferred to the cloud for analytic processing. A variation of hybrid clouds is an edge cloud, which uses local computation resources in addition to cloud platforms. This architecture pattern is used when a network may not be reliable or have sufficient bandwidth to transfer data to the cloud. It is also used when low-latency processing is required.

35
Q

Understand messaging with data processing solutions.

A

Message brokers are services that provide three kinds of functionality: message validation, message transformation, and routing. Message validation is the process of ensuring that messages received are correctly formatted. Message transformation is the process of mapping data to structures that can be used by other services. Message brokers can receive a message and use data in the message to determine where the message should be sent. Routing is used when hub-and-spoke message brokers are used.

36
Q

Know distributed processing architectures for data processing solutions?

A

 SOA is a distributed architecture that is driven by business operations and delivering business value. Typically, an SOA system serves a discrete business activity. SOAs are self-contained sets of services. Microservices are a variation on SOA architecture. Like other SOA systems, microservice architectures use multiple, independent components and common communication protocols to provide higher-level business services. Serverless functions extend the principles of microservices by removing concerns for containers and managing runtime environments.

37
Q

Know the steps to migrate a data warehouse. 

A

 At a high level, the process of migrating a data warehouse involves four stages:

Assessing the current state of the data warehouse
Designing the future state
Migrating data, jobs, and access controls to the cloud
Validating the cloud data warehouse

38
Q

Know that Compute Engine supports provisioning single instances or groups of instances, known as instance groups. 

A

 Instance groups are either managed or unmanaged instance groups. Managed instance groups (MIGs) consist of identically configured VMs; unmanaged instance groups allow for heterogeneous VMs, but they should be used only when migrating legacy clusters from on-premises data centers.

39
Q

Understand the benefits of MIGs. These benefits include the following:

A

Autohealing based on application-specific health checks, which replace nonfunctioning instances
Support for multizone groups that provide for availability in spite of zone-level failures
Load balancing to distribute workload across all instances in the group
Autoscaling, which adds or removes instances in the group to accommodate increases and decreases in workloads
Automatic, incremental updates to reduce disruptions to workload processing

40
Q

Know that Kubernetes Engine is a managed Kubernetes service that provides ?  

A

Container orchestration
Containers are increasingly used to process workloads because they have less overhead than VMs and allow for finer-grained allocation of resources than VMs. A Kubernetes cluster has two types of instances: cluster masters and nodes.

41
Q

Understand Kubernetes abstractions. 

A

Pods are the smallest computation unit managed by Kubernetes. Pods contain one or more containers.

ReplicaSet is a controller that manages the number of pods running for a deployment. A deployment is a higher-level concept that manages ReplicaSets and provides declarative updates.

PersistentVolumes is Kubernetes’ way of representing storage allocated or provisioned for use by a pod. Pods acquire access to persistent volumes by creating a

PersistentVolumeClaim, which is a logical way to link a pod to persistent storage. StatefulSets are used to designate pods as stateful and assign a unique identifier to them. Kubernetes uses them to track which clients are using which pods and to keep them paired. An Ingress is an object that controls external access to services running in a Kubernetes cluster.

42
Q

Know how to provision Bigtable instances. 

A

 Cloud Bigtable is a managed wide-column NoSQL database used for applications that require high-volume, low-latency writes. Bigtable has an HBase interface, so it is also a good alternative to using Hadoop HBase on a Hadoop cluster. Bigtable instances can be provisioned using the cloud console, the command-line SDK, and the REST API. When creating an instance, you provide an instance name, an instance ID, an instance type, a storage type, and cluster specifications.

43
Q

Know how to provision Cloud Dataproc.  

A

When provisioning Cloud Dataproc resources, you will specify the configuration of a cluster using the cloud console, the command-line SDK, or the REST API. When you create a cluster, you will specify a name, a region, a zone, a cluster mode, machine types, and an autoscaling policy. The cluster mode determines the number of master nodes and possible worker nodes. Master nodes and worker nodes are configured separately. For each type of node, you can specify a machine type, disk size, and disk type.

44
Q

Understand that serverless services do not require conventional infrastructure provisioning but can be configured. 

A

 You can configure App Engine using the app.yaml, cron.yaml, distpatch.yaml, or queue.yaml file. Cloud Functions can be configured using parameters to specify memory, region, timeout, and max instances. Cloud Dataflow parameters include job name, project ID, running, staging location, and the default and maximum number of worker nodes.

45
Q

Understand the structure and function of policies.

A

  A policy consists of binding, metadata, and an audit configuration. Bindings specify how access is granted to a resource. Bindings are made up of members, roles, and conditions. The metadata of a policy includes an attribute called etag and versions. Audit configurations describe which permission types are logged and which identities are exempt from logging. Policies can be defined at different levels of the resource hierarchy, including organizations, folders, projects, and individual resources. Only one policy at a time can be assigned to an organization, folder, project, or individual resource.

46
Q

Understand key management.  

A

Cloud KMS is a hosted key management service in the Google Cloud. It enables customers to generate and store keys in GCP. It is used when customers want control over key management. Customer-supplied keys are used when an organization needs complete control over key management, including storage.

47
Q

Know how to provision Bigtable instances.

A

  Cloud Bigtable is a managed wide-column NoSQL database used for applications that require high-volume, low-latency writes. Bigtable has an HBase interface, so it is also a good alternative to using Hadoop HBase on a Hadoop cluster. Bigtable instances can be provisioned using the cloud console, the command-line SDK, and the REST API. When creating an instance, you provide an instance name, an instance ID, an instance type, a storage type, and cluster specifications.

48
Q

Know how to provision Cloud Dataproc.

A

  When provisioning Cloud Dataproc resources, you will specify the configuration of a cluster using the cloud console, the command-line SDK, or the REST API. When you create a cluster, you will specify a name, a region, a zone, a cluster mode, machine types, and an autoscaling policy. The cluster mode determines the number of master nodes and possible worker nodes. Master nodes and worker nodes are configured separately. For each type of node, you can specify a machine type, disk size, and disk type.

49
Q

Understand that serverless services do not require conventional infrastructure provisioning but can be configured. 

A

 You can configure App Engine using the app.yaml, cron.yaml, distpatch.yaml, or queue.yaml file. Cloud Functions can be configured using parameters to specify memory, region, timeout, and max instances. Cloud Dataflow parameters include job name, project ID, running, staging location, and the default and maximum number of worker nodes.

50
Q

Understand the purpose of Stackdriver Monitoring, Stackdriver Logging, and Stackdriver Trace.  

A

Stackdriver Metrics collect metrics on the performance of infrastructure resources and applications. Stackdriver Logging is a service for storing and searching log data about events in infrastructure and applications. Stackdriver Trace is a distributed tracing system designed to collect data on how long it takes to process requests to services.

51
Q

Know that Cloud Dataprep is an interactive tool for ?

A

preparing data for analysis and machine learning.  
Cloud Dataprep is used to cleanse, enrich, import, export, discover, structure, and validate data. The main cleansing operations in Cloud Dataprep center around altering column names, reformatting strings, and working with numeric values. Cloud Dataprep supports this process by providing for filtering data, locating outliers, deriving aggregates, calculating values across columns, and comparing strings.

52
Q

Understand that Cloud Datalab is an interactive tool for?

A

exploring and transforming data.  
Cloud Datalab runs as an instance of a container. Users of Cloud Datalab create a Compute Engine instance, run the container, and then connect from a browser to a Cloud Datalab notebook, which is a Jupyter Notebook.

Many of the commonly used packages are available in Cloud Datalab, but when users need to add others, they can do so by using either the conda install command or the pip install command.

53
Q

Know the stages of ML pipelines. 

A

 Data ingestion, data preparation, data segregation, model training, model evaluation, model deployment, and model monitoring are the stages of ML pipelines. Although the stages are listed in a linear manner, ML pipelines are more cyclic than linear, especially relating to training and evaluation.

54
Q

Understand batch and streaming ingestion.

A

  Batch data ingestion should use a dedicated process for ingesting each distinct data source. Batch ingestion often occurs on a relatively fixed schedule, much like many data warehouse ETL processes. It is important to be able to track which batch data comes from, so include a batch identifier with each record that is ingested. Cloud Pub/Sub is designed for scalable messaging, including ingesting streaming data. Cloud Pub/Sub is a good option for ingesting streaming data that will be stored in a database, such as Bigtable or Cloud Firebase, or immediately consumed by machine learning processes running in Cloud Dataflow, Cloud Dataproc, Kubernetes Engine, or Compute Engine. When using BigQuery, you have the option of using streaming inserts.

55
Q

Know the three kinds of data preparation. 

A

 The three kinds of data preparation are data exploration, data transformation, and feature engineering. Data exploration is the first step in working with a new data source or a data source that has had significant changes. The goal of this stage is to understand the distribution of data and the overall quality of data. Data transformation is the process of mapping data from its raw form into data structures and formats that allow for machine learning. Transformations can include replacing missing values with a default value, changing the format of numeric values, and deduplicating records. Feature engineering is the process of adding or modifying the representation of features to make implicit patterns more explicit. For example, if a ratio of two numeric features is important to classifying an instance, then calculating that ratio and including it as a feature may improve the model quality. Feature engineering includes the understanding of key attributes (features) that are meaningful for machine learning objectives at hand. This includes dimensional reduction.

56
Q

Know that data segregation is the process splitting a dataset into three segments:

A

training, validation, and test data.  Training data is used to build machine learning models. Validation data is used during hyperparameter tuning. Test data is used to evaluate model performance. The main criteria for deciding how to split data are to ensure that the test and validation datasets are large enough to produce statistically meaningful results, that test and validation datasets are representative of the data as a whole, and that the training dataset is large enough for the model to learn to make accurate predictions with reasonable precision and recall.

57
Q

Understand the process of training a model. 

A

 Know that feature selection is the process of evaluating how a particular attribute or feature contributes to the predictiveness of a model. The goal is to have features of a dataset that allow a model to learn to make accurate predictions. Know that underfitting creates a model that is not able to predict values of training data correctly or new data that was not used during training.

58
Q

Understand underfitting, overfitting, and regularization. 

A

 The problem of underfitting may be corrected by increasing the amount of training data, using a different machine learning algorithm, or modifying hyperparameters. Understand that overfitting occurs when a model fits the training data too well. One way to compensate for the impact of noise in the data and reduce the risk of overfitting is by introducing a penalty for data points, which makes the model more complicated. This process is called regularization. Two kinds of regularization are L1 regularization, which is also known as Lasso Regularization, for Least Absolute Shrinkage and Selection Operator, and L2 or Ridge Regression.

59
Q

Know ways to evaluate a model. 

A

 Methods for evaluation a model include individual evaluation metrics, such as accuracy, precision, recall, and the F measure; k-fold cross-validation; confusion matrices; and bias and variance. K-fold cross-validation is a technique for evaluating model performance by splitting a data set into k segments, where k is an integer. Confusion matrices are used with classification models to show the relative performance of a model. In the case of a binary classifier, a confusion matrix would be 2×2, with one column and one row for each value.

60
Q

Know options for deploying machine learning workloads on GCP.

A

  These options include Cloud AutoML, BigQuery ML, Kubeflow, and Spark MLib. Cloud AutoML is a machine learning service designed for developers who want to incorporate machine learning in their applications without having to learn many of the details of ML. BigQuery ML enables users of the analytical database to build machine learning models using SQL and data in BigQuery datasets. Kubeflow is an open source project for developing, orchestrating, and deploying scalable and portable machine learning workloads. Kubeflow is designed for the Kubernetes platform. Cloud Dataproc is a managed Spark and Hadoop service. Included with Spark is a machine learning library called MLib, and it is a good option for machine learning workloads if you are already using Spark or need one of the more specialized algorithms included in Spark MLib.

61
Q

Understand bias and variance.  

A

Bias is the difference between the average prediction of a model and the correct prediction of a model. Models with high bias tend to have oversimplified models; this is underfitting the model. Variance is the variability in model predictions. Models with high variance tend to overfit training data so that the model works well when making predictions on the training data but does not generalize to data that the model has not seen before.

62
Q

Understand that single machines are useful for training ? models

A

small.

sThis includes when you are developing machine learning applications or exploring data using Jupyter Notebooks or related tools. Cloud Datalab, for example, runs instances in Compute Engine virtual machines.

63
Q

Know that you also have the option of offloading some of the training load from CPUs to ?

A

GPUs.  GPUs have high-bandwidth memory and typically outperform CPUs on floating-point operations. GCP uses NVIDIA GPUs, and NVIDIA is the creator of CUDA, a parallel computing platform that facilitates the use of GPUs.

64
Q

Know that distributing model training over a group of servers provides for scalability and improved availability.  

A

There are a variety of ways to use distributed infrastructure, and the best choice for you will depend on your specific requirements and development practices. One way to distribute training is to use machine learning frameworks that are designed to run in a distributed environment, such as TensorFlow.

65
Q

Understand that serving a machine learning model is the process of making the model available to make ?????

A

predictions for other services. 

When serving models, you need to consider latency, scalability, and version management. Serving models from a centralized location, such as a data center, can introduce latency because input data and results are sent over the network. If an application needs real-time results, it is better to serve the model closer to where it is needed, such as an edge or IoT device.

66
Q

Know that edge computing is the practice of moving compute and storage resources closer to

A

the location at which they are needed.  
Edge computing devices can be relatively simple IoT devices, such as sensors with a small amount of memory and limited processing power. This type of device could be useful when the data processing load is light. Edge computing is used when low-latency data processing is needed—for example, to control machinery such as autonomous vehicles or manufacturing equipment. To enable edge computing, the system architecture has to be designed to provide compute, storage, and networking capabilities at the edge while services run in the cloud or in an on-premises data center for the centralized management of devices and centrally stored data.

67
Q

Know that an Edge TPU is a hardware device available from Google for implementing ???

A

edge computing.  

This device is an application-specific integrated circuit (ASIC) designed for running AI services at the edge. Edge TPU is designed to work with Cloud TPU and Google Cloud services. In addition to the hardware, Edge TPU includes software and AI algorithms.

68
Q

Know that Cloud IoT is Google’s managed service for IoT services. 

A

 This platform provides services for integrating edge computing with centralized processing services. Device data is captured by the Cloud IoT Core service, which can then publish data to Cloud Pub/Sub for streaming analytics. Data can also be stored in BigQuery for analysis or used for training new machine learning models in Cloud ML. Data provided through Cloud IoT can also be used to trigger Cloud Functions and associated workflows.

69
Q

Know the three types of machine learning algorithms: supervised, unsupervised, and reinforcement learning.  

A

Supervised algorithms learn from labeled examples. Unsupervised learning starts with unlabeled data and identifies salient features, such as groups or clusters, and anomalies in a data stream. Reinforcement learning is a third type of machine learning algorithm that is distinct from supervised and unsupervised learning. It trains a model by interacting with its environment and receiving feedback on the decisions that it makes.

70
Q

Know that supervised learning is used for classification and regression.  

A

Classification models assign discrete values to instances. The simplest form is a binary classifier that assigns one of two values, such as fraudulent/not fraudulent, or has malignant tumor/does not have malignant tumor. Multiclass classification models assign more than two values. Regress

71
Q

Understand how unsupervised learning differs from supervised learning.  

A

UUnsupervised learning algorithms find patterns in data without using predefined labels. Three types of unsupervised learning are clustering, anomaly detection, and collaborative filtering. Clustering, or cluster analysis, is the process of grouping instances together based on common features. Anomaly detection is the process of identifying unexpected patterns in data.

72
Q

Understand how reinforcement learning differs from supervised and unsupervised techniques.

A

  
Reinforcement learning is an approach to learning that uses agents interacting with an environment and adapting behavior based on rewards from the environment. This form of learning does not depend on labels. Reinforcement learning is modeled as an environment, a set of agents, a set of actions, and a set of probabilities of transitioning from one state to another after a particular action is taken. A reward is given after the transition from one state to another following an action.

73
Q

Understand the structure of neural networks, particularly deep learning networks.  

A

Neural networks are systems roughly modeled after neurons in animal brains and consist of sets of connected artificial neurons or nodes. The network is composed of artificial neurons that are linked together into a network. The links between artificial neurons are called connections. A single neuron is limited in what it can learn. A multilayer network, however, is able to learn more functions. A multilayer neural network consists of a set of input nodes, hidden nodes, and an output layer.

74
Q

Know machine learning terminology.

A

  This includes general machine learning terminology, such as baseline and batches; feature terminology, such as feature engineering and bucketing; training terminology, such as gradient descent and backpropagation; and neural network and deep learning terms, such as activation function and dropout. Finally, know model evaluation terminology such as precision and recall.

75
Q

Know common sources of errors, including data-quality errors, unbalanced training sets, and bias. 

A

 Poor-quality data leads to poor models. Some common data-quality problems are missing data, invalid values, inconsistent use of codes and categories, and data that is not representative of the population at large. Unbalanced datasets are ones that have significantly more instances of some categories than of others. There are several forms of bias, including automation bias, reporting bias, and group attribution.

76
Q

Understand the functionality of the Vision AI API?

A

 The Vision AI API is designed to analyze images and identify text, enable the search of images, and filter explicit images. Images are sent to the Vision AI API by specifying a URI path to an image or by sending the image data as Base64-encoded text. There are three options for calling the Vision AI API: Google-supported client libraries, REST, and gRPC.

77
Q

Understand the functionality of the Video Intelligence API. 

A

 The Video Intelligence API provides models that can extract metadata; identify key persons, places, and things; and annotate video content. This service has pretrained models that automatically recognize objects in videos. Specifically, this API can be used to identify objects, locations, activities, animal species, products, and so on, and detect shot changes, detect explicit content, track objects, detect text, and transcribe videos.

78
Q

Understand the functionality of Dialogflow.

A

  Dialogflow is used for chatbots, interactive voice response (IVR), and other dialogue-based interactions with human speech. The service is based on natural language–understanding technology that is used to identify entities in a conversation and extract numbers, dates, and time, as well as custom entities that can be trained using examples. Dialogflow also provides prebuilt agents that can be used as templates.

79
Q

Understand the functionality of the Cloud Text-to-Speech API.  

A

GCP’s Cloud Text-to-Speech API maps natural language texts to human-like speech. The API works with more than 30 languages and has more than 180 humanlike voices. The API works with plain-text or Speech Synthesis Markup Language (SSML) and audio files, including MP3 and WAV files. To generate speech, you call a synthesize function of the API.

80
Q

Understand the functionality of the Cloud Speech-to-Text API.  

A

The Cloud Speech-to-Text API is used to convert audio to text. This service is based on deep learning technology and supports 120 languages and variants. The service can be used for transcribing audios as well as for supporting voice-activated interfaces. Cloud Speech-to-Text automatically detects the language being spoken. Generated text can be returned as a stream of text or in batches as a text file.

81
Q

Understand the functionality of the Cloud Translation API. 

A

 Google’s translation technology is available for use through the Cloud Translation API. The basic version of this service, Translation API Basic, enables the translation of texts between more than 100 languages. There is also an advanced API, Translation API Advanced, which supports customization for domain-specific and context-specific terms and phrases.

82
Q

Understand the functionality of the Natural Language API. 

A

 The Natural Language API uses machine learning–derived models to analyze texts. With this API, developers can extract information about people, places, events, addresses, and numbers, as well as other types of entities. The service can be used to find and label fields within semi-structured documents, such as emails. It also supports sentiment analysis. The Natural Language API has a set of more than 700 general categories, such as sports and entertainment, for document classification. For more advanced users, the service performs syntactic analysis that provides parts of speech labels and creates parse trees for each sentence. Users of the API can specify domain-specific keywords and phrases for entity extraction and custom labels for content classification.

83
Q

Understand the functionality of the Recommendations AI API.  T

A

he Recommendations AI API is a service for suggesting products to customers based on their behavior on the user’s website and the product catalog of that website. The service builds a recommendation model specific to the site. The product catalog contains information on products that are sold to customers, such as names of products, prices, and availability. End-user behavior is captured in logged events, such as information about what customers search for, which products they view, and which products they have purchased. There are two primary functions the Recommendations AI API: ingesting data and making predictions.