Data Engineering Solutions Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Data pipelines are sequences of operations that?

A

Data pipelines are sequences of operations that copy, transform, load, and analyze data. There are common high-level design patterns that you see repeatedly in batch, streaming, and machine learning pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Understand the model of data pipelines. 

A

 A data pipeline is an abstract concept that captures the idea that data flows from one stage of processing to another. Data pipelines are modeled as directed acyclic graphs (DAGs). A graph is a set of nodes linked by edges. A directed graph has edges that flow from one node to another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Know the four stages in a data pipeline. 

A

Ingestion is the process of bringing data into the GCP environment.

Transformation is the process of mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline.

Cloud Storage can be used as both the staging area for storing data immediately after ingestion and also as a long-term store for transformed data.
BigQuery and Cloud Storage treat data as external tables and query them.
Cloud Dataproc can use Cloud Storage as HDFS-compatible storage.
Analysis can take on several forms, from simple SQL querying and report generation to machine learning model training and data science analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Know that the structure and function of data pipelines will vary according to the use case to which they are applied.  

A

Three common types of pipelines are data warehousing pipelines, stream processing pipelines, and machine learning pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Know the common patterns in data warehousing pipelines. 

A

 Extract, transformation, and load (ETL) pipelines begin with extracting data from one or more data sources.

When multiple data sources are used, the extraction processes need to be coordinated.

This is because extractions are often time based, so it is important that extracts from different sources cover the same time period. Extract, load, and transformation (ELT) processes are slightly different from ETL processes.
In an ELT process, data is loaded into a database before transforming the data. Extraction and load procedures do not transform data. This kind of process is appropriate when data does not require changes from the source format. In a change data capture approach, each change is a source system that is captured and recorded in a data store. This is helpful in cases where it is important to know all changes over time and not just the state of the database at the time of data extraction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Understand the unique processing characteristics of stream processing. 

A

This includes the difference between event time and processing time, sliding and tumbling windows, late-arriving data and watermarks, and missing data.
Event time is the time that something occurred at the place where the data is generated.
Processing time is the time that data arrives at the endpoint where data is ingested.
Sliding windows are used when you want to show how an aggregate, such as the average of the last three values, change over time, and you want to update that stream of averages each time a new value arrives in the stream.
Tumbling windows are used when you want to aggregate data over a fixed period of time—for example, for the last one minute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Know the components of a typical machine learning pipeline.  
This includes data ingestion, data preprocessing, feature engineering, model training and evaluation, and deployment.

A

Data ingestion uses the same tools and services as data warehousing and streaming data pipelines. Cloud Storage is used for batch storage of datasets, whereas Cloud Pub/Sub can be used for the ingestion of streaming data. Feature engineering is a machine learning practice in which new attributes are introduced into a dataset. The new attributes are derived from one or more existing attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Know that Cloud Pub/Sub is a managed message queue service. 

A

 Cloud Pub/Sub is a real-time messaging service that supports both push and pull subscription models.
It is a managed service, and it requires no provisioning of servers or clusters.

Cloud Pub/Sub will automatically scale as needed. Messaging queues are used in distributed systems to decouple services in a pipeline.
This allows one service to produce more output than the consuming service can process without adversely affecting the consuming service. This is especially helpful when one process is subject to spikes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Know that Cloud Dataflow is a managed stream and batch processing service.  

A

Cloud Dataflow is a core component for running pipelines that collect, transform, and output data. In the past, developers would typically create a stream processing pipeline (hot path) and a separate batch processing pipeline (cold path). Cloud Dataflow is based on Apache Beam, which is a model for combined stream and batch processing. Understand these key Cloud Dataflow concepts:

Pipelines
PCollection
Transforms
ParDo
Pipeline I/O
Aggregation
User-defined functions
Runner
Triggers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Know that Cloud Dataproc is a managed Hadoop and Spark service.  

A

Cloud Dataproc makes it easy to create and destroy ephemeral clusters. Cloud Dataproc makes it easy to migrate from on-premises Hadoop clusters to GCP. A typical Cloud Dataproc cluster is configured with commonly used components of the Hadoop ecosystem, including Hadoop, Spark, Pig, and Hive. Cloud Dataproc clusters consist of two types of nodes: master nodes and worker nodes. The master node is responsible for distributing and managing workload distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Know that Cloud Composer is a managed service implementing Apache Airflow.

A

  Cloud Composer is used for scheduling and managing workflows. As pipelines become more complex and have to be resilient when errors occur, it becomes more important to have a framework for managing workflows so that you are not reinventing code for handling errors and other exceptional cases. Cloud Composer automates the scheduling and monitoring of workflows. Before you can run workflows with Cloud Composer, you will need to create an environment in GCP.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Understand what to consider when migrating from on-premises Hadoop and Spark to GCP. 

A

 
Factors include migrating data, migrating jobs, and migrating HBase to Bigtable. Hadoop and Spark migrations can happen incrementally, especially since you will be using ephemeral clusters configured for specific jobs. There may be cases where you will have to keep an on-premises cluster while migrating some jobs and data to GCP. In those cases, you will have to keep data synchronized between environments. It is a good practice to migrate HBase databases to Bigtable, which provides consistent, scalable performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is ingestion?

A

Ingestion (see Figure 3.3) is the process of bringing data into the GCP environment. This can occur in either batch or streaming mode.

In batch mode, data sets made up of one or more files are copied to GCP. Often these files will be copied to Cloud Storage first. There are several ways to get data into Cloud Storage, including gsutil copying, Transfer Service, and Transfer Appliance.

Streaming ingestion receives data in increments, typically a single record or small batches of records, that continuously flow into an ingestion endpoint, typically a Cloud Pub/Sub topic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Transformation

A

Transformation is the process of mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline. There are many kinds of transformations, including the following:

Converting data types, such as converting a text representation of a date to a datetime data type
Substituting missing data with default or imputed values
Aggregating data; for example, averaging all CPU utilization metrics for an instance over the course of one minute
Filtering records that violate business logic rules, such as an audit log transaction with a date in the future
Augmenting data by joining records from distinct sources, such as joining data from an employee table with data from a sales table that includes the employee identifier of the person who made the sale
Dropping columns or attributes from a dataset when they will not be needed
Adding columns or attributes derived from input data; for example, the average of the previous three reported sales prices of a stock might be added to a row of data about the latest price for that stock

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are options for storage in a pipeline

A

Storage
After data is ingested and transformed, it is often stored. Chapter 2, “Building and Operationalizing Storage Systems,” describes GCP storage systems in detail, but key points related to data pipelines will be reviewed here as well.

Cloud Storage can be used as both the staging area for storing data immediately after ingestion and also as a long-term store for transformed data. BigQuery can treat Cloud Storage data as external tables and query them. Cloud Dataproc can use Cloud Storage as HDFS-compatible storage.

BigQuery is an analytical database that uses a columnar storage model that is highly efficient for data warehousing and analytic use cases.

Bigtable is a low-latency, wide-column NoSQL database used for time-series, IoT, and other high-volume write applications. Bigtable also supports the HBase API, making it a good storage option when migrating an on-premises HBase database on Hadoop (see Figure 3.5).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Types of Data Pipelines

A

The structure and function of data pipelines will vary according to the use case to which they are applied, but three common types of pipelines are as follows:

Data warehousing pipelines
Stream processing pipelines
Machine learning pipeline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Data Warehousing Pipelines

A

Data warehouses are databases for storing data from multiple data sources, typically organized in a dimensional data model. Dimensional data models are denormalized; that is, they do not adhere to the rules of normalization used in transaction processing systems. This is done intentionally because the purpose of a data warehouse is to answer analytic queries efficiently, and highly normalized data models can require complex joins and significant amounts of I/O operations. Denormalized dimensional models keep related data together in a minimal number of tables so that few joins are required.

Collecting and restructuring data from online transaction processing systems is often a multistep process. Some common patterns in data warehousing pipelines are as follows:

Extraction, transformation, and load (ETL)
Extraction, load, and transformation (ELT)
Extraction and load
Change data capture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the difference between event time and processing time?

A

Event Time and Processing Time
Data in time-series streams is ordered by time. If a set of data A arrives before data B, then presumably the event described by A occurred before the event described by B. There is a subtle but important issue implied in the previous sentence, which is that you are actually dealing with two points in time in stream processing:

Event time is the time that something occurred at the place where the data is generated.
Processing time is the time that data arrives at the endpoint where data is ingested. Processing time could be defined as some other point in the data pipeline, such as the time that transformation starts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a watermark?

A

To help stream processing applications, you can use the concept of a watermark, which is basically a timestamp indicating that no data older than that timestamp will ever appear in the stream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the difference between hotpath and cold path?

A

Hot Path and Cold Path Ingestion
We have been considering a streaming-only ingestion process. This is sometimes called a hot path ingestion. It reflects the latest data available and makes it available as soon as possible. You improve the timeliness of reporting data at the potential risk of a loss of accuracy.

There are many use cases where this tradeoff is acceptable. For example, an online retailer having a flash sale would want to know sales figures in real time, even if they might be slightly off. Sales professionals running the flash sale need that data to adjust the parameters of the sale, and approximate, but not necessarily accurate, data meets their needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

GCP Pipeline Components
GCP has several services that are commonly used components of pipelines, including?

A

Cloud Pub/Sub
Cloud Dataflow
Cloud Dataproc
Cloud Composer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

A job is an executing pipeline in Cloud Dataflow. There are two ways to execute jobs: the traditional method and the template method.

A

With the traditional method, developers create a pipeline in a development environment and run the job from that environment. The template method separates development from staging and execution. With the template method, developers still create pipelines in a development environment, but they also create a template, which is a configured job specification. The specification can have parameters that are specified when a user runs the template. Google provides a number of templates, and you can create your own as well. See Figure 3.9 for examples of templates provided by Google.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

the four main compute GCP products?

A

  Compute Engine is GCP’s infrastructure-as-a-service (IaaS) product.

With Compute Engine, you have the greatest amount of control over your infrastructure relative to the other GCP compute services.
Kubernetes is a container orchestration system, and Kubernetes Engine is a managed Kubernetes service. With Kubernetes Engine, Google maintains the cluster and assumes responsibility for installing and configuring the Kubernetes platform on the cluster. Kubernetes Engine deploys Kubernetes on managed instance groups.
App Engine is GCP’s original platform-as-a-service (PaaS) offering. App Engine is designed to allow developers to focus on application development while minimizing their need to support the infrastructure that runs their applications. App Engine has two versions: App Engine Standard and App Engine Flexible.
Cloud Functions is a serverless, managed compute service for running code in response to events that occur in the cloud. Events are supported for Cloud Pub/Sub, Cloud Storage, HTTP events, Firebase, and Stackdriver Logging.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Understand the definitions of availability, reliability, and scalability. 

A

 Availability is defined as the ability of a user to access a resource at a specific time. Availability is usually measured as the percentage of time a system is operational.
Reliability is defined as the probability that a system will meet service-level objectives for some duration of time. Reliability is often measured as the mean time between failures.

Scalability is the ability of a system to meet the demands of workloads as they vary over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Know when to use hybrid clouds and edge computing. 

A

 The analytics hybrid cloud is used when transaction processing systems continue to run on premises and data is extracted and transferred to the cloud for analytic processing. A variation of hybrid clouds is an edge cloud, which uses local computation resources in addition to cloud platforms. This architecture pattern is used when a network may not be reliable or have sufficient bandwidth to transfer data to the cloud. It is also used when low-latency processing is required.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Understand messaging. 

A

 Message brokers are services that provide three kinds of functionality: message validation, message transformation, and routing. Message validation is the process of ensuring that messages received are correctly formatted. Message transformation is the process of mapping data to structures that can be used by other services. Message brokers can receive a message and use data in the message to determine where the message should be sent. Routing is used when hub-and-spoke message brokers are used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Know distributed processing architectures. 

A

 SOA is a distributed architecture that is driven by business operations and delivering business value. Typically, an SOA system serves a discrete business activity. SOAs are self-contained sets of services. Microservices are a variation on SOA architecture. Like other SOA systems, microservice architectures use multiple, independent components and common communication protocols to provide higher-level business services. Serverless functions extend the principles of microservices by removing concerns for containers and managing runtime environments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Know the steps to migrate a data warehouse. 

A

 At a high level, the process of migrating a data warehouse involves four stages:

Assessing the current state of the data warehouse
Designing the future state
Migrating data, jobs, and access controls to the cloud
Validating the cloud data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Making Compute Resources Available, Reliable, and Scalable

A

Making Compute Resources Available, Reliable, and Scalable
Highly available and scalable compute resources typically employ clusters of machines or virtual machines with load balancers and autoscalers to distribute workload and adjust the size of the cluster to meet demand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Making Storage Resources Available, Reliable, and Scalable

A

GCP provides a range of storage systems, from in-memory caches to archival storage. Here are some examples.

Memorystore is an in-memory Redis cache. Standard Tier is automatically configured to maintain a replica in a different zone. The replica is used only for high availability, not scalability. The replica is used only when Redis detects a failure and triggers a failover to the replica.

Persistent disks are used with Compute Engine and Kubernetes Engine to provide network-based disk storage to VMs and containers. Persistent disks have built-in redundancy for high availability and reliability. Also, users can create snapshots of disks and store them in Cloud Storage for additional risk mitigation.

Cloud SQL is a managed relational database that can operate in high-availability mode by maintaining a primary instance in one zone and a standby instance in another zone within the same region. Synchronous replication keeps the data up to date in both instances. If you require multi-regional redundancy in your relational database, you should consider Cloud Spanner.

Cloud Storage stores replicas of objects within a region when using standard storage and across regions when using multi-regional st

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Making Network Resources Available, Reliable, and Scalable
Networking resources requires advanced planning for availability, reliability, and scalability.

A

You have the option of using Standard Tier or Premium Tier networking. Standard Tier uses the public Internet network to transfer data between Google data centers, whereas Premium Tier routes traffic only over Google’s global network. When using the Standard Tier, your data is subject to the reliability of the public Internet.

Network interconnects between on-premises data centers and Google Cloud are not rapidly scaled up or down. At the low end of the bandwidth spectrum, VPNs are used when up to 3 Gbps is sufficient. It is common practice to use two VPNs to connect an enterprise data center to the GCP for redundancy. HA VPN is an option for high-availability VPNs that uses two IP addresses and provides a 99.99 percent service availability, in contrast to the standard VPN, which has a 99.9 percent service level agreement.

For high-throughput use cases, enterprises can use Cloud Interconnect. Cloud Interconnect is available as a dedicated interconnect in which an enterprise directly connects to a Google endpoint and traffic flows directly between the two networks. The other option is to use a partner interconnect, in which case data flows through a third-party network but not over the Internet. Architects may choose Cloud Interconnect for better security, higher speed, and entry into protected networks. In this case, availability, reliability, and scalability are all addressed by redundancy in network infrastructure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Distributed processing presents challenges not found when processing is performed on a single server.

A

For starters, you need mechanisms for sharing data across servers. These include message brokers and message queues, collectively known as middleware. There is more than one way to do distributed processing. Some common architecture patterns are service-oriented architectures, microservices, and serverless functions. Distributed systems also have to contend with the possibility of duplicated processing and data arriving out of order. Depending on requirements, distributed processing can use different event processing models for handling duplicated and out-of-order processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Message Brokers are?

A

Message brokers are services that provide three kinds of functionality: message validation, message transformation, and routing.

Message validation is the process of ensuring that messages received are correctly formatted. For example, a message may be specified in a Thrift or Protobuf format. Both Thrift and Protobuf, which is short for Protocol Buffers, are designed to make it easy to share data across applications and languages. For example, Java might store structured data types one way, whereas Python would store the same logical structure in a different way. Instead of sharing data using language-specific structures, software developers can map their data to a common format, a process known as serialization. A serialized message can then be placed on a message broker and routed to another service that can read the message without having to have information about the language or data structure used in the source system.

Message transformation is the process of mapping data to structures that can be used by other services. This is especially important when source and consumer services can change independently. For example, an accounting system may change the definition of a sales order. Other systems, like data warehouses, which use sales order data, would need to update ETL processes each time the source system changes unless the message broker between the accounting system and data warehouse implemented necessary transformations. The advantage of applying these transformations in the message broker is that other systems in addition to the data warehouse can use the transformed data without having to implement their own transformation.

Message brokers can receive a message and use data in the message to determine where the message should be sent. Routing is used when hub-and-spoke message brokers are used. With a hub-and-spoke model, messages are sent to a central processing node and from there routed to the correct destination node. See Figure 4.3 for an example of a hub-and-spoke model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Know that Compute Engine supports provisioning single instances or groups of instances, known as ?d

A

Know that Compute Engine supports provisioning single instances or groups of instances, known as instance groups.  Instance groups are either managed or unmanaged instance groups. Managed instance groups (MIGs) consist of identically configured VMs; unmanaged instance groups allow for heterogeneous VMs, but they should be used only when migrating legacy clusters from on-premises data centers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Understand the benefits of MIGs??

A

These benefits include the following:

Autohealing based on application-specific health checks, which replace nonfunctioning instances
Support for multizone groups that provide for availability in spite of zone-level failures
Load balancing to distribute workload across all instances in the group
Autoscaling, which adds or removes instances in the group to accommodate increases and decreases in workloads
Automatic, incremental updates to reduce disruptions to workload processingd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What service provides container orchestration?

A

 Containers are increasingly used to process workloads because they have less overhead than VMs and allow for finer-grained allocation of resources than VMs. A Kubernetes cluster has two types of instances: cluster masters and nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Understand Kubernetes abstractions. 

A

 Pods are the smallest computation unit managed by Kubernetes. Pods contain one or more containers.

A ReplicaSet is a controller that manages the number of pods running for a deployment.

A deployment is a higher-level concept that manages ReplicaSets and provides declarative updates.

PersistentVolumes is Kubernetes’ way of representing storage allocated or provisioned for use by a pod.

Pods acquire access to persistent volumes by creating a PersistentVolumeClaim, which is a logical way to link a pod to persistent storage. StatefulSets are used to designate pods as stateful and assign a unique identifier to them.

Kubernetes uses them to track which clients are using which pods and to keep them paired.
An Ingress is an object that controls external access to services running in a Kubernetes cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Know how to provision Bigtable instances. 

A

 Cloud Bigtable is a managed wide-column NoSQL database used for applications that require high-volume, low-latency writes. Bigtable has an HBase interface, so it is also a good alternative to using Hadoop HBase on a Hadoop cluster.

Bigtable instances can be provisioned using the cloud console, the command-line SDK, and the REST API.
When creating an instance, you provide an instance name, an instance ID, an instance type, a storage type, and cluster specifications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Know how to provision Cloud Dataproc.  

A

When provisioning Cloud Dataproc resources, you will specify the configuration of a cluster using the cloud console, the command-line SDK, or the REST API.

When you create a cluster, you will specify a name, a region, a zone, a cluster mode, machine types, and an autoscaling policy.

The cluster mode determines the number of master nodes and possible worker nodes. Master nodes and worker nodes are configured separately.

For each type of node, you can specify a machine type, disk size, and disk type.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Understand that serverless services do not require conventional infrastructure provisioning but can be configured.

A

  You can configure App Engine using the app.yaml, cron.yaml, distpatch.yaml, or queue.yaml file. Cloud Functions can be configured using parameters to specify memory, region, timeout, and max instances. Cloud Dataflow parameters include job name, project ID, running, staging location, and the default and maximum number of worker nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Understand the purpose of Stackdriver Monitoring, Stackdriver Logging, and Stackdriver Trace. 

A

 Stackdriver Metrics collect metrics on the performance of infrastructure resources and applications.

Stackdriver Logging is a service for storing and searching log data about events in infrastructure and applications.

Stackdriver Trace is a distributed tracing system designed to collect data on how long it takes to process requests to services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Understand the benefits of MIGs. These benefits include the following:

A

Autohealing based on application-specific health checks, which replace nonfunctioning instances
Support for multizone groups that provide for availability in spite of zone-level failures
Load balancing to distribute workload across all instances in the group
Autoscaling, which adds or removes instances in the group to accommodate increases and decreases in workloads
Automatic, incremental updates to reduce disruptions to workload processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Know that Kubernetes Engine is a managed Kubernetes service that provides container orchestration. 

A

 Containers are increasingly used to process workloads because they have less overhead than VMs and allow for finer-grained allocation of resources than VMs. A Kubernetes cluster has two types of instances: cluster masters and nodes.

44
Q

Understand the purpose of service accounts. 

A

 Service accounts are a type of identity that are used with VM instances and applications, which are able to make API calls authorized by roles assigned to the service account. A service account is identified by a unique email address. These accounts are authenticated by two sets of public/private keys. One set is managed by Google, and the other set is managed by users. Public keys are provided to API calls to authenticate the service account.

45
Q

Understand the structure and function of policies. 

A

 A policy consists of binding, metadata, and an audit configuration. Bindings specify how access is granted to a resource. Bindings are made up of members, roles, and conditions. The metadata of a policy includes an attribute called etag and versions. Audit configurations describe which permission types are logged and which identities are exempt from logging. Policies can be defined at different levels of the resource hierarchy, including organizations, folders, projects, and individual resources. Only one policy at a time can be assigned to an organization, folder, project, or individual resource.

46
Q

Understand the structure and function of policies. 

A

A policy consists of binding, metadata, and an audit configuration. Bindings specify how access is granted to a resource. Bindings are made up of members, roles, and conditions. The metadata of a policy includes an attribute called etag and versions. Audit configurations describe which permission types are logged and which identities are exempt from logging. Policies can be defined at different levels of the resource hierarchy, including organizations, folders, projects, and individual resources. Only one policy at a time can be assigned to an organization, folder, project, or individual resource.

47
Q

Know the basic requirements of major regulations. 

A

 The Health Insurance Portability and Accountability Act (HIPAA) is a federal law in the United States that protects individuals’ healthcare information. The Children’s Online Privacy Protection Act (COPPA) is primarily focused on children under the age of 13, and it applies to websites and online services that collect information about children. The Federal Risk and Authorization Management Program (FedRAMP) is a U.S. federal government program that promotes a standard approach to assessment, authorization, and monitoring of cloud resources. The European Union’s (EU) General Data Protection Regulation (GDPR) is designed to standardize privacy protections across the EU, grant controls to individuals over their private information, and specify security practices required for organizations holding private information of EU citizens.

48
Q

Understand Cloud Bigtable is a nonrelational database based on a sparse three-dimensional map?

A

  The three dimensions are rows, columns, and cells. When you create a Cloud Bigtable instance, you specify a number of type of nodes. These nodes manage metadata about the data stored in the Cloud Bigtable database, whereas the actual data is stored outside of the nodes on the Colossus filesystem. Within the Colossus filesystem, data is organized into sorted string tables, or SSTables, which are called tablets.

49
Q

Understand how to design row-keys in Cloud Bigtable.  

A

In general, it is best to avoid monotonically increasing values or lexicographically close strings at the beginning of keys. When a using a multitenant Cloud Bigtable database, it is a good practice to use a tenant prefix in the row-key. String identifiers, such as a customer ID or a sensor ID, are good candidates for a row-key. Timestamps may be used as part of a row-key, but they should not be the entire row-key or the start of the row-key. Moving timestamps from the front of a row-key so that another attribute is the first part of the row-key is an example of field promotion. In general, it is a good practice to promote, or move toward the front of the key, values that are highly varied. Another way to avoid hotspots is to use salting.

50
Q

Know how to use tall and narrow tables for time-series databases.

A

  Keep names short; this reduces the size of metadata since names are stored along with data values. Store few events within each row, ideally only one event per row; this makes querying easier. Also, storing multiple events increases the chance of exceeding maximum recommended row sizes. Design row-keys for looking up a single value or a range of values. Range scans are common in time-series analysis. Keep in mind that there is only one index on Cloud Bigtable tables.

51
Q

Know when to use interleaved tables in Cloud Spanner

A

Use interleaved tables with a parent-child relationship in which parent data is stored with child data.

This makes retrieving data from both tables simultaneously more efficient than if the data were stored separately and is especially helpful when performing joins.

Since the data from both tables is co-located, the database has to perform fewer seeks to get all the needed data.

52
Q

Know how to avoid hotspots by designing primary keys properly. 

A

 Monotonically increasing keys can cause read and write operations to happen in few servers simultaneously instead of being evenly distributed across all servers. Options for keys include using the hash of a natural key; swapping the order of columns in keys to promote higher-cardinality attributes; using a universally unique identifier (UUID), specifically version 4 or later; and using bit-reverse sequential values.

53
Q

Know the differences between primary and secondary indexes. 

A

Primary indexes are created automatically on the primary key. Secondary indexes are explicitly created using the CREATE INDEX command. Secondary indexes are useful when filtering in a query using a WHERE clause. If the column referenced in the WHERE clause is indexed, the index can be used for filtering rather than scanning the full table and then filtering. Secondary indexes are also useful when you need to return rows in a sort order other than the primary key order. When a secondary index is created, the index will store all primary key columns from the base table, all columns included in the index, and any additional columns specified in a STORING clause.

54
Q

Understand the organizational structure of BigQuery databases. 

A

 Projects are the high-level structure used to organize the use of GCP services and resources. Datasets exist within a project and are containers for tables and views. Access to tables and views are defined at the dataset level. Tables are collections of rows and columns stored in a columnar format, known as Capacitor format, which is designed to support compression and execution optimizations.

55
Q

Understand how to denormalize data in BigQuery using nested and repeated fields. 

A

 
Denormalizing in BigQuery can be done with nested and repeated columns. A column that contains nested and repeated data is defined as a RECORD datatype and is accessed as a STRUCT in SQL. BigQuery supports up to 15 levels of nested STRUCTs.

56
Q

Know when and why to use partitioning and clustering in BigQuery.

A

  Partitioning is the process of dividing tables into segments called partitions. BigQuery has three partition types: ingestion time partitioned tables, timestamp partitioned tables, and integer range partitioned tables. In BigQuery, clustering is the ordering of data in its stored format. Clustering is supported only on partitioned tables and is used when filters or aggregations are frequently used.

57
Q

Understand the different kinds of queries in BigQuery.

A

  BigQuery supports two types of queries: interactive and batch queries. Interactive queries are executed immediately, whereas batch queries are queued and run when resources are available. The advantage of using these batch queries is that resources are drawn from a shared resource pool and batch queries do not count toward the concurrent rate limit, which is 100 concurrent queries. Queries are run as jobs, similar to jobs run to load and export data.

58
Q

Know that BigQuery can access external data without you having to import it into BigQuery first.

A

  BigQuery can access data in external sources, known as federated sources. Instead of first loading data into BigQuery, you can create a reference to an external source. External sources can be Cloud Bigtable, Cloud Storage, and Google Drive. When accessing external data, you can create either permanent or temporary external tables. Permanent tables are those created in a dataset and linked to an external source. Temporary tables are useful for one-time operations, such as loading data into a data warehouse.

59
Q

Know that BigQuery ML supports machine learning in BigQuery using SQL. 

A

 BigQuery extends standard SQL with the addition of machine learning functionality. This allows BigQuery users to build machine learning models in BigQuery rather than programming models in Python, R, Java, or other programming languages outside of BigQuery.

60
Q

Know that Data Catalog is a metadata service for ?

A

Know that Data Catalog is a metadata service for data management.  Data Catalog is fully managed, so there are no servers to provision or configure. Its primary function is to provide a single, consolidated view of enterprise data. Metadata is collected automatically during ingest operations to BigQuery and Cloud Pub/Sub, as well through APIs and third-party tools.

61
Q

Data Catalog will collect metadata automatically from several GCP sources.

A

Understand that Data Catalog will collect metadata automatically from several GCP sources.  These sources include Cloud Storage, Cloud Bigtable, Google Sheets, BigQuery, and Cloud Pub/Sub. In addition to native metadata, Data Catalog can collect custom metadata through the use of tags.

62
Q

Know that Cloud Dataprep is an interactive tool for

A

preparing data for analysis and machine learning

Cloud Dataprep is used to cleanse, enrich, import, export, discover, structure, and validate data.
The main cleansing operations in Cloud Dataprep center around altering column names, reformatting strings, and working with numeric values. Cloud Dataprep supports this process by providing for filtering data, locating outliers, deriving aggregates, calculating values across columns, and comparing strings.

63
Q

Data Studio as a ? and ? tool

A

Be familiar with Data Studio as a reporting and visualization tool.  The Data Studio tool is organized around reports, and it reads data from data sources and formats the data into tables and charts. Data Studio uses the concept of a connector for working with datasets. Datasets can come in a variety of forms, including a relational database table, a Google Sheet, or a BigQuery table. Connectors provide access to all or to a subset of columns in a data source. Data Studio provides components that can be deployed in a drag-and-drop manner to create reports. Reports are collections of tables and visualization.

64
Q

Understand that Cloud Datalab is an interactive tool for exploring and transforming data?

A


Cloud Datalab runs as an instance of a container. Users of Cloud Datalab create a Compute Engine instance, run the container, and then connect from a browser to a Cloud Datalab notebook, which is a Jupyter Notebook. Many of the commonly used packages are available in Cloud Datalab, but when users need to add others, they can do so by using either the conda install command or the pip install command.

65
Q

Know that Cloud Composer is a fully managed ? 

A

Know that Cloud Composer is a fully managed workflow orchestration service based on Apache Airflow.  Workflows are defined as directed acyclic graphs, which are specified in Python. Elements of workflows can run on premises and in other clouds as well as in GCP. Airflow DAGs are defined in Python as a set of operators and operator relationships. An operator specifies a single task in a workflow. Common operators include BashOperator and PythonOperator.

66
Q

stages of a machine learning pipeline are?

A

stages of a machine learning pipeline are as follows:

Data ingestion
Data preparation
Data segregation
Model training
Model evaluation
Model deployment
Model monitoring
Although the stages are listed in a linear manner, ML pipelines are more cyclic than linear, as shown in Figure 9.1. This is a difference with dataflow pipelines, like those used to ingest, transform, and store data, which are predominantly linear.

67
Q

Batch Data Ingestion

A

Batch data ingestion should use a dedicated process for ingesting each distinct data source. For example, one process may ingest sales transactions from an e-commerce site, whereas another process ingests data about customers from another source. Batch ingestion is often done on a relatively fixed schedule, much like many data warehouse extraction, load, and transformation (ELT) processes. It is important to be able to track which batch data comes from, so you must include a batch identifier with each record that is ingested. This is considered a best practice, and it allows you to compare results across datasets more easily.

68
Q

Data preparation is the process of transforming data from its raw form into a structure and format that is amenable to analysis by machine learning algorithms.

A

Data preparation is the process of transforming data from its raw form into a structure and format that is amenable to analysis by machine learning algorithms. There are three steps to data preparation:

Data exploration
Data transformation
Feature engineering

69
Q

Understand batch and streaming ingestion.  

A

Batch data ingestion should use a dedicated process for ingesting each distinct data source. Batch ingestion often occurs on a relatively fixed schedule, much like many data warehouse ETL processes. It is important to be able to track which batch data comes from, so include a batch identifier with each record that is ingested. Cloud Pub/Sub is designed for scalable messaging, including ingesting streaming data. Cloud Pub/Sub is a good option for ingesting streaming data that will be stored in a database, such as Bigtable or Cloud Firebase, or immediately consumed by machine learning processes running in Cloud Dataflow, Cloud Dataproc, Kubernetes Engine, or Compute Engine. When using BigQuery, you have the option of using streaming inserts.

70
Q

Know the three kinds of data preparation.

A

  The three kinds of data preparation are data exploration, data transformation, and feature engineering. Data exploration is the first step in working with a new data source or a data source that has had significant changes. The goal of this stage is to understand the distribution of data and the overall quality of data. Data transformation is the process of mapping data from its raw form into data structures and formats that allow for machine learning. Transformations can include replacing missing values with a default value, changing the format of numeric values, and deduplicating records. Feature engineering is the process of adding or modifying the representation of features to make implicit patterns more explicit. For example, if a ratio of two numeric features is important to classifying an instance, then calculating that ratio and including it as a feature may improve the model quality. Feature engineering includes the understanding of key attributes (features) that are meaningful for machine learning objectives at hand. This includes dimensional reduction.

71
Q

Know that data segregation is the process of ?

A

Know that data segregation is the process splitting a dataset into three segments: training, validation, and test data.  Training data is used to build machine learning models. Validation data is used during hyperparameter tuning. Test data is used to evaluate model performance. The main criteria for deciding how to split data are to ensure that the test and validation datasets are large enough to produce statistically meaningful results, that test and validation datasets are representative of the data as a whole, and that the training dataset is large enough for the model to learn to make accurate predictions with reasonable precision and recall.

72
Q

What is the process of training a model?

A

Understand the process of training a model.  Know that feature selection is the process of evaluating how a particular attribute or feature contributes to the predictiveness of a model. The goal is to have features of a dataset that allow a model to learn to make accurate predictions. Know that underfitting creates a model that is not able to predict values of training data correctly or new data that was not used during training.

73
Q

Understand underfitting, overfitting, and regularization. 

A

 The problem of underfitting may be corrected by increasing the amount of training data, using a different machine learning algorithm, or modifying hyperparameters.
Understand that overfitting occurs when a model fits the training data too well.

One way to compensate for the impact of noise in the data and reduce the risk of overfitting is by introducing a penalty for data points, which makes the model more complicated. This process is called regularization. Two kinds of regularization are L1 regularization, which is also known as Lasso Regularization, for Least Absolute Shrinkage and Selection Operator, and L2 or Ridge Regression.

74
Q

Know ways to evaluate a model. 

A

 Methods for evaluation a model include individual evaluation metrics, such as accuracy, precision, recall, and the F measure; k-fold cross-validation; confusion matrices; and bias and variance. K-fold cross-validation is a technique for evaluating model performance by splitting a data set into k segments, where k is an integer. Confusion matrices are used with classification models to show the relative performance of a model. In the case of a binary classifier, a confusion matrix would be 2×2, with one column and one row for each value.

75
Q

Understand bias and variance. 

A

 Bias is the difference between the average prediction of a model and the correct prediction of a model. Models with high bias tend to have oversimplified models; this is underfitting the model. Variance is the variability in model predictions. Models with high variance tend to overfit training data so that the model works well when making predictions on the training data but does not generalize to data that the model has not seen before.

76
Q

Know options for deploying machine learning workloads on GCP.

A

Know options for deploying machine learning workloads on GCP.  These options include Cloud AutoML, BigQuery ML, Kubeflow, and Spark MLib. Cloud AutoML is a machine learning service designed for developers who want to incorporate machine learning in their applications without having to learn many of the details of ML. BigQuery ML enables users of the analytical database to build machine learning models using SQL and data in BigQuery datasets. Kubeflow is an open source project for developing, orchestrating, and deploying scalable and portable machine learning workloads. Kubeflow is designed for the Kubernetes platform. Cloud Dataproc is a managed Spark and Hadoop service. Included with Spark is a machine learning library called MLib, and it is a good option for machine learning workloads if you are already using Spark or need one of the more specialized algorithms included in Spark MLib.

77
Q

Understand that single machines are useful for training small models.  

A

This includes when you are developing machine learning applications or exploring data using Jupyter Notebooks or related tools. Cloud Datalab, for example, runs instances in Compute Engine virtual machines.

78
Q

Know that you also have the option of offloading some of the training load from CPUs to GPUs.

A

  GPUs have high-bandwidth memory and typically outperform CPUs on floating-point operations. GCP uses NVIDIA GPUs, and NVIDIA is the creator of CUDA, a parallel computing platform that facilitates the use of GPUs.

79
Q

Know that distributing model training over a group of servers provides for ?

A

Know that distributing model training over a group of servers provides for scalability and improved availability.  There are a variety of ways to use distributed infrastructure, and the best choice for you will depend on your specific requirements and development practices. One way to distribute training is to use machine learning frameworks that are designed to run in a distributed environment, such as TensorFlow.

80
Q

Understand that serving a machine learning model is the process of making the model available to?

A

Understand that serving a machine learning model is the process of making the model available to make predictions for other services.  When serving models, you need to consider latency, scalability, and version management. Serving models from a centralized location, such as a data center, can introduce latency because input data and results are sent over the network. If an application needs real-time results, it is better to serve the model closer to where it is needed, such as an edge or IoT device.

81
Q

Know that edge computing is the practice of moving compute and storage resources closer to the location at which they are needed.

A

  Edge computing devices can be relatively simple IoT devices, such as sensors with a small amount of memory and limited processing power. This type of device could be useful when the data processing load is light. Edge computing is used when low-latency data processing is needed—for example, to control machinery such as autonomous vehicles or manufacturing equipment. To enable edge computing, the system architecture has to be designed to provide compute, storage, and networking capabilities at the edge while services run in the cloud or in an on-premises data center for the centralized management of devices and centrally stored data.

82
Q

Be able to list the three basic components of edge computing.

A

  Edge computing consists of edge devices, gateway devices, and the cloud platform. Edge devices provide three kinds of data: metadata about the device, state information about the device, and telemetry data. Before a device is incorporated into an IoT processing system, it must be provisioned. After a device is provisioned and it starts collecting data, the data is then processed on the device. After local processing, data is transmitted to a gateway. Gateways can manage network traffic across protocols. Data sent to the cloud is ingested by one of a few different kinds of services in GCP, including Cloud Pub/Sub, IoT Core MQTT, and Stackdriver Monitoring and Logging.

83
Q

Know that an Edge TPU is a hardware device available from Google for implementing?

A

Know that an Edge TPU is a hardware device available from Google for implementing edge computing.
This device is an application-specific integrated circuit (ASIC) designed for running AI services at the edge.
Edge TPU is designed to work with Cloud TPU and Google Cloud services. In addition to the hardware, Edge TPU includes software and AI algorithms.

84
Q

Know that Cloud IoT is Google’s managed service for IoT services.

A

This platform provides services for integrating edge computing with centralized processing services.
Device data is captured by the Cloud IoT Core service, which can then publish data to Cloud Pub/Sub for streaming analytics.
Data can also be stored in BigQuery for analysis or used for training new machine learning models in Cloud ML.
Data provided through Cloud IoT can also be used to trigger Cloud Functions and associated workflows.

85
Q

Understand GPUs and TPUs

A

 Graphic processing units are accelerators that have multiple arithmetic logic units (ALUs) that implement adders and multipliers. This architecture is well suited to workloads that benefit from massive parallelization, such as training deep learning models. GPUs and CPUs are both subject to the von Neumann bottleneck, which is the limited data rate between a processor and memory, and slow processing. TPUs are specialized accelerators based on ASICs and created by Google to improve training of deep neural networks. These accelerators are designed for the TensorFlow framework. TPUs reduces the impact of the von Neumann bottleneck by implementing matrix multiplication in the processor. Know the criteria for choosing between CPUs, GPUs, and TPUs.

86
Q

The three types of machine learning algorithms?

A

supervised, unsupervised, and reinforcement learning.  Supervised algorithms learn from labeled examples. Unsupervised learning starts with unlabeled data and identifies salient features, such as groups or clusters, and anomalies in a data stream. Reinforcement learning is a third type of machine learning algorithm that is distinct from supervised and unsupervised learning. It trains a model by interacting with its environment and receiving feedback on the decisions that itmakes.

87
Q

Know that supervised learning is used for classification and regression.

A

Know that supervised learning is used for classification and regression.  Classification models assign discrete values to instances. The simplest form is a binary classifier that assigns one of two values, such as fraudulent/not fraudulent, or has malignant tumor/does not have malignant tumor. Multiclass classification models assign more than two values. Regression models map continuous variables to other continuous variables.

88
Q

Understand how unsupervised learning differs from supervised learning. 

A

 Unsupervised learning algorithms find patterns in data without using predefined labels. Three types of unsupervised learning are clustering, anomaly detection, and collaborative filtering. Clustering, or cluster analysis, is the process of grouping instances together based on common features. Anomaly detection is the process of identifying unexpected patterns in data.

89
Q

Understand how reinforcement learning differs from supervised and unsupervised techniques. 

A

 
Reinforcement learning is an approach to learning that uses agents interacting with an environment and adapting behavior based on rewards from the environment. This form of learning does not depend on labels. Reinforcement learning is modeled as an environment, a set of agents, a set of actions, and a set of probabilities of transitioning from one state to another after a particular action is taken. A reward is given after the transition from one state to another following an action.

90
Q

Understand the structure of neural networks, particularly deep learning networks. 

A

Neural networks are systems roughly modeled after neurons in animal brains and consist of sets of connected artificial neurons or nodes.
The network is composed of artificial neurons that are linked together into a network.
The links between artificial neurons are called connections. A single neuron is limited in what it can learn.
A multilayer network, however, is able to learn more functions.

A multilayer neural network consists of a set of input nodes, hidden nodes, and an output layer.

91
Q

Know machine learning terminology.

A

  This includes general machine learning terminology, such as baseline and batches; feature terminology, such as feature engineering and bucketing; training terminology, such as gradient descent and backpropagation; and neural network and deep learning terms, such as activation function and dropout. Finally, know model evaluation terminology such as precision and recall.

92
Q

Know common sources of errors, including data-quality errors, unbalanced training sets, and bias. 

A

 Poor-quality data leads to poor models. Some common data-quality problems are missing data, invalid values, inconsistent use of codes and categories, and data that is not representative of the population at large. Unbalanced datasets are ones that have significantly more instances of some categories than of others. There are several forms of bias, including automation bias, reporting bias, and group attribution.

93
Q

What are Common Sources of Error in Machine Learning Models

A

Machine learning engineers face a number of challenges when building effective models. Problems like overfitting, underfitting, and vanishing gradient can be addressed by adjusting the way that a model is trained. In other cases, the data used to train can be a source of error. Three common problems are as follows:

Data quality
Unbalanced training sets
Bias in training data

94
Q

Understand the functionality of the Vision AI API. 

A

 The Vision AI API is designed to analyze images and identify text, enable the search of images, and filter explicit images. Images are sent to the Vision AI API by specifying a URI path to an image or by sending the image data as Base64-encoded text. There are three options for calling the Vision AI API: Google-supported client libraries, REST, and gRPC.

95
Q

Understand the functionality of the Video Intelligence API.

A

  The Video Intelligence API provides models that can extract metadata; identify key persons, places, and things; and annotate video content. This service has pretrained models that automatically recognize objects in videos. Specifically, this API can be used to identify objects, locations, activities, animal species, products, and so on, and detect shot changes, detect explicit content, track objects, detect text, and transcribe videos.

96
Q

What is Dialogflow?

A

.  Dialogflow is used for chatbots, interactive voice response (IVR), and other dialogue-based interactions with human speech. The service is based on natural language–understanding technology that is used to identify entities in a conversation and extract numbers, dates, and time, as well as custom entities that can be trained using examples. Dialogflow also provides prebuilt agents that can be used as templates.

97
Q

Understand the functionality of the Cloud Text-to-Speech API.  

A

GCP’s Cloud Text-to-Speech API maps natural language texts to human-like speech. The API works with more than 30 languages and has more than 180 humanlike voices. The API works with plain-text or Speech Synthesis Markup Language (SSML) and audio files, including MP3 and WAV files. To generate speech, you call a synthesize function of the API.

98
Q

Understand the functionality of the Cloud Speech-to-Text API.

A

The Cloud Speech-to-Text API is used to convert audio to text. This service is based on deep learning technology and supports 120 languages and variants. The service can be used for transcribing audios as well as for supporting voice-activated interfaces. Cloud Speech-to-Text automatically detects the language being spoken. Generated text can be returned as a stream of text or in batches as a text file.

99
Q

Understand the functionality of the Cloud Translation API.  d

A

Google’s translation technology is available for use through the Cloud Translation API. The basic version of this service, Translation API Basic, enables the translation of texts between more than 100 languages. There is also an advanced API, Translation API Advanced, which supports customization for domain-specific and context-specific terms and phrases.

100
Q

Understand the functionality of the Natural Language API.

A

  The Natural Language API uses machine learning–derived models to analyze texts. With this API, developers can extract information about people, places, events, addresses, and numbers, as well as other types of entities. The service can be used to find and label fields within semi-structured documents, such as emails. It also supports sentiment analysis. The Natural Language API has a set of more than 700 general categories, such as sports and entertainment, for document classification. For more advanced users, the service performs syntactic analysis that provides parts of speech labels and creates parse trees for each sentence. Users of the API can specify domain-specific keywords and phrases for entity extraction and custom labels for content classification.

101
Q

Understand the functionality of the Recommendations AI API.

A

Understand the functionality of the Recommendations AI API.  The Recommendations AI API is a service for suggesting products to customers based on their behavior on the user’s website and the product catalog of that website. The service builds a recommendation model specific to the site. The product catalog contains information on products that are sold to customers, such as names of products, prices, and availability. End-user behavior is captured in logged events, such as information about what customers search for, which products they view, and which products they have purchased. There are two primary functions the Recommendations AI API: ingesting data and making predictions.

102
Q

Understand the functionality of the Cloud Inference API.

A

  The Cloud Inference API provides real-time analysis of time-series data. The Cloud Inference API provides for processing time-series datasets, including ingesting from JSON formats, removing data, and listing active datasets. It also supports inference queries over datasets, including correlation queries, variation in frequency over time, and probability of events given evidence of those events in the dataset.

103
Q

So, what is the difference between a data lake and a data warehouse?

A

Both have the same idea to store data in centralized storage. Is it simply that a data lake stores unstructured data and a data warehouse doesn’t?

What if I say some data warehouse products can now store and process unstructured data? Does the data warehouse become a data lake? The answer is no.

One of the key differences from a technical perspective is that data lake technologies separate most of the building blocks, in particular, the storage and computation, but also the other blocks, such as schema, stream, SQL interface, and machine learning. This evolves the concept of a monolithic platform into a modern and modular platform consisting of separated components, as illustrated in the following diagram:

104
Q

Understanding the need for a data warehouse

A

Data warehouse is not a new concept; I believe you’ve at least heard of it. In fact, the terminology is no longer appealing. In my experience, no one gets excited when talking about data warehouses in the 2020s. Especially when compared to terminologies such as big data, cloud computing, and artificial intelligence.

So, why do we need to know about data warehouses? The answer to that is because almost every single data engineering challenge from the old times to these days is conceptually the same. The challenges are always about moving data from the data source to other environments so the business can use it to get information. The difference from time to time is only about the how and newer technologies. If we understand why people needed data warehouses in historical times, we will have a better foundation to understand the data engineering space and, more specifically, the data life cycle.

105
Q

How does Big Query process and store data?

A

igQuery stores data in a distributed filesystem called Google Colossus, in columnar storage format. Colossus is the successor to Google File System, which is the inspiration for Hadoop File System (HDFS). As users, we can’t access Google Colossus directly. We access the data using tables (metadata) and the SQL interface to process the data.

BigQuery processes data in a distributed SQL execution engine inspired by Dremel SQL. Dremel SQL is a Google internal SQL analytics tool. The main purpose of Dremel is to interactively query large datasets. But BigQuery is a product in its own right. Many improvements and adjustments have been made to BigQuery compared to Dremel. The reason is, of course, to serve broader GCP customer requirements around the world. By way of a simple example, the SQL language is different in Dremel (legacy SQL) to BigQuery (standard SQL).