Data Engineering Solutions Flashcards
Data pipelines are sequences of operations that?
Data pipelines are sequences of operations that copy, transform, load, and analyze data. There are common high-level design patterns that you see repeatedly in batch, streaming, and machine learning pipelines.
Understand the model of data pipelines.
A data pipeline is an abstract concept that captures the idea that data flows from one stage of processing to another. Data pipelines are modeled as directed acyclic graphs (DAGs). A graph is a set of nodes linked by edges. A directed graph has edges that flow from one node to another.
Know the four stages in a data pipeline.
Ingestion is the process of bringing data into the GCP environment.
Transformation is the process of mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline.
Cloud Storage can be used as both the staging area for storing data immediately after ingestion and also as a long-term store for transformed data.
BigQuery and Cloud Storage treat data as external tables and query them.
Cloud Dataproc can use Cloud Storage as HDFS-compatible storage.
Analysis can take on several forms, from simple SQL querying and report generation to machine learning model training and data science analysis.
Know that the structure and function of data pipelines will vary according to the use case to which they are applied.
Three common types of pipelines are data warehousing pipelines, stream processing pipelines, and machine learning pipelines.
Know the common patterns in data warehousing pipelines.
Extract, transformation, and load (ETL) pipelines begin with extracting data from one or more data sources.
When multiple data sources are used, the extraction processes need to be coordinated.
This is because extractions are often time based, so it is important that extracts from different sources cover the same time period. Extract, load, and transformation (ELT) processes are slightly different from ETL processes.
In an ELT process, data is loaded into a database before transforming the data. Extraction and load procedures do not transform data. This kind of process is appropriate when data does not require changes from the source format. In a change data capture approach, each change is a source system that is captured and recorded in a data store. This is helpful in cases where it is important to know all changes over time and not just the state of the database at the time of data extraction.
Understand the unique processing characteristics of stream processing.
This includes the difference between event time and processing time, sliding and tumbling windows, late-arriving data and watermarks, and missing data.
Event time is the time that something occurred at the place where the data is generated.
Processing time is the time that data arrives at the endpoint where data is ingested.
Sliding windows are used when you want to show how an aggregate, such as the average of the last three values, change over time, and you want to update that stream of averages each time a new value arrives in the stream.
Tumbling windows are used when you want to aggregate data over a fixed period of time—for example, for the last one minute.
Know the components of a typical machine learning pipeline.
This includes data ingestion, data preprocessing, feature engineering, model training and evaluation, and deployment.
Data ingestion uses the same tools and services as data warehousing and streaming data pipelines. Cloud Storage is used for batch storage of datasets, whereas Cloud Pub/Sub can be used for the ingestion of streaming data. Feature engineering is a machine learning practice in which new attributes are introduced into a dataset. The new attributes are derived from one or more existing attributes.
Know that Cloud Pub/Sub is a managed message queue service.
Cloud Pub/Sub is a real-time messaging service that supports both push and pull subscription models.
It is a managed service, and it requires no provisioning of servers or clusters.
Cloud Pub/Sub will automatically scale as needed. Messaging queues are used in distributed systems to decouple services in a pipeline.
This allows one service to produce more output than the consuming service can process without adversely affecting the consuming service. This is especially helpful when one process is subject to spikes.
Know that Cloud Dataflow is a managed stream and batch processing service.
Cloud Dataflow is a core component for running pipelines that collect, transform, and output data. In the past, developers would typically create a stream processing pipeline (hot path) and a separate batch processing pipeline (cold path). Cloud Dataflow is based on Apache Beam, which is a model for combined stream and batch processing. Understand these key Cloud Dataflow concepts:
Pipelines
PCollection
Transforms
ParDo
Pipeline I/O
Aggregation
User-defined functions
Runner
Triggers
Know that Cloud Dataproc is a managed Hadoop and Spark service.
Cloud Dataproc makes it easy to create and destroy ephemeral clusters. Cloud Dataproc makes it easy to migrate from on-premises Hadoop clusters to GCP. A typical Cloud Dataproc cluster is configured with commonly used components of the Hadoop ecosystem, including Hadoop, Spark, Pig, and Hive. Cloud Dataproc clusters consist of two types of nodes: master nodes and worker nodes. The master node is responsible for distributing and managing workload distribution
Know that Cloud Composer is a managed service implementing Apache Airflow.
Cloud Composer is used for scheduling and managing workflows. As pipelines become more complex and have to be resilient when errors occur, it becomes more important to have a framework for managing workflows so that you are not reinventing code for handling errors and other exceptional cases. Cloud Composer automates the scheduling and monitoring of workflows. Before you can run workflows with Cloud Composer, you will need to create an environment in GCP.
Understand what to consider when migrating from on-premises Hadoop and Spark to GCP.
Factors include migrating data, migrating jobs, and migrating HBase to Bigtable. Hadoop and Spark migrations can happen incrementally, especially since you will be using ephemeral clusters configured for specific jobs. There may be cases where you will have to keep an on-premises cluster while migrating some jobs and data to GCP. In those cases, you will have to keep data synchronized between environments. It is a good practice to migrate HBase databases to Bigtable, which provides consistent, scalable performance.
What is ingestion?
Ingestion (see Figure 3.3) is the process of bringing data into the GCP environment. This can occur in either batch or streaming mode.
In batch mode, data sets made up of one or more files are copied to GCP. Often these files will be copied to Cloud Storage first. There are several ways to get data into Cloud Storage, including gsutil copying, Transfer Service, and Transfer Appliance.
Streaming ingestion receives data in increments, typically a single record or small batches of records, that continuously flow into an ingestion endpoint, typically a Cloud Pub/Sub topic.
Transformation
Transformation is the process of mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline. There are many kinds of transformations, including the following:
Converting data types, such as converting a text representation of a date to a datetime data type
Substituting missing data with default or imputed values
Aggregating data; for example, averaging all CPU utilization metrics for an instance over the course of one minute
Filtering records that violate business logic rules, such as an audit log transaction with a date in the future
Augmenting data by joining records from distinct sources, such as joining data from an employee table with data from a sales table that includes the employee identifier of the person who made the sale
Dropping columns or attributes from a dataset when they will not be needed
Adding columns or attributes derived from input data; for example, the average of the previous three reported sales prices of a stock might be added to a row of data about the latest price for that stock
What are options for storage in a pipeline
Storage
After data is ingested and transformed, it is often stored. Chapter 2, “Building and Operationalizing Storage Systems,” describes GCP storage systems in detail, but key points related to data pipelines will be reviewed here as well.
Cloud Storage can be used as both the staging area for storing data immediately after ingestion and also as a long-term store for transformed data. BigQuery can treat Cloud Storage data as external tables and query them. Cloud Dataproc can use Cloud Storage as HDFS-compatible storage.
BigQuery is an analytical database that uses a columnar storage model that is highly efficient for data warehousing and analytic use cases.
Bigtable is a low-latency, wide-column NoSQL database used for time-series, IoT, and other high-volume write applications. Bigtable also supports the HBase API, making it a good storage option when migrating an on-premises HBase database on Hadoop (see Figure 3.5).
Types of Data Pipelines
The structure and function of data pipelines will vary according to the use case to which they are applied, but three common types of pipelines are as follows:
Data warehousing pipelines
Stream processing pipelines
Machine learning pipeline
Data Warehousing Pipelines
Data warehouses are databases for storing data from multiple data sources, typically organized in a dimensional data model. Dimensional data models are denormalized; that is, they do not adhere to the rules of normalization used in transaction processing systems. This is done intentionally because the purpose of a data warehouse is to answer analytic queries efficiently, and highly normalized data models can require complex joins and significant amounts of I/O operations. Denormalized dimensional models keep related data together in a minimal number of tables so that few joins are required.
Collecting and restructuring data from online transaction processing systems is often a multistep process. Some common patterns in data warehousing pipelines are as follows:
Extraction, transformation, and load (ETL)
Extraction, load, and transformation (ELT)
Extraction and load
Change data capture
What is the difference between event time and processing time?
Event Time and Processing Time
Data in time-series streams is ordered by time. If a set of data A arrives before data B, then presumably the event described by A occurred before the event described by B. There is a subtle but important issue implied in the previous sentence, which is that you are actually dealing with two points in time in stream processing:
Event time is the time that something occurred at the place where the data is generated.
Processing time is the time that data arrives at the endpoint where data is ingested. Processing time could be defined as some other point in the data pipeline, such as the time that transformation starts.
What is a watermark?
To help stream processing applications, you can use the concept of a watermark, which is basically a timestamp indicating that no data older than that timestamp will ever appear in the stream.
What is the difference between hotpath and cold path?
Hot Path and Cold Path Ingestion
We have been considering a streaming-only ingestion process. This is sometimes called a hot path ingestion. It reflects the latest data available and makes it available as soon as possible. You improve the timeliness of reporting data at the potential risk of a loss of accuracy.
There are many use cases where this tradeoff is acceptable. For example, an online retailer having a flash sale would want to know sales figures in real time, even if they might be slightly off. Sales professionals running the flash sale need that data to adjust the parameters of the sale, and approximate, but not necessarily accurate, data meets their needs.
GCP Pipeline Components
GCP has several services that are commonly used components of pipelines, including?
Cloud Pub/Sub
Cloud Dataflow
Cloud Dataproc
Cloud Composer
A job is an executing pipeline in Cloud Dataflow. There are two ways to execute jobs: the traditional method and the template method.
With the traditional method, developers create a pipeline in a development environment and run the job from that environment. The template method separates development from staging and execution. With the template method, developers still create pipelines in a development environment, but they also create a template, which is a configured job specification. The specification can have parameters that are specified when a user runs the template. Google provides a number of templates, and you can create your own as well. See Figure 3.9 for examples of templates provided by Google.
the four main compute GCP products?
Compute Engine is GCP’s infrastructure-as-a-service (IaaS) product.
With Compute Engine, you have the greatest amount of control over your infrastructure relative to the other GCP compute services.
Kubernetes is a container orchestration system, and Kubernetes Engine is a managed Kubernetes service. With Kubernetes Engine, Google maintains the cluster and assumes responsibility for installing and configuring the Kubernetes platform on the cluster. Kubernetes Engine deploys Kubernetes on managed instance groups.
App Engine is GCP’s original platform-as-a-service (PaaS) offering. App Engine is designed to allow developers to focus on application development while minimizing their need to support the infrastructure that runs their applications. App Engine has two versions: App Engine Standard and App Engine Flexible.
Cloud Functions is a serverless, managed compute service for running code in response to events that occur in the cloud. Events are supported for Cloud Pub/Sub, Cloud Storage, HTTP events, Firebase, and Stackdriver Logging.
Understand the definitions of availability, reliability, and scalability.
Availability is defined as the ability of a user to access a resource at a specific time. Availability is usually measured as the percentage of time a system is operational.
Reliability is defined as the probability that a system will meet service-level objectives for some duration of time. Reliability is often measured as the mean time between failures.
Scalability is the ability of a system to meet the demands of workloads as they vary over time.
Know when to use hybrid clouds and edge computing.
The analytics hybrid cloud is used when transaction processing systems continue to run on premises and data is extracted and transferred to the cloud for analytic processing. A variation of hybrid clouds is an edge cloud, which uses local computation resources in addition to cloud platforms. This architecture pattern is used when a network may not be reliable or have sufficient bandwidth to transfer data to the cloud. It is also used when low-latency processing is required.
Understand messaging.
Message brokers are services that provide three kinds of functionality: message validation, message transformation, and routing. Message validation is the process of ensuring that messages received are correctly formatted. Message transformation is the process of mapping data to structures that can be used by other services. Message brokers can receive a message and use data in the message to determine where the message should be sent. Routing is used when hub-and-spoke message brokers are used.
Know distributed processing architectures.
SOA is a distributed architecture that is driven by business operations and delivering business value. Typically, an SOA system serves a discrete business activity. SOAs are self-contained sets of services. Microservices are a variation on SOA architecture. Like other SOA systems, microservice architectures use multiple, independent components and common communication protocols to provide higher-level business services. Serverless functions extend the principles of microservices by removing concerns for containers and managing runtime environments.
Know the steps to migrate a data warehouse.
At a high level, the process of migrating a data warehouse involves four stages:
Assessing the current state of the data warehouse
Designing the future state
Migrating data, jobs, and access controls to the cloud
Validating the cloud data warehouse
Making Compute Resources Available, Reliable, and Scalable
Making Compute Resources Available, Reliable, and Scalable
Highly available and scalable compute resources typically employ clusters of machines or virtual machines with load balancers and autoscalers to distribute workload and adjust the size of the cluster to meet demand.
Making Storage Resources Available, Reliable, and Scalable
GCP provides a range of storage systems, from in-memory caches to archival storage. Here are some examples.
Memorystore is an in-memory Redis cache. Standard Tier is automatically configured to maintain a replica in a different zone. The replica is used only for high availability, not scalability. The replica is used only when Redis detects a failure and triggers a failover to the replica.
Persistent disks are used with Compute Engine and Kubernetes Engine to provide network-based disk storage to VMs and containers. Persistent disks have built-in redundancy for high availability and reliability. Also, users can create snapshots of disks and store them in Cloud Storage for additional risk mitigation.
Cloud SQL is a managed relational database that can operate in high-availability mode by maintaining a primary instance in one zone and a standby instance in another zone within the same region. Synchronous replication keeps the data up to date in both instances. If you require multi-regional redundancy in your relational database, you should consider Cloud Spanner.
Cloud Storage stores replicas of objects within a region when using standard storage and across regions when using multi-regional st
Making Network Resources Available, Reliable, and Scalable
Networking resources requires advanced planning for availability, reliability, and scalability.
You have the option of using Standard Tier or Premium Tier networking. Standard Tier uses the public Internet network to transfer data between Google data centers, whereas Premium Tier routes traffic only over Google’s global network. When using the Standard Tier, your data is subject to the reliability of the public Internet.
Network interconnects between on-premises data centers and Google Cloud are not rapidly scaled up or down. At the low end of the bandwidth spectrum, VPNs are used when up to 3 Gbps is sufficient. It is common practice to use two VPNs to connect an enterprise data center to the GCP for redundancy. HA VPN is an option for high-availability VPNs that uses two IP addresses and provides a 99.99 percent service availability, in contrast to the standard VPN, which has a 99.9 percent service level agreement.
For high-throughput use cases, enterprises can use Cloud Interconnect. Cloud Interconnect is available as a dedicated interconnect in which an enterprise directly connects to a Google endpoint and traffic flows directly between the two networks. The other option is to use a partner interconnect, in which case data flows through a third-party network but not over the Internet. Architects may choose Cloud Interconnect for better security, higher speed, and entry into protected networks. In this case, availability, reliability, and scalability are all addressed by redundancy in network infrastructure.
Distributed processing presents challenges not found when processing is performed on a single server.
For starters, you need mechanisms for sharing data across servers. These include message brokers and message queues, collectively known as middleware. There is more than one way to do distributed processing. Some common architecture patterns are service-oriented architectures, microservices, and serverless functions. Distributed systems also have to contend with the possibility of duplicated processing and data arriving out of order. Depending on requirements, distributed processing can use different event processing models for handling duplicated and out-of-order processing.
Message Brokers are?
Message brokers are services that provide three kinds of functionality: message validation, message transformation, and routing.
Message validation is the process of ensuring that messages received are correctly formatted. For example, a message may be specified in a Thrift or Protobuf format. Both Thrift and Protobuf, which is short for Protocol Buffers, are designed to make it easy to share data across applications and languages. For example, Java might store structured data types one way, whereas Python would store the same logical structure in a different way. Instead of sharing data using language-specific structures, software developers can map their data to a common format, a process known as serialization. A serialized message can then be placed on a message broker and routed to another service that can read the message without having to have information about the language or data structure used in the source system.
Message transformation is the process of mapping data to structures that can be used by other services. This is especially important when source and consumer services can change independently. For example, an accounting system may change the definition of a sales order. Other systems, like data warehouses, which use sales order data, would need to update ETL processes each time the source system changes unless the message broker between the accounting system and data warehouse implemented necessary transformations. The advantage of applying these transformations in the message broker is that other systems in addition to the data warehouse can use the transformed data without having to implement their own transformation.
Message brokers can receive a message and use data in the message to determine where the message should be sent. Routing is used when hub-and-spoke message brokers are used. With a hub-and-spoke model, messages are sent to a central processing node and from there routed to the correct destination node. See Figure 4.3 for an example of a hub-and-spoke model.
Know that Compute Engine supports provisioning single instances or groups of instances, known as ?d
Know that Compute Engine supports provisioning single instances or groups of instances, known as instance groups. Instance groups are either managed or unmanaged instance groups. Managed instance groups (MIGs) consist of identically configured VMs; unmanaged instance groups allow for heterogeneous VMs, but they should be used only when migrating legacy clusters from on-premises data centers.
Understand the benefits of MIGs??
These benefits include the following:
Autohealing based on application-specific health checks, which replace nonfunctioning instances
Support for multizone groups that provide for availability in spite of zone-level failures
Load balancing to distribute workload across all instances in the group
Autoscaling, which adds or removes instances in the group to accommodate increases and decreases in workloads
Automatic, incremental updates to reduce disruptions to workload processingd
What service provides container orchestration?
Containers are increasingly used to process workloads because they have less overhead than VMs and allow for finer-grained allocation of resources than VMs. A Kubernetes cluster has two types of instances: cluster masters and nodes.
Understand Kubernetes abstractions.
Pods are the smallest computation unit managed by Kubernetes. Pods contain one or more containers.
A ReplicaSet is a controller that manages the number of pods running for a deployment.
A deployment is a higher-level concept that manages ReplicaSets and provides declarative updates.
PersistentVolumes is Kubernetes’ way of representing storage allocated or provisioned for use by a pod.
Pods acquire access to persistent volumes by creating a PersistentVolumeClaim, which is a logical way to link a pod to persistent storage. StatefulSets are used to designate pods as stateful and assign a unique identifier to them.
Kubernetes uses them to track which clients are using which pods and to keep them paired.
An Ingress is an object that controls external access to services running in a Kubernetes cluster.
Know how to provision Bigtable instances.
Cloud Bigtable is a managed wide-column NoSQL database used for applications that require high-volume, low-latency writes. Bigtable has an HBase interface, so it is also a good alternative to using Hadoop HBase on a Hadoop cluster.
Bigtable instances can be provisioned using the cloud console, the command-line SDK, and the REST API.
When creating an instance, you provide an instance name, an instance ID, an instance type, a storage type, and cluster specifications.
Know how to provision Cloud Dataproc.
When provisioning Cloud Dataproc resources, you will specify the configuration of a cluster using the cloud console, the command-line SDK, or the REST API.
When you create a cluster, you will specify a name, a region, a zone, a cluster mode, machine types, and an autoscaling policy.
The cluster mode determines the number of master nodes and possible worker nodes. Master nodes and worker nodes are configured separately.
For each type of node, you can specify a machine type, disk size, and disk type.
Understand that serverless services do not require conventional infrastructure provisioning but can be configured.
You can configure App Engine using the app.yaml, cron.yaml, distpatch.yaml, or queue.yaml file. Cloud Functions can be configured using parameters to specify memory, region, timeout, and max instances. Cloud Dataflow parameters include job name, project ID, running, staging location, and the default and maximum number of worker nodes.
Understand the purpose of Stackdriver Monitoring, Stackdriver Logging, and Stackdriver Trace.
Stackdriver Metrics collect metrics on the performance of infrastructure resources and applications.
Stackdriver Logging is a service for storing and searching log data about events in infrastructure and applications.
Stackdriver Trace is a distributed tracing system designed to collect data on how long it takes to process requests to services.
Understand the benefits of MIGs. These benefits include the following:
Autohealing based on application-specific health checks, which replace nonfunctioning instances
Support for multizone groups that provide for availability in spite of zone-level failures
Load balancing to distribute workload across all instances in the group
Autoscaling, which adds or removes instances in the group to accommodate increases and decreases in workloads
Automatic, incremental updates to reduce disruptions to workload processing