Data Engineering Foundations Flashcards
The data lifecycle consists of four stages:
Ingest
Store
Process and analyze
Explore and visualize
Ingestion is the first stage in the data lifecycle, and it entails acquiring data and bringing data into the Google Cloud Platform (GCP).
The storage stage is about persisting data to a storage system from which it can be accessed for later stages of the data lifecycle.
The process and analyze stage begins with transforming data into a usable format for analysis applications.
Explore and visualize is the final stage, in which insights are derived from analysis and presented in tables, charts, and other visualizations for use by others.
The three broad ingestion modes with which data engineers typically work are as follows:
Ingest
Application data
Streaming data
Batch data
Application Data what are examples and where does it come from?
Whatj d Examples of application data include the following:
Transactions from an online retail application
Clickstream data from users reading articles on a news site
Log data from a server running computer-aided design software
User registration data from an online service
Application data can be ingested by services running in Compute Engine, Kubernetes Engine, or App Engine, for example. Application data can also be written to Stackdriver Logging or one of the managed databases, such as Cloud SQL or Cloud Datastore.
Where from?
Application data is generated by applications, including mobile apps, and pushed to backend services.
This data includes user-generated data, like a name and shipping address collected as part of a sales transaction.
It also includes data generated by the application, such as log data.
Event data, like clickstream data, is also a type of application-generated data.
The volume of this kind of data depends on the number of users of the application, the types of data the application generates, and the duration of time the application is in use. This size of application data that is sent in a single operation can vary widely. A clickstream event may have less than 1KB of data, whereas an image upload could be multiple megabytes.
What is streaming data and what are examples?
Streaming data is a set of data that is typically sent in small messages that are transmitted continuously from the data source. Streaming data may be sensor data, which is data generated at regular intervals, and event data, which is data generated in response to a particular event. Examples of streaming data include the following:
Virtual machine monitoring data, such as CPU utilization rates and memory consumption data
An IoT device that sends temperature, humidity, and pressure data every minute
A customer adding an item to an online shopping cart, which then generates an event with data about the customer and the item
Time-series data may require some additional processing early in the ingestion process. If a stream of data needs to be in time order for processing, then late arriving data will need to be inserted in the correct position in the stream. This can require buffering of data for a short period of time in case the data arrives out of order. Of course, there is a maximum amount of time to wait before processing data.
What is the event time?
Streaming data often includes a timestamp indicating the time that the data was generated. This is often .
What is known as the process time?
Some applications will also track the time that data arrives at the beginning of the ingestion pipeline.
Streaming Data
Streaming data is well suited for Cloud Pub/Sub ingestion, which can buffer data while applications process the data. During spikes in data ingestion in which application instances cannot keep up with the rate data is arriving, the data can be preserved in a Cloud Pub/Sub topic and processed later after application instances have a chance to catch up. Cloud Pub/Sub has global endpoints and uses GCP’s global frontend load balancer to support ingestion. The messaging service scales automatically to meet the demands of the current workload.d
What is Batch Data?
Batch data is ingested in bulk, typically in files. Examples of batch data ingestion include uploading files of data exported from one application to be processed by another.
Examples of batch data include the following:
Transaction data that is collected from applications may be stored in a relational database and later exported for use by a machine learning pipeline
Archiving data in long-term storage to comply with data retention regulations
Migrating an application from on premises to the cloud by uploading files of exported data
The focus of the storage stage of the data lifecycle is to make data available for transformation and analysis?
Several factors influence the choice of storage system, including
How the data is accessed—by individual record (row) or by an aggregation of columns across many records (rows)
The way access controls need to be implemented, at the schema or database level or finer-grained level
How long the data will be stored
These three characteristics are the minimum that should be considered when choosing a storage system; there may be additional criteria for some use cases. (Structure is another factor and is discussed later in this chapter.)
What are Data Access Patterns?
Data is accessed in different ways. Online transaction processing systems often query for specific records using a set of filtering parameters. For example, an e-commerce application may need to look up a customer shipping address from a data store table that holds tens of thousands of addresses. Databases, like Cloud SQL and Cloud Datastore, provide that kind of query functionality.
In another example, a machine learning pipeline might begin by accessing files with thousands of rows of data that is used for training the model. Since machine learning models are often trained in batch mode, all of the training data is needed. Cloud Storage is a good option for storing data that is accessed in bulk.
Time to Store is required for?
Consider how long data will be stored when choosing a data store. Some data is transient. For example, data that is needed only temporarily by an application running on a Compute Engine instance could be stored on a local solid-state drive (SSD) on the instance. As long as the data can be lost when the instance shuts down, this could be a reasonable option.
Data is often needed longer than the lifetime of a virtual machine instance, so other options are better fits for those cases. Cloud Storage is a good option for long-term storage, especially if you can make use of storage lifecycle policies to migrate older data to Nearline or Coldline storage. For long-lived analytics data, Cloud Storage or BigQuery are good options, since the costs are similar.
Nearline storage is used for data that is accessed less than once per 30days. Coldline storage is used to store data accesses less than once peryear.
Statistical techniques are often used with numeric data to do the following:
Describe characteristics of a dataset, such as a mean and standard deviation of the dataset.
Generate histograms to understand the distribution of values of an attribute.
Find correlations between variables, such as customer type and average revenue per sales order.
Make predictions using regression models, which allow you to estimate one attribute based on the value of another. In statistical terms, regression models generate predictions of a dependent variable based on the value of an independent variable.
Cluster subsets of a dataset into groups of similar entities. For example, a retail sales dataset may yield groups of customers who purchase similar types of products and spend similar amounts over time.
Some storage services are designed to store large volumes of data, including petabyte scales, whereas others are limited to smaller volumes.
Cloud Storage is an example of the former. An individual item in Cloud Storage can be up to 5 TB, and there is no limit to the number of read or write operations. Cloud Bigtable, which is used for telemetry data and large-volume analytic applications, can store up to 8 TB per node when using hard disk drives, and it can store up to 2.5 TB per node when using SSDs. Each Bigtable instance can have up to 1,000 tables. BigQuery, the managed data warehouse and analytics database, has no limit on the number of tables in a dataset, and it may have up to 4,000 partitions per table. Persistent disks, which can be attached to Compute Engine instances, can store up to 64 TB.
What is Velocity
Velocity of data is the rate at which it is sent to and processed by an application. Web applications and mobile apps that collect and store human-entered data are typically low velocity, at least when measured by individual user.
There are three widely recognized categories:
Structured
Semi-structured
Unstructured
What do they mean?
These categories are particularly helpful when choosing a database.
Structured Data Structured data has a fixed set of attributes that can be modeled in a table of rows and columns.
Semi-Structured Data Semi-structured data has attributes like structured data, but the set of attributes can vary from one instance to another. For example, a product description of an appliance might include length, width, height, weight, and power consumption. A chair in the same catalog might have length, width, height, color, and style as attributes. Semi-structured data may be organized using arrays or sets of key-value pairs.
Unstructured Data Unstructured data does not fit into a tabular structure. Images and audio files are good examples of unstructured data. In between these two extremes lies semi-structured data, which has characteristics of both structured and unstructured.
What is Row Key Access?
Wide-column databases usually take a different approach to querying. Rather than using indexes to allow efficient lookup of rows with needed data, wide-column databases organize data so that rows with similar row keys are close together. Queries use a row key, which is analogous to a primary key in relational databases, to retrieve data. This has two implications.
Google has developed a decision tree for choosing a storage system that starts with distinguishing structured, semi-structured, and unstructured data.
Schema Design Considerations
Structured and semi-structured data has a schema associated with it. Structured data is usually stored in relational databases whereas semi-structured data is often stored in NoSQL databases. The schema influences how data is stored and accessed, so once you have determined which kind of storage technology to use, you may then need to design a schema that will support optimal storage and retrieval.
The distinction between relational and NoSQL databases is becoming less pronounced as each type adopts features of the other. Some relational databases support storing and querying JavaScript Object Notation (JSON) structures, similar to the way that document databases do. Similarly, some NoSQL databases now support ACID (atomicity, consistency, isolation, durability) transactions, which are a staple feature of relational databases.
Relational Database Design is centered around?
Data modeling for relational databases begins with determining which type of relational database you are developing: an online transaction processing (OLTP) database or an online analytical processing (OLAP) database.
Online transaction processing (OLTP) databases are designed for transaction processing and typically follow data normalization rules.
Denormalization—that is, intentionally violating one of the rules of normalization—is often used to improve query performance. For example, repeating customer names in both the customer table and an order table could avoid having to join the two tables when printing invoices.
Online analytical processing (OLAP) data models are often used for data warehouse and data mart applications. OLAP models are also called dimensional models because data is organized around several dimensions.
NoSQL Database Design is centered around on?
NoSQL databases are less structured than relational databases, and there is no formal model, like relational algebra and forms of normalization, that apply to all NoSQL databases. The four types of NoSQL databases available in GCP are
Key-value
Document
Wide column
Graph
Key-value data stores
Key-value data stores are databases that use associative arrays or dictionaries as the basic datatype. Keys are data used to look up values.
Key-value data stores are simple, but it is possible to have more complex data structures as values. For example, a JSON object could be stored as a value. This would be reasonable use of a key-value data store if the JSON object was only looked up by the key, and there was no need to search on items within the JSON structure. In situations where items in the JSON structure should be searchable, a document database would be a better option.
Cloud Memorystore is a fully managed key-value data store based on Redis, a popular open source key-value datastore. As of this writing, Cloud Memorystore does not support persistence, so it should not be used for applications that do not need to save data to persistent storage. Open source Redis does support persistence.
Document Databases
Document stores allow complex data structures, called documents, to be used as values and accessed in more ways than simple key lookup. When designing a data model for document databases, documents should be designed to group data that is read together.
Wide-column databases are used for use cases with the following?
High volumes of data
Need for low-latency writes
More write operations than read operations
Limited range of queries—in other words, no ad hoc queries
Lookup by a single key
Wide-column databases have a data model similar to the tabular structure of relational tables, but there are significant differences. Wide-column databases are often sparse, with the exception of IoT and other time-series databases that have few columns that are almost always used.
Graph Databases
Another type of NoSQL database are graph databases, which are based on modeling entities and relationships as nodes and links in a graph or network. Social networks are a good example of a use case for graph databases. People could be modeled as nodes in the graph, and relationships between people are links, also called edges.
How does Cloud SQL support data engineers?
Cloud SQL supports MySQL, PostgreSQL, and SQL Server (beta). Cloud SQL instances are created in a single zone by default, but they can be created for high availability and use instances in multiple zones. Use read replicas to improve read performance. Importing and exporting are implemented via the RDBMS-specific tool.
How does Cloud Spanner support Data Engineers?
Replica Types?
Cloud Spanner is configured as regional or multi-regional instances. Cloud Spanner is a horizontally scalable relational database that automatically replicates data. Three types of replicas are read-write replicas, read-only replicas, and witness replicas. Avoid hotspots by not using consecutive values for primary keys.
How does Cloud Big Table support data engineers?
Cloud Bigtable is a wide-column NoSQL database used for high-volume databases that require sub-10 ms latency. Cloud Bigtable is used for IoT, time-series, finance, and similar applications. For multi-regional high availability, you can create a replicated cluster in another region. All data is replicated between clusters. Designing tables for Bigtable is fundamentally different from designing them for relational databases. Bigtable tables are denormalized, and they can have thousands of columns. There is no support for joins in Bigtable or for secondary indexes. Data is stored in Bigtable lexicographically by row-key, which is the one indexed column in a Bigtable table. Keeping related data in adjacent rows can help make reads more efficient.
The Cloud Firestore data model consists of?
The Cloud Firestore data model consists of entities, entity groups, properties, and keys.
Entities have properties that can be atomic values, arrays, or entities.
Keys can be used to lookup entities and their properties.
Alternatively, entities can be retrieved using queries that specify properties and values, much like using a WHERE clause in SQL.
However, to query using property values, properties need to be indexed.
BigQuery has what major components?
BigQuery is an analytics database that uses SQL as a query language.
Datasets are the basic unit of organization for sharing data in BigQuery.
A dataset can have multiple tables.
BigQuery supports two dialects of SQL: legacy and standard. Standard SQL supports advanced SQL features such as correlated subqueries, ARRAY and STRUCT data types, and complex join expressions.
BigQuery uses the concepts of slots for allocating computing resources to execute queries.
BigQuery also supports streaming inserts, which load one row at a time.
Data is generally available for analysis within a few seconds, but it may be up to 90 minutes before data is available for copy and export operations.
Streaming inserts provide for best effort de-duplication. Stackdriver is used for monitoring and logging in BigQuery.
Stackdriver Monitoring provides performance metrics, such query counts and time, to run queries. Stackdriver Logging is used to track events, such as running jobs or creating tables. BigQuery costs are based on the amount of data stored, the amount of data streamed, and the workload required to execute queries.
How does cloud storage data engineers?
Google Cloud Storage is an object storage system. It is designed for persisting unstructured data, such as data files, images, videos, backup files, and any other data. It is unstructured in the sense that objects—that is, files stored in Cloud Storage—use buckets to group objects. A bucket is a group of objects that share access controls at the bucket level. The four storage tiers are Regional, Multi-regional, Nearline, and Coldline.
When you manage your own databases, you will be responsible for an array of database and system administration tasks.
How do you manage monitoring?
The two Stackdriver components that are used with unmanaged databases are Stackdriver Monitoring and Stackdriver Logging. Instances have built-in monitoring and logging. Monitoring includes CPU, memory, and I/O metrics. Audit logs, which have information about who created an instance, are also available by default. Once the Stackdriver Logging agent is installed, it can collect application logs, including database logs. Stackdriver Logging is configured with Fluentd, an open source data collector for logs. Once the Stackdriver Monitoring agent is installed, it can collect application performance metrics.
Understand the definitions of availability, reliability, and scalability.
Availability is defined as the ability of a user to access a resource at a specific time. Availability is usually measured as the percentage of time a system is operational. Reliability is defined as the probability that a system will meet service-level objectives for some duration of time. Reliability is often measured as the mean time between failures. Scalability is the ability of a system to meet the demands of workloads as they vary over time.