GCP Architecture Framework Flashcards

Question

What is the Reliability category of the architecture framework?

Answer 1

Design and operate resilient and highly available workloads in the cloud. ## Footnote https://cloud.google.com/architecture/framework

Answer 2

Maximize the business value of your investment in Google Cloud. ## Footnote https://cloud.google.com/architecture/framework

Answer 3

Design and tune your cloud resources for optimal performance. ## Footnote https://cloud.google.com/architecture/framework

Answer 4

Simplicity is crucial for system design. If your architecture is too complex to understand, it will be difficult to implement the design and manage it over time. Where feasible, use fully managed services to minimize the risks, time, and effort associated with managing and maintaining baseline systems. If you're already running your workloads in production, test with managed services to see how they might help to reduce operational complexities. If you're developing new workloads, then start simple, establish a minimal viable product (MVP), and resist the urge to over-engineer. You can identify exceptional use cases, iterate, and improve your systems incrementally over time. ## Footnote https://cloud.google.com/architecture/framework/system-design/principles

Answer 5

Decoupling is a technique that's used to separate your applications and service components into smaller components that can operate independently. For example, you might break up a monolithic application stack into separate service components. In a decoupled architecture, an application can run its functions independently, regardless of the various dependencies. You can start decoupling early in your design phase or incorporate it as part of your system upgrades as you scale. ## Footnote https://cloud.google.com/architecture/framework/system-design/principles

Answer 6

A decoupled architecture gives you increased flexibility to do the following: Apply independent upgrades. Enforce specific security controls. Establish reliability goals for each subsystem. Monitor health. Granularly control performance and cost parameters. ## Footnote https://cloud.google.com/architecture/framework/system-design/principles

Answer 7

A stateless architecture can increase both the reliability and scalability of your applications. Stateful applications rely on various dependencies to perform tasks, such as locally cached data. Stateful applications often require additional mechanisms to capture progress and restart gracefully. Stateless applications can perform tasks without significant local dependencies by using shared storage or cached services. A stateless architecture enables your applications to scale up quickly with minimum boot dependencies. The applications can withstand hard restarts, have lower downtime, and provide better performance for end users. ## Footnote https://cloud.google.com/architecture/framework/system-design/principles

Answer 8

When you select a region or multiple regions for your business applications, you consider criteria including: service availability end-user latency application latency cost regulatory or sustainability requirements. Decision Making Process: To support your business priorities and policies, balance these requirements and identify the best tradeoffs. For example, the most compliant region might not be the most cost-efficient region or it might not have the lowest carbon footprint. ## Footnote https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

Answer 9

**Deploy over multiple regions ** Why choose this strategy? To help protect against expected downtime (including maintenance) and help protect against unexpected downtime like incidents, we recommend that you deploy fault-tolerant applications that have high availability and deploy your applications across multiple zones in one or more regions. For more information, see Geography and regions, Application deployment considerations, and Best practices for Compute Engine regions selection. Multi-zonal deployments can provide resiliency if multi-region deployments are limited due to cost or other considerations. This approach is especially helpful in preventing zonal or regional outages and in addressing disaster recovery and business continuity concerns. For more information, see Design for scale and high availability. ## Footnote https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

Answer 10

Why choose regions with closer proxiomity? Latency impacts the user experience and affects costs associated with serving external users. To minimize latency when serving traffic to external users, select a region or set of regions that are geographically close to your users and where your services run in a compliant way. For more information, see Cloud locations and the Compliance resource center. ## Footnote https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

Answer 11

Not all services are available in every region, you must verify their availability during the design process. Select a region based on the available services that your business requires. Most services are available across all regions. Some enterprise-specific services might be available in a subset of regions with their initial release. To verify region selection, see Cloud locations. ## Footnote https://cloud.google.com/about/locations https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

Answer 12

Select a specific region or set of regions to meet geographic regulatory or compliance regulations that require the use of certain geographies, for example General Data Protection Regulation (GDPR) or data residency. To learn more about designing secure systems, see Compliance offerings and Data residency, operational transparency, and privacy for European customers on Google Cloud. ## Footnote https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

Answer 13

Compare prices across the different options for your regions? Regions have different cost rates for the same services. To identify a cost-efficient region, compare pricing of the major resources that you plan to use. Cost considerations differ depending on backup requirements and resources like compute, networking, and data storage. To learn more, see the Cost optimization category. ## Footnote https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

Answer 14

Use the Cloud Region Picker to support sustainability Google has been carbon neutral since 2007 and is committed to being carbon-free by 2030. To select a region by its carbon footprint, use the Google Cloud Region Picker. To learn more about designing for sustainability, see Cloud sustainability. ## Footnote https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

Answer 15

Use Cloud Load Balancing to serve global users To improve the user experience when you serve global users, use Cloud Load Balancing to help provide a single IP address that is routed to your application. To learn more about designing reliable systems, see Google Cloud Architecture Framework: Reliability. ## Footnote https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

Answer 16

Use a simple folder structure Folders let you group any combination of projects and subfolders. Create a simple folder structure to organize your Google Cloud resources. You can add more levels as needed to define your resource hierarchy so that it supports your business needs. The folder structure is flexible and extensible.To learn more, see Creating and managing folders. A common situation is to create folders that in turn contain additional folders or projects, as shown in the image above. This structure is referred to as a folder hierarchy. When creating a folder hierarchy, keep in mind the following: You can nest folders up to 10 (ten) levels deep. A parent folder cannot contain more than 300 folders. This refers to direct child folders only. Those child folders can, in turn, contain additional folders or projects. Folder display names must be unique within the same level of the hierarchy. You can use folder-level IAM policies to control access to the resources the folder contains. For example, if a user is granted the Compute Instance Admin role on a folder, that user has the Compute Instance Admin role for all of the projects in the folder. Before you begin Folder functionality is only available to Google Workspace and Cloud Identity customers that have an organization resource. For more information about acquiring an organization resource, see Creating and managing organizations. If you're exploring how to best use folders, we recommend that you: Review Access Control for Folders Using IAM. The topic describes how you can control who has what access to folders and the resources they contain. Understand how to set folder permissions. Folders support a number of different IAM roles. If you want to broadly set up permissions so users can see the structure of their projects, grant the entire domain the Organization Viewer and Folder Viewer roles at the organization level. To restrict visibility to branches of your folder hierarchy, grant the Folder Viewer role on the folder or folders you want users to see. Create folders. As you plan how to organize your Cloud resources, we recommend that you start with a single folder as a sandbox where you can experiment with which hierarchy makes the most sense for your organization. Think of folders in terms of isolation boundaries between resources and attach points for access and configuration policies. You may choose to create folders to contain resources that belong to different departments and assign Admin roles on folders to delegate administrator privilege. Folders can also be used to group resources that belong to applications or different environments, such as development, production, test. Use nested folders to model these different scenarios. ## Footnote https://cloud.google.com/architecture/framework/system-design/resource-management https://cloud.google.com/resource-manager/docs/creating-managing-folders

Answer 17

Option 1: Hierarchy based on application environments In many organizations, you define different policies and access controls for different application environments, such as development, production, and testing. Having separate policies that are standardized across each environment eases management and configuration. For example, you might have security policies that are more stringent in production environments than in testing environments. Use a hierarchy based on application environments if the following is true: You have separate application environments that have different policy and administration requirements. You have use cases that have highly customized security or audit requirements. You require different Identity and Access Management (IAM) roles to access your Google Cloud resources in different environments. Avoid this hierarchy if the following is true: You don't run multiple application environments. You don't have varying application dependencies and business processes across environments. You have strong policy differences for different regions, teams, or applications. Option 2: Hierarchy based on regions or subsidiaries Some organizations operate across many regions and have subsidiaries doing business in different geographies or have been a result of mergers and acquisitions. These organizations require a resource hierarchy that uses the scalability and management options in Google Cloud, and maintains the independence for different processes and policies that exist between the regions or subsidiaries. This hierarchy uses subsidiaries or regions as the highest folder level in the resource hierarchy. Deployment procedures are typically focused around the regions. Use this hierarchy if the following is true: Different regions or subsidiaries operate independently. Regions or subsidiaries have different operational backbones, digital platform offerings, and processes. Your business has different regulatory and compliance standards for regions or subsidiaries. Option 3: Hierarchy based on an accountability framework A hierarchy based on an accountability framework works best when your products are run independently or organizational units have clearly defined teams who own the lifecycle of the products. In these organizations, the product owners are responsible for the entire product lifecycle, including its processes, support, policies, and access rights. Your products are quite different from each other, so only a few organization-wide guidelines exist. Use this hierarchy when the following is true: You run an organization that has clear ownership and accountability for each product. Your workloads are independent and don't share many common policies. Your processes and external developer platforms are offered as service or product offerings. ## Footnote https://cloud.google.com/architecture/framework/system-design/resource-management https://cloud.google.com/architecture/landing-zones/decide-resource-hierarchy

Answer 18

Use folders and projects to reflect data governance policies Use folders, subfolders, and projects to separate resources from each other to reflect data governance policies within your organization. For example, you can use a combination of folders and projects to separate financial, human resources, and engineering. Use projects to group resources that share the same trust boundary. For example, resources for the same product or microservice can belong to the same project. For more information, see Decide a resource hierarchy for your Google Cloud landing zone. Use a single organization node To avoid management overhead, use a single organization node whenever possible. However, consider using multiple organization nodes to address the following use cases: You want to test major changes to your IAM levels or resource hierarchy. You want to experiment in a sandbox environment that doesn't have the same organization policies. Your organization includes sub-companies that are likely to be sold off or run as completely separate entities in the future. Use standardized naming conventions Use a standardized naming convention throughout your organization. The security foundations blueprint has a sample naming convention that you can adapt to your requirements. Understand resource interactions throughout the hierarchy Understand which resources interact with the resource hierarchy and how the folder structure works for them. Organization policies are inherited by descendants in the resource hierarchy, but can be superseded by policies defined at a lower level. For more information, see understanding hierarchy evaluation. You use organization policy constraints to set guidelines around the whole organization or significant parts of it and still allow for exceptions. IAM policies are inherited by descendants, and cannot be superseded. However, you can add more access controls at lower levels at the hierarchy. See using resource hierarchy for access control for details. You also need to consider the following: Cloud Logging includes aggregated sinks that you can use to aggregate logs at the folder or organization level. Billing is not directly linked to the resource hierarchy, but assigned at the project level. However, to get aggregated information at the folder level, you can analyze your costs by project hierarchy using billing reports. Hierarchical firewall policies let you implement consistent firewall policies throughout the organization or in specific folders. Inheritance is implicit, which means that you can allow or deny traffic at any level or you can delegate the decision to a lower level. Keep bootstrapping resources and common services separate Keep separate folders for bootstrapping the Google Cloud environment using infrastructure-as-code (IaC) and for common services that are shared between environments or applications. Place the bootstrap folder right below the organization node in the resource hierarchy. Place the folders for common services at different levels of the hierarchy, depending on the structure that you choose. Place the folder for common services right below the organization node when the following is true: Your hierarchy uses application environments at the highest level and teams or applications at the second layer. You have shared services such as monitoring that are common between environments. Place the folder for common services at a lower level, below the application folders, when you have services that are shared between applications but are deployed separately for each deployment environment, for example shared microservices that are used by multiple applications but are updated regularly and require development and testing. ## Footnote https://cloud.google.com/architecture/framework/system-design/resource-management https://cloud.google.com/architecture/landing-zones/decide-resource-hierarchy

Answer 19

When? Use tags and labels at the outset of your project Use labels and tags when you start to use Google Cloud products, even if you don't need them immediately. Adding labels and tags later on can require manual effort that can be error prone and difficult to complete. A tag provides a way to conditionally allow or deny policies based on whether a resource has a specific tag. A label is a key-value pair that helps you organize your Google Cloud instances. For more information on labels, see requirements for labels, a list of services that support labels, and label formats. Resource Manager provides labels and tags to help you manage resources, allocate and report on cost, and assign policies to different resources for granular access controls. For example, you can use labels and tags to apply granular access and management principles to different tenant resources and services. For information about VM labels and network tags, see Relationship between VM labels and network tags. You can use labels for multiple purposes, including the following: Managing resource billing: Labels are available in the billing system, which lets you separate cost by labels. For example, you can label different cost centers or budgets. Grouping resources by similar characteristics or by relation: You can use labels to separate different application lifecycle stages or environments. For example, you can label production, development, and testing environments. Tag inheritance When a tag key-value pair is attached to a resource, all descendants of the resource inherit the tag. You can override an inherited tag on a descendant resource. To override an inherited tag, apply a tag using the same key as the inherited tag, but use a different value. ## Footnote https://cloud.google.com/architecture/framework/system-design/resource-management https://cloud.google.com/resource-manager/docs/tags/tags-overview

Answer 20

Assign labels to support cost and billing reporting To support granular cost and billing reporting based on attributes outside of your integrated reporting structures (like per-project or per-product type), assign labels to resources. Labels can help you allocate consumption to cost centers, departments, specific projects, or internal recharge mechanisms. For more information, see the Cost optimization category. Avoid creating large numbers of labels Avoid creating large numbers of labels. We recommend that you create labels primarily at the project level, and that you avoid creating labels at the sub-team level. If you create overly granular labels, it can add noise to your analytics. To learn about common use cases for labels, see Common uses of labels. Avoid adding sensitive information to labels Labels aren't designed to handle sensitive information. Don't include sensitive information in labels, including information that might be personally identifiable, like an individual's name or title. Apply tags to model business dimensions You can apply tags to model additional business dimensions like organization structure, regions, workload types, or cost centers. To learn more about tags, see Tags overview, Tag inheritance, and Creating and managing tags. To learn how to use tags with policies, see Policies and tags. To learn how to use tags to manage access control, see Tags and access control. ## Footnote https://cloud.google.com/architecture/framework/system-design/resource-management

Answer 21

Establish project naming conventions Establish a standardized project naming convention, for example, SYSTEM_NAME-ENVIRONMENT (dev, test, uat, stage, prod). Project names have a 30-character limit. Although you can apply a prefix like COMPANY_TAG-SUB_GROUP/SUBSIDIARY_TAG, project names can become out of date when companies go through reorganizations. Consider moving identifiable names from project names to project labels. Anonymize information in project names Follow a project naming pattern like COMPANY_INITIAL_IDENTIFIER-ENVIRONMENT-APP_NAME, where the placeholders are unique and don't reveal company or application names. Don't include attributes that can change in the future, for example, a team name or technology. ## Footnote https://cloud.google.com/architecture/framework/system-design/resource-management

Answer 22

Use the Organization Policy Service to control resources The Organization Policy Service gives policy administrators centralized and programmatic control over your organization's cloud resources so that they can configure constraints across the resource hierarchy. For more information, see Add an organization policy administrator. Use the Organization Policy Service to comply with regulatory policies To meet compliance requirements, use the Organization Policy Service to enforce compliance requirements for resource sharing and access. For example, you can limit sharing with external parties or determine where to deploy cloud resources geographically. Other compliance scenarios include the following: Centralizing control to configure restrictions that define how your organization's resources can be used. Defining and establishing policies to help your development teams remain within compliance boundaries. Helping project owners and their teams make system changes while maintaining regulatory compliance and minimizing concerns about breaking compliance rules. ## Footnote https://cloud.google.com/architecture/framework/system-design/resource-management

Answer 23

Google Cloud resources are arranged hierarchically in organizations, folders, and projects. This hierarchy lets you manage common aspects of your resources like access control, configuration settings, and policies. For best practices to design the hierarchy of your cloud resources, see Decide a resource hierarchy for your Google Cloud landing zone based on the flowchart and resources. ## Footnote https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy https://cloud.google.com/architecture/landing-zones/decide-resource-hierarchy https://cloud.google.com/architecture/framework/system-design/resource-managementd

Answer 24

Computation is at the core of many workloads, whether it refers to the execution of custom business logic or the application of complex computational algorithms against datasets. Most solutions use compute resources in some form, and it's critical that you select the right compute resources for your application needs. Google Cloud provides several options for using time on a CPU. Options are based on CPU types, performance, and how your code is scheduled to run, including usage billing. Google Cloud compute options include the following: Virtual machines (VM) with cloud-specific benefits like live migration. Bin-packing of containers on cluster-machines that can share CPUs. Functions and serverless approaches, where your use of CPU time can be metered to the work performed during a single HTTP request. ## Footnote https://cloud.google.com/architecture/framework/system-design/compute

Answer 25

The technical requirements of the workload, lifecycle automation processes, regionalization, and security. Evaluate the nature of CPU usage by your app and the entire supporting system, including how your code is packaged and deployed, distributed, and invoked. While some scenarios might be compatible with multiple platform options, a portable workload should be capable and performant on a range of compute options. Choose a compute migration approach If you're migrating your existing applications from another cloud or from on-premises, use one of the following Google Cloud products to help you optimize for performance, scale, cost, and security. ## Footnote https://cloud.google.com/architecture/framework/system-design/compute

Answer 26

This section provides best practices for designing workloads to support your system. Evaluate serverless options for simple logic Simple logic is a type of compute that doesn't require specialized hardware or machine types like CPU-optimized machines. Before you invest in Google Kubernetes Engine (GKE) or Compute Engine implementations to abstract operational overhead and optimize for cost and performance, evaluate serverless options for lightweight logic. Decouple your applications to be stateless Where possible, decouple your applications to be stateless to maximize use of serverless computing options. This approach lets you use managed compute offerings, scale applications based on demand, and optimize for cost and performance. For more information about decoupling your application to design for scale and high availability, see Design for scale and high availability. Use caching logic when you decouple architectures If your application is designed to be stateful, use caching logic to decouple and make your workload scalable. For more information, see Database best practices. Use live migrations to facilitate upgrades To facilitate Google maintenance upgrades, use live migration by setting instance availability policies. For more information, see Set VM host maintenance policy. ## Footnote https://cloud.google.com/architecture/framework/system-design/compute

Answer 27

This section provides best practices for scaling workloads to support your system. Use startup and shutdown scripts For stateful applications, use startup and shutdown scripts where possible to start and stop your application state gracefully. A graceful startup is when a computer is turned on by a software function and the operating system is allowed to perform its tasks of safely starting processes and opening connections. Graceful startups and shutdowns are important because stateful applications depend on immediate availability to the data that sits close to the compute, usually on local or persistent disks, or in RAM. To avoid running application data from the beginning for each startup, use a startup script to reload the last saved data and run the process from where it previously stopped on shutdown. To save the application memory state to avoid losing progress on shutdown, use a shutdown script. For example, use a shutdown script when a VM is scheduled to be shut down due to downscaling or Google maintenance events. Use MIGs to support VM management When you use Compute Engine VMs, managed instance groups (MIGs) support features like autohealing, load balancing, autoscaling, auto updating, and stateful workloads. You can create zonal or regional MIGs based on your availability goals. You can use MIGs for stateless serving or batch workloads and for stateful applications that need to preserve each VM's unique state. Use pod autoscalers to scale your GKE workloads Use horizontal and vertical Pod autoscalers to scale your workloads, and use node auto-provisioning to scale underlying compute resources. Distribute application traffic To scale your applications globally, use Cloud Load Balancing to distribute your application instances across more than one region or zone. Load balancers optimize packet routing from Google Cloud edge networks to the nearest zone, which increases serving traffic efficiency and minimizes serving costs. To optimize for end-user latency, use Cloud CDN to cache static content where possible. Automate compute creation and management Minimize human-induced errors in your production environment by automating compute creation and management. ## Footnote https://cloud.google.com/architecture/framework/system-design/compute

Answer 28

This section provides best practices for managing operations to support your system. Use Google-supplied public images Use public images supplied by Google Cloud. The Google Cloud public images are regularly updated. For more information, see List of public images available on Compute Engine. You can also create your own images with specific configurations and settings. Where possible, automate and centralize image creation in a separate project that you can share with authorized users within your organization. Creating and curating a custom image in a separate project lets you update, patch, and create a VM using your own configurations. You can then share the curated VM image with relevant projects. Use snapshots for instance backups Snapshots let you create backups for your instances. Snapshots are especially useful for stateful applications, which aren't flexible enough to maintain state or save progress when they experience abrupt shutdowns. If you frequently use snapshots to create new instances, you can optimize your backup process by creating a base image from that snapshot. Use a machine image to enable VM instance creation Although a snapshot only captures an image of the data inside a machine, a machine image captures machine configurations and settings, in addition to the data. Use a machine image to store all of the configurations, metadata, permissions, and data from one or more disks that are needed to create a VM instance. When you create a machine from a snapshot, you must configure instance settings on the new VM instances, which requires a lot of work. Using machine images lets you copy those known settings to new machines, reducing overhead. For more information, see When to use a machine image. ## Footnote https://cloud.google.com/architecture/framework/system-design/compute

Answer 29

Capacity, reservations, and isolation This section provides best practices for managing capacity, reservations, and isolation to support your system. Use committed-use discounts to reduce costs You can reduce your operational expenditure (OPEX) cost for workloads that are always on by using committed use discounts. For more information, see the Cost optimization category. Choose machine types to support cost and performance Google Cloud offers machine types that let you choose compute based on cost and performance parameters. You can choose a low-performance offering to optimize for cost or choose a high-performance compute option at higher cost. For more information, see the Cost optimization category. Use sole-tenant nodes to support compliance needs Sole-tenant nodes are physical Compute Engine servers that are dedicated to hosting only your project's VMs. Sole-tenant nodes can help you to meet compliance requirements for physical isolation, including the following: Keep your VMs physically separated from VMs in other projects. Group your VMs together on the same host hardware. Isolate payments processing workloads. For more information, see Sole-tenant nodes. Use reservations to ensure resource availability Google Cloud lets you define reservations for your workloads to ensure those resources are always available. There is no additional charge to create reservations, but you pay for the reserved resources even if you don't use them. For more information, see Consuming and managing reservations. ## Footnote https://cloud.google.com/architecture/framework/system-design/compute

Answer 30

Evaluate built-in migration tools Evaluate built-in migration tools to move your workloads from another cloud or from on-premises. For more information, see Migration to Google Cloud. Google Cloud offers tools and services to help you migrate your workloads and optimize for cost and performance. To receive a free migration cost assessment based on your current IT landscape, see Google Cloud Rapid Assessment & Migration Program. Use virtual disk import for customized operating systems To import customized supported operating systems, see Importing virtual disks. Sole-tenant nodes can help you meet your hardware bring-your-own-license requirements for per-core or per-processor licenses. For more information, see Bringing your own licenses. ## Footnote https://cloud.google.com/architecture/framework/system-design/compute

Answer 31

Developing your cloud networking design includes the following steps: Design the workload VPC architecture. Start by identifying how many Google Cloud projects and VPC networks you require. Add inter-VPC connectivity. Design how your workloads connect to other workloads in different VPC networks. Design hybrid network connectivity. Design how your workload VPCs connect to on-premises and other cloud environments. When you design your Google Cloud network, consider the following: A VPC provides a private networking environment in the cloud for interconnecting services that are built on Compute Engine, Google Kubernetes Engine (GKE), and Serverless Computing Solutions. You can also use a VPC to privately access Google-managed services such as Cloud Storage, BigQuery, and Cloud SQL. VPC networks, including their associated routes and firewall rules, are global resources; they aren't associated with any particular region or zone. Subnets are regional resources. Compute Engine VM instances that are deployed in different zones in the same cloud region can use IP addresses from the same subnet. Traffic to and from instances can be controlled by using VPC firewall rules. Network administration can be secured by using Identity and Access Management (IAM) roles. VPC networks can be securely connected in hybrid environments by using Cloud VPN or Cloud Interconnect. ## Footnote https://cloud.google.com/architecture/framework/system-design/networking

Answer 32

VPC networks have the following properties: VPC networks, including their associated routes and firewall rules, are global resources. They are not associated with any particular region or zone. Subnets are regional resources. Each subnet defines a range of IPv4 addresses. Subnets in custom mode VPC networks can also have a range of IPv6 addresses. Traffic to and from instances can be controlled with network firewall rules. Rules are implemented on the VMs themselves, so traffic can only be controlled and logged as it leaves or arrives at a VM. Resources within a VPC network can communicate with one another by using internal IPv4 addresses, internal IPv6 addresses, or external IPv6 addresses, subject to applicable network firewall rules. For more information, see communication within the network. Instances with internal IPv4 or IPv6 addresses can communicate with Google APIs and services. For more information, see Private access options for services. Network administration can be secured by using Identity and Access Management (IAM) roles. An organization can use Shared VPC to keep a VPC network in a common host project. Authorized IAM principals from other projects in the same organization can create resources that use subnets of the Shared VPC network. VPC networks can be connected to other VPC networks in different projects or organizations by using VPC Network Peering. VPC networks can be securely connected in hybrid environments by using Cloud VPN or Cloud Interconnect. VPC networks support GRE traffic, including traffic on Cloud VPN and Cloud Interconnect. VPC networks do not support GRE for Cloud NAT or for forwarding rules for load balancing and protocol forwarding. Support for GRE allows you to terminate GRE traffic on a VM from the internet (external IP address) and Cloud VPN or Cloud Interconnect (internal IP address). The decapsulated traffic can then be forwarded to a reachable destination. GRE enables you to use services such as Secure Access Service Edge (SASE) and SD-WAN. ## Footnote https://cloud.google.com/architecture/framework/system-design/networking https://cloud.google.com/vpc/docs/vpc#specifications

Answer 33

This section provides best practices for designing workload VPC architectures to support your system. Consider VPC network design early Make VPC network design an early part of designing your organizational setup in Google Cloud. Organizational-level design choices can't be easily reversed later in the process. For more information, see Best practices and reference architectures for VPC design and Decide the network design for your Google Cloud landing zone. Start with a single VPC network For many use cases that include resources with common requirements, a single VPC network provides the features that you need. Single VPC networks are simple to create, maintain, and understand. For more information, see VPC Network Specifications. Keep VPC network topology simple To ensure a manageable, reliable, and well-understood architecture, keep the design of your VPC network topology as simple as possible. Use VPC networks in custom mode To ensure that Google Cloud networking integrates seamlessly with your existing networking systems, we recommend that you use custom mode when you create VPC networks. Using custom mode helps you integrate Google Cloud networking into existing IP address management schemes and it lets you control which cloud regions are included in the VPC. For more information, see VPC. ## Footnote https://cloud.google.com/architecture/framework/system-design/networking

Answer 34

This section provides best practices for designing inter-VPC connectivity to support your system. Choose a VPC connection method If you decide to implement multiple VPC networks, you need to connect those networks. VPC networks are isolated tenant spaces within Google's Andromeda software-defined network (SDN). There are several ways that VPC networks can communicate with each other. Choose how you connect your network based on your bandwidth, latency, and service level agreement (SLA) requirements. To learn more about the connection options, see Choose the VPC connection method that meets your cost, performance, and security needs. Use Shared VPC to administer multiple working groups For organizations with multiple teams, Shared VPC provides an effective tool to extend the architectural simplicity of a single VPC network across multiple working groups. Use simple naming conventions Choose simple, intuitive, and consistent naming conventions. Doing so helps administrators and users to understand the purpose of each resource, where it's located, and how it's differentiated from other resources. Use connectivity tests to verify network security In the context of network security, you can use connectivity tests to verify that traffic you intend to prevent between two endpoints is blocked. To verify that traffic is blocked and why it's blocked, define a test between two endpoints and evaluate the results. For example, you might test a VPC feature that lets you define rules that support blocking traffic. For more information, see Connectivity Tests overview. Use Private Service Connect to create private endpoints To create private endpoints that let you access Google services with your own IP address scheme, use Private Service Connect. You can access the private endpoints from within your VPC and through hybrid connectivity that terminates in your VPC. Secure and limit external connectivity Limit internet access only to those resources that need it. Resources with only a private, internal IP address can still access many Google APIs and services through Private Google Access. Use Network Intelligence Center to monitor your cloud networks Network Intelligence Center provides a comprehensive view of your Google Cloud networks across all regions. It helps you to identify traffic and access patterns that can cause operational or security risks. ## Footnote https://cloud.google.com/architecture/framework/system-design/networking

Answer 35

To facilitate data exchange and securely back up and store data, organizations need to choose a storage plan based on **workload, input/output operations per second (IOPS), latency, retrieval frequency, location, capacity, and format (block, file, and object).** Cloud Storage provides reliable, secure object storage services, including the following: Built-in redundancy options to protect your data against equipment failure and to ensure data availability during data center maintenance. Data transfer options, including the following: Storage Transfer Service Transfer Appliance BigQuery Data Transfer Service Migration to Google Cloud: Transferring your large datasets Storage classes to support your workloads. Calculated checksums for all Cloud Storage operations that enable Google to verify reads and writes. In Google Cloud, IOPS scales according to your provisioned storage space. Storage types like Persistent Disk require manual replication and backup because they are zonal or regional. By contrast, object storage is highly available and it automatically replicates data across a single region or across multiple regions. ## Footnote https://cloud.google.com/architecture/framework/system-design/storage

Answer 36

Choose active or archival storage based on storage access needs A storage class is a piece of metadata that is used by every object. For data that is served at a high rate with high availability, use the Standard Storage class. For data that is infrequently accessed and can tolerate slightly lower availability, use the Nearline Storage, Coldline Storage, or Archive Storage class. For more information about cost considerations for choosing a storage class, see Cloud Storage pricing ## Footnote https://cloud.google.com/architecture/framework/system-design/storage

Answer 37

Use Cloud CDN to improve static object delivery To optimize the cost to retrieve objects and minimize access latency, use Cloud CDN. Cloud CDN uses the Cloud Load Balancing external HTTP(S) load balancer to provide routing, health checking, and anycast IP address support. For more information, see Setting up Cloud CDN with cloud buckets. ## Footnote https://cloud.google.com/architecture/framework/system-design/storage

Answer 38

Use Persistent Disk to support high-performance storage access Data access patterns depend on how you design system performance. Cloud Storage provides scalable storage, but it isn't an ideal choice when you run heavy compute workloads that need high throughput access to large amounts of data. For high-performance storage access, use Persistent Disk. Use exponential backoff when implementing retry logic Use exponential backoff when implementing retry logic to handle 5XX, 408, and 429 errors. Each Cloud Storage bucket is provisioned with initial I/O capacity. For more information, see Request rate and access distribution guidelines. Plan a gradual ramp-up for retry requests. ## Footnote https://cloud.google.com/architecture/framework/system-design/storage

Answer 39

This section provides best practices for storage management to support your system. Assign unique names to every bucket Make every bucket name unique across the Cloud Storage namespace. Don't include sensitive information in a bucket name. Choose bucket and object names that are difficult to guess. For more information, see the bucket naming guidelines and Object naming guidelines. Keep Cloud Storage buckets private Unless there is a business-related reason, ensure that your Cloud Storage bucket isn't anonymously or publicly accessible. For more information, see Overview of access control. Assign random object names to distribute load evenly Assign random object names to facilitate performance and avoid hotspotting. Use a randomized prefix for objects where possible. For more information, see Use a naming convention that distributes load evenly across key ranges. Use public access prevention To prevent access at the organization, folder, project, or bucket level, use public access prevention. For more information, see Using public access prevention. ## Footnote https://cloud.google.com/architecture/framework/system-design/storage

Answer 40

Choose a migration strategy Selecting the appropriate target database is one of the keys to a successful migration. The following table provides migration options for some use cases: Use case Recommendation New development in Google Cloud. Select one of the managed databases that's built for the cloud—Cloud SQL, Cloud Spanner, Bigtable, or Firestore—to meet your use-case requirements. Lift-and-shift migration. Choose a compatible, managed-database service like Cloud SQL, MYSQL, PostgreSQL, or SQLServer. Your application requires granular access to a database that CloudSQL doesn't support. Run your database on Compute Engine VMs. ## Footnote https://cloud.google.com/architecture/framework/system-design/databases

Answer 41

Memorystore is a fully managed Redis and Memcached database that supports submilliseconds latency. Memorystore is fully compatible with open source Redis and Memcached. If you use these caching databases in your applications, you can use Memorystore without making application-level changes in your code. ## Footnote https://cloud.google.com/architecture/framework/system-design/databases

Answer 42

Use Bare Metal servers to run an Oracle database If your workloads require an Oracle database, use Bare Metal servers provided by Google Cloud. This approach fits into a lift-and-shift migration strategy. If you want to move your workload to Google Cloud and modernize after your baseline workload is working, consider using managed database options like Spanner, Bigtable, and Firestore. Databases built for the cloud are modern managed databases which are built from the bottom up on the cloud infrastructure. These databases provide unique default capabilities like scalability and high availability, which are difficult to achieve if you run your own database. Modernize your database Plan your database strategy early in the system design process, whether you're designing a new application in the cloud or you're migrating an existing database to the cloud. Google Cloud provides managed database options for open source databases such as Cloud SQL for MySQL and Cloud SQL for PostgreSQL. We recommend that you use the migration as an opportunity to modernize your database and prepare it to support future business needs. Use fixed databases with off-the-shelf applications Commercial off-the-shelf (COTS) applications require a fixed type of database and fixed configuration. Lift and shift is usually the most appropriate migration approach for COTS applications. Verify your team's database migration skill set Choose a cloud database-migration approach based on your team's database migration capabilities and skill sets. Use Google Cloud Partner Advantage to find a partner to support you throughout your migration journey. Design your database to meet HA and DR requirements When you design your databases to meet high availability (HA) and disaster recovery (DR) requirements, evaluate the tradeoffs between reliability and cost. Database services that are built for the cloud create multiple copies of your data within a region or in multiple regions, depending upon the database and configuration. Some Google Cloud services have multi-regional variants, such as BigQuery and Cloud Spanner. To be resilient against regional failures, use these multi-regional services in your design where possible. If you design your database on Compute Engine VMs instead of using managed databases on Google Cloud, ensure that you run multiple copies of your databases. For more information, see Design for scale and high availability in the Reliability category. ## Footnote https://cloud.google.com/architecture/framework/system-design/databases

Answer 43

Database design and scaling This section provides best practices for designing and scaling a database to support your system. Use monitoring metrics to assess scaling needs Use metrics from existing monitoring tools and environments to establish a baseline understanding of database size and scaling requirements—for example, right-sizing and designing scaling strategies for your database instance. For new database designs, determine scaling numbers based on expected load and traffic patterns from the serving application. For more information, see Monitoring Cloud SQL instances, Monitoring with Cloud Monitoring, and Monitoring an instance. ## Footnote https://cloud.google.com/architecture/framework/system-design/databases

Answer 44

best practices for managing networking and access to support your system. Run databases inside a private network Run your databases inside your private network and grant restricted access only from the clients who need to interact with the database. You can create Cloud SQL instances inside a VPC. Google Cloud also provides VPC Service Controls for Cloud SQL, Spanner, and Bigtable databases to ensure that access to these resources is restricted only to clients within authorized VPC networks. Grant minimum privileges to users Identity and Access Management (IAM) controls access to Google Cloud services, including database services. To minimize the risk of unauthorized access, grant the least number of privileges to your users. For application-level access to your databases, use service accounts with the least number of privileges. ## Footnote https://cloud.google.com/architecture/framework/system-design/databases

Answer 45

best practices for managing networking and access to support your system. Run databases inside a private network Run your databases inside your private network and grant restricted access only from the clients who need to interact with the database. You can create Cloud SQL instances inside a VPC. Google Cloud also provides VPC Service Controls for Cloud SQL, Spanner, and Bigtable databases to ensure that access to these resources is restricted only to clients within authorized VPC networks. Grant minimum privileges to users Identity and Access Management (IAM) controls access to Google Cloud services, including database services. To minimize the risk of unauthorized access, grant the least number of privileges to your users. For application-level access to your databases, use service accounts with the least number of privileges. ## Footnote https://cloud.google.com/architecture/framework/system-design/databases

Answer 46

best practices for defining automation and right-sizing to support your system. Define database instances as code One of the benefits of migrating to Google Cloud is the ability to automate your infrastructure and other aspects of your workload like compute and database layers. Google Deployment Manager and third-party tools like Terraform Cloud let you define your database instances as code, which lets you apply a consistent and repeatable approach to creating and updating your databases. Use Liquibase to version control your database Google database services like Cloud SQL and Cloud Spanner support Liquibase, an open source version control tool for databases. Liquibase helps you to track your database schema changes, roll back schema changes, and perform repeatable migrations. Test and tune your database to support scaling Perform load tests on your database instance and tune it based on the test results to meet your application's requirements. Determine the initial scale of your database by load testing key performance indicators (KPI) or by using monitoring KPIs derived from your current database. When you create database instances, start with a size that is based on the testing results or historical monitoring metrics. Test your database instances with the expected load in the cloud. Then fine-tune the instances until you get the desired results for the expected load on your database instances. Choose the right database for your scaling requirements Scaling databases is different from scaling compute layer components. Databases have state; when one instance of your database isn't able to handle the load, consider the appropriate strategy to scale your database instances. Scaling strategies vary depending on the database type. ## Footnote https://cloud.google.com/architecture/framework/system-design/databases

Answer 47

Model development and training Apply the following model development and training best practices to your own environment. Choose managed or custom-trained model development When you build your model, consider the highest level of abstraction possible. Use AutoML when possible so that the development and training tasks are handled for you. For custom-trained models, choose managed options for scalability and flexibility, instead of self-managed options. To learn more about model development options, see Use recommended tools and products. Consider the Vertex AI training service instead of self-managed training on Compute Engine VMs or Deep Learning VM containers. For a JupyterLab environment, consider Vertex AI Workbench, which provides both managed and user-managed JupyterLab environments. For more information, see Machine learning development and Operationalized training. Use pre-built or custom containers for custom-trained models For custom-trained models on Vertex AI, you can use pre-built or custom containers depending on your machine learning framework and framework version. Pre-built containers are available for Python training applications that are created for specific TensorFlow, scikit-learn, PyTorch, and XGBoost versions. Otherwise, you can choose to build a custom container for your training job. For example, use a custom container if you want to train your model using a Python ML framework that isn't available in a pre-built container, or if you want to train using a programming language other than Python. In your custom container, pre-install your training application and all its dependencies onto an image that runs your training job. Consider distributed training requirements Consider your distributed training requirements. Some ML frameworks, like TensorFlow and PyTorch, let you run identical training code on multiple machines. These frameworks automatically coordinate division of work based on environment variables that are set on each machine. Other frameworks might require additional customization. ## Footnote https://cloud.google.com/architecture/framework/system-design/databases

Answer 48

Design for environmental sustainability Understand your carbon footprint To understand the carbon footprint from your Google Cloud usage, use the Carbon Footprint dashboard. The Carbon Footprint dashboard attributes emissions to the Google Cloud projects that you own and the cloud services that you use. For more information, see Understand your carbon footprint in "Reduce your Google Cloud carbon footprint." Choose the most suitable cloud regions One simple and effective way to reduce carbon emissions is to choose cloud regions with lower carbon emissions. To help you make this choice, Google publishes carbon data for all Google Cloud regions. When you choose a region, you might need to balance lowering emissions with other requirements, such as pricing and network latency. To help select a region, use the Google Cloud Region Picker. For more information, see Choose the most suitable cloud regions in "Reduce your Google Cloud carbon footprint." Choose the most suitable cloud services To help reduce your existing carbon footprint, consider migrating your on-premises VM workloads to Compute Engine. Also consider that many workloads don't require VMs. Often you can utilize a serverless offering instead. These managed services can optimize cloud resource usage, often automatically, which simultaneously reduces cloud costs and carbon footprint. For more information, see Choose the most suitable cloud services in "Reduce your Google Cloud carbon footprint." Minimize idle cloud resources Idle resources incur unnecessary costs and emissions. Some common causes of idle resources include the following: Unused, active cloud resources, such as idle VM instances. Over-provisioned resources, such as larger VM machine types than necessary for a workload. Non-optimal architectures, such as lift-and-shift migrations that aren't always optimized for efficiency. Consider making incremental improvements to these architectures. The following are some general strategies to help minimize wasted cloud resources: Identify idle or overprovisioned resources and either delete them or rightsize them. Refactor your architecture to incorporate a more optimal design. Migrate workloads to managed services. ## Footnote https://cloud.google.com/architecture/framework/system-design/databases

Answer 49

Automation helps you standardize your builds, tests, and deployments by eliminating human-induced errors for repeated processes like code updates. This section describes how to use various checks and guards as you automate. A standardized machine-controlled process helps ensure that your deployments are applied safely. It also provides a mechanism to restore previous deployments as needed without significantly affecting your user's experience. Store your code in central code repositories Use continuous integration and continuous deployment (CI/CD) Automate your deployments using a continuous integration and continuous deployment (CI/CD) approach. A CI/CD approach is a combination of pipelines that you configure and processes that your development team follows. A CI/CD approach increases deployment velocity by making your software development team more productive. This approach lets developers make smaller and more frequent changes that are thoroughly tested while reducing the time needed to deploy those changes. Provision and manage your infrastructure using infrastructure as code Infrastructure as code is the use of a descriptive model to manage infrastructure, such as VMs, and configurations, such as firewall rules. Infrastructure as code lets you do the following: Create your cloud resources automatically, including the deployment or test environments for your CI/CD pipeline. Treat infrastructure changes like you treat application changes. For example, ensure changes to the configuration are reviewed, tested, and can be audited. Have a single version of the truth for your cloud infrastructure. Replicate your cloud environment as needed. Roll back to a previous configuration if necessary. Incorporate testing throughout the software delivery lifecycle Testing is critical to successfully launching your software. Continuous testing helps teams create high-quality software faster and enhance software stability. Launch deployments gradually Choose your deployment strategy based on important parameters, like minimum disruption to end users, rolling updates, rollback strategies, and A/B testing strategies. For each workload, evaluate these requirements and pick a deployment strategy from proven techniques, such as rolling updates, blue/green deployments, and canary deployments. Restore previous releases seamlessly Define your restoration strategy as part of your deployment strategy. Ensure that you can roll back a deployment, or an infrastructure configuration, to a previous version of the source code. Restoring a previous stable deployment is an important step in incident management for both reliability and security incidents. Monitor your CI/CD pipelines To keep your automated build, test, and deploy process running smoothly, monitor your CI/CD pipelines. Set alerts that indicate when anything in any pipeline fails. Each step of your pipeline should write suitable log statements so that your team can perform root cause analysis if a pipeline fails.w ## Footnote https://cloud.google.com/architecture/framework/operational-excellence/automate-your-deployments

Answer 50

Use the following four golden signals to monitor your system: Latency. The time it takes to service a request. Traffic. How much demand is being placed on your system. Errors. The rate of requests that fail. Failure can be explicit (for example, HTTP 500s), implicit (for example, an HTTP 200 success response coupled with the wrong content), or by policy (for example, if you commit to one-second response times, any request over one second is an error). Saturation. How full your service is. Saturation is a measure of your system fraction, emphasizing the resources that are most constrained (that is, in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Create a monitoring plan Include the following details in your monitoring plan: Include all your systems, including on-premises resources and cloud resources. Include monitoring of your cloud costs to help make sure that scaling events doesn't cause usage to cross your budget thresholds. Build different monitoring strategies for measuring infrastructure performance, user experience, and business key performance indicators (KPIs). For example, static thresholds might work well to measure infrastructure performance but don't truly reflect the user's experience. Update the plan as your monitoring strategies mature. Iterate on the plan to improve the health of your systems. Define metrics that measure all aspects of your organization Use these metrics to create service level indicators (SLIs) for your applications. For more information, see Choose appropriate SLIs. Choose a monitoring solution that: Is platform independent Provides uniform capabilities for monitoring of on-premises, hybrid, and multi-cloud environments Using a single platform to consolidate the monitoring data that comes in from different sources lets you build uniform metrics and visualization dashboards. As you set up monitoring, automate monitoring tasks where possible. Monitoring with Google Cloud Using a monitoring service, such as Cloud Monitoring, is easier than building a monitoring service yourself. Monitoring a complex application is a substantial engineering endeavor by itself. Even with existing infrastructure for instrumentation, data collection and display, and alerting in place, it is a full-time job for someone to build and maintain. Cloud Monitoring is a managed service that is part of the Google Cloud's operations suite. You can use Cloud Monitoring to monitor Google Cloud services and custom metrics. Cloud Monitoring provides an API for integration with third-party monitoring tools. Cloud Monitoring aggregates metrics, logs, and events from your system's cloud-based infrastructure. That data gives developers and operators a rich set of observable signals that can speed root-cause analysis and reduce mean time to resolution. You can use Cloud Monitoring to define alerts and custom metrics that meet your business objectives and help you aggregate, visualize, and monitor system health. Cloud Monitoring provides default dashboards for cloud and open source application services. Using the metrics model, you can define custom dashboards with powerful visualization tools and configure charts in Metrics Explorer. Set up alerting As you set up alerting, map alerts directly to critical metrics. These critical metrics include: The four golden signals: Latency Traffic Errors Saturation System health Service usage Security events User experience Make alerts actionable to minimize the time to resolution. To do so, for each alert: Include a clear description, including stating what is monitored and its business impact. Provide all the information necessary to act immediately. If it takes a few clicks and navigation to understand alerts, it is challenging for the on-call person to act. Define priority levels for various alerts. Clearly identify the person or team responsible for responding to the alert. For critical applications and services, build self-healing actions into the alerts triggered due to common fault conditions such as service health failure, configuration change, or throughput spikes. As you set up alerts, try to eliminate toil. \ Build monitoring and alerting dashboards Once monitoring is in place, build relevant, uncomplicated dashboards that include information from your monitoring and alerting systems. Choosing an appropriate way to visualize your dashboard can be difficult to tie into your reliability goals. Create dashboards to visualize both: Short-term and real-time analysis Long-term analysis Logging the data your systems generate helps you ensure an effective security posture. For more information about logging and security, see Implement logging and detective controls in the security category of the Architecture Framework. Cloud Logging is an integrated logging service you can use to store, search, analyze, monitor, and alert on log data and events. Logging automatically collects logs from the services of Google Cloud and other cloud providers. You can use these logs to build metrics for monitoring and to create logging exports to external services such as Cloud Storage, BigQuery, and Pub/Sub. Set up an audit trail To help answer questions like "who did what, where, and when" in your Google Cloud projects, use Cloud Audit Logs. Cloud Audit Logs captures several types of activity, such as the following: Admin Activity logs contain log entries for API calls or other administrative actions that modify the configuration or metadata of resources. Admin Activity logs are always enabled. ## Footnote https://cloud.google.com/architecture/framework/operational-excellence/set-up-monitoring-alerting-logging

Answer 51

Establish support from your providers Purchase a support contract from your cloud provider or other third-party service providers. Support is critical to ensure the prompt response and resolution of various operational issues. To work with Google Cloud Customer Care, consider purchasing a Customer Care offering that includes Standard, Enhanced, or Premium Support. Consider using Enhanced or Premium Support for your major production environments. Define your escalation process A well-defined escalation process is key to reducing the effort and time that it takes to identify and address any issues in your systems. This includes issues that require support for Google Cloud products or for other cloud producers or third-party services. Ensure you receive communication from support Ensure that your administrators are receiving communication from your cloud providers and third-party services. This information allows admins to make informed decisions and fix issues before they cause larger problems. Ensure that the following are true: Establish review processes Establish a review or postmortem processes. Follow these processes after you raise a new support ticket or escalate an existing support ticket. Build centers of excellence It can be valuable to capture your organization's information, experience, and patterns in an internal knowledge base, such as a wiki, Google site, or intranet site. As new products and features are continually being rolled out in Google Cloud, this knowledge can help track why you chose a particular design for your applications and services. For more information, see Architecture decision records. ## Footnote https://cloud.google.com/architecture/framework/operational-excellence/establish-cloud-support-and-escalation-processes

Answer 52

Peak and launch events includes three stages: Planning and preparation for the launch or peak traffic event Launching the event Reviewing event performance The practices described in this document can help each of these stages run smoothly. Create a general playbook for launch and peak events Build a general playbook with a long-term view of current and future peak events. Keep adding lessons learned to the document, so it can be a reference for future peak events. Plan for your launch and for peak events Plan ahead. Create business projections for upcoming launches and for expected (and unexpected) peak events. Preparing your system for scale spikes depends on understanding your business projections. The more you know about prior forecasts, the more accurate you can make your new business forecasts. Those new forecasts are critical inputs into projecting expected demand on your system. Establish review processes When the peak traffic event or launch event is over, review the event to document the lessons you learned. Then, update your playbook with those lessons. Finally, apply what you learned to the next major event. Learning from prior events is important, especially when they highlight constraints to the system while under stress. Retrospective reviews, also called postmortems, for peak traffic events or launch events are a useful technique for capturing data and understanding the incidents that occurred. Do this review for peak traffic and launch events that went as expected, and for any incidents that caused problems. As you do this review foster a blameless culture. ## Footnote https://cloud.google.com/architecture/framework/operational-excellence/plan-for-peak-traffic-and-launch-events

Answer 53

Create a culture of automation Toil is manual and repetitive work with no enduring value, and it increases as a service grows. Continually aim to reduce or eliminate toil. Otherwise, operational work can eventually overwhelm operators, and any growth in product use or complexity can require additional staffing. Automation is a key way to minimize toil. Automation also improves release velocity and helps minimize human-induced errors. Create an inventory and assess the cost of toil Start by creating an inventory and assessing the cost of toil on the teams managing your systems. Make this a continuous process, followed by investing in customized automation to extend what's already provided by Google Cloud services and partners. You can often modify Google Cloud's own automation—for example, Compute Engine's autoscaler. Prioritize eliminating toil Automation is useful but isn't a solution to all operational problems. As a first step in addressing known toil, we recommend reviewing your inventory of existing toil and prioritize eliminating as much toil as you can. Then, you can focus on automation. Automate necessary toil Some toil in your systems cannot be eliminated. As a second step in addressing known toil, automate this toil using the solutions that Google Cloud provides through configurable automation. Build or buy solutions for high-cost toil The third step, which can be completed in parallel with the first and second steps, entails evaluating building or buying other solutions if your toil cost stays high—for example, if toil takes a significant amount of time for any team managing your production systems. When building or buying solutions, consider integration, security, privacy, and compliance costs. Designing and implementing your own automation comes with maintenance costs and risks to reliability beyond its initial development and setup costs, so consider this option as a last resort. ## Footnote https://sre.google/workbook/eliminating-toil/

Answer 54

Metrics are generated at all levels of your service, from infrastructure and networking to business logic. For example: Infrastructure metrics: Virtual machine statistics, including instances, CPU, memory, utilization, and counts Container-based statistics, including cluster utilization, cluster capacity, pod level utilization, and counts Networking statistics, including ingress/egress, bandwidth between components, latency, and throughput Requests per second, as measured by the load balancer Total disk blocks read, per disk Packets sent over a given network interface Memory heap size for a given process Distribution of response latencies Number of invalid queries rejected by a database instance Application metrics: Application-specific behavior, including queries per second, writes per second, and messages sent per second Managed services statistics metrics: QPS, throughput, latency, utilization for Google-managed services (APIs or products such as BigQuery, App Engine, and Cloud Bigtable) Network connectivity statistics metrics: VPN/interconnect-related statistics about connecting to on-premises systems or systems that are external to Google Cloud. SLIs Metrics associated with the overall health of the system. Set up monitoring Set up monitoring to monitor both on-premises resources and cloud resources. ## Footnote https://cloud.google.com/architecture/framework/operational-excellence/set-up-monitoring-alerting-logging

Answer 55

To create your escalation path: Define when and how to escalate issues internally. Define when and how to create support cases with your cloud provider or other third-party service provider. Learn how to work with the teams that provide you support. For Google Cloud, you and your operations teams should review the Best practices for working with Customer Care. Incorporate these practices into your escalation path. Find or create documents that describe your architecture. Ensure these documents include information that is helpful for support engineers. Define how your teams communicate during an outage. Ensure that people who need support have appropriate levels of support permissions to access the Google Cloud Support Center, or to communicate with other support providers. To learn about using the Google Cloud Support Center, visit Support procedures. Set up monitoring, alerting, and logging so that you have the information needed to act on when issues arise. Create templates for incident reporting. For information to include in your incident reports, see Best practices for working with Customer Care. Document your organization's escalation process. Ensure that you have clear, well-defined actions to address escalations. Include a plan to teach new team members how to interact with support. Regularly test your escalation process internally. Test your escalation process before major events, such as migrations, new product launches, and peak traffic events. If you have Google Cloud Customer Care Premium Support, your Technical Account Manager can help review your escalation process and coordinate your tests with Google Cloud Customer Care. ## Footnote https://cloud.google.com/architecture/framework/operational-excellence/establish-cloud-support-and-escalation-processes

Answer 56

As you create this plan do the following: Run load tests to determine how much load the system can handle while meeting its latency targets, given a fixed amount of resources. Load tests should use a mix of request types that matches production traffic profiles from live users. Don't use a uniform or random mix of operations. Include spikes in usage in your traffic profile. Create a capacity model. A capacity model is a set of formulas for calculating incremental resources needed per unit increase in service load, as determined from load testing. Forecast future traffic and account for growth. See the article Measure Future Load for a summary of how Google builds traffic forecasts. Apply the capacity model to the forecast to determine future resource needs. Estimate the cost of resources your organization needs. Then, get budget approval from your Finance organization. This step is essential because the business can choose to make cost versus risk tradeoffs across a range of products. Those tradeoffs can mean acquiring capacity that's lower or higher than the predicted need for a given product based on business priorities. Work with your cloud provider to get the correct amount of resources at the correct time with quotas and reservations. Involve infrastructure teams for capacity planning and have operations create capacity plans with confidence intervals. Repeat the previous steps every quarter or two. ## Footnote https://www.usenix.org/publications/login/feb15/capacity-planning

Answer 57

ddddIdentity management—for example, Cloud Identity and Identity and Access Management. Google Cloud hosted solutions, as opposed to self-designed solutions—for example, cluster management (Google Kubernetes Engine (GKE)), relational database management (Cloud SQL), data warehouse management (BigQuery), and API management (Apigee). Google Cloud services and tenant provisioning—for example Terraform and Cloud Foundation Toolkit. Automated workflow orchestration for multi-step operations—for example, Cloud Composer. Additional capacity provisioning—for example, several Google Cloud products, like Compute Engine and GKE, offer configurable autoscaling. Evaluate the Google Cloud services you are using to determine if they include configurable autoscaling. CI/CD pipelines with automated deployment—for example, Cloud Build. Canary analysis to validate deployments. Automated model training (for machine learning)—for example, AutoML. If a Google Cloud product or service only partially satisfies your technical needs when automating or eliminating manual workflows, consider filing a feature request through your Google Cloud account representative. Your issue might be a priority for other customers or already a part of our roadmap. If so, knowing the feature's priority and timeline helps you to better assess the trade-offs of building your own solution versus waiting to use a Google Cloud feature.

Answer 58

Understanding the shared responsibility model is important when determining how to best protect your data and workloads on Google Cloud. The shared responsibility model describes the tasks that you have when it comes to security in the cloud and how these tasks are different for cloud providers. Understanding shared responsibility, however, can be challenging. The model requires an in-depth understanding of each service you utilize, the configuration options that each service provides, and what Google Cloud does to secure the service. Every service has a different configuration profile, and it can be difficult to determine the best security configuration. Google believes that the shared responsibility model stops short of helping cloud customers achieve better security outcomes. Instead of shared responsibility, we believe in shared fate. Shared fate includes us building and operating a trusted cloud platform for your workloads. We provide best practice guidance and secured, attested infrastructure code that you can use to deploy your workloads in a secure way. We release solutions that combine various Google Cloud services to solve complex security problems and we offer innovative insurance options to help you measure and mitigate the risks that you must accept. Shared fate involves us more closely interacting with you as you secure your resources on Google Cloud. ## Footnote https://cloud.google.com/architecture/framework/security/shared-responsibility-shared-fate

Answer 59

Build a layered security approach Implement security at each level in your application and infrastructure by applying a defense-in-depth approach. Use the features in each product to limit access and configure encryption where appropriate. Design for secured decoupled systems Simplify system design to accommodate flexibility where possible, and document security requirements for each component. Incorporate a robust secured mechanism to account for resiliency and recovery. Automate deployment of sensitive tasks Take humans out of the workstream by automating deployment and other admin tasks. Automate security monitoring Use automated tools to monitor your application and infrastructure. To scan your infrastructure for vulnerabilities and detect security incidents, use automated scanning in your continuous integration and continuous deployment (CI/CD) pipelines. Meet the compliance requirements for your regions Be mindful that you might need to obfuscate or redact personally identifiable information (PII) to meet your regulatory requirements. Where possible, automate your compliance efforts. For example, use Cloud Data Loss Prevention (Cloud DLP) and Dataflow to automate the PII redaction job before new data is stored in the system. Comply with data residency and sovereignty requirements You might have internal (or external) requirements that require you to control the locations of data storage and processing. These requirements vary based on systems design objectives, industry regulatory concerns, national law, tax implications, and culture. Shift security left DevOps and deployment automation let your organization increase the velocity of delivering products. To help ensure that your products remain secure, incorporate security processes from the start of the development process. ## Footnote https://cloud.google.com/architecture/framework/security/security-principles

Answer 60

Manage risk with controls You should complete risk analysis before you deploy workloads on Google Cloud, and regularly afterwards as your business needs, regulatory requirements, and the threats relevant to your organization change. Identify risks to your organization Before you create and deploy resources on Google Cloud, complete a risk assessment to determine what security features you need in order to meet your internal security requirements and external regulatory requirements. Your risk assessment provides you with a catalog of risks that are relevant to you, and tells you how capable your organization is in detecting and counteracting security threats. Your risks in a cloud environment differ from your risks in an on-premises environment due to the shared responsibility arrangement that you enter with your cloud provider. For example, in an on-premises environment you need to mitigate vulnerabilities to the hardware stack. In contrast, in a cloud environment these risks are borne by the cloud provider. In addition, your risks differ depending on how you plan on using Google Cloud. Are you transferring some of your workloads to Google Cloud, or all of them? Are you using Google Cloud only for disaster recovery purposes? Are you setting up a hybrid cloud environment? We recommend that you use an industry-standard risk assessment framework that applies to cloud environments and to your regulatory requirements. For example, the Cloud Security Alliance (CSA) provides the Cloud Controls Matrix (CCM). In addition, there are threat models such as OWASP application threat modeling that provide you with a list of potential gaps, and that suggest actions to remediate any gaps that are found. You can check our partner directory for a list of experts in conducting risk assessments for Google Cloud. To help catalog your risks, consider Risk Manager, which is part of the Risk Protection Program. (This program is currently in preview.) Risk Manager scans your workloads to help you understand your business risks. Its detailed reports provide you with a security baseline. In addition, you can use Risk Manager reports to compare your risks against the risks outlined in the Center for Internet Security (CIS) Benchmark. After you catalog your risks, you must determine how to address them—that is, whether you want to accept, avoid, transfer, or mitigate them. The following section describes mitigation controls. Mitigate your risks You can mitigate risks using technical controls, contractual protections, and third-party verifications or attestations. The following table lists how you can use these mitigations when you adopt new public cloud services. Technical controls Technical controls refer to the features and technologies that you use to protect your environment. These include built-in cloud security controls, such as firewalls and logging. Technical controls can also include using third-party tools to reinforce or support your security strategy. Contractual protections Contractual protections refer to the legal commitments made by us regarding Google Cloud services. Google is committed to maintaining and expanding our compliance portfolio. The Data Processing and Security Terms (DPST) document defines our commitment to maintaining our ISO 27001, 27017, and 27018 certifications and to updating our SOC 2 and SOC 3 reports every 12 months. Third-party verifications or attestations refers to having a third-party vendor audit the cloud provider to ensure that the provider meets compliance requirements. For example, Google was audited by a third party for ISO 27017 compliance. ## Footnote https://cloud.google.com/architecture/framework/security/risk-management

Answer 61

Manage your assets Asset management is an important part of your business requirements analysis. You must know what assets you have, and you must have a good understanding of all your assets, their value, and any critical paths or processes related to them. You must have an accurate asset inventory before you can design any sort of security controls to protect your assets. To manage security incidents and meet your organization's regulatory requirements, you need an accurate and up-to-date asset inventory that includes a way to analyze historical data. You must be able to track your assets, including how their risk exposure might change over time. Use cloud asset management tools Google Cloud asset management tools are tailored specifically to our environment and to top customer use cases. Automate asset management Automation lets you quickly create and manage assets based on the security requirements that you specify. You can automate aspects of the asset lifecycle in the following ways: Deploy your cloud infrastructure using automation tools such as Terraform. Google Cloud provides the security foundations blueprint, which helps you set up infrastructure resources that meet security best practices. In addition, it configures asset changes and policy compliance notifications in Cloud Asset Inventory. Deploy your applications using automation tools such as Cloud Run and the Artifact Registry. Monitor for deviations from your compliance policies Deviations from policies can occur during all phases of the asset lifecycle. For example, assets might be created without the proper security controls, or their privileges might be escalated. Similarly, assets might be abandoned without the appropriate end-of-life procedures being followed. Integrate with your existing asset management monitoring systems If you already use a SIEM system or other monitoring system, integrate your Google Cloud assets with that system. Integration ensures that your organization has a single, comprehensive view into all resources, regardless of environment. For more information, see Export Google Cloud security data to your SIEM system and Scenarios for exporting Cloud Logging data: Splunk. Use data analysis to enrich your monitoring You can export your inventory to a BigQuery table or Cloud Storage bucket for additional analysis. For an example, see Tracking assets with IoT devices: Pycom, Sigfox, and Google Cloud. ## Footnote https://cloud.google.com/architecture/framework/security/asset-management

Answer 62

Manage identity and access The practice of identity and access management (generally referred to as IAM) helps you ensure that the right people can access the right resources. IAM addresses the following aspects of authentication and authorization: Account management, including provisioning Identity governance Authentication Access control (authorization) Identity federation Managing IAM can be challenging when you have different environments or you use multiple identity providers. However, it's critical that you set up a system that can meet your business requirements while mitigating risks. The recommendations in this document help you review your current IAM policies and procedures and determine which of those you might need to modify for your workloads in Google Cloud. For example, you must review the following: Whether you can use existing groups to manage access or whether you need to create new ones. Your authentication requirements (such as multi-factor authentication (MFA) using a token). The impact of service accounts on your current policies. If you're using Google Cloud for disaster recovery, maintaining appropriate separation of duties. Within Google Cloud, you use Cloud Identity to authenticate your users and resources and Google's Identity and Access Management (IAM) product to dictate resource access. Administrators can restrict access at the organization, folder, project, and resource level. Google IAM policies dictate who can do what on which resources. Correctly configured IAM policies help secure your environment by preventing unauthorized access to resources. For more information, see Overview of identity and access management. Use a single identity provider Protect the super admin account Plan your use of service accounts A service account is a Google account that applications can use to call the Google API of a service. Unlike your user accounts, service accounts are created and managed within Google Cloud. Service accounts also authenticate differently than user accounts: Update your identity processes for the cloud Set up SSO and MFA Implement least privilege and separation of duties You must ensure that the right individuals get access only to the resources and services that they need in order to perform their jobs. That is, you should follow the principle of least privilege. In addition, you must ensure there is an appropriate separation of duties. Overprovisioning user access can increase the risk of insider threat, misconfigured resources, and non-compliance with audits. Underprovisioning permissions can prevent users from being able to access the resources they need in order to complete their tasks. Audit access To monitor the activities of privileged accounts for deviations from approved conditions, use Cloud Audit Logs. Cloud Audit Logs records the actions that members in your Google Cloud organization have taken in your Google Cloud resources. You can work with various audit log types across Google services. For more information, see Using Cloud Audit Logs to Help Manage Insider Risk (video). Automate your policy controls Set access permissions programmatically whenever possible. For best practices, see Organization policy setup. The Terraform scripts for this example foundation are in the example foundation repository. Set restrictions on resources Google IAM focuses on who, and it lets you authorize who can act on specific resources based on permissions. The Organization Policy Service focuses on what, and it lets you set restrictions on resources to specify how they can be configured. For example, you can use an organization policy to do the following: Limit resource sharing based on domain. Limit the use of service accounts. Restrict the physical location of newly created resources. In addition to using organizational policies for these tasks, you can restrict access to resources using one of the following methods: Use tags to manage access to your resources without defining the access permissions on each resource. Instead, you add the tag and then set the access definition for the tag itself. Use IAM Conditions for conditional, attribute-based control of access to resources. Implement defense-in-depth using VPC Service Controls to further restrict access to resources. ## Footnote https://cloud.google.com/architecture/framework/security/identity-access

Answer 63

Implement compute and container security Google Cloud includes controls to protect your compute resources and Google Kubernetes Engine (GKE) container resources. This document in the Google Cloud Architecture Framework describes key controls and best practices for using them. Use hardened and curated VM images Google Cloud includes Shielded VM, which allows you to harden your VM instances. Shielded VM is designed to prevent malicious code from being loaded during the boot cycle. It provides boot security, monitors integrity, and uses the Virtual Trusted Platform Module (vTPM). Use Shielded VM for sensitive workloads. Use Confidential Computing for processing sensitive data By default, Google Cloud encrypts data at rest and in transit across the network, but data isn't encrypted while it's in use in memory. If your organization handles confidential data, you need to mitigate against threats that undermine the confidentiality and integrity of either the application or the data in system memory. Confidential data includes personally identifiable information (PII), financial data, and health information. Confidential Computing builds on Shielded VM. It protects data in use by performing computation in a hardware-based trusted execution environment. In Google Cloud, you can enable Confidential Computing by running Confidential VMs or Confidential GKE nodes. Protect VMs and containers OS Login lets your employees connect to your VMs using Identity and Access Management (IAM) permissions as the source of truth instead of relying on SSH keys. In GKE, App Engine runs application instances within Docker containers. To enable a defined risk profile and to restrict employees from making changes to containers, ensure that your containers are stateless and immutable. The principle of immutability means that your employees do not modify the container or access it interactively. If it must be changed, you build a new image and redeploy. Enable SSH access to the underlying containers only in specific debugging scenarios. Disable external IP addresses unless they're necessary To disable external IP address allocation (video) for your production VMs and to prevent the use of external load balancers, you can use organization policies. If you require your VMs to reach the internet or your on-premises data center, you can enable a Cloud NAT gateway. You can deploy private clusters in GKE. In a private cluster, nodes have only internal IP addresses, which means that nodes and Pods are isolated from the internet by default. You can also define a network policy to manage Pod-to-Pod communication in the cluster. Monitor your compute instance and GKE usage Cloud Audit Logs are automatically enabled for Compute Engine and GKE. Audit logs let you automatically capture all activities with your cluster and monitor for any suspicious activity. Keep your images and clusters up to date Control access to your images and clusters IIsolate containers in a sandbox Use GKE Sandbox to deploy multi-tenant applications that need an extra layer of security and isolation from their host kernel. For example, use GKE Sandbox when you are executing unknown or untrusted code. GKE Sandbox is a container isolation solution that provides a second layer of defense between containerized workloads on GKE. ## Footnote https://cloud.google.com/architecture/framework/security/compute-container-security

Answer 64

Secure your network Extending your existing network to include cloud environments has many implications for security. Your on-premises approach to multi-layered defenses likely involves a distinct perimeter between the internet and your internal network. You probably protect the perimeter by using physical firewalls, routers, intrusion detection systems, and so on. Because the boundary is clearly defined, you can easily monitor for intrusions and respond accordingly. When you move to the cloud (either completely or in a hybrid approach), you move beyond your on-premises perimeter. This document describes ways that you can continue to secure your organization's data and workloads on Google Cloud. As mentioned in Manage risks with controls, how you set up and secure your Google Cloud network depends on your business requirements and risk appetite. Deploy zero trust networks Secure connections to your on-premises or multi-cloud environments Disable default networks When you create a new Google Cloud project, a default Google Cloud VPC network with auto mode IP addresses and pre-populated firewall rules is automatically provisioned. For production deployments, we recommend that you delete the default networks in existing projects, and disable the creation of default networks in new projects. Virtual Private Cloud networks let you use any internal IP address. To avoid IP address conflicts, we recommend that you first plan your network and IP address allocation across your connected deployments and across your projects. A project allows multiple VPC networks, but it's usually a best practice to limit these networks to one per project in order to enforce access control effectively. Secure your perimeter In Google Cloud, you can use various methods to segment and secure your cloud perimeter, including firewalls and VPC Service Controls. Use Shared VPC to build a production deployment that gives you a single shared network and that isolates workloads into individual projects that can be managed by different teams. Shared VPC provides centralized deployment, management, and control of the network and network security resources across multiple projects. Shared VPC consists of host and service projects that perform the following functions: A host project contains the networking and network security-related resources, such as VPC networks, subnets, firewall rules, and hybrid connectivity. A service project attaches to a host project. It lets you isolate workloads and users at the project level by using Identity and Access Management (IAM), while it shares the networking resources from the centrally managed host project. Define firewall rules and policies at the organization, folder, and VPC network level. You can configure firewall rules to permit or deny traffic to or from VM instances. For more information and examples, see Using firewall rules. In addition to defining rules based on IP addresses, protocols, and ports, you can manage traffic and apply firewall rules based on the service account that's used by a VM instance. Use service accounts in your firewall rules to simplify your configuration and enforce isolation without relying on an IP address as the sole identifier of a workload. Use hierarchical firewall policies to define rules that apply to all networks in your organization, regardless of what the network-level firewall rules permit. You can also define rules at the folder level to cover only portions of your organization. To control the movement of data in Google services and to set up context-based perimeter security, consider VPC Service Controls. VPC Service Controls provides an extra layer of security for Google Cloud services that's independent of IAM and VPC firewall rules and policies. For example, VPC Service Controls lets you set up perimeters between confidential and non-confidential data so that you can apply controls that help prevent data exfiltration. Inspect your network traffic You can use Cloud IDS and Packet Mirroring to help you ensure the security and compliance of workloads running in Compute Engine and Google Kubernetes Engine (GKE). Use a web application firewall For external web applications and services, you can enable Google Cloud Armor to provide distributed denial-of-service (DDoS) protection and web application firewall (WAF) capabilities. Google Cloud Armor supports Google Cloud workloads that are exposed using external HTTP(S) load balancing, TCP Proxy load balancing, or SSL Proxy load balancing. Automate infrastructure provisioning Automation lets you create immutable infrastructure, which means that it can't be changed after provisioning. This measure gives your operations team a known good state, fast rollback, and troubleshooting capabilities. For automation, you can use tools such as Terraform, Jenkins, and Cloud Build. Monitor your network Monitor your network and your traffic using telemetry. ## Footnote https://cloud.google.com/architecture/framework/security/network-security

Answer 65

Implement data security As part of your deployment architecture, you must consider what data you plan to process and store in Google Cloud, and the sensitivity of the data. Design your controls to help secure the data during its lifecycle, to identify data ownership and classification, and to help protect data from unauthorized use. Automatically classify your data Perform data classification as early in the data management lifecycle as possible, ideally when the data is created. Usually, data classification efforts require only a few categories, Use Cloud DLP to discover and classify data across your Google Cloud environment. Manage data governance using metadata Data governance is a combination of processes that ensure that data is secure, private, accurate, available, and usable. Use Dataproc Metastore or Hive metastore to manage metadata for workloads. Data Catalog has a hive connector that allows the service to discover metadata that's inside a hive metastore. Use Dataprep by Trifacta to define and enforce data quality rules through a console. Protect data according to its lifecycle phase and classification Encrypt your data You can control access by Google support and engineering personnel to your environment on Google Cloud. You can control the network locations from which users can access data by using VPC Service Controls. Manage secrets using Secret Manager Monitor your data ## Footnote https://cloud.google.com/architecture/framework/security/data-security

Answer 66

Deploy applications securely To deploy secure applications, you must have a well-defined software development lifecycle, with appropriate security checks during the design, development, testing, and deployment stages. When you design an application, we recommend a layered system architecture that uses standardized frameworks for identity, authorization, and access control. Automate secure releases Without automated tools, it can be hard to deploy, update, and patch complex application environments to meet consistent security requirements. Therefore, we recommend that you build a CI/CD pipeline for these tasks, which can solve many of these issues. You can use automation to scan for security vulnerabilities when artifacts are created. You can also define policies for different environments (development, test, production, and so on) so that only verified artifacts are deployed. Scan for known vulnerabilities before deployment Use Container Analysis to automatically scan for vulnerabilities for containers that are stored in Artifact Registry and Container Registry. Monitor your application code for known vulnerabilities Control movement of data across perimeters To control the movement of data across a perimeter, you can configure security perimeters around the resources of your Google-managed services. Use VPC Service Controls to place all components and services in your CI/CD pipeline (for example, Container Registry, Artifact Registry, Container Analysis, and Binary Authorization) inside a security perimeter. VPC Service Controls improves your ability to mitigate the risk of unauthorized copying or transfer of data (data exfiltration) from Google-managed services. ## Footnote https://cloud.google.com/architecture/framework/security/app-security

Answer 67

Manage compliance obligations Your cloud regulatory requirements depend on a combination of factors, including the following: The laws and regulations that apply your organization's physical locations. The laws and regulations that apply to your customers' physical locations. Your industry's regulatory requirements. A typical compliance journey goes through three stages: assessment, gap remediation, and continual monitoring. This section addresses the best practices that you can use during each stage. Assess your compliance needs Compliance assessment starts with a thorough review of all of your regulatory obligations and how your business is implementing them. To help you with your assessment of Google Cloud services, use the Compliance resource center. This site provides you with details on the following: Deploy Assured Workloads Assured Workloads is the Google Cloud tool that builds on the controls within Google Cloud to help you meet your compliance obligations. Review blueprints for templates and best practices that apply to your compliance regime Google has published blueprints and solutions guides that describe best practices and that provide Terraform modules to let you roll out an environment that helps you achieve compliance. The following table lists a selection of blueprints that address security and alignment with compliance requirements. Monitor your compliance Most regulations require you to monitor particular activities, including access controls. To help with your monitoring, you can use the following: Access Transparency, which provides near real-time logs when Google Cloud admins access your content. Firewall Rules Logging to record TCP and UDP connections inside a VPC network for any rules that you create yourself. VPC Flow Logs to record network traffic flows that are sent or received by VM instances. Set up automatic remediation to particular notifications. For more information, see Cloud Functions code.

Answer 68

Data residency and sovereignty requirements are based on your regional and industry-specific regulations, and different organizations might have different data sovereignty requirements. For example, you might have the following requirements: Control over all access to your data by Google Cloud, including what type of personnel can access the data and from which region they can access it. Inspectability of changes to cloud infrastructure and services, which can have an impact on access to your data or the security of your data. Insight into these types of changes helps ensure that Google Cloud is unable to circumvent controls or move your data out of the region. Survivability of your workloads for an extended time when you are unable to receive software updates from Google Cloud. Manage your data sovereignty Store and manage encryption keys outside the cloud. Only grant access to these keys based on detailed access justifications. Protect data in use. Manage your operational sovereignty Restrict the deployment of new resources to specific provider regions. Limit Google personnel access based on predefined attributes such as their citizenship or geographic location. Manage software sovereignty Software sovereignty provides you with assurances that you can control the availability of your workloads and run them wherever you want, without depending on (or being locked in to) a single cloud provider. Software sovereignty includes the ability to survive events that require you to quickly change where your workloads are deployed and what level of outside connection is allowed. For example, Google Cloud supports hybrid and multi-cloud deployments. In addition, Anthos lets you manage and deploy your applications in both cloud environments and on-premises environments. Control data residency Understanding the type of your data and its location. Determining what risks exist to your data, and what laws and regulations apply. Controlling where data is or where it goes. ## Footnote https://cloud.google.com/architecture/framework/security/compliance

Answer 69

Implement privacy requirements Privacy regulations help define how you can obtain, process, store, and manage your users' data. Many privacy controls (for example, controls for cookies, session management, and obtaining user permission) are your responsibility because you own your data (including the data that you receive from your users). Google Cloud includes the following controls that promote privacy: Default encryption of all data when it's at rest, when it's in transit, and while it's being processed. Safeguards against insider access. Support for numerous privacy regulations. For more information, see Google Cloud Privacy Commitments. Classify your confidential data You must define what data is confidential and then ensure that the confidential data is properly protected. Confidential data can include credit card numbers, addresses, phone numbers, and other personal identifiable information (PII). Using Cloud DLP, you can set up appropriate classifications. You can then tag and tokenize your data before you store it in Google Cloud. For more information, see Automatically classify your data. Lock down access to sensitive data Place sensitive data in its own service perimeter using VPC Service Controls, and set Google Identity and Access Management (IAM) access controls for that data. Configure multi-factor authentication (MFA) for all users who require access to sensitive data. Set up SSO and MFA. Monitor for phishing attacks Ensure that your email system is configured to protect against phishing attacks, which are often used for fraud and malware attacks. If your organization uses Gmail, you can use advanced phishing and malware protection. This collection of settings provides controls to quarantine emails, defends against anomalous attachment types, and helps protect against from inbound spoofing emails. Security Sandbox detects malware in attachments. Gmail is continually and automatically updated with the latest security improvements and protections to help keep your organization's email safe. Extend zero trust security to your hybrid workforce A zero trust security model means that no one is trusted implicitly, whether they are inside or outside of your organization's network. When your IAM systems verify access requests, a zero trust security posture means that the user's identity and context (for example, their IP address or location) are considered. Unlike a VPN, zero trust security shifts access controls from the network perimeter to users and their devices. Zero trust security allows users to work more securely from any location. For example, users can access your organization's resources from their laptops or mobile devices while at home. On Google Cloud, you can configure BeyondCorp Enterprise and Identity-Aware Proxy (IAP) to enable zero trust for your Google Cloud resources. If your users use Google Chrome and you enable BeyondCorp Enterprise, you can integrate zero-trust security into your users browsers. ## Footnote https://cloud.google.com/architecture/framework/security/data-residency-sovereignty

Answer 70

implement logging and detective controls Detective controls use telemetry to detect misconfigurations, vulnerabilities, and potentially malicious activity in a cloud environment. Google Cloud lets you create tailored monitoring and detective controls for your environment. This section describes these additional features and recommendations for their use. Monitor network performance Network Intelligence Center gives you visibility into how your network topology and architecture are performing. You can get detailed insights into network performance and then use that information to optimize your deployment by eliminating bottlenecks on your services. Connectivity Tests provides you with insights into the firewall rules and policies that are applied to the network path. Monitor and prevent data exfiltration Data exfiltration is a key concern for organizations. Typically, it occurs when an authorized person extracts data from a secured system and then shares that data with an unauthorized party or moves it to an insecure system. Google Cloud provides several features and tools that help you detect and prevent data exfiltration. For more information, see Preventing data exfiltration. Centralize your monitoring Security Command Center provides visibility into the resources that you have in Google Cloud and into their security state. Security Command Center helps you prevent, detect, and respond to threats. It provides a centralized dashboard that you can use to help identify security misconfigurations in virtual machines, in networks, in applications, and in storage buckets. You can address these issues before they result in business damage or loss. The built-in capabilities of Security Command Center can reveal suspicious activity in your Cloud Logging security logs or indicate compromised virtual machines. Enable the services that you need for your workloads, and then only monitor and analyze important data. Monitor for threats Event Threat Detection is an optional managed service of Security Command Center Premium that detects threats in your log stream. By using Event Threat Detection, you can detect high-risk and costly threats such as malware, cryptomining, unauthorized access to Google Cloud resources, DDoS attacks, and brute-force SSH attacks. Using the tool's features to distill volumes of log data, your security teams can quickly identify high-risk incidents and focus on remediation. To help detect potentially compromised user accounts in your organization, use the Sensitive Actions Cloud Platform logs to identify when sensitive actions are taken and to confirm that valid users took those actions for valid purposes. A sensitive action is an action, such as the addition of a highly privileged role, that could be damaging to your business if a malicious actor took the action. Use Cloud Logging to view, monitor, and query the Sensitive Actions Cloud Platform logs. You can also view the sensitive action log entries with the Sensitive Actions Service, a built-in service of Security Command Center Premium. Chronicle can store and analyze all of your security data centrally. Using Chronicle, you can create detection rules, set up indicators of compromise (IoC) matching, and perform threat-hunting activities. To help you see the entire span of an attack, Chronicle can map logs into a common model, enrich them, and then link them together into timelines. Chronicle also supports threat detection using extended YARA, an open standard for malware-detection rule writing.

Answer 71

As part of your planning process, consider the following actions to help you understand and implement appropriate security controls: Create a list of the type of workloads that you will host in Google Cloud, and whether they require IaaS, PaaS, and SaaS services. You can use the shared responsibility diagram as a checklist to ensure that you know the security controls that you need to consider. Create a list of regulatory requirements that you must comply with, and access resources in the Compliance resource center that relate to those requirements. Review the list of available blueprints and architectures in the Architecture Center for the security controls that you require for your particular workloads. The blueprints provide a list of recommended controls and the IaC code that you require to deploy that architecture. Use the landing zone documentation and the recommendations in the security foundations guide to design a resource hierarchy and network architecture that meets your requirements. You can use the opinionated workload blueprints, like the secured data warehouse, to accelerate your development process. After you deploy your workloads, verify that you're meeting your security responsibilities using services such as the Risk Manager, Assured Workloads, Policy Intelligence tools, and Security Command Center Premium. ## Footnote https://cloud.google.com/architecture/framework/security/shared-responsibility-shared-fate

Answer 72

Shared fate also includes the Risk Protection Program (currently in preview), which helps you use the power of Google Cloud as a platform to manage risk, rather than just seeing cloud workloads as another source of risk that you need to manage. The Risk Protection Program is a collaboration between Google Cloud and two leading cyber insurance companies, Munich Re and Allianz Global & Corporate Speciality. The Risk Protection Program includes Risk Manager, which provides data-driven insights that you can use to better understand your cloud security posture. If you're looking for cyber insurance coverage, you can share these insights from Risk Manager directly with our insurance partners to obtain a quote. For more information, see Google Cloud Risk Protection Program now in Preview. ## Footnote https://cloud.google.com/architecture/framework/security/shared-responsibility-shared-fate

Answer 73

Though shared responsibility helps define the security roles that you or the cloud provider has, relying on shared responsibility can still create challenges. Consider the following scenarios: Most cloud security breaches are the direct result of misconfiguration (listed as number 3 in the Cloud Security Alliance's Pandemic 11 Report) and this trend is expected to increase. Cloud products are constantly changing, and new ones are constantly being launched. Keeping up with constant change can seem overwhelming. Customers need cloud providers to provide them with opinionated best practices to help keep up with the change, starting with best practices by default and having a baseline secure configuration. Though dividing items by cloud services is helpful, many enterprises have workloads that require multiple cloud services types. In this circumstance, you must consider how various security controls for these services interact, including whether they overlap between and across services. For example, you might have an on-premises application that you're migrating to Compute Engine, use Google Workspace for corporate email, and also run BigQuery to analyze data to improve your products. Your business and markets are constantly changing; as regulations change, as you enter new markets, or as you acquire other companies. Your new markets might have different requirements, and your new acquisition might host their workloads on another cloud. To manage the constant changes, you must constantly re-assess your risk profile and be able to implement new controls quickly. How and where to manage your data encryption keys is an important decision that ties with your responsibilities to protect your data. The option that you choose depends on your regulatory requirements, whether you're running a hybrid cloud environment or still have an on-premises environment, and the sensitivity of the data that you're processing and storing. Incident management is an important, and often overlooked, area where your responsibilities and the cloud provider responsibilities aren't easily defined. Many incidents require close collaboration and support from the cloud provider to help investigate and mitigate them. Other incidents can result from poorly configured cloud resources or stolen credentials, and ensuring that you meet the best practices for securing your resources and accounts can be quite challenging. Advanced persistent threats (APTs) and new vulnerabilities can impact your workloads in ways that you might not consider when you start your cloud transformation. Ensuring that you remain up-to-date on the changing landscape, and who is responsible for threat mitigation is difficult, particularly if your business doesn't have a large security team. ## Footnote https://cloud.google.com/architecture/framework/security/shared-responsibility-shared-fate

Answer 74

To run a reliable service, your architecture must include the following: Measurable reliability goals, with deviations that you promptly correct. Design patterns for scalability, high availability, disaster recovery, and automated change management. Components that self-heal where possible, and code that includes instrumentation for observability. Operational procedures that run the service with minimal manual work and cognitive load on operators, and that let you rapidly detect and mitigate failures. ## Footnote https://cloud.google.com/architecture/framework/reliability

Answer 75

The following are covered in this section of the Architecture Framework: Assign clear service ownership. Reduce time to detect (TTD) with well tuned alerts. Reduce time to mitigate (TTM) with incident management plans and training. Design dashboard layouts and content to minimize TTM. Document diagnostic procedures and mitigation for known outage scenarios. Use blameless postmortems to learn from outages and prevent recurrences.

Answer 76

Relaiabiility Core principles Reliability is your top feature Reliability is defined by the user 100% reliability is the wrong target Reliability and rapid innovation are complementary Design and operational principles Define your reliability goals Build observability into your infrastructure and applications Design for scale and high availability Create reliable operational processes and tools Build efficient alerts Build a collaborative incident management process ## Footnote https://cloud.google.com/architecture/framework/reliability/principles

Answer 77

A service level indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is being provided. It is a metric, not a target. ## Footnote https://cloud.google.com/architecture/framework/reliability/principles

Answer 78

A service level objective (SLO) specifies a target level for the reliability of your service. The SLO is a target value for an SLI. When the SLI is at or better than this value, the service is considered to be "reliable enough." Because SLOs are key to making data-driven decisions about reliability, they are the focal point of site reliability engineering (SRE) practices.

Answer 79

An error budget is calculated as 100% – SLO over a period of time. Error budgets tell you if your system has been more or less reliable than is needed over a certain time window, and how many minutes of downtime are allowed during that period. For example, if your availability SLO is 99.9%, your error budget over a 30-day period is (1 - 0.999) ✕ 30 days ✕ 24 hours ✕ 60 minutes = 43.2 minutes. The error budget for a system is consumed, or burned, whenever the system is unavailable. Using the previous example, if the system has had 10 minutes of downtime in the past 30 days and started the 30-day period with the full budget of 43.2 minutes unutilized, then the remaining error budget is reduced to 33.2 minutes. We recommend using a rolling window of 30 days when computing your total error budget and the error budget burn rate.

Answer 80

A service level agreement (SLA) is an explicit or implicit contract with your users that includes consequences if you meet, or miss, the SLOs referenced in the contract.

Answer 81

Define and measure customer-centric SLIs, such as the availability or latency of the service. Define a customer-centric error budget that's stricter than your external SLA. Include consequences for violations, such as production freezes. Set up latency SLIs to capture outlier values, such as 90th or 99th percentile, to detect the slowest responses. Review SLOs at least annually and confirm that they correlate well with user happiness and service outages. ## Footnote https://cloud.google.com/architecture/framework/reliability/define-goals

Answer 82

Availability tells you the fraction of the time that a service is usable. It's often defined in terms of the fraction of well-formed requests that succeed, such as 99%. Latency tells you how quickly a certain percentage of requests can be fulfilled. It's often defined in terms of a percentile other than 50th, such as "99th percentile at 300 ms". Quality tells you how good a certain response is. The definition of quality is often service-specific, and indicates the extent to which the content of the response to a request varies from the ideal response content. The response quality could be binary (good or bad) or expressed on a scale from 0% to 100%. ## Footnote https://cloud.google.com/architecture/framework/reliability/define-goals

Answer 83

Coverage tells you the fraction of data that has been processed, such as 99.9%. Correctness tells you the fraction of output data deemed to be correct, such as 99.99%. Freshness tells you how fresh the source data or the aggregated output data is. Typically the more recently updated, the better, such as 20 minutes. Throughput tells you how much data is being processed, such as 500 MiB/sec or even 1000 requests per second (RPS). ## Footnote https://cloud.google.com/architecture/framework/reliability/define-goals

Answer 84

Storage systems Durability tells you how likely the data written to the system can be retrieved in the future, such as 99.9999%. Any permanent data loss incident reduces the durability metric. Throughput and latency are also common SLIs for storage systems. ## Footnote https://cloud.google.com/architecture/framework/reliability/define-goals

Answer 85

Implement monitoring early, such as before you initiate a migration or before you deploy a new application to a production environment. Disambiguate between application issues and underlying cloud issues. Use the Monitoring API, or other Cloud Monitoring products and the Google Cloud Status Dashboard. Define an observability strategy beyond monitoring that includes tracing, profiling, and debugging. Regularly clean up observability artifacts that you don't use or that don't provide value, such as unactionable alerts. If you generate large amounts of observability data, send application events to a data warehouse system such as BigQuery. Monitoring is at the base of the service reliability hierarchy in the Google SRE Handbook. Without proper monitoring, you can't tell whether an application works correctly. ## Footnote https://cloud.google.com/architecture/framework/reliability/observability-infrastructure-applications

Answer 86

The following reliability design principles and best practices should be part of your system architecture and deployment plan. Follow these recommendations: Implement exponential backoff with randomization in the error retry logic of client applications. Implement a multi-region architecture with automatic failover for high availability. Use load balancing to distribute user requests across shards and regions. Design the application to degrade gracefully under overload. Serve partial responses or provide limited functionality rather than failing completely. Establish a data-driven process for capacity planning, and use load tests and traffic forecasts to determine when to provision resources. Establish disaster recovery procedures and test them periodically. ## Footnote https://cloud.google.com/architecture/framework/reliability/design-scale-high-availability

Answer 87

Use prioritized request queues and give higher priority to requests where a user is waiting for a response. Serve responses out of a cache to reduce latency and load. Fail safe in a way that preserves function. Degrade gracefully when there's a traffic overload. ## Footnote https://cloud.google.com/architecture/framework/reliability/design-scale-high-availability

Answer 88

Increase the level of redundancy in critical dependencies. Adding more replicas makes it less likely that an entire component will be unavailable. Use asynchronous requests to other services instead of blocking on a response or use publish/subscribe messaging to decouple requests from responses. Cache responses from other services to recover from short-term unavailability of dependencies. ## Footnote https://cloud.google.com/architecture/framework/reliability/design-scale-high-availability

Answer 89

Choose good names for applications and services Avoid using internal code names in production configuration files, because they can be confusing, particularly to newer employees, potentially increasing time to mitigate (TTM) for outages. Implement progressive rollouts with canary testing Instantaneous global changes to service binaries or configuration are inherently risky. Roll out new versions of executables and configuration changes incrementally. Start with a small scope, such as a few VM instances in a zone, and gradually expand the scope. Roll back rapidly if the change doesn't perform as you expect, or negatively impacts users at any stage of the rollout. Your goal is to identify and address bugs when they only affect a small portion of user traffic, before you roll out the change globally. Spread out traffic for timed promotions and launches You might have promotional events, such as sales that start at a precise time and encourage many users to connect to the service simultaneously. If so, design client code to spread the traffic over a few seconds. Use random delays before they initiate requests. Automate build, test, and deployment Eliminate manual effort from your release process with the use of continuous integration and continuous delivery (CI/CD) pipelines. Perform automated integration testing and deployment. For example, create a modern CI/CD process with Anthos. Defend against operator error Design your operational tools to reject potentially invalid configurations. Detect and alert when a configuration version is empty, partial or truncated, corrupt, logically incorrect or unexpected, or not received within the expected time. Tools should also reject configuration versions that differ too much from the previous version. Test failure recovery Regularly test your operational procedures to recover from failures in your service. Without regular tests, your procedures might not work when you need them if there's a real failure. Items to test periodically include regional failover, how to roll back a release, and how to restore data from backups. Conduct disaster recovery tests Like with failure recovery tests, don't wait for a disaster to strike. Periodically test and verify your disaster recovery procedures and processes. Practice chaos engineering Consider the use of chaos engineering in your test practices. Introduce actual failures into different components of production systems under load in a safe environment. This approach helps to ensure that there's no overall system impact because your service handles failures correctly at each level. ## Footnote https://cloud.google.com/architecture/framework/reliability/create-operational-processes-tools

Answer 90

Instantaneous global changes to service binaries or configuration are inherently risky. Roll out new versions of executables and configuration changes incrementally. Start with a small scope, such as a few VM instances in a zone, and gradually expand the scope. Roll back rapidly if the change doesn't perform as you expect, or negatively impacts users at any stage of the rollout. Your goal is to identify and address bugs when they only affect a small portion of user traffic, before you roll out the change globally. Set up a canary testing system that's aware of service changes and does A/B comparison of the metrics of the changed servers with the remaining servers. The system should flag unexpected or anomalous behavior. If the change doesn't perform as you expect, the canary testing system should automatically halt rollouts. Problems can be clear, such as user errors, or subtle, like CPU usage increase or memory bloat. It's better to stop and roll back at the first hint of trouble and diagnose issues without the time pressure of an outage. After the change passes canary testing, propagate it to larger scopes gradually, such as to a full zone, then to a second zone. Allow time for the changed system to handle progressively larger volumes of user traffic to expose any latent bugs. ## Footnote https://cloud.google.com/architecture/framework/reliability/create-operational-processes-tools

Answer 91

. The more information you have on how your service performs, the more informed your decisions are when there's an issue. Design your alerts for early and accurate detection of all user-impacting system problems, and minimize false positives. Optimize the alert delay There's a balance between alerts that are sent too soon that stress the operations team and alerts that are sent too late and cause long service outages. Tune the alert delay before the monitoring system notifies humans of a problem to minimize time to detect, while maximizing signal versus noise. Use the error budget consumption rate to derive the optimal alert configuration. Alert on symptoms rather than causes Trigger alerts based on the direct impact to user experience. Noncompliance with global or per-customer SLOs indicates a direct impact. Don't alert on every possible root cause of a failure, especially when the impact is limited to a single replica. A well-designed distributed system recovers seamlessly from single-replica failures. Alert on outlier values rather than averages When monitoring latency, define SLOs and set alerts for (pick two out of three) 90th, 95th, or 99th percentile latency, not for average or 50th percentile latency. Good mean or median latency values can hide unacceptably high values at the 90th percentile or above that cause very bad user experiences. Therefore you should apply this principle of alerting on outlier values when monitoring latency for any critical operation, such as a request-response interaction with a webserver, batch completion in a data processing pipeline, or a read or write operation on a storage service.

Answer 92

Establish an incident management plan, and train your teams to use it. To reduce TTD, implement the recommendations to build observability into your infrastructure and applications. Build a "What's changed?" dashboard that you can glance at when there's an incident. Document query snippets or build a Looker Studio dashboard with frequent log queries. Evaluate Firebase Remote Config to mitigate rollout issues for mobile applications. Test failure recovery, including restoring data from backups, to decrease TTM for a subset of your incidents. Design for and test configuration and binary rollbacks. Replicate data across regions for disaster recovery and use disaster recovery tests to decrease TTM after regional outages. Design a multi-region architecture for resilience to regional outages if the business need for high availability justifies the cost, to increase TBF. ## Footnote https://cloud.google.com/architecture/framework/reliability/build-incident-management-process

Answer 93

In the cost optimization category of the Architecture Framework, you: Adopt and implement FinOps: Strategies to help you encourage employees to consider the cost impact when provisioning and managing resources in Google Cloud. Monitor and control cost: Best practices, tools, and techniques to track and control the cost of your resources in Google Cloud. Optimize cost: Compute, containers, and serverless: Service-specific cost-optimization controls for Compute Engine, Google Kubernetes Engine, Cloud Run, Cloud Functions, and App Engine. Optimize cost: Storage: Cost-optimization controls for Cloud Storage, Persistent Disk, and Filestore. Optimize cost: Databases and smart analytics: Cost-optimization controls for BigQuery, Cloud Bigtable, Cloud Spanner, Cloud SQL, Dataflow, and Dataproc. Optimize cost: Networking: Cost-optimization controls for your networking resources in Google Cloud. Optimize cost: Cloud operations: Recommendations to help you optimize the cost of monitoring and managing your resources in Google Cloud.

Answer 94

FinOps is a practice that combines people, processes, and technology to promote financial accountability and the discipline of cost optimization in an organization, regardless of its size or maturity in the cloud. The guidance in this section is intended for CTOs, CIOs, and executives responsible for controlling their organization's spend in the cloud. The guidance also helps individual cloud operators understand and adopt FinOps. Every employee in your organization can help reduce the cost of your resources in Google Cloud, regardless of role (analyst, architect, developer, or administrator). In teams that have not had to track infrastructure costs in the past, you might have to educate employees about the need for collective responsibility. A common model is for a central FinOps team or Cloud Center of Excellence (CCoE) to standardize the process for optimizing cost across all the cloud workloads. This model assumes that the central team has the required knowledge and expertise to identify high-value opportunities to improve efficiency. Although centralized cost-control might work well in the initial stages of cloud adoption when usage is low, it doesn't scale well when cloud adoption and usage increase. The central team might struggle with scaling, and project teams might not accept decisions made by anyone outside their teams. We recommend that the central team delegate the decision making for resource optimization to the project teams. The central team can drive broader efforts to encourage the adoption of FinOps across the organization. To enable the individual project teams to practice FinOps, the central team must standardize the process, reporting, and tooling for cost optimization. The central team must work closely with teams that aren't familiar with FinOps practices, and help them consider cost in their decision-making processes. The central team must also act as an intermediary between the finance team and the individual project teams. ## Footnote https://cloud.google.com/architecture/framework/cost-optimization/finops

Answer 95

Identify Cost-management focus areas The cost of your resources in Google Cloud depends on the quantity of resources that you use and the rate at which you're billed for the resources. Cost visibility Track how much you spend and how your resources and services are billed, so that you can analyze the effect of cost on business outcomes. We recommend that you follow the FinOps operating model, which suggests the following actions to make cost information visible across your organization: Resource optimization Align the number and size of your cloud resources to the requirements of your workload. Where feasible, consider using managed services or re-architecting your applications. Typically, individual engineering teams have more context than the central FinOps (financial operations) team on opportunities and techniques to optimize resource deployment. We recommend that the FinOps team work with the individual engineering teams to identify resource-optimization opportunities that can be applied across the organization. Rate optimization The FinOps team often makes rate optimization decisions centrally. We recommend that the individual engineering teams work with the central FinOps team to take advantage of deep discounts for reservations, committed usage, Spot VMs, flat-rate pricing, and volume and contract discounting. Design recommendations Consolidate billing and resource management To manage billing and resources in Google Cloud efficiently, we recommend that you use a single billing account for your organization, and use internal chargeback mechanisms to allocate costs. Use multiple billing accounts for loosely structured conglomerates and organizations with entities that don't affect each other. For example, resellers might need distinct accounts for each customer. Using separate billing accounts might also help you meet country-specific tax regulations. Track and allocate cost using labels Labels are key-value pairs that you can use to tag projects and resources. To categorize cost data at the required granularity, establish a labeling schema that suits your organization's chargeback mechanism and helps you allocate costs appropriately. Assign cost allocation labels at the project level, and define a set of labels that can be applied by default to all the projects. You can automate the assignment of labels when you create projects. Configure billing access control To control access to Cloud Billing, we recommend that you assign the Billing Account Administrator role to only those users who manage billing contact information. For example, employees in finance, accounting, and operations might need this role. Configure billing reports Set up billing reports to provide data for the key metrics that you need to track. We recommend that you track the following metrics: Analyze trends and forecast cost Customize and analyze cost reports using BigQuery Billing Export, and visualize cost data using Looker Studio. Assess the trend of actual costs and how much you might spend by using the forecasting tool. Optimize resource usage and cost This section recommendeds best practices to help you optimize the usage and cost of your resources across Google Cloud services. Tools and techniques The on-demand provisioning and pay-per-use characteristics of the cloud help you to optimize your IT spend. This section describes tools that Google Cloud provides and techniques that you can use to track and control the cost of your resources in the cloud. Before you use these tools and techniques, review the basic Cloud Billing concepts. Billing reports Google Cloud provides billing reports within the Google Cloud console to help you view your current and forecasted spend. The billing reports enable you to view cost data on a single page, discover and analyze trends, forecast the end-of-period cost, and take corrective action when necessary. Data export to BigQuery You can export billing reports to BigQuery, and analyze costs using granular and historical views of data, including data that's categorized using labels. You can perform more advanced analyses using BigQuery ML. We recommend that you enable export of billing reports to BigQuery when you create the Cloud Billing account. Your BigQuery dataset contains billing data from the date you set up Cloud Billing export. The dataset doesn't include data for the period before you enabled export. Billing access control You can control access to Cloud Billing for specific resources by defining Identity and Access Management (IAM) policies for the resources. To grant or limit access to Cloud Billing, you can set an IAM policy at the organization level, the billing account level, or the project level. Budgets, alerts, and quotas Budgets help you track actual Google Cloud costs against planned spending. When you create a budget, you can configure alert rules to trigger email notifications when the actual or forecasted spend exceeds a defined threshold. You can also use budgets to automate cost-control responses.

Answer 96

he following recommendations are applicable to all the compute, containers, and serverless services in Google Cloud that are discussed in this section. Track usage and cost Use the following tools and techniques to monitor resource usage and cost: View and respond to cost-optimization recommendations in the Recommendation Hub. Get email notifications for potential increases in resource usage and cost by configuring budget alerts. Manage and respond to alerts programmatically by using the Pub/Sub and Cloud Functions services. Control resource provisioning Use the following recommendations to control the quantity of resources provisioned in the cloud and the location where the resources are created: To help ensure that resource consumption and cost don't exceed the forecast, use resource quotas. Provision resources in the lowest-cost region that meets the latency requirements of your workload. To control where resources are provisioned, you can use the organization policy constraint gcp.resourceLocations. Get discounts for committed use Committed use discounts (CUDs) are ideal for workloads with predictable resource needs. After migrating your workload to Google Cloud, find the baseline for the resources required, and get deeper discounts for committed usage. For example, purchase a one or three-year commitment, and get a substantial discount on Compute Engine VM pricing. Automate cost-tracking using labels Define and assign labels consistently. The following are examples of how you can use labels to automate cost-tracking: For VMs that only developers use during business hours, assign the label env: development. You can use Cloud Scheduler to set up a serverless Cloud Function to shut down these VMs after business hours, and restart them when necessary. For an application that has several Cloud Run services and Cloud Functions instances, assign a consistent label to all the Cloud Run and Cloud Functions resources. Identify the high-cost areas, and take action to reduce cost. Customize billing reports Configure your Cloud Billing reports by setting up the required filters and grouping the data as necessary (for example, by projects, services, or labels). Promote a cost-saving culture Train your developers and operators on your cloud infrastructure. Create and promote learning programs using traditional or online classes, discussion groups, peer reviews, pair programming, and cost-saving games. As shown in Google's DORA research, organizational culture is a key driver for improving performance, reducing rework and burnout, and optimizing cost. By giving employees visibility into the cost of their resources, you help them align their priorities and activities with business objectives and constraints. d ## Footnote https://cloud.google.com/architecture/framework/cost-optimization/compute

Answer 97

Use Cloud Monitoring to get real-time information about your GKE clusters (spending, bin-packing, application right-sizing, and scaling). Use GKE Autopilot to let GKE maximize the efficiency of your cluster's infrastructure. You don't need to monitor the health of your nodes, handle bin-packing, or calculate the capacity that your workloads need. Fine-tune GKE autoscaling by using Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), Cluster Autoscaler (CA), or node auto-provisioning based on your workload's requirements. For batch workloads that aren't sensitive to startup latency, use the optimization-utilization autoscaling profile to help improve the utilization of the cluster. Use node auto-provisioning to extend the GKE cluster autoscaler, and efficiently create and delete node pools based on the specifications of pending pods without over-provisioning. Use separate node pools: a static node pool for static load, and dynamic node pools with cluster autoscaling groups for dynamic loads. Use Spot VMs for Kubernetes node pools when your pods are fault-tolerant and can terminate gracefully in less than 25 seconds. Combined with the GKE cluster autoscaler, this strategy helps you ensure that the node pool with lower-cost VMs (in this case, the node pool with Spot VMs) scales first. Choose cost-efficient machine types (for example: E2, N2D, T2D), which provide 20–40% higher performance-to-price. Use GKE usage metering to analyze your clusters' usage profiles by namespaces and labels. Identify the team or application that's spending the most, the environment or component that caused spikes in usage or cost, and the team that's wasting resources. Use resource quotas in multi-tenant clusters to prevent any tenant from using more than its assigned share of cluster resources. Schedule automatic downscaling of development and test environments after business hours.

Answer 98

Adjust the concurrency setting (default: 80) to reduce cost. Cloud Run determines the number of requests to be sent to an instance based on CPU and memory usage. By increasing the request concurrency, you can reduce the number of instances required. Set a limit for the number of instances that can be deployed. Estimate the number of instances required by using the Billable Instance Time metric. For example, if the metric shows 100s/s, around 100 instances were scheduled. Add a 30% buffer to preserve performance; that is, 130 instances for 100s/s of traffic. To reduce the impact of cold starts, configure a minimum number of instances. When these instances are idle, they are billed at a tenth of the price. Track CPU usage, and adjust the CPU limits accordingly. Use traffic management to determine a cost-optimal configuration. Consider using Cloud CDN or Firebase Hosting for serving static assets. For Cloud Run apps that handle requests globally, consider deploying the app to multiple regions, because cross continent egress traffic can be expensive. This design is recommended if you use a load balancer and CDN. Reduce the startup times for your instances, because the startup time is also billable. Purchase Committed Use Discounts, and save up to 17% off the on-demand pricing for a one-year commitment.

Answer 99

Observe the execution time of your functions. Experiment and benchmark to design the smallest function that still meets your required performance threshold. If your Cloud Functions workloads run constantly, consider using GKE or Compute Engine to handle the workloads. Containers or VMs might be lower-cost options for always-running workloads. Limit the number of function instances that can co-exist. Benchmark the runtime performance of the Cloud Functions programming languages against the workload of your function. Programs in compiled languages have longer cold starts, but run faster. Programs in interpreted languages run slower, but have a lower cold-start overhead. Short, simple functions that run frequently might cost less in an interpreted language. Delete temporary files written to the local disk, which is an in-memory file system. Temporary files consume memory that's allocated to your function, and sometimes persist between invocations. If you don't delete these files, an out-of-memory error might occur and trigger a cold start, which increases the execution time and cost. ## Footnote https://cloud.google.com/architecture/framework/cost-optimization/compute#general_recommendations

Answer 100

Set maximum instances based on your traffic and request latency. App Engine usually scales capacity based on the traffic that the applications receive. You can control cost by limiting the number of instances that App Engine can create. To limit the memory or CPU available for your application, set an instance class. For CPU-intensive applications, allocate more CPU. Test a few configurations to determine the optimal size. Benchmark your App Engine workload in multiple programming languages. For example, a workload implemented in one language may need fewer instances and lower cost to complete tasks on time than the same workload programmed in another language. Optimize for fewer cold starts. When possible, reduce CPU-intensive or long-running tasks that occur in the global scope. Try to break down the task into smaller operations that can be "lazy loaded" into the context of a request. If you expect bursty traffic, configure a minimum number of idle instances that are pre-warmed. If you are not expecting traffic, you can configure the minimum idle instances to zero. To balance performance and cost, run an A/B test by splitting traffic between two versions, each with a different configuration. Monitor the performance and cost of each version, tune as necessary, and decide the configuration to which traffic should be sent. Configure request concurrency, and set the maximum concurrent requests higher than the default. The more requests each instance can handle concurrently, the more efficiently you can use existing instances to serve traffic. ## Footnote https://cloud.google.com/architecture/framework/cost-optimization/compute#general_recommendations

Answer 101

For VMs that only developers use during business hours, assign the label env: development. You can use Cloud Scheduler to set up a serverless Cloud Function to shut down these VMs after business hours, and restart them when necessary. For an application that has several Cloud Run services and Cloud Functions instances, assign a consistent label to all the Cloud Run and Cloud Functions resources. Identify the high-cost areas, and take action to reduce cost. ## Footnote https://cloud.google.com/architecture/framework/cost-optimization/compute#general_recommendations

Answer 102

Adopt and implement FinOps: Strategies to help you encourage employees to consider the cost impact when provisioning and managing resources in Google Cloud. Monitor and control cost: Best practices, tools, and techniques to track and control the cost of your resources in Google Cloud. Optimize cost: Compute, containers, and serverless: Service-specific cost-optimization controls for Compute Engine, Google Kubernetes Engine, Cloud Run, Cloud Functions, and App Engine. Optimize cost: Storage: Cost-optimization controls for Cloud Storage, Persistent Disk, and Filestore. Optimize cost: Databases and smart analytics: Cost-optimization controls for BigQuery, Cloud Bigtable, Cloud Spanner, Cloud SQL, Dataflow, and Dataproc. Optimize cost: Networking: Cost-optimization controls for your networking resources in Google Cloud. Optimize cost: Cloud operations: Recommendations to help you optimize the cost of monitoring and managing your resources in Google Cloud. ## Footnote https://cloud.google.com/architecture/framework/cost-optimization

Answer 103

Every employee in your organization can help reduce the cost of your resources in Google Cloud, regardless of role (analyst, architect, developer, or administrator). In teams that have not had to track infrastructure costs in the past, you might have to educate employees about the need for collective responsibility. A common model is for a central FinOps team or Cloud Center of Excellence (CCoE) to standardize the process for optimizing cost across all the cloud workloads. This model assumes that the central team has the required knowledge and expertise to identify high-value opportunities to improve efficiency. Although centralized cost-control might work well in the initial stages of cloud adoption when usage is low, it doesn't scale well when cloud adoption and usage increase. The central team might struggle with scaling, and project teams might not accept decisions made by anyone outside their teams. We recommend that the central team delegate the decision making for resource optimization to the project teams. The central team can drive broader efforts to encourage the adoption of FinOps across the organization. To enable the individual project teams to practice FinOps, the central team must standardize the process, reporting, and tooling for cost optimization. The central team must work closely with teams that aren't familiar with FinOps practices, and help them consider cost in their decision-making processes. The central team must also act as an intermediary between the finance team and the individual project teams. ## Footnote https://cloud.google.com/architecture/framework/cost-optimization/finops

Answer 104

**Encourage individual accountability **Any employee who creates and uses cloud resources affects the usage and the cost of those resources. implementing data-driven cost-optimization actions. Educate users about cost-optimization opportunities and techniques. **Reward employees who optimize cost, and celebrate success.** **Make costs visible across the organization.** Use a single, well-defined method for calculating the fully loaded costs of cloud resources. For example, the method could consider the total cloud spend adjusted for purchased discounts and shared costs, like the cost of shared databases. Set up dashboards that enable employees to view their cloud spend in near real time. To motivate individuals in the team to own their costs, allow wide visibility of cloud spending across teams. **Enable collaborative behavior ** Create a workload-onboarding process that helps ensure cost efficiency in the design stage through peer reviews of proposed architectures by other engineers. Create a cross-team knowledge base of cost-efficient architectural patterns. **Establish a blameless culture** **Promote a culture of learning and growth that makes it safe to take risks, make corrections when required, and innovate.** **While FinOps practices are often focused on cost reduction, the focus for a central team must be on enabling project teams to make decisions that maximize the business value of their cloud resources. **

Answer 105

**Cost visibility Resource optimization Rate optimization** **Cost visibility** Track how much you spend and how your resources and services are billed, so that you can analyze the effect of cost on business outcomes. We recommend that you follow the FinOps operating model, which suggests the following actions to make cost information visible across your organization: Allocate: Assign an owner for every cost item. Report: Make cost data available, consumable, and actionable. Forecast: Estimate and track future spend. **Resource optimization** Align the number and size of your cloud resources to the requirements of your workload. Where feasible, consider using managed services or re-architecting your applications. Typically, individual engineering teams have more context than the central FinOps (financial operations) team on opportunities and techniques to optimize resource deployment. We recommend that the FinOps team work with the individual engineering teams to identify resource-optimization opportunities that can be applied across the organization. **Rate optimization** The FinOps team often makes rate optimization decisions centrally. We recommend that the individual engineering teams work with the central FinOps team to take advantage of deep discounts for reservations, committed usage, Spot VMs, flat-rate pricing, and volume and contract discounting.

Answer 106

Plan capacity based on performance requirements You can use Bigtable in a broad spectrum of applications, each with a different optimization goal. For example, for batch data-processing jobs, throughput might be more important than latency. For an online service that serves user requests, you might need to prioritize lower latency over throughput. When you plan capacity for your Bigtable clusters, consider the tradeoffs between throughput and latency. For more information, see Plan your Bigtable capacity. Follow schema-design best practices Your tables can scale to billions of rows and thousands of columns, enabling you to store petabytes of data. When you design the schema for your Bigtable tables, consider the schema design best practices. Monitor performance and make adjustments Monitor the CPU and disk usage for your instances, analyze the performance of each cluster, and review the sizing recommendations that are shown in the monitoring charts. ## Footnote https://cloud.google.com/architecture/framework/performance-optimization/databases

Answer 107

Choose a primary key that prevents a hotspot A hotspot is a single server that is forced to handle many requests. Follow best practices for SQL coding Use query options to manage the SQL query optimizer Visualize and tune the structure of query execution plans Use operations APIs to manage long-running operations Follow best practices for bulk loading Monitor and control CPU utilization Analyze and solve latency issues Launch applications after the database reaches the warm state ## Footnote https://cloud.google.com/architecture/framework/performance-optimization/databases

Answer 108

For SQL Server databases, Google recommends that you modify certain parameters and retain the default values for some parameters. When you choose the storage type for MySQL or PostgreSQL databases, consider the cost-performance tradeoff between SSD and HDD storage. To identify and analyze performance issues with PostgreSQL databases, use the Cloud SQL Insights dashboard. To diagnose poor performance when running SQL queries, use the EXPLAIN statement. ## Footnote https://cloud.google.com/architecture/framework/performance-optimization/databases

Answer 109

Reduce latency when using Cloud Storage To reduce latency when you access data that's stored in Cloud Storage, we recommend the following: Create your Cloud Storage bucket in the same region as the Dataproc cluster. Disable auto.purge for Apache Hive-managed tables stored in Cloud Storage. When using Spark SQL, consider creating Dataproc clusters with the latest versions of available images . By using the latest version, you can avoid performance issues that might remain in older versions, such as slow INSERT OVERWRITE performance in Spark 2.x. To minimize the possibility of writing many files with varying or small sizes to Cloud Storage, you can configure the Spark SQL parameters spark.sql.shuffle.partitions and spark.default.parallelism or the Hadoop parameter mapreduce.job.reduces. ## Footnote https://cloud.google.com/architecture/framework/performance-optimization/analytics

Answer 110

When you create and deploy pipelines, you can configure execution parameters, like the Compute Engine machine type that should be used for the Dataflow worker VMs. For more information, see Pipeline options. After you deploy pipelines, Dataflow manages the Compute Engine and Cloud Storage resources that are necessary to run your jobs. In addition, the following features of Dataflow help optimize the performance of the pipelines: Parallelization: Dataflow automatically partitions your data and distributes your worker code to Compute Engine instances for parallel processing. For more information, see parallelization and distribution. Optimization: Dataflow uses your pipeline code to create an execution graph that represents PCollection objects and transforms in the pipeline. It then optimizes the graph for the most efficient performance and resource usage. Dataflow also automatically optimizes potentially costly operations, such as data aggregations. For more information, see Fusion optimization and Combine optimization. Automatic tuning: Dataflow dynamically optimizes jobs while they are running by using Horizontal Autoscaling, Vertical Autoscaling, and Dynamic Work Rebalancing. ## Footnote https://cloud.google.com/architecture/framework/performance-optimization/analytics

Answer 111

Optimize query design Query performance depends on factors like the number of bytes that your queries read and write, and the volume of data that's passed between slots. To optimize the performance of your queries in BigQuery, apply the best practices that are described in the following documentation: Introduction to optimizing query performance Managing input data and data sources Optimizing communication between slots Optimize query computation Manage query outputs Avoiding SQL anti-patterns Define and use materialized views efficiently To improve the performance of workloads that use common and repeated queries, you can use materialized views. There are limits to the number of materialized views that you can create. Don't create a separate materialized view for every permutation of a query. Instead, define materialized views that you can use for multiple patterns of queries. ## Footnote https://cloud.google.com/architecture/framework/performance-optimization/analytics

Answer 112

Manage capacity and quota In contrast, when you use Google Cloud you cede most capacity planning to Google. Using the cloud means you don't have to provision and maintain idle resources when they aren't needed. For example, you can create, scale up, and scale down VM instances as needed. Because you pay for what you use, you can optimize your spending, including excess capacity that you only need at peak traffic times. To help you save, Compute Engine provides machine type recommendations if it detects that you have underutilized VM instances that can be resized or deleted. Evaluate your cloud capacity requirements To manage your capacity effectively, you need to know your organization's capacity requirements. To evaluate your capacity requirements, start by identifying your top cloud workloads. Evaluate the average and peak utilizations of these workloads, and their current and future capacity needs. Identify the teams who use these top workloads. Work with them to establish an internal demand-planning process. Use this process to understand their current and forecasted cloud resource needs. View your infrastructure utilization metrics To make capacity planning easier, gather and store historical data about your organization's use of cloud resources. Ensure you have visibility into infrastructure utilization metrics. For example, for top workloads, evaluate the following: Average and peak utilization Spikes in usage patterns Seasonal spikes based on business requirements, such as holiday periods for retailers How much over-provisioning is needed to prepare for peak events and rapidly handle potential traffic spikes Ensure your organization has set up alerts to automatically notify of when you get close to quota and capacity limitations. Use Google's monitoring tools to get insights on application usage and capacity. For example, you can define custom metrics with Monitoring. Use these custom metrics to define alerting trends. Monitoring also provides flexible dashboards and rich visualization tools to help identify emergent issues. Create a process for capacity planning Ensure your quotas match your capacity requirements Google Cloud uses quotas to restrict how much of a particular shared Google Cloud resource that you can use. Each quota represents a specific countable resource, such as API calls to a particular service, the number of load balancers used concurrently by your project, or the number of projects that you can create. For example, quotas ensure that a few customers or projects can't monopolize CPU cores in a particular region or zone. ## Footnote https://cloud.google.com/architecture/framework/operational-excellence/manage-capacity-and-quota

Answer 113

Evaluate built-in migration tools Evaluate built-in migration tools to move your workloads from another cloud or from on-premises. For more information, see Migration to Google Cloud. Google Cloud offers tools and services to help you migrate your workloads and optimize for cost and performance. To receive a free migration cost assessment based on your current IT landscape, see Google Cloud Rapid Assessment & Migration Program. Use virtual disk import for customized operating systems To import customized supported operating systems, see Importing virtual disks. Sole-tenant nodes can help you meet your hardware bring-your-own-license requirements for per-core or per-processor licenses. For more information, see Bringing your own licenses. ## Footnote https://cloud.google.com/architecture/framework/system-design/compute

GCP Architecture Framework Flashcards

(137 cards)