GCP Architecture Framework Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

https://cloud.google.com/architecture/frameworkWhat is the foundational category from the Google Cloud Architecture Framework that provides design recommendations and describes best practices and principles to help you define the architecture, components, modules, interfaces, and data on a cloud platform to satisfy your system requirements?

A

System design is the foundational category of the Google Cloud Architecture Framework. This category provides design recommendations and describes best practices and principles to help you define the architecture, components, modules, interfaces, and data on a cloud platform to satisfy your system requirements. You also learn about Google Cloud products and features that support system design.

https://cloud.google.com/architecture/framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

https://cloud.google.com/architecture/frameworkThe Google Cloud Architecture Framework provides what type of recommendations and describes?

A

…… best practices to help architects, developers, administrators, and other cloud practitioners design and operate a cloud topology that’s secure, efficient, resilient, high-performing, and cost-effective.

https://cloud.google.com/architecture/framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the Google Cloud Architecture Framework categories (also known as pillars)?

A

The design guidance in the Architecture Framework applies to applications built for the cloud and for workloads migrated from on-premises to Google Cloud, hybrid cloud deployments, and multi-cloud environments.

The Google Cloud Architecture Framework is organized into six categories (also known as pillars), as shown in the following diagram:

Google Cloud Architecture Framework

architecture System design
This category is the foundation of the Google Cloud Architecture Framework. Define the architecture, components, modules, interfaces, and data needed to satisfy cloud system requirements, and learn about Google Cloud products and features that support system design.
construction Operational excellence
Efficiently deploy, operate, monitor, and manage your cloud workloads.
security Security, privacy, and compliance
Maximize the security of your data and workloads in the cloud, design for privacy, and align with regulatory requirements and standards.
restore Reliability
Design and operate resilient and highly available workloads in the cloud.
payment Cost optimization
Maximize the business value of your investment in Google Cloud.
speed Performance optimization
Design and tune your cloud resources for optimal performance.

https://cloud.google.com/architecture/framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

???????? Which category is the foundational category of the Google Cloud Architecture Framework. This category provides design recommendations and describes best practices and principles to help you define the architecture, components, modules, interfaces, and data on a cloud platform to satisfy your system requirements.

A

System Design is this category.

You also learn about Google Cloud products and features that support system design.

https://cloud.google.com/architecture/frameworkhttps://cloud.google.com/architecture/framework/system-design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the core principles of system design?

A

Document everything
When you start to move your workloads to the cloud or build your applications, a major blocker to success is lack of documentation of the system. Documentation is especially important for correctly visualizing the architecture of your current deployments.

Simplify your design and use fully managed services
Simplicity is crucial for system design. If your architecture is too complex to understand, it will be difficult to implement the design and manage it over time. Where feasible, use fully managed services to minimize the risks, time, and effort associated with managing and maintaining baseline systems.

Decouple your architecture
Decoupling is a technique that’s used to separate your applications and service components into smaller components that can operate independently. For example, you might break up a monolithic application stack into separate service components. In a decoupled architecture, an application can run its functions independently, regardless of the various dependencies.

Use a stateless architecture
A stateless architecture can increase both the reliability and scalability of your applications.

Stateful applications rely on various dependencies to perform tasks, such as locally cached data. Stateful applications often require additional mechanisms to capture progress and restart gracefully. Stateless applications can perform tasks without significant local dependencies by using shared storage or cached services. A stateless architecture enables your applications to scale up quickly with minimum boot dependencies. The applications can withstand hard restarts, have lower downtime, and provide better performance for end users.

https://cloud.google.com/architecture/framework/system-design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What deployment best practices should you consider when deploying a GCP solution (system design best practice)?

A

Deploy over multiple regions
Select regions based on geographic proximity
Use Cloud Load Balancing to serve global users
Use the Cloud Region Picker to support sustainability
Compare pricing of major resources

https://cloud.google.com/architecture/framework/system-design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are 2 ways to design resilient services?

A

Ensure you standardize deployments and incorporate automation wherever possible

Using architectural standards and deploying with automation helps you standardize your builds, tests, and deployments, which helps to eliminate human-induced errors for repeated processes like code updates and provides confidence in testing before deployment. Understand your operational tools portfolio

https://cloud.google.com/architecture/framework/system-design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the high level overview of the design process:

A
  • Understand the overarching goals.
  • Identify and itemize objectives.
  • Categorize and prioritize objectives.
  • Analyze objectives.
  • Determine preliminary options.
  • Refine options.
  • Identify your final solution.

https://cloud.google.com/architecture/framework/system-design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you identify the architecture?

A

Identify Goals
Use existing patterns or designs
Google provides resources at the Cloud Architecture Center that can form the basis for your solution design.
For example, there’s a reference architecture diagram for migrating an on premises data center to Google Cloud

Take each sentence in your case study, and rewrite it as an objective. An objective could be an existing architectural
or a task that needs to be completed.

https://cloud.google.com/architecture/framework/system-design/principles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When you categorize requirements, what are the categories?

A

business requirements
technical requirements
As we discussed in module 1, we can use the business and technical requirements as watchpoints to use when deciding what we can use to implement a solution.

solution requirements
The solution components may act as equivalents to the existing infrastructure or may replace the existing infrastructure with new components and functionality to achieve Dress4Win’s goals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

After you develop the solution approach, how do you work with the product information?

A

Professional Cloud Architect, because they will determine howto best use the resources available to you in Google Cloud.
It often helps to start at with a broad approach and decide what products to use, like we did in the last module.

Next, focus on the different products, services, and practices and decide how best implement them in Google Cloud.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are 5 generic steps for an architect to follow when designing a cloud solution?

A

Designing a solution infrastructure that meets business requirements
Designing a solution infrastructure that meets technical requirements
Designing network, storage, compute, and other resources
Creating a migration plan
Envisioning future solution improvements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are core principles that an architect must consider in analyzing data for a solution during the system design aspect?

A

Identify the pros/cons of the different alternatives:
Dataflow lets you write complex transformations in a serverless approach, but you must rely on an opinionated version of configurations for compute and processing needs.
Alternatively, Dataproc lets you run the same transformations, but you manage the clusters and fine-tune the jobs yourself.

-Processing strategy
In your system design, think about which processing strategy your teams use, such as extract, transform, load (ETL) or extract, load, transform (ELT). Your system design should also consider whether you need to process batch analytics or streaming analytics. Google Cloud provides a unified data platform, and it lets you build a data lake or a data warehouse to meet your business needs.

https://cloud.google.com/architecture/framework/system-design/data-analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When you create your system design, you can group the Google Cloud data analytics services around the general data movement in any system, or around the data lifecycle.

A

The data lifecycle includes the following stages and example services:

Ingestion includes services such as Pub/Sub, Storage Transfer Service, Transfer Appliance, Cloud IoT Core, and BigQuery.
Storage includes services such as Cloud Storage, Bigtable, Memorystore, and BigQuery.
Processing and transformation includes services such as Dataflow, Dataproc, Dataprep, Cloud Data Loss Prevention (Cloud DLP), and BigQuery.
Analysis and warehousing includes services such as BigQuery.
Reporting and visualization includes services such as Looker Studio and Looker.
The following stages and services run across the entire data lifecycle:

Data integration includes services such as Data Fusion.
Metadata management and governance includes services such as Data Catalog.
Workflow management includes services such as Cloud Composer.
https://cloud.google.com/architecture/framework/system-design/data-analytics

What are the different data services that an architect can use during each stage of the data lifecycle? (System Design - Analyze data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the data ingestion best practices an architect should consider during the system design phase?

A

Determine the data source for ingestion
Data typically comes from another cloud provider or service, or from an on-premises location:

To ingest data from other cloud providers, you typically use Cloud Data Fusion, Storage Transfer Service, or BigQuery Transfer Service.

For on-premises data ingestion, consider the volume of data to ingest and your team’s skill set. If your team prefers a low-code, graphical user interface (GUI) approach, use Cloud Data Fusion with a suitable connector, such as Java Database Connectivity (JDBC). For large volumes of data, you can use Transfer Appliance or Storage Transfer Service.

Consider how you want to process your data after you ingest it. For example, Storage Transfer Service only writes data to a Cloud Storage bucket, and BigQuery Data Transfer Service only writes data to a BigQuery dataset. Cloud Data Fusion supports multiple destinations.

https://cloud.google.com/architecture/framework/system-design/data-analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do architects determine which services to use for data ingestion during the system design phase?

A

Identify streaming or batch data services to use:
Consider how you need to use your data and identify where you have streaming or batch use cases.
Pub/Sub - used for a global streaming service that has low latency requirements
BigQuery - used when you have a need for analytics and reporting uses.
Kafka to BigQuery Dataflow template- used
if you need to stream data from a system like Apache Kafka in an on-premises or other cloud environment.
Storage Transfer Service -
used for batch workloads. the first step is usually to ingest data into Cloud Storage. Use the gsutil tool or ingest data.
Big Query Transfers
Apart from using above tools, you also have following data pipeline options to load data into BigQuery:

Cloud Dataflow

Dataflow is a fully managed service on GCP built using the open source Apache Beam API with support for various data sources — files, databases, message based and more. With Dataflow you can transform and enrich data in both batch and streaming modes with the same code. Google provides prebuilt Dataflow templates for batch jobs.

Cloud Dataproc

Dataproc is a fully managed service on GCP for Apache Spark and Apache Hadoop services. Dataproc provides BigQuery connector enabling Spark and Hadoop applications to process data from BigQuery and write data to BigQuery using its native terminology.

Cloud Logging

This is not a data pipeline option but Cloud Logging (previously known as Stackdriver) provides an option to export log files into BigQuery. See Exporting with the Logs Viewer for more information and reference guide on exporting logs to BigQuery for security and access analytics.

https://cloud.google.com/architecture/framework/system-design/data-analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can you get around Querying Without Loading Data

A

As mentioned in the beginning of this post, you don’t need to load data into BigQuery before running queries in the following situations:

Public Datasets: Public datasets are datasets stored in BigQuery and shared with the public. For more information, see BigQuery public datasets.

Shared Datasets: You can share datasets stored in BigQuery. If someone has shared a dataset with you, you can run queries on that dataset without loading the data.

External data sources (Federated): You can skip the data loading process by creating a table based on an external data source.

https://cloud.google.com/blog/topics/developers-practitioners/bigquery-explained-data-ingestion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What services allow you to ingest data using automation?

A

Ingest data with automated tools
Manually moving data from other systems into the cloud can be a challenge. If possible, use tools that let you automate the data ingestion processes.
Cloud Data Fusion
provides connectors and plugins to bring data from external sources with a drag-and-drop GUI.
Data Flow or BigQuery
If your teams want to write some code, you can automate data ingestion.

Pub/Sub
Helps in both a low-code or code-first approach.

Storage Transfer Service-
To ingest data into storage buckets, use gsutil for data sizes of up to 1 TB. To ingest amounts of data larger than 1 TB, use .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What services do you use to regularly ingest data on a schedule?

A

Storage Transfer Service and BigQuery Data Transfer Service both let you schedule ingestion jobs. For fine-grain control of the timing of ingestion or the source and destination system, use a workflow-management system like Cloud Composer. If you want a more manual approach, you can use Cloud Scheduler and Pub/Sub to trigger a Cloud Function.
If you want to manage the Compute infrastructure, you can use the gsutil command with cron for data transfer of up to 1 TB. If you use this manual approach instead of Cloud Composer, follow the best practices to script production transfers.sd

https://cloud.google.com/architecture/framework/system-design/data-analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do architects pick the right datastore based on their needs?

A

Identify which of the following common use cases for your data to pick which Google Cloud product to use:

Data use case Product recommendation
File-based Filestore
Object-based Cloud Storage
Low latency Bigtable
Time series Bigtable
Online cache Memorystore
Transaction processing Cloud SQL
Business intelligence (BI) & analytics BigQuery Batch processing Cloud Storage

Bigtable if incoming data is time series and you need low latency access to it.

BigQuery if you use SQL.

https://cloud.google.com/architecture/framework/system-design/data-analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What would an architect use when you need to ingest data from multiple sources?

A

Use Dataflow to ingest data from multiple sources
To ingest data from multiple sources, such as Pub/Sub, Cloud Storage, HDFS, S3, or Kafka, use Dataflow. Dataflow is a managed serverless service that supports Dataflow templates, which lets your teams run templates from different tools.

Dataflow Prime provides horizontal and vertical autoscaling of machines that are used in the execution process of a pipeline. It also provides smart diagnostics and recommendations that identify problems and suggest how to fix them.

https://cloud.google.com/architecture/framework/system-design/data-analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the System design category of the architecture framework?

A

This category is the foundation of the Google Cloud Architecture Framework. Define the architecture, components, modules, interfaces, and data needed to satisfy cloud system requirements, and learn about Google Cloud products and features that support system design.

https://cloud.google.com/architecture/framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the Operational excellence category?

A

Efficiently deploy, operate, monitor, and manage your cloud workloads.

https://cloud.google.com/architecture/framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the Security, privacy, and compliance category of the architecture framework?

A

Maximize the security of your data and workloads in the cloud, design for privacy, and align with regulatory requirements and standards.

https://cloud.google.com/architecture/framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the Reliability category of the architecture framework?

A

Design and operate resilient and highly available workloads in the cloud.

https://cloud.google.com/architecture/framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the Cost optimization category of the architecture framework?d

A

Maximize the business value of your investment in Google Cloud.

https://cloud.google.com/architecture/framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the Performance optimization category of the architecture framework?

A

Design and tune your cloud resources for optimal performance.

https://cloud.google.com/architecture/framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How do you simplify your design and use fully managed services

A

Simplicity is crucial for system design. If your architecture is too complex to understand, it will be difficult to implement the design and manage it over time. Where feasible, use fully managed services to minimize the risks, time, and effort associated with managing and maintaining baseline systems.

If you’re already running your workloads in production, test with managed services to see how they might help to reduce operational complexities. If you’re developing new workloads, then start simple, establish a minimal viable product (MVP), and resist the urge to over-engineer. You can identify exceptional use cases, iterate, and improve your systems incrementally over time.

https://cloud.google.com/architecture/framework/system-design/principles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How do you decouple your architecture, what if it was a monolithic?

A

Decoupling is a technique that’s used to separate your applications and service components into smaller components that can operate independently.
For example, you might break up a monolithic application stack into separate service components. In a decoupled architecture, an application can run its functions independently, regardless of the various dependencies.

You can start decoupling early in your design phase or incorporate it as part of your system upgrades as you scale.

https://cloud.google.com/architecture/framework/system-design/principles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are the benefits of a decoupled architecture?

A

A decoupled architecture gives you increased flexibility to do the following:

Apply independent upgrades.
Enforce specific security controls.
Establish reliability goals for each subsystem.
Monitor health.
Granularly control performance and cost parameters.

https://cloud.google.com/architecture/framework/system-design/principles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Why use a stateless architecture?

A

A stateless architecture can increase both the reliability and scalability of your applications.

Stateful applications rely on various dependencies to perform tasks, such as locally cached data. Stateful applications often require additional mechanisms to capture progress and restart gracefully.

Stateless applications can perform tasks without significant local dependencies by using shared storage or cached services. A stateless architecture enables your applications to scale up quickly with minimum boot dependencies. The applications can withstand hard restarts, have lower downtime, and provide better performance for end users.

https://cloud.google.com/architecture/framework/system-design/principles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

When you select a region or multiple regions for your business applications, what criteria might you consider for your decision?

A

When you select a region or multiple regions for your business applications, you consider criteria including:
service availability
end-user latency
application latency
cost
regulatory or sustainability requirements.

Decision Making Process:
To support your business priorities and policies, balance these requirements and identify the best tradeoffs.

For example, the most compliant region might not be the most cost-efficient region or it might not have the lowest carbon footprint.

https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Why would you deploy a customer solution over multiple regions?

A

**Deploy over multiple regions
**

Why choose this strategy?

To help protect against expected downtime (including maintenance) and help protect against unexpected downtime like incidents, we recommend that you deploy fault-tolerant applications that have high availability and deploy your applications across multiple zones in one or more regions. For more information, see Geography and regions, Application deployment considerations, and Best practices for Compute Engine regions selection.

Multi-zonal deployments can provide resiliency if multi-region deployments are limited due to cost or other considerations.

This approach is especially helpful in preventing zonal or regional outages and in addressing disaster recovery and business continuity concerns.

For more information, see Design for scale and high availability.

https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Why would an architect select regions based on geographic proximity?

A

Why choose regions with closer proxiomity?

Latency impacts the user experience and affects costs associated with serving external users.
To minimize latency when serving traffic to external users, select a region or set of regions that are geographically close to your users and where your services run in a compliant way.
For more information, see Cloud locations and the Compliance resource center.

https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Why and how would you select regions based on available services?

A

Not all services are available in every region, you must verify their availability during the design process.

Select a region based on the available services that your business requires. Most services are available across all regions. Some enterprise-specific services might be available in a subset of regions with their initial release.

To verify region selection, see Cloud locations.

https://cloud.google.com/about/locations
https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Why would you need to choose regions to support compliance?

A

Select a specific region or set of regions to meet geographic regulatory or compliance regulations that require the use of certain geographies, for example General Data Protection Regulation (GDPR) or data residency.
To learn more about designing secure systems, see Compliance offerings and Data residency, operational transparency, and privacy for European customers on Google Cloud.

https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Why and how must you compare pricing of major resources for a customer solution?

A

Compare prices across the different options for your regions?

Regions have different cost rates for the same services. To identify a cost-efficient region, compare pricing of the major resources that you plan to use. Cost considerations differ depending on backup requirements and resources like compute, networking, and data storage.

To learn more, see the Cost optimization category.

https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How do you pick a region to support sustainability?

A

Use the Cloud Region Picker to support sustainability
Google has been carbon neutral since 2007 and is committed to being carbon-free by 2030. To select a region by its carbon footprint, use the Google Cloud Region Picker. To learn more about designing for sustainability, see Cloud sustainability.

https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How would you create a solution that serves global users?

A

Use Cloud Load Balancing to serve global users
To improve the user experience when you serve global users, use Cloud Load Balancing to help provide a single IP address that is routed to your application. To learn more about designing reliable systems, see Google Cloud Architecture Framework: Reliability.

https://cloud.google.com/architecture/framework/system-design/geographic-zones-regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How should an architect create a strategy for folder structure?

A

Use a simple folder structure
Folders let you group any combination of projects and subfolders. Create a simple folder structure to organize your Google Cloud resources. You can add more levels as needed to define your resource hierarchy so that it supports your business needs.

The folder structure is flexible and extensible.To learn more, see Creating and managing folders.

A common situation is to create folders that in turn contain additional folders or projects, as shown in the image above. This structure is referred to as a folder hierarchy. When creating a folder hierarchy, keep in mind the following:

You can nest folders up to 10 (ten) levels deep.
A parent folder cannot contain more than 300 folders. This refers to direct child folders only. Those child folders can, in turn, contain additional folders or projects.
Folder display names must be unique within the same level of the hierarchy.
You can use folder-level IAM policies to control access to the resources the folder contains. For example, if a user is granted the Compute Instance Admin role on a folder, that user has the Compute Instance Admin role for all of the projects in the folder.

Before you begin
Folder functionality is only available to Google Workspace and Cloud Identity customers that have an organization resource. For more information about acquiring an organization resource, see Creating and managing organizations.

If you’re exploring how to best use folders, we recommend that you:

Review Access Control for Folders Using IAM. The topic describes how you can control who has what access to folders and the resources they contain.
Understand how to set folder permissions. Folders support a number of different IAM roles. If you want to broadly set up permissions so users can see the structure of their projects, grant the entire domain the Organization Viewer and Folder Viewer roles at the organization level. To restrict visibility to branches of your folder hierarchy, grant the Folder Viewer role on the folder or folders you want users to see.
Create folders. As you plan how to organize your Cloud resources, we recommend that you start with a single folder as a sandbox where you can experiment with which hierarchy makes the most sense for your organization. Think of folders in terms of isolation boundaries between resources and attach points for access and configuration policies. You may choose to create folders to contain resources that belong to different departments and assign Admin roles on folders to delegate administrator privilege. Folders can also be used to group resources that belong to applications or different environments, such as development, production, test. Use nested folders to model these different scenarios.

https://cloud.google.com/architecture/framework/system-design/resource-management

https://cloud.google.com/resource-manager/docs/creating-managing-folders

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What are the two different options for organizing resourcdes in the customer’s IAM resource heirarchy?

A

Option 1: Hierarchy based on application environments
In many organizations, you define different policies and access controls for different application environments, such as development, production, and testing. Having separate policies that are standardized across each environment eases management and configuration. For example, you might have security policies that are more stringent in production environments than in testing environments.

Use a hierarchy based on application environments if the following is true:

You have separate application environments that have different policy and administration requirements.
You have use cases that have highly customized security or audit requirements.
You require different Identity and Access Management (IAM) roles to access your Google Cloud resources in different environments.
Avoid this hierarchy if the following is true:

You don’t run multiple application environments.
You don’t have varying application dependencies and business processes across environments.
You have strong policy differences for different regions, teams, or applications.

Option 2: Hierarchy based on regions or subsidiaries
Some organizations operate across many regions and have subsidiaries doing business in different geographies or have been a result of mergers and acquisitions. These organizations require a resource hierarchy that uses the scalability and management options in Google Cloud, and maintains the independence for different processes and policies that exist between the regions or subsidiaries. This hierarchy uses subsidiaries or regions as the highest folder level in the resource hierarchy. Deployment procedures are typically focused around the regions.

Use this hierarchy if the following is true:

Different regions or subsidiaries operate independently.
Regions or subsidiaries have different operational backbones, digital platform offerings, and processes.
Your business has different regulatory and compliance standards for regions or subsidiaries.

Option 3: Hierarchy based on an accountability framework
A hierarchy based on an accountability framework works best when your products are run independently or organizational units have clearly defined teams who own the lifecycle of the products. In these organizations, the product owners are responsible for the entire product lifecycle, including its processes, support, policies, and access rights. Your products are quite different from each other, so only a few organization-wide guidelines exist.

Use this hierarchy when the following is true:

You run an organization that has clear ownership and accountability for each product.
Your workloads are independent and don’t share many common policies.
Your processes and external developer platforms are offered as service or product offerings.

https://cloud.google.com/architecture/framework/system-design/resource-management
https://cloud.google.com/architecture/landing-zones/decide-resource-hierarchy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What are best practices for designing a resource heirarchy for your organization?

A

Use folders and projects to reflect data governance policies
Use folders, subfolders, and projects to separate resources from each other to reflect data governance policies within your organization. For example, you can use a combination of folders and projects to separate financial, human resources, and engineering.

Use projects to group resources that share the same trust boundary. For example, resources for the same product or microservice can belong to the same project. For more information, see Decide a resource hierarchy for your Google Cloud landing zone.

Use a single organization node
To avoid management overhead, use a single organization node whenever possible. However, consider using multiple organization nodes to address the following use cases:

You want to test major changes to your IAM levels or resource hierarchy.
You want to experiment in a sandbox environment that doesn’t have the same organization policies.
Your organization includes sub-companies that are likely to be sold off or run as completely separate entities in the future.
Use standardized naming conventions
Use a standardized naming convention throughout your organization. The security foundations blueprint has a sample naming convention that you can adapt to your requirements.

Understand resource interactions throughout the hierarchy
Understand which resources interact with the resource hierarchy and how the folder structure works for them.

Organization policies are inherited by descendants in the resource hierarchy, but can be superseded by policies defined at a lower level. For more information, see understanding hierarchy evaluation. You use organization policy constraints to set guidelines around the whole organization or significant parts of it and still allow for exceptions.

IAM policies are inherited by descendants, and cannot be superseded. However, you can add more access controls at lower levels at the hierarchy. See using resource hierarchy for access control for details.

You also need to consider the following:

Cloud Logging includes aggregated sinks that you can use to aggregate logs at the folder or organization level.
Billing is not directly linked to the resource hierarchy, but assigned at the project level. However, to get aggregated information at the folder level, you can analyze your costs by project hierarchy using billing reports.
Hierarchical firewall policies let you implement consistent firewall policies throughout the organization or in specific folders. Inheritance is implicit, which means that you can allow or deny traffic at any level or you can delegate the decision to a lower level.
Keep bootstrapping resources and common services separate
Keep separate folders for bootstrapping the Google Cloud environment using infrastructure-as-code (IaC) and for common services that are shared between environments or applications. Place the bootstrap folder right below the organization node in the resource hierarchy.

Place the folders for common services at different levels of the hierarchy, depending on the structure that you choose. Place the folder for common services right below the organization node when the following is true:

Your hierarchy uses application environments at the highest level and teams or applications at the second layer.
You have shared services such as monitoring that are common between environments.
Place the folder for common services at a lower level, below the application folders, when you have services that are shared between applications but are deployed separately for each deployment environment, for example shared microservices that are used by multiple applications but are updated regularly and require development and testing.

https://cloud.google.com/architecture/framework/system-design/resource-management

https://cloud.google.com/architecture/landing-zones/decide-resource-hierarchy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

How and when should you figure out how to use tags and labels?

A

When?
Use tags and labels at the outset of your project

Use labels and tags when you start to use Google Cloud products, even if you don’t need them immediately. Adding labels and tags later on can require manual effort that can be error prone and difficult to complete.

A tag provides a way to conditionally allow or deny policies based on whether a resource has a specific tag. A label is a key-value pair that helps you organize your Google Cloud instances. For more information on labels, see requirements for labels, a list of services that support labels, and label formats.

Resource Manager provides labels and tags to help you manage resources, allocate and report on cost, and assign policies to different resources for granular access controls.
For example, you can use labels and tags to apply granular access and management principles to different tenant resources and services. For information about VM labels and network tags, see Relationship between VM labels and network tags.

You can use labels for multiple purposes, including the following:

Managing resource billing: Labels are available in the billing system, which lets you separate cost by labels. For example, you can label different cost centers or budgets.
Grouping resources by similar characteristics or by relation: You can use labels to separate different application lifecycle stages or environments. For example, you can label production, development, and testing environments.

Tag inheritance
When a tag key-value pair is attached to a resource, all descendants of the resource inherit the tag. You can override an inherited tag on a descendant resource. To override an inherited tag, apply a tag using the same key as the inherited tag, but use a different value.

https://cloud.google.com/architecture/framework/system-design/resource-management
https://cloud.google.com/resource-manager/docs/tags/tags-overview

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What are best practice uses for tags and labels?

A

Assign labels to support cost and billing reporting
To support granular cost and billing reporting based on attributes outside of your integrated reporting structures (like per-project or per-product type), assign labels to resources. Labels can help you allocate consumption to cost centers, departments, specific projects, or internal recharge mechanisms. For more information, see the Cost optimization category.

Avoid creating large numbers of labels
Avoid creating large numbers of labels. We recommend that you create labels primarily at the project level, and that you avoid creating labels at the sub-team level. If you create overly granular labels, it can add noise to your analytics. To learn about common use cases for labels, see Common uses of labels.

Avoid adding sensitive information to labels
Labels aren’t designed to handle sensitive information. Don’t include sensitive information in labels, including information that might be personally identifiable, like an individual’s name or title.

Apply tags to model business dimensions
You can apply tags to model additional business dimensions like organization structure, regions, workload types, or cost centers. To learn more about tags, see Tags overview, Tag inheritance, and Creating and managing tags. To learn how to use tags with policies, see Policies and tags. To learn how to use tags to manage access control, see Tags and access control.

https://cloud.google.com/architecture/framework/system-design/resource-management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are best practices for project names?

A

Establish project naming conventions
Establish a standardized project naming convention, for example, SYSTEM_NAME-ENVIRONMENT (dev, test, uat, stage, prod).

Project names have a 30-character limit.

Although you can apply a prefix like COMPANY_TAG-SUB_GROUP/SUBSIDIARY_TAG, project names can become out of date when companies go through reorganizations. Consider moving identifiable names from project names to project labels.
Anonymize information in project names
Follow a project naming pattern like COMPANY_INITIAL_IDENTIFIER-ENVIRONMENT-APP_NAME, where the placeholders are unique and don’t reveal company or application names. Don’t include attributes that can change in the future, for example, a team name or technology.

https://cloud.google.com/architecture/framework/system-design/resource-management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What factors should an architect consider when they create an organization policy ??

A

Use the Organization Policy Service to control resources
The Organization Policy Service gives policy administrators centralized and programmatic control over your organization’s cloud resources so that they can configure constraints across the resource hierarchy. For more information, see Add an organization policy administrator.

Use the Organization Policy Service to comply with regulatory policies
To meet compliance requirements, use the Organization Policy Service to enforce compliance requirements for resource sharing and access. For example, you can limit sharing with external parties or determine where to deploy cloud resources geographically. Other compliance scenarios include the following:

Centralizing control to configure restrictions that define how your organization’s resources can be used.
Defining and establishing policies to help your development teams remain within compliance boundaries.
Helping project owners and their teams make system changes while maintaining regulatory compliance and minimizing concerns about breaking compliance rules.

https://cloud.google.com/architecture/framework/system-design/resource-management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What type of best practices should you keep in mind when you organize the hierarchy for your resources?

A

Google Cloud resources are arranged hierarchically in organizations, folders, and projects. This hierarchy lets you manage common aspects of your resources like access control, configuration settings, and policies.
For best practices to design the hierarchy of your cloud resources, see

Decide a resource hierarchy for your Google Cloud landing zone based on the flowchart and resources.

https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy

https://cloud.google.com/architecture/landing-zones/decide-resource-hierarchy

https://cloud.google.com/architecture/framework/system-design/resource-managementd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Why is selecting compute options so important for an architect to consider?

A

Computation is at the core of many workloads, whether it refers to the execution of custom business logic or the application of complex computational algorithms against datasets. Most solutions use compute resources in some form, and it’s critical that you select the right compute resources for your application needs.

Google Cloud provides several options for using time on a CPU. Options are based on CPU types, performance, and how your code is scheduled to run, including usage billing.

Google Cloud compute options include the following:

Virtual machines (VM) with cloud-specific benefits like live migration.
Bin-packing of containers on cluster-machines that can share CPUs.
Functions and serverless approaches, where your use of CPU time can be metered to the work performed during a single HTTP request.

https://cloud.google.com/architecture/framework/system-design/compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

When you choose a compute platform for your workload, what should you consider?

A

The technical requirements of the workload, lifecycle automation processes, regionalization, and security.

Evaluate the nature of CPU usage by your app and the entire supporting system, including how your code is packaged and deployed, distributed, and invoked. While some scenarios might be compatible with multiple platform options, a portable workload should be capable and performant on a range of compute options.

Choose a compute migration approach
If you’re migrating your existing applications from another cloud or from on-premises, use one of the following Google Cloud products to help you optimize for performance, scale, cost, and security.

https://cloud.google.com/architecture/framework/system-design/compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

How does an architect designing workloads for their solutions?

A

This section provides best practices for designing workloads to support your system.

Evaluate serverless options for simple logic
Simple logic is a type of compute that doesn’t require specialized hardware or machine types like CPU-optimized machines. Before you invest in Google Kubernetes Engine (GKE) or Compute Engine implementations to abstract operational overhead and optimize for cost and performance, evaluate serverless options for lightweight logic.

Decouple your applications to be stateless
Where possible, decouple your applications to be stateless to maximize use of serverless computing options. This approach lets you use managed compute offerings, scale applications based on demand, and optimize for cost and performance. For more information about decoupling your application to design for scale and high availability, see Design for scale and high availability.

Use caching logic when you decouple architectures
If your application is designed to be stateful, use caching logic to decouple and make your workload scalable. For more information, see Database best practices.

Use live migrations to facilitate upgrades
To facilitate Google maintenance upgrades, use live migration by setting instance availability policies. For more information, see Set VM host maintenance policy.

https://cloud.google.com/architecture/framework/system-design/compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What are the best practices an architect uses for scaling workloads for a solution?

A

This section provides best practices for scaling workloads to support your system.

Use startup and shutdown scripts
For stateful applications, use startup and shutdown scripts where possible to start and stop your application state gracefully. A graceful startup is when a computer is turned on by a software function and the operating system is allowed to perform its tasks of safely starting processes and opening connections.

Graceful startups and shutdowns are important because stateful applications depend on immediate availability to the data that sits close to the compute, usually on local or persistent disks, or in RAM. To avoid running application data from the beginning for each startup, use a startup script to reload the last saved data and run the process from where it previously stopped on shutdown. To save the application memory state to avoid losing progress on shutdown, use a shutdown script. For example, use a shutdown script when a VM is scheduled to be shut down due to downscaling or Google maintenance events.

Use MIGs to support VM management
When you use Compute Engine VMs, managed instance groups (MIGs) support features like autohealing, load balancing, autoscaling, auto updating, and stateful workloads. You can create zonal or regional MIGs based on your availability goals. You can use MIGs for stateless serving or batch workloads and for stateful applications that need to preserve each VM’s unique state.

Use pod autoscalers to scale your GKE workloads
Use horizontal and vertical Pod autoscalers to scale your workloads, and use node auto-provisioning to scale underlying compute resources.

Distribute application traffic
To scale your applications globally, use Cloud Load Balancing to distribute your application instances across more than one region or zone. Load balancers optimize packet routing from Google Cloud edge networks to the nearest zone, which increases serving traffic efficiency and minimizes serving costs. To optimize for end-user latency, use Cloud CDN to cache static content where possible.

Automate compute creation and management
Minimize human-induced errors in your production environment by automating compute creation and management.

https://cloud.google.com/architecture/framework/system-design/compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What would you consider doing as an architect to manage compute operations to support your system?

A

This section provides best practices for managing operations to support your system.

Use Google-supplied public images
Use public images supplied by Google Cloud. The Google Cloud public images are regularly updated. For more information, see List of public images available on Compute Engine.

You can also create your own images with specific configurations and settings. Where possible, automate and centralize image creation in a separate project that you can share with authorized users within your organization. Creating and curating a custom image in a separate project lets you update, patch, and create a VM using your own configurations. You can then share the curated VM image with relevant projects.

Use snapshots for instance backups
Snapshots let you create backups for your instances. Snapshots are especially useful for stateful applications, which aren’t flexible enough to maintain state or save progress when they experience abrupt shutdowns. If you frequently use snapshots to create new instances, you can optimize your backup process by creating a base image from that snapshot.

Use a machine image to enable VM instance creation
Although a snapshot only captures an image of the data inside a machine, a machine image captures machine configurations and settings, in addition to the data. Use a machine image to store all of the configurations, metadata, permissions, and data from one or more disks that are needed to create a VM instance.

When you create a machine from a snapshot, you must configure instance settings on the new VM instances, which requires a lot of work. Using machine images lets you copy those known settings to new machines, reducing overhead. For more information, see When to use a machine image.

https://cloud.google.com/architecture/framework/system-design/compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What are best practices to support best practices for managing capacity, reservations, and isolation to support your system design.

A

Capacity, reservations, and isolation
This section provides best practices for managing capacity, reservations, and isolation to support your system.

Use committed-use discounts to reduce costs
You can reduce your operational expenditure (OPEX) cost for workloads that are always on by using committed use discounts. For more information, see the Cost optimization category.

Choose machine types to support cost and performance
Google Cloud offers machine types that let you choose compute based on cost and performance parameters. You can choose a low-performance offering to optimize for cost or choose a high-performance compute option at higher cost. For more information, see the Cost optimization category.

Use sole-tenant nodes to support compliance needs Sole-tenant nodes are physical Compute Engine servers that are dedicated to hosting only your project’s VMs. Sole-tenant nodes can help you to meet compliance requirements for physical isolation, including the following:

Keep your VMs physically separated from VMs in other projects.
Group your VMs together on the same host hardware.
Isolate payments processing workloads.
For more information, see Sole-tenant nodes.

Use reservations to ensure resource availability Google Cloud lets you define reservations for your workloads to ensure those resources are always available. There is no additional charge to create reservations, but you pay for the reserved resources even if you don’t use them. For more information, see Consuming and managing reservations.

https://cloud.google.com/architecture/framework/system-design/compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What are the best practices an architect must consider to support VM Migration for a solution?

A

Evaluate built-in migration tools
Evaluate built-in migration tools to move your workloads from another cloud or from on-premises. For more information, see Migration to Google Cloud. Google Cloud offers tools and services to help you migrate your workloads and optimize for cost and performance. To receive a free migration cost assessment based on your current IT landscape, see Google Cloud Rapid Assessment & Migration Program.

Use virtual disk import for customized operating systems
To import customized supported operating systems, see Importing virtual disks. Sole-tenant nodes can help you meet your hardware bring-your-own-license requirements for per-core or per-processor licenses. For more information, see Bringing your own licenses.

https://cloud.google.com/architecture/framework/system-design/compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What are the core principles of VPC network design an architect will consider?

A

Developing your cloud networking design includes the following steps:

Design the workload VPC architecture. Start by identifying how many Google Cloud projects and VPC networks you require.
Add inter-VPC connectivity. Design how your workloads connect to other workloads in different VPC networks.
Design hybrid network connectivity. Design how your workload VPCs connect to on-premises and other cloud environments.
When you design your Google Cloud network, consider the following:

A VPC provides a private networking environment in the cloud for interconnecting services that are built on Compute Engine, Google Kubernetes Engine (GKE), and Serverless Computing Solutions. You can also use a VPC to privately access Google-managed services such as Cloud Storage, BigQuery, and Cloud SQL.
VPC networks, including their associated routes and firewall rules, are global resources; they aren’t associated with any particular region or zone.
Subnets are regional resources. Compute Engine VM instances that are deployed in different zones in the same cloud region can use IP addresses from the same subnet.
Traffic to and from instances can be controlled by using VPC firewall rules.
Network administration can be secured by using Identity and Access Management (IAM) roles.
VPC networks can be securely connected in hybrid environments by using Cloud VPN or Cloud Interconnect.

https://cloud.google.com/architecture/framework/system-design/networking

56
Q

What are the principles to use when designing your network infrastructure on desire?

A

VPC networks have the following properties:

VPC networks, including their associated routes and firewall rules, are global resources. They are not associated with any particular region or zone.

Subnets are regional resources.

Each subnet defines a range of IPv4 addresses. Subnets in custom mode VPC networks can also have a range of IPv6 addresses.

Traffic to and from instances can be controlled with network firewall rules. Rules are implemented on the VMs themselves, so traffic can only be controlled and logged as it leaves or arrives at a VM.

Resources within a VPC network can communicate with one another by using internal IPv4 addresses, internal IPv6 addresses, or external IPv6 addresses, subject to applicable network firewall rules. For more information, see communication within the network.

Instances with internal IPv4 or IPv6 addresses can communicate with Google APIs and services. For more information, see Private access options for services.

Network administration can be secured by using Identity and Access Management (IAM) roles.

An organization can use Shared VPC to keep a VPC network in a common host project. Authorized IAM principals from other projects in the same organization can create resources that use subnets of the Shared VPC network.

VPC networks can be connected to other VPC networks in different projects or organizations by using VPC Network Peering.

VPC networks can be securely connected in hybrid environments by using Cloud VPN or Cloud Interconnect.

VPC networks support GRE traffic, including traffic on Cloud VPN and Cloud Interconnect. VPC networks do not support GRE for Cloud NAT or for forwarding rules for load balancing and protocol forwarding. Support for GRE allows you to terminate GRE traffic on a VM from the internet (external IP address) and Cloud VPN or Cloud Interconnect (internal IP address). The decapsulated traffic can then be forwarded to a reachable destination. GRE enables you to use services such as Secure Access Service Edge (SASE) and SD-WAN.

https://cloud.google.com/architecture/framework/system-design/networking

https://cloud.google.com/vpc/docs/vpc#specifications

57
Q

What are proper practices that an architect uses for workload VPC architecture?

A

This section provides best practices for designing workload VPC architectures to support your system.

Consider VPC network design early
Make VPC network design an early part of designing your organizational setup in Google Cloud. Organizational-level design choices can’t be easily reversed later in the process. For more information, see Best practices and reference architectures for VPC design and Decide the network design for your Google Cloud landing zone.

Start with a single VPC network
For many use cases that include resources with common requirements, a single VPC network provides the features that you need. Single VPC networks are simple to create, maintain, and understand. For more information, see VPC Network Specifications.

Keep VPC network topology simple
To ensure a manageable, reliable, and well-understood architecture, keep the design of your VPC network topology as simple as possible.

Use VPC networks in custom mode
To ensure that Google Cloud networking integrates seamlessly with your existing networking systems, we recommend that you use custom mode when you create VPC networks. Using custom mode helps you integrate Google Cloud networking into existing IP address management schemes and it lets you control which cloud regions are included in the VPC. For more information, see VPC.

https://cloud.google.com/architecture/framework/system-design/networking

58
Q

What are best practices an architect considers for designing inter-VPC connectivity to support your system?

A

This section provides best practices for designing inter-VPC connectivity to support your system.

Choose a VPC connection method
If you decide to implement multiple VPC networks, you need to connect those networks. VPC networks are isolated tenant spaces within Google’s Andromeda software-defined network (SDN). There are several ways that VPC networks can communicate with each other. Choose how you connect your network based on your bandwidth, latency, and service level agreement (SLA) requirements. To learn more about the connection options, see Choose the VPC connection method that meets your cost, performance, and security needs.

Use Shared VPC to administer multiple working groups
For organizations with multiple teams, Shared VPC provides an effective tool to extend the architectural simplicity of a single VPC network across multiple working groups.

Use simple naming conventions
Choose simple, intuitive, and consistent naming conventions. Doing so helps administrators and users to understand the purpose of each resource, where it’s located, and how it’s differentiated from other resources.

Use connectivity tests to verify network security
In the context of network security, you can use connectivity tests to verify that traffic you intend to prevent between two endpoints is blocked. To verify that traffic is blocked and why it’s blocked, define a test between two endpoints and evaluate the results. For example, you might test a VPC feature that lets you define rules that support blocking traffic. For more information, see Connectivity Tests overview.

Use Private Service Connect to create private endpoints
To create private endpoints that let you access Google services with your own IP address scheme, use Private Service Connect. You can access the private endpoints from within your VPC and through hybrid connectivity that terminates in your VPC.

Secure and limit external connectivity
Limit internet access only to those resources that need it. Resources with only a private, internal IP address can still access many Google APIs and services through Private Google Access.

Use Network Intelligence Center to monitor your cloud networks
Network Intelligence Center provides a comprehensive view of your Google Cloud networks across all regions. It helps you to identify traffic and access patterns that can cause operational or security risks.

https://cloud.google.com/architecture/framework/system-design/networking

59
Q

What are the key principles to consider when designing a storage solution?

A

To facilitate data exchange and securely back up and store data, organizations need to choose a storage plan based on workload, input/output operations per second (IOPS), latency, retrieval frequency, location, capacity, and format (block, file, and object).

Cloud Storage provides reliable, secure object storage services, including the following:

Built-in redundancy options to protect your data against equipment failure and to ensure data availability during data center maintenance.
Data transfer options, including the following:
Storage Transfer Service
Transfer Appliance
BigQuery Data Transfer Service
Migration to Google Cloud: Transferring your large datasets
Storage classes to support your workloads.
Calculated checksums for all Cloud Storage operations that enable Google to verify reads and writes.
In Google Cloud, IOPS scales according to your provisioned storage space. Storage types like Persistent Disk require manual replication and backup because they are zonal or regional. By contrast, object storage is highly available and it automatically replicates data across a single region or across multiple regions.

https://cloud.google.com/architecture/framework/system-design/storage

60
Q

How do you choose an archival storage system based on the user needs?

A

Choose active or archival storage based on storage access needs
A storage class is a piece of metadata that is used by every object. For data that is served at a high rate with high availability, use the Standard Storage class. For data that is infrequently accessed and can tolerate slightly lower availability, use the Nearline Storage, Coldline Storage, or Archive Storage class. For more information about cost considerations for choosing a storage class, see Cloud Storage pricing

https://cloud.google.com/architecture/framework/system-design/storage

61
Q

What can architect do to minimize access latency for specific objects?

A

Use Cloud CDN to improve static object delivery
To optimize the cost to retrieve objects and minimize access latency, use Cloud CDN. Cloud CDN uses the Cloud Load Balancing external HTTP(S) load balancer to provide routing, health checking, and anycast IP address support. For more information, see Setting up Cloud CDN with cloud buckets.

https://cloud.google.com/architecture/framework/system-design/storage

62
Q

What are things an architect to consider when choosing disk vs cloud storage options?

A

Use Persistent Disk to support high-performance storage access
Data access patterns depend on how you design system performance. Cloud Storage provides scalable storage, but it isn’t an ideal choice when you run heavy compute workloads that need high throughput access to large amounts of data. For high-performance storage access, use Persistent Disk.

Use exponential backoff when implementing retry logic
Use exponential backoff when implementing retry logic to handle 5XX, 408, and 429 errors. Each Cloud Storage bucket is provisioned with initial I/O capacity. For more information, see Request rate and access distribution guidelines. Plan a gradual ramp-up for retry requests.

https://cloud.google.com/architecture/framework/system-design/storage

63
Q

What are best practices for an architect to consider when developing the cloud storage design?

A

This section provides best practices for storage management to support your system.

Assign unique names to every bucket
Make every bucket name unique across the Cloud Storage namespace. Don’t include sensitive information in a bucket name. Choose bucket and object names that are difficult to guess. For more information, see the bucket naming guidelines and Object naming guidelines.

Keep Cloud Storage buckets private
Unless there is a business-related reason, ensure that your Cloud Storage bucket isn’t anonymously or publicly accessible. For more information, see Overview of access control.

Assign random object names to distribute load evenly
Assign random object names to facilitate performance and avoid hotspotting. Use a randomized prefix for objects where possible. For more information, see Use a naming convention that distributes load evenly across key ranges.

Use public access prevention
To prevent access at the organization, folder, project, or bucket level, use public access prevention. For more information, see Using public access prevention.

https://cloud.google.com/architecture/framework/system-design/storage

64
Q

Selecting the appropriate target database is one of the keys to a successful migration. What are the migration options for some use cases?

A

Choose a migration strategy
Selecting the appropriate target database is one of the keys to a successful migration. The following table provides migration options for some use cases:

Use case Recommendation
New development in Google Cloud. Select one of the managed databases that’s built for the cloud—Cloud SQL, Cloud Spanner, Bigtable, or Firestore—to meet your use-case requirements.
Lift-and-shift migration. Choose a compatible, managed-database service like Cloud SQL, MYSQL, PostgreSQL, or SQLServer.
Your application requires granular access to a database that CloudSQL doesn’t support. Run your database on Compute Engine VMs.

https://cloud.google.com/architecture/framework/system-design/databases

65
Q

What are Memorystore best practices?

A

Memorystore is a fully managed Redis and Memcached database that supports submilliseconds latency.
Memorystore is fully compatible with open source Redis and Memcached.

If you use these caching databases in your applications, you can use Memorystore without making application-level changes in your code.

https://cloud.google.com/architecture/framework/system-design/databases

66
Q

What would an architect use for database best practices?

A

Use Bare Metal servers to run an Oracle database
If your workloads require an Oracle database, use Bare Metal servers provided by Google Cloud. This approach fits into a lift-and-shift migration strategy.

If you want to move your workload to Google Cloud and modernize after your baseline workload is working, consider using managed database options like Spanner, Bigtable, and Firestore.

Databases built for the cloud are modern managed databases which are built from the bottom up on the cloud infrastructure. These databases provide unique default capabilities like scalability and high availability, which are difficult to achieve if you run your own database.

Modernize your database
Plan your database strategy early in the system design process, whether you’re designing a new application in the cloud or you’re migrating an existing database to the cloud. Google Cloud provides managed database options for open source databases such as Cloud SQL for MySQL and Cloud SQL for PostgreSQL. We recommend that you use the migration as an opportunity to modernize your database and prepare it to support future business needs.

Use fixed databases with off-the-shelf applications
Commercial off-the-shelf (COTS) applications require a fixed type of database and fixed configuration. Lift and shift is usually the most appropriate migration approach for COTS applications.

Verify your team’s database migration skill set
Choose a cloud database-migration approach based on your team’s database migration capabilities and skill sets. Use Google Cloud Partner Advantage to find a partner to support you throughout your migration journey.

Design your database to meet HA and DR requirements
When you design your databases to meet high availability (HA) and disaster recovery (DR) requirements, evaluate the tradeoffs between reliability and cost. Database services that are built for the cloud create multiple copies of your data within a region or in multiple regions, depending upon the database and configuration.

Some Google Cloud services have multi-regional variants, such as BigQuery and Cloud Spanner. To be resilient against regional failures, use these multi-regional services in your design where possible.

If you design your database on Compute Engine VMs instead of using managed databases on Google Cloud, ensure that you run multiple copies of your databases. For more information, see Design for scale and high availability in the Reliability category.

https://cloud.google.com/architecture/framework/system-design/databases

67
Q

What are the best practices for best practices for designing and scaling a database to support your system?

A

Database design and scaling
This section provides best practices for designing and scaling a database to support your system.

Use monitoring metrics to assess scaling needs
Use metrics from existing monitoring tools and environments to establish a baseline understanding of database size and scaling requirements—for example, right-sizing and designing scaling strategies for your database instance.

For new database designs, determine scaling numbers based on expected load and traffic patterns from the serving application. For more information, see Monitoring Cloud SQL instances, Monitoring with Cloud Monitoring, and Monitoring an instance.

https://cloud.google.com/architecture/framework/system-design/databases

68
Q

What best practices are defined for networking and access with databases to support the system

A

best practices for managing networking and access to support your system.

Run databases inside a private network
Run your databases inside your private network and grant restricted access only from the clients who need to interact with the database. You can create Cloud SQL instances inside a VPC. Google Cloud also provides VPC Service Controls for Cloud SQL, Spanner, and Bigtable databases to ensure that access to these resources is restricted only to clients within authorized VPC networks.

Grant minimum privileges to users
Identity and Access Management (IAM) controls access to Google Cloud services, including database services. To minimize the risk of unauthorized access, grant the least number of privileges to your users. For application-level access to your databases, use service accounts with the least number of privileges.

https://cloud.google.com/architecture/framework/system-design/databases

69
Q

What best practices are defined for networking and access with databases to support the system

A

best practices for managing networking and access to support your system.

Run databases inside a private network
Run your databases inside your private network and grant restricted access only from the clients who need to interact with the database. You can create Cloud SQL instances inside a VPC. Google Cloud also provides VPC Service Controls for Cloud SQL, Spanner, and Bigtable databases to ensure that access to these resources is restricted only to clients within authorized VPC networks.

Grant minimum privileges to users
Identity and Access Management (IAM) controls access to Google Cloud services, including database services. To minimize the risk of unauthorized access, grant the least number of privileges to your users. For application-level access to your databases, use service accounts with the least number of privileges.

https://cloud.google.com/architecture/framework/system-design/databases

70
Q

What are the best practices for defining automation and right-sizing to support databases in your system?

A

best practices for defining automation and right-sizing to support your system.

Define database instances as code
One of the benefits of migrating to Google Cloud is the ability to automate your infrastructure and other aspects of your workload like compute and database layers. Google Deployment Manager and third-party tools like Terraform Cloud let you define your database instances as code, which lets you apply a consistent and repeatable approach to creating and updating your databases.

Use Liquibase to version control your database
Google database services like Cloud SQL and Cloud Spanner support Liquibase, an open source version control tool for databases. Liquibase helps you to track your database schema changes, roll back schema changes, and perform repeatable migrations.

Test and tune your database to support scaling
Perform load tests on your database instance and tune it based on the test results to meet your application’s requirements. Determine the initial scale of your database by load testing key performance indicators (KPI) or by using monitoring KPIs derived from your current database.

When you create database instances, start with a size that is based on the testing results or historical monitoring metrics. Test your database instances with the expected load in the cloud. Then fine-tune the instances until you get the desired results for the expected load on your database instances.

Choose the right database for your scaling requirements
Scaling databases is different from scaling compute layer components. Databases have state; when one instance of your database isn’t able to handle the load, consider the appropriate strategy to scale your database instances. Scaling strategies vary depending on the database type.

https://cloud.google.com/architecture/framework/system-design/databases

71
Q

What are design principles when designing for Machine Learning?

A

Model development and training
Apply the following model development and training best practices to your own environment.

Choose managed or custom-trained model development
When you build your model, consider the highest level of abstraction possible. Use AutoML when possible so that the development and training tasks are handled for you. For custom-trained models, choose managed options for scalability and flexibility, instead of self-managed options. To learn more about model development options, see Use recommended tools and products.

Consider the Vertex AI training service instead of self-managed training on Compute Engine VMs or Deep Learning VM containers. For a JupyterLab environment, consider Vertex AI Workbench, which provides both managed and user-managed JupyterLab environments. For more information, see Machine learning development and Operationalized training.

Use pre-built or custom containers for custom-trained models
For custom-trained models on Vertex AI, you can use pre-built or custom containers depending on your machine learning framework and framework version. Pre-built containers are available for Python training applications that are created for specific TensorFlow, scikit-learn, PyTorch, and XGBoost versions.

Otherwise, you can choose to build a custom container for your training job. For example, use a custom container if you want to train your model using a Python ML framework that isn’t available in a pre-built container, or if you want to train using a programming language other than Python. In your custom container, pre-install your training application and all its dependencies onto an image that runs your training job.

Consider distributed training requirements
Consider your distributed training requirements. Some ML frameworks, like TensorFlow and PyTorch, let you run identical training code on multiple machines. These frameworks automatically coordinate division of work based on environment variables that are set on each machine. Other frameworks might require additional customization.

https://cloud.google.com/architecture/framework/system-design/databases

72
Q

How can you design for environmental sustainability?

A

Design for environmental sustainability

Understand your carbon footprint
To understand the carbon footprint from your Google Cloud usage, use the Carbon Footprint dashboard. The Carbon Footprint dashboard attributes emissions to the Google Cloud projects that you own and the cloud services that you use.

For more information, see Understand your carbon footprint in “Reduce your Google Cloud carbon footprint.”

Choose the most suitable cloud regions
One simple and effective way to reduce carbon emissions is to choose cloud regions with lower carbon emissions. To help you make this choice, Google publishes carbon data for all Google Cloud regions.

When you choose a region, you might need to balance lowering emissions with other requirements, such as pricing and network latency. To help select a region, use the Google Cloud Region Picker.

For more information, see Choose the most suitable cloud regions in “Reduce your Google Cloud carbon footprint.”

Choose the most suitable cloud services
To help reduce your existing carbon footprint, consider migrating your on-premises VM workloads to Compute Engine.

Also consider that many workloads don’t require VMs. Often you can utilize a serverless offering instead. These managed services can optimize cloud resource usage, often automatically, which simultaneously reduces cloud costs and carbon footprint.

For more information, see Choose the most suitable cloud services in “Reduce your Google Cloud carbon footprint.”

Minimize idle cloud resources
Idle resources incur unnecessary costs and emissions. Some common causes of idle resources include the following:

Unused, active cloud resources, such as idle VM instances.
Over-provisioned resources, such as larger VM machine types than necessary for a workload.
Non-optimal architectures, such as lift-and-shift migrations that aren’t always optimized for efficiency. Consider making incremental improvements to these architectures.
The following are some general strategies to help minimize wasted cloud resources:

Identify idle or overprovisioned resources and either delete them or rightsize them.
Refactor your architecture to incorporate a more optimal design.
Migrate workloads to managed services.

https://cloud.google.com/architecture/framework/system-design/databases

73
Q

Automate your deployments

A

Automation helps you standardize your builds, tests, and deployments by eliminating human-induced errors for repeated processes like code updates. This section describes how to use various checks and guards as you automate. A standardized machine-controlled process helps ensure that your deployments are applied safely. It also provides a mechanism to restore previous deployments as needed without significantly affecting your user’s experience.

Store your code in central code repositories

Use continuous integration and continuous deployment (CI/CD)
Automate your deployments using a continuous integration and continuous deployment (CI/CD) approach. A CI/CD approach is a combination of pipelines that you configure and processes that your development team follows.

A CI/CD approach increases deployment velocity by making your software development team more productive. This approach lets developers make smaller and more frequent changes that are thoroughly tested while reducing the time needed to deploy those changes.

Provision and manage your infrastructure using infrastructure as code
Infrastructure as code is the use of a descriptive model to manage infrastructure, such as VMs, and configurations, such as firewall rules. Infrastructure as code lets you do the following:

Create your cloud resources automatically, including the deployment or test environments for your CI/CD pipeline.
Treat infrastructure changes like you treat application changes. For example, ensure changes to the configuration are reviewed, tested, and can be audited.
Have a single version of the truth for your cloud infrastructure.
Replicate your cloud environment as needed.
Roll back to a previous configuration if necessary.
Incorporate testing throughout the software delivery lifecycle
Testing is critical to successfully launching your software. Continuous testing helps teams create high-quality software faster and enhance software stability.

Launch deployments gradually
Choose your deployment strategy based on important parameters, like minimum disruption to end users, rolling updates, rollback strategies, and A/B testing strategies. For each workload, evaluate these requirements and pick a deployment strategy from proven techniques, such as rolling updates, blue/green deployments, and canary deployments.

Restore previous releases seamlessly
Define your restoration strategy as part of your deployment strategy. Ensure that you can roll back a deployment, or an infrastructure configuration, to a previous version of the source code. Restoring a previous stable deployment is an important step in incident management for both reliability and security incidents.

Monitor your CI/CD pipelines
To keep your automated build, test, and deploy process running smoothly, monitor your CI/CD pipelines. Set alerts that indicate when anything in any pipeline fails. Each step of your pipeline should write suitable log statements so that your team can perform root cause analysis if a pipeline fails.w

https://cloud.google.com/architecture/framework/operational-excellence/automate-your-deployments

74
Q

Set up monitoring, alerting, and logging

A

Use the following four golden signals to monitor your system:

Latency. The time it takes to service a request.
Traffic. How much demand is being placed on your system.
Errors. The rate of requests that fail. Failure can be explicit (for example, HTTP 500s), implicit (for example, an HTTP 200 success response coupled with the wrong content), or by policy (for example, if you commit to one-second response times, any request over one second is an error).
Saturation. How full your service is. Saturation is a measure of your system fraction, emphasizing the resources that are most constrained (that is, in a memory-constrained system, show memory; in an I/O-constrained system, show I/O).

Create a monitoring plan

Include the following details in your monitoring plan:

Include all your systems, including on-premises resources and cloud resources.
Include monitoring of your cloud costs to help make sure that scaling events doesn’t cause usage to cross your budget thresholds.
Build different monitoring strategies for measuring infrastructure performance, user experience, and business key performance indicators (KPIs). For example, static thresholds might work well to measure infrastructure performance but don’t truly reflect the user’s experience.
Update the plan as your monitoring strategies mature. Iterate on the plan to improve the health of your systems.

Define metrics that measure all aspects of your organization
Use these metrics to create service level indicators (SLIs) for your applications. For more information, see Choose appropriate SLIs.

Choose a monitoring solution that:

Is platform independent
Provides uniform capabilities for monitoring of on-premises, hybrid, and multi-cloud environments
Using a single platform to consolidate the monitoring data that comes in from different sources lets you build uniform metrics and visualization dashboards.

As you set up monitoring, automate monitoring tasks where possible.

Monitoring with Google Cloud
Using a monitoring service, such as Cloud Monitoring, is easier than building a monitoring service yourself. Monitoring a complex application is a substantial engineering endeavor by itself. Even with existing infrastructure for instrumentation, data collection and display, and alerting in place, it is a full-time job for someone to build and maintain.

Cloud Monitoring is a managed service that is part of the Google Cloud’s operations suite. You can use Cloud Monitoring to monitor Google Cloud services and custom metrics. Cloud Monitoring provides an API for integration with third-party monitoring tools.

Cloud Monitoring aggregates metrics, logs, and events from your system’s cloud-based infrastructure. That data gives developers and operators a rich set of observable signals that can speed root-cause analysis and reduce mean time to resolution. You can use Cloud Monitoring to define alerts and custom metrics that meet your business objectives and help you aggregate, visualize, and monitor system health.

Cloud Monitoring provides default dashboards for cloud and open source application services. Using the metrics model, you can define custom dashboards with powerful visualization tools and configure charts in Metrics Explorer.

Set up alerting
As you set up alerting, map alerts directly to critical metrics. These critical metrics include:

The four golden signals:
Latency
Traffic
Errors
Saturation
System health
Service usage
Security events
User experience
Make alerts actionable to minimize the time to resolution. To do so, for each alert:

Include a clear description, including stating what is monitored and its business impact.
Provide all the information necessary to act immediately. If it takes a few clicks and navigation to understand alerts, it is challenging for the on-call person to act.
Define priority levels for various alerts.
Clearly identify the person or team responsible for responding to the alert.
For critical applications and services, build self-healing actions into the alerts triggered due to common fault conditions such as service health failure, configuration change, or throughput spikes.

As you set up alerts, try to eliminate toil. \

Build monitoring and alerting dashboards
Once monitoring is in place, build relevant, uncomplicated dashboards that include information from your monitoring and alerting systems.

Choosing an appropriate way to visualize your dashboard can be difficult to tie into your reliability goals. Create dashboards to visualize both:

Short-term and real-time analysis
Long-term analysis

Logging the data your systems generate helps you ensure an effective security posture. For more information about logging and security, see Implement logging and detective controls in the security category of the Architecture Framework.

Cloud Logging is an integrated logging service you can use to store, search, analyze, monitor, and alert on log data and events. Logging automatically collects logs from the services of Google Cloud and other cloud providers. You can use these logs to build metrics for monitoring and to create logging exports to external services such as Cloud Storage, BigQuery, and Pub/Sub.

Set up an audit trail
To help answer questions like “who did what, where, and when” in your Google Cloud projects, use Cloud Audit Logs.

Cloud Audit Logs captures several types of activity, such as the following:

Admin Activity logs contain log entries for API calls or other administrative actions that modify the configuration or metadata of resources. Admin Activity logs are always enabled.

https://cloud.google.com/architecture/framework/operational-excellence/set-up-monitoring-alerting-logging

75
Q

Establish cloud support and escalation processes

A

Establish support from your providers
Purchase a support contract from your cloud provider or other third-party service providers. Support is critical to ensure the prompt response and resolution of various operational issues.
To work with Google Cloud Customer Care, consider purchasing a Customer Care offering that includes Standard, Enhanced, or Premium Support. Consider using Enhanced or Premium Support for your major production environments.
Define your escalation process
A well-defined escalation process is key to reducing the effort and time that it takes to identify and address any issues in your systems. This includes issues that require support for Google Cloud products or for other cloud producers or third-party services.
Ensure you receive communication from support
Ensure that your administrators are receiving communication from your cloud providers and third-party services. This information allows admins to make informed decisions and fix issues before they cause larger problems. Ensure that the following are true:
Establish review processes
Establish a review or postmortem processes. Follow these processes after you raise a new support ticket or escalate an existing support ticket.
Build centers of excellence
It can be valuable to capture your organization’s information, experience, and patterns in an internal knowledge base, such as a wiki, Google site, or intranet site. As new products and features are continually being rolled out in Google Cloud, this knowledge can help track why you chose a particular design for your applications and services. For more information, see Architecture decision records.

https://cloud.google.com/architecture/framework/operational-excellence/establish-cloud-support-and-escalation-processes

76
Q

Plan for peak traffic and launch events

A

Peak and launch events includes three stages:
Planning and preparation for the launch or peak traffic event
Launching the event
Reviewing event performance
The practices described in this document can help each of these stages run smoothly.
Create a general playbook for launch and peak events
Build a general playbook with a long-term view of current and future peak events. Keep adding lessons learned to the document, so it can be a reference for future peak events.
Plan for your launch and for peak events
Plan ahead. Create business projections for upcoming launches and for expected (and unexpected) peak events. Preparing your system for scale spikes depends on understanding your business projections. The more you know about prior forecasts, the more accurate you can make your new business forecasts. Those new forecasts are critical inputs into projecting expected demand on your system.
Establish review processes
When the peak traffic event or launch event is over, review the event to document the lessons you learned. Then, update your playbook with those lessons. Finally, apply what you learned to the next major event. Learning from prior events is important, especially when they highlight constraints to the system while under stress.
Retrospective reviews, also called postmortems, for peak traffic events or launch events are a useful technique for capturing data and understanding the incidents that occurred. Do this review for peak traffic and launch events that went as expected, and for any incidents that caused problems. As you do this review foster a blameless culture.

https://cloud.google.com/architecture/framework/operational-excellence/plan-for-peak-traffic-and-launch-events

77
Q

Create a culture of automation

A

Create a culture of automation
Toil is manual and repetitive work with no enduring value, and it increases as a service grows. Continually aim to reduce or eliminate toil. Otherwise, operational work can eventually overwhelm operators, and any growth in product use or complexity can require additional staffing.
Automation is a key way to minimize toil. Automation also improves release velocity and helps minimize human-induced errors.
Create an inventory and assess the cost of toil
Start by creating an inventory and assessing the cost of toil on the teams managing your systems. Make this a continuous process, followed by investing in customized automation to extend what’s already provided by Google Cloud services and partners. You can often modify Google Cloud’s own automation—for example, Compute Engine’s autoscaler.
Prioritize eliminating toil
Automation is useful but isn’t a solution to all operational problems. As a first step in addressing known toil, we recommend reviewing your inventory of existing toil and prioritize eliminating as much toil as you can. Then, you can focus on automation.
Automate necessary toil
Some toil in your systems cannot be eliminated. As a second step in addressing known toil, automate this toil using the solutions that Google Cloud provides through configurable automation.

Build or buy solutions for high-cost toil
The third step, which can be completed in parallel with the first and second steps, entails evaluating building or buying other solutions if your toil cost stays high—for example, if toil takes a significant amount of time for any team managing your production systems.
When building or buying solutions, consider integration, security, privacy, and compliance costs. Designing and implementing your own automation comes with maintenance costs and risks to reliability beyond its initial development and setup costs, so consider this option as a last resort.

https://sre.google/workbook/eliminating-toil/

78
Q

What are common metrics for various components?

A

Metrics are generated at all levels of your service, from infrastructure and networking to business logic. For example:

Infrastructure metrics:
Virtual machine statistics, including instances, CPU, memory, utilization, and counts
Container-based statistics, including cluster utilization, cluster capacity, pod level utilization, and counts
Networking statistics, including ingress/egress, bandwidth between components, latency, and throughput
Requests per second, as measured by the load balancer
Total disk blocks read, per disk
Packets sent over a given network interface
Memory heap size for a given process
Distribution of response latencies
Number of invalid queries rejected by a database instance
Application metrics:
Application-specific behavior, including queries per second, writes per second, and messages sent per second
Managed services statistics metrics:
QPS, throughput, latency, utilization for Google-managed services (APIs or products such as BigQuery, App Engine, and Cloud Bigtable)
Network connectivity statistics metrics:
VPN/interconnect-related statistics about connecting to on-premises systems or systems that are external to Google Cloud.
SLIs
Metrics associated with the overall health of the system.
Set up monitoring
Set up monitoring to monitor both on-premises resources and cloud resources.

https://cloud.google.com/architecture/framework/operational-excellence/set-up-monitoring-alerting-logging

79
Q

How do you create an Escalation Process?

A

To create your escalation path:
Define when and how to escalate issues internally.
Define when and how to create support cases with your cloud provider or other third-party service provider.
Learn how to work with the teams that provide you support. For Google Cloud, you and your operations teams should review the Best practices for working with Customer Care. Incorporate these practices into your escalation path.
Find or create documents that describe your architecture. Ensure these documents include information that is helpful for support engineers.
Define how your teams communicate during an outage.
Ensure that people who need support have appropriate levels of support permissions to access the Google Cloud Support Center, or to communicate with other support providers. To learn about using the Google Cloud Support Center, visit Support procedures.
Set up monitoring, alerting, and logging so that you have the information needed to act on when issues arise.
Create templates for incident reporting. For information to include in your incident reports, see Best practices for working with Customer Care.
Document your organization’s escalation process. Ensure that you have clear, well-defined actions to address escalations.
Include a plan to teach new team members how to interact with support.
Regularly test your escalation process internally. Test your escalation process before major events, such as migrations, new product launches, and peak traffic events. If you have Google Cloud Customer Care Premium Support, your Technical Account Manager can help review your escalation process and coordinate your tests with Google Cloud Customer Care.

https://cloud.google.com/architecture/framework/operational-excellence/establish-cloud-support-and-escalation-processes

80
Q

How do you create a process for capacity planning?

A

As you create this plan do the following:
Run load tests to determine how much load the system can handle while meeting its latency targets, given a fixed amount of resources. Load tests should use a mix of request types that matches production traffic profiles from live users. Don’t use a uniform or random mix of operations. Include spikes in usage in your traffic profile.
Create a capacity model. A capacity model is a set of formulas for calculating incremental resources needed per unit increase in service load, as determined from load testing.
Forecast future traffic and account for growth. See the article Measure Future Load for a summary of how Google builds traffic forecasts.
Apply the capacity model to the forecast to determine future resource needs.
Estimate the cost of resources your organization needs. Then, get budget approval from your Finance organization. This step is essential because the business can choose to make cost versus risk tradeoffs across a range of products. Those tradeoffs can mean acquiring capacity that’s lower or higher than the predicted need for a given product based on business priorities.
Work with your cloud provider to get the correct amount of resources at the correct time with quotas and reservations. Involve infrastructure teams for capacity planning and have operations create capacity plans with confidence intervals.
Repeat the previous steps every quarter or two.

https://www.usenix.org/publications/login/feb15/capacity-planning

81
Q

The following are some areas where configurable automation or customized automation can assist your organization in eliminating toil:

A

ddddIdentity management—for example, Cloud Identity and Identity and Access Management.
Google Cloud hosted solutions, as opposed to self-designed solutions—for example, cluster management (Google Kubernetes Engine (GKE)), relational database management (Cloud SQL), data warehouse management (BigQuery), and API management (Apigee).
Google Cloud services and tenant provisioning—for example Terraform and Cloud Foundation Toolkit.
Automated workflow orchestration for multi-step operations—for example, Cloud Composer.
Additional capacity provisioning—for example, several Google Cloud products, like Compute Engine and GKE, offer configurable autoscaling. Evaluate the Google Cloud services you are using to determine if they include configurable autoscaling.
CI/CD pipelines with automated deployment—for example, Cloud Build.
Canary analysis to validate deployments.
Automated model training (for machine learning)—for example, AutoML.
If a Google Cloud product or service only partially satisfies your technical needs when automating or eliminating manual workflows, consider filing a feature request through your Google Cloud account representative. Your issue might be a priority for other customers or already a part of our roadmap. If so, knowing the feature’s priority and timeline helps you to better assess the trade-offs of building your own solution versus waiting to use a Google Cloud feature.

82
Q

What is shared responsibility and shared fate on Google Cloud?

A

Understanding the shared responsibility model is important when determining how to best protect your data and workloads on Google Cloud. The shared responsibility model describes the tasks that you have when it comes to security in the cloud and how these tasks are different for cloud providers.
Understanding shared responsibility, however, can be challenging. The model requires an in-depth understanding of each service you utilize, the configuration options that each service provides, and what Google Cloud does to secure the service. Every service has a different configuration profile, and it can be difficult to determine the best security configuration. Google believes that the shared responsibility model stops short of helping cloud customers achieve better security outcomes. Instead of shared responsibility, we believe in shared fate.
Shared fate includes us building and operating a trusted cloud platform for your workloads. We provide best practice guidance and secured, attested infrastructure code that you can use to deploy your workloads in a secure way. We release solutions that combine various Google Cloud services to solve complex security problems and we offer innovative insurance options to help you measure and mitigate the risks that you must accept. Shared fate involves us more closely interacting with you as you secure your resources on Google Cloud.

https://cloud.google.com/architecture/framework/security/shared-responsibility-shared-fate

83
Q

What are the best practices for security principles?

A

Build a layered security approach
Implement security at each level in your application and infrastructure by applying a defense-in-depth approach. Use the features in each product to limit access and configure encryption where appropriate.
Design for secured decoupled systems
Simplify system design to accommodate flexibility where possible, and document security requirements for each component. Incorporate a robust secured mechanism to account for resiliency and recovery.
Automate deployment of sensitive tasks
Take humans out of the workstream by automating deployment and other admin tasks.
Automate security monitoring
Use automated tools to monitor your application and infrastructure. To scan your infrastructure for vulnerabilities and detect security incidents, use automated scanning in your continuous integration and continuous deployment (CI/CD) pipelines.
Meet the compliance requirements for your regions
Be mindful that you might need to obfuscate or redact personally identifiable information (PII) to meet your regulatory requirements. Where possible, automate your compliance efforts. For example, use Cloud Data Loss Prevention (Cloud DLP) and Dataflow to automate the PII redaction job before new data is stored in the system.
Comply with data residency and sovereignty requirements
You might have internal (or external) requirements that require you to control the locations of data storage and processing. These requirements vary based on systems design objectives, industry regulatory concerns, national law, tax implications, and culture.
Shift security left
DevOps and deployment automation let your organization increase the velocity of delivering products. To help ensure that your products remain secure, incorporate security processes from the start of the development process.

https://cloud.google.com/architecture/framework/security/security-principles

84
Q

How do you manage risks with controls?

A

Manage risk with controls
You should complete risk analysis before you deploy workloads on Google Cloud, and regularly afterwards as your business needs, regulatory requirements, and the threats relevant to your organization change.
Identify risks to your organization
Before you create and deploy resources on Google Cloud, complete a risk assessment to determine what security features you need in order to meet your internal security requirements and external regulatory requirements. Your risk assessment provides you with a catalog of risks that are relevant to you, and tells you how capable your organization is in detecting and counteracting security threats.
Your risks in a cloud environment differ from your risks in an on-premises environment due to the shared responsibility arrangement that you enter with your cloud provider. For example, in an on-premises environment you need to mitigate vulnerabilities to the hardware stack. In contrast, in a cloud environment these risks are borne by the cloud provider.
In addition, your risks differ depending on how you plan on using Google Cloud. Are you transferring some of your workloads to Google Cloud, or all of them? Are you using Google Cloud only for disaster recovery purposes? Are you setting up a hybrid cloud environment?
We recommend that you use an industry-standard risk assessment framework that applies to cloud environments and to your regulatory requirements. For example, the Cloud Security Alliance (CSA) provides the Cloud Controls Matrix (CCM). In addition, there are threat models such as OWASP application threat modeling that provide you with a list of potential gaps, and that suggest actions to remediate any gaps that are found. You can check our partner directory for a list of experts in conducting risk assessments for Google Cloud.
To help catalog your risks, consider Risk Manager, which is part of the Risk Protection Program. (This program is currently in preview.) Risk Manager scans your workloads to help you understand your business risks. Its detailed reports provide you with a security baseline. In addition, you can use Risk Manager reports to compare your risks against the risks outlined in the

Center for Internet Security (CIS) Benchmark.
After you catalog your risks, you must determine how to address them—that is, whether you want to accept, avoid, transfer, or mitigate them. The following section describes mitigation controls.

Mitigate your risks
You can mitigate risks using technical controls, contractual protections, and third-party verifications or attestations. The following table lists how you can use these mitigations when you adopt new public cloud services.

Technical controls Technical controls refer to the features and technologies that you use to protect your environment. These include built-in cloud security controls, such as firewalls and logging. Technical controls can also include using third-party tools to reinforce or support your security strategy.

Contractual protections Contractual protections refer to the legal commitments made by us regarding Google Cloud services.

Google is committed to maintaining and expanding our compliance portfolio. The Data Processing and Security Terms (DPST) document defines our commitment to maintaining our ISO 27001, 27017, and 27018 certifications and to updating our SOC 2 and SOC 3 reports every 12 months.

Third-party verifications or attestations refers to having a third-party vendor audit the cloud provider to ensure that the provider meets compliance requirements. For example, Google was audited by a third party for ISO 27017 compliance.

https://cloud.google.com/architecture/framework/security/risk-management

85
Q

Manage your assets

A

Manage your assets
Asset management is an important part of your business requirements analysis. You must know what assets you have, and you must have a good understanding of all your assets, their value, and any critical paths or processes related to them. You must have an accurate asset inventory before you can design any sort of security controls to protect your assets.
To manage security incidents and meet your organization’s regulatory requirements, you need an accurate and up-to-date asset inventory that includes a way to analyze historical data. You must be able to track your assets, including how their risk exposure might change over time.
Use cloud asset management tools
Google Cloud asset management tools are tailored specifically to our environment and to top customer use cases.
Automate asset management
Automation lets you quickly create and manage assets based on the security requirements that you specify. You can automate aspects of the asset lifecycle in the following ways:
Deploy your cloud infrastructure using automation tools such as Terraform. Google Cloud provides the security foundations blueprint, which helps you set up infrastructure resources that meet security best practices. In addition, it configures asset changes and policy compliance notifications in Cloud Asset Inventory.
Deploy your applications using automation tools such as Cloud Run and the Artifact Registry.
Monitor for deviations from your compliance policies
Deviations from policies can occur during all phases of the asset lifecycle. For example, assets might be created without the proper security controls, or their privileges might be escalated. Similarly, assets might be abandoned without the appropriate end-of-life procedures being followed.
Integrate with your existing asset management monitoring systems
If you already use a SIEM system or other monitoring system, integrate your Google Cloud assets with that system. Integration ensures that your organization has a single, comprehensive view into all resources, regardless of environment. For more information, see Export Google Cloud security data to your SIEM system and Scenarios for exporting Cloud Logging data: Splunk.
Use data analysis to enrich your monitoring
You can export your inventory to a BigQuery table or Cloud Storage bucket for additional analysis. For an example, see Tracking assets with IoT devices: Pycom, Sigfox, and Google Cloud.

https://cloud.google.com/architecture/framework/security/asset-management

86
Q

Manage identity and access

A

Manage identity and access
The practice of identity and access management (generally referred to as IAM) helps you ensure that the right people can access the right resources. IAM addresses the following aspects of authentication and authorization:
Account management, including provisioning
Identity governance
Authentication
Access control (authorization)
Identity federation
Managing IAM can be challenging when you have different environments or you use multiple identity providers. However, it’s critical that you set up a system that can meet your business requirements while mitigating risks.
The recommendations in this document help you review your current IAM policies and procedures and determine which of those you might need to modify for your workloads in Google Cloud. For example, you must review the following:
Whether you can use existing groups to manage access or whether you need to create new ones.
Your authentication requirements (such as multi-factor authentication (MFA) using a token).
The impact of service accounts on your current policies.
If you’re using Google Cloud for disaster recovery, maintaining appropriate separation of duties.
Within Google Cloud, you use Cloud Identity to authenticate your users and resources and Google’s Identity and Access Management (IAM) product to dictate resource access. Administrators can restrict access at the organization, folder, project, and resource level. Google IAM policies dictate who can do what on which resources. Correctly configured IAM policies help secure your environment by preventing unauthorized access to resources.
For more information, see Overview of identity and access management.
Use a single identity provider
Protect the super admin account
Plan your use of service accounts
A service account is a Google account that applications can use to call the Google API of a service.
Unlike your user accounts, service accounts are created and managed within Google Cloud. Service accounts also authenticate differently than user accounts:
Update your identity processes for the cloud
Set up SSO and MFA
Implement least privilege and separation of duties
You must ensure that the right individuals get access only to the resources and services that they need in order to perform their jobs. That is, you should follow the principle of least privilege. In addition, you must ensure there is an appropriate separation of duties.
Overprovisioning user access can increase the risk of insider threat, misconfigured resources, and non-compliance with audits. Underprovisioning permissions can prevent users from being able to access the resources they need in order to complete their tasks.
Audit access
To monitor the activities of privileged accounts for deviations from approved conditions, use Cloud Audit Logs. Cloud Audit Logs records the actions that members in your Google Cloud organization have taken in your Google Cloud resources. You can work with various audit log types across Google services. For more information, see Using Cloud Audit Logs to Help Manage Insider Risk (video).
Automate your policy controls
Set access permissions programmatically whenever possible. For best practices, see Organization policy setup. The Terraform scripts for this example foundation are in the example foundation repository.
Set restrictions on resources
Google IAM focuses on who, and it lets you authorize who can act on specific resources based on permissions. The Organization Policy Service focuses on what, and it lets you set restrictions on resources to specify how they can be configured. For example, you can use an organization policy to do the following:
Limit resource sharing based on domain.
Limit the use of service accounts.
Restrict the physical location of newly created resources.
In addition to using organizational policies for these tasks, you can restrict access to resources using one of the following methods:
Use tags to manage access to your resources without defining the access permissions on each resource. Instead, you add the tag and then set the access definition for the tag itself.
Use IAM Conditions for conditional, attribute-based control of access to resources.
Implement defense-in-depth using VPC Service Controls to further restrict access to resources.

https://cloud.google.com/architecture/framework/security/identity-access

87
Q

How do you implement compute and container security?

A

Implement compute and container security
Google Cloud includes controls to protect your compute resources and Google Kubernetes Engine (GKE) container resources. This document in the Google Cloud Architecture Framework describes key controls and best practices for using them.
Use hardened and curated VM images
Google Cloud includes Shielded VM, which allows you to harden your VM instances. Shielded VM is designed to prevent malicious code from being loaded during the boot cycle. It provides boot security, monitors integrity, and uses the Virtual Trusted Platform Module (vTPM). Use Shielded VM for sensitive workloads.
Use Confidential Computing for processing sensitive data
By default, Google Cloud encrypts data at rest and in transit across the network, but data isn’t encrypted while it’s in use in memory. If your organization handles confidential data, you need to mitigate against threats that undermine the confidentiality and integrity of either the application or the data in system memory. Confidential data includes personally identifiable information (PII), financial data, and health information.
Confidential Computing builds on Shielded VM. It protects data in use by performing computation in a hardware-based trusted execution environment.
In Google Cloud, you can enable Confidential Computing by running Confidential VMs or Confidential GKE nodes.
Protect VMs and containers
OS Login lets your employees connect to your VMs using Identity and Access Management (IAM) permissions as the source of truth instead of relying on SSH keys. In GKE, App Engine runs application instances within Docker containers. To enable a defined risk profile and to restrict employees from making changes to containers, ensure that your containers are stateless and immutable. The principle of immutability means that your employees do not modify the container or access it interactively. If it must be changed, you build a new image and redeploy. Enable SSH access to the underlying containers only in specific debugging scenarios.
Disable external IP addresses unless they’re necessary
To disable external IP address allocation (video) for your production VMs and to prevent the use of external load balancers, you can use organization policies. If you require your VMs to reach the internet or your on-premises data center, you can enable a Cloud NAT gateway.
You can deploy private clusters in GKE. In a private cluster, nodes have only internal IP addresses, which means that nodes and Pods are isolated from the internet by default. You can also define a network policy to manage Pod-to-Pod communication in the cluster.
Monitor your compute instance and GKE usage
Cloud Audit Logs are automatically enabled for Compute Engine and GKE. Audit logs let you automatically capture all activities with your cluster and monitor for any suspicious activity.
Keep your images and clusters up to date
Control access to your images and clusters
IIsolate containers in a sandbox
Use GKE Sandbox to deploy multi-tenant applications that need an extra layer of security and isolation from their host kernel. For example, use GKE Sandbox when you are executing unknown or untrusted code. GKE Sandbox is a container isolation solution that provides a second layer of defense between containerized workloads on GKE.

https://cloud.google.com/architecture/framework/security/compute-container-security

88
Q

How do architects secure a network?

A

Secure your network
Extending your existing network to include cloud environments has many implications for security. Your on-premises approach to multi-layered defenses likely involves a distinct perimeter between the internet and your internal network. You probably protect the perimeter by using physical firewalls, routers, intrusion detection systems, and so on. Because the boundary is clearly defined, you can easily monitor for intrusions and respond accordingly.
When you move to the cloud (either completely or in a hybrid approach), you move beyond your on-premises perimeter. This document describes ways that you can continue to secure your organization’s data and workloads on Google Cloud. As mentioned in Manage risks with controls, how you set up and secure your Google Cloud network depends on your business requirements and risk appetite.
Deploy zero trust networks
Secure connections to your on-premises or multi-cloud environments
Disable default networks
When you create a new Google Cloud project, a default Google Cloud VPC network with auto mode IP addresses and pre-populated firewall rules is automatically provisioned. For production deployments, we recommend that you delete the default networks in existing projects, and disable the creation of default networks in new projects.
Virtual Private Cloud networks let you use any internal IP address. To avoid IP address conflicts, we recommend that you first plan your network and IP address allocation across your connected deployments and across your projects. A project allows multiple VPC networks, but it’s usually a best practice to limit these networks to one per project in order to enforce access control effectively.
Secure your perimeter
In Google Cloud, you can use various methods to segment and secure your cloud perimeter, including firewalls and VPC Service Controls.
Use Shared VPC to build a production deployment that gives you a single shared network and that isolates workloads into individual projects that can be managed by different teams. Shared VPC provides centralized deployment, management, and control of the network and network security resources across multiple projects. Shared VPC consists of host and service projects that perform the following functions:
A host project contains the networking and network security-related resources, such as VPC networks, subnets, firewall rules, and hybrid connectivity.
A service project attaches to a host project. It lets you isolate workloads and users at the project level by using Identity and Access Management (IAM), while it shares the networking resources from the centrally managed host project.
Define firewall rules and policies at the organization, folder, and VPC network level. You can configure firewall rules to permit or deny traffic to or from VM instances. For more information and examples, see Using firewall rules. In addition to defining rules based on IP addresses, protocols, and ports, you can manage traffic and apply firewall rules based on the service account that’s used by a VM instance. Use service accounts in your firewall rules to simplify your configuration and enforce isolation without relying on an IP address as the sole identifier of a workload.
Use hierarchical firewall policies to define rules that apply to all networks in your organization, regardless of what the network-level firewall rules permit. You can also define rules at the folder level to cover only portions of your organization.
To control the movement of data in Google services and to set up context-based perimeter security, consider VPC Service Controls. VPC Service Controls provides an extra layer of security for Google Cloud services that’s independent of IAM and VPC firewall rules and policies. For example, VPC Service Controls lets you set up perimeters between confidential and non-confidential data so that you can apply controls that help prevent data exfiltration.
Inspect your network traffic
You can use Cloud IDS and Packet Mirroring to help you ensure the security and compliance of workloads running in Compute Engine and Google Kubernetes Engine (GKE).
Use a web application firewall
For external web applications and services, you can enable Google Cloud Armor to provide distributed denial-of-service (DDoS) protection and web application firewall (WAF) capabilities. Google Cloud Armor supports Google Cloud workloads that are exposed using external HTTP(S) load balancing, TCP Proxy load balancing, or SSL Proxy load balancing.
Automate infrastructure provisioning
Automation lets you create immutable infrastructure, which means that it can’t be changed after provisioning. This measure gives your operations team a known good state, fast rollback, and troubleshooting capabilities. For automation, you can use tools such as Terraform, Jenkins, and Cloud Build.
Monitor your network
Monitor your network and your traffic using telemetry.

https://cloud.google.com/architecture/framework/security/network-security

89
Q

Implement data security

A

Implement data security
As part of your deployment architecture, you must consider what data you plan to process and store in Google Cloud, and the sensitivity of the data.
Design your controls to help secure the data during its lifecycle, to identify data ownership and classification, and to help protect data from unauthorized use.
Automatically classify your data
Perform data classification as early in the data management lifecycle as possible, ideally when the data is created. Usually, data classification efforts require only a few categories,
Use Cloud DLP to discover and classify data across your Google Cloud environment.
Manage data governance using metadata
Data governance is a combination of processes that ensure that data is secure, private, accurate, available, and usable.
Use Dataproc Metastore or Hive metastore to manage metadata for workloads. Data Catalog has a hive connector that allows the service to discover metadata that’s inside a hive metastore.
Use Dataprep by Trifacta to define and enforce data quality rules through a console.
Protect data according to its lifecycle phase and classification
Encrypt your data
You can control access by Google support and engineering personnel to your environment on Google Cloud.

You can control the network locations from which users can access data by using VPC Service Controls.
Manage secrets using Secret Manager
Monitor your data

https://cloud.google.com/architecture/framework/security/data-security

90
Q

How do you deploy applications securely?

A

Deploy applications securely
To deploy secure applications, you must have a well-defined software development lifecycle, with appropriate security checks during the design, development, testing, and deployment stages.
When you design an application, we recommend a layered system architecture that uses standardized frameworks for identity, authorization, and access control.
Automate secure releases
Without automated tools, it can be hard to deploy, update, and patch complex application environments to meet consistent security requirements. Therefore, we recommend that you build a CI/CD pipeline for these tasks, which can solve many of these issues.
You can use automation to scan for security vulnerabilities when artifacts are created. You can also define policies for different environments (development, test, production, and so on) so that only verified artifacts are deployed.
Scan for known vulnerabilities before deployment
Use Container Analysis to automatically scan for vulnerabilities for containers that are stored in Artifact Registry and Container Registry.
Monitor your application code for known vulnerabilities
Control movement of data across perimeters
To control the movement of data across a perimeter, you can configure security perimeters around the resources of your Google-managed services.
Use VPC Service Controls to place all components and services in your CI/CD pipeline (for example, Container Registry, Artifact Registry, Container Analysis, and Binary Authorization) inside a security perimeter.
VPC Service Controls improves your ability to mitigate the risk of unauthorized copying or transfer of data (data exfiltration) from Google-managed services.

https://cloud.google.com/architecture/framework/security/app-security

91
Q

Manage compliance obligations

A

Manage compliance obligations
Your cloud regulatory requirements depend on a combination of factors, including the following:
The laws and regulations that apply your organization’s physical locations.
The laws and regulations that apply to your customers’ physical locations.
Your industry’s regulatory requirements.
A typical compliance journey goes through three stages: assessment, gap remediation, and continual monitoring. This section addresses the best practices that you can use during each stage.
Assess your compliance needs
Compliance assessment starts with a thorough review of all of your regulatory obligations and how your business is implementing them. To help you with your assessment of Google Cloud services, use the Compliance resource center. This site provides you with details on the following:

Deploy Assured Workloads
Assured Workloads is the Google Cloud tool that builds on the controls within Google Cloud to help you meet your compliance obligations.
Review blueprints for templates and best practices that apply to your compliance regime
Google has published blueprints and solutions guides that describe best practices and that provide Terraform modules to let you roll out an environment that helps you achieve compliance. The following table lists a selection of blueprints that address security and alignment with compliance requirements.
Monitor your compliance
Most regulations require you to monitor particular activities, including access controls. To help with your monitoring, you can use the following:
Access Transparency, which provides near real-time logs when Google Cloud admins access your content.
Firewall Rules Logging to record TCP and UDP connections inside a VPC network for any rules that you create yourself.
VPC Flow Logs to record network traffic flows that are sent or received by VM instances.
Set up automatic remediation to particular notifications. For more information, see Cloud Functions code.

92
Q

Implement data residency and sovereignty requirements

A

Data residency and sovereignty requirements are based on your regional and industry-specific regulations, and different organizations might have different data sovereignty requirements. For example, you might have the following requirements:
Control over all access to your data by Google Cloud, including what type of personnel can access the data and from which region they can access it.
Inspectability of changes to cloud infrastructure and services, which can have an impact on access to your data or the security of your data. Insight into these types of changes helps ensure that Google Cloud is unable to circumvent controls or move your data out of the region.
Survivability of your workloads for an extended time when you are unable to receive software updates from Google Cloud.
Manage your data sovereignty
Store and manage encryption keys outside the cloud.
Only grant access to these keys based on detailed access justifications.
Protect data in use.
Manage your operational sovereignty
Restrict the deployment of new resources to specific provider regions.
Limit Google personnel access based on predefined attributes such as their citizenship or geographic location.
Manage software sovereignty
Software sovereignty provides you with assurances that you can control the availability of your workloads and run them wherever you want, without depending on (or being locked in to) a single cloud provider. Software sovereignty includes the ability to survive events that require you to quickly change where your workloads are deployed and what level of outside connection is allowed.
For example, Google Cloud supports hybrid and multi-cloud deployments. In addition, Anthos lets you manage and deploy your applications in both cloud environments and on-premises environments.
Control data residency
Understanding the type of your data and its location.
Determining what risks exist to your data, and what laws and regulations apply.
Controlling where data is or where it goes.

https://cloud.google.com/architecture/framework/security/compliance

93
Q

Implement privacy requirements

A

Implement privacy requirements

Privacy regulations help define how you can obtain, process, store, and manage your users’ data. Many privacy controls (for example, controls for cookies, session management, and obtaining user permission) are your responsibility because you own your data (including the data that you receive from your users).

Google Cloud includes the following controls that promote privacy:

Default encryption of all data when it’s at rest, when it’s in transit, and while it’s being processed.
Safeguards against insider access.
Support for numerous privacy regulations.
For more information, see Google Cloud Privacy Commitments.

Classify your confidential data
You must define what data is confidential and then ensure that the confidential data is properly protected. Confidential data can include credit card numbers, addresses, phone numbers, and other personal identifiable information (PII).

Using Cloud DLP, you can set up appropriate classifications. You can then tag and tokenize your data before you store it in Google Cloud. For more information, see Automatically classify your data.

Lock down access to sensitive data
Place sensitive data in its own service perimeter using VPC Service Controls, and set Google Identity and Access Management (IAM) access controls for that data. Configure multi-factor authentication (MFA) for all users who require access to sensitive data.

Set up SSO and MFA.

Monitor for phishing attacks
Ensure that your email system is configured to protect against phishing attacks, which are often used for fraud and malware attacks.

If your organization uses Gmail, you can use advanced phishing and malware protection. This collection of settings provides controls to quarantine emails, defends against anomalous attachment types, and helps protect against from inbound spoofing emails. Security Sandbox detects malware in attachments. Gmail is continually and automatically updated with the latest security improvements and protections to help keep your organization’s email safe.

Extend zero trust security to your hybrid workforce
A zero trust security model means that no one is trusted implicitly, whether they are inside or outside of your organization’s network. When your IAM systems verify access requests, a zero trust security posture means that the user’s identity and context (for example, their IP address or location) are considered. Unlike a VPN, zero trust security shifts access controls from the network perimeter to users and their devices. Zero trust security allows users to work more securely from any location. For example, users can access your organization’s resources from their laptops or mobile devices while at home.

On Google Cloud, you can configure BeyondCorp Enterprise and Identity-Aware Proxy (IAP) to enable zero trust for your Google Cloud resources. If your users use Google Chrome and you enable BeyondCorp Enterprise, you can integrate zero-trust security into your users browsers.

https://cloud.google.com/architecture/framework/security/data-residency-sovereignty

94
Q

Implement logging and detective controls

A

implement logging and detective controls
Detective controls use telemetry to detect misconfigurations, vulnerabilities, and potentially malicious activity in a cloud environment. Google Cloud lets you create tailored monitoring and detective controls for your environment. This section describes these additional features and recommendations for their use.
Monitor network performance
Network Intelligence Center gives you visibility into how your network topology and architecture are performing. You can get detailed insights into network performance and then use that information to optimize your deployment by eliminating bottlenecks on your services. Connectivity Tests provides you with insights into the firewall rules and policies that are applied to the network path.
Monitor and prevent data exfiltration
Data exfiltration is a key concern for organizations. Typically, it occurs when an authorized person extracts data from a secured system and then shares that data with an unauthorized party or moves it to an insecure system.
Google Cloud provides several features and tools that help you detect and prevent data exfiltration. For more information, see Preventing data exfiltration.
Centralize your monitoring
Security Command Center provides visibility into the resources that you have in Google Cloud and into their security state. Security Command Center helps you prevent, detect, and respond to threats. It provides a centralized dashboard that you can use to help identify security misconfigurations in virtual machines, in networks, in applications, and in storage buckets. You can address these issues before they result in business damage or loss. The built-in capabilities of Security Command Center can reveal suspicious activity in your Cloud Logging security logs or indicate compromised virtual machines.

Enable the services that you need for your workloads, and then only monitor and analyze important data.
Monitor for threats
Event Threat Detection is an optional managed service of Security Command Center Premium that detects threats in your log stream. By using Event Threat Detection, you can detect high-risk and costly threats such as malware, cryptomining, unauthorized access to Google Cloud resources, DDoS attacks, and brute-force SSH attacks. Using the tool’s features to distill volumes of log data, your security teams can quickly identify high-risk incidents and focus on remediation.
To help detect potentially compromised user accounts in your organization, use the Sensitive Actions Cloud Platform logs to identify when sensitive actions are taken and to confirm that valid users took those actions for valid purposes. A sensitive action is an action, such as the addition of a highly privileged role, that could be damaging to your business if a malicious actor took the action. Use Cloud Logging to view, monitor, and query the Sensitive Actions Cloud Platform logs. You can also view the sensitive action log entries with the Sensitive Actions Service, a built-in service of Security Command Center Premium.
Chronicle can store and analyze all of your security data centrally. Using Chronicle, you can create detection rules, set up indicators of compromise (IoC) matching, and perform threat-hunting activities. To help you see the entire span of an attack, Chronicle can map logs into a common model, enrich them, and then link them together into timelines. Chronicle also supports threat detection using extended YARA, an open standard for malware-detection rule writing.

95
Q

How do you put shared responsibility and shared fate into practice

A

As part of your planning process, consider the following actions to help you understand and implement appropriate security controls:

Create a list of the type of workloads that you will host in Google Cloud, and whether they require IaaS, PaaS, and SaaS services. You can use the shared responsibility diagram as a checklist to ensure that you know the security controls that you need to consider.
Create a list of regulatory requirements that you must comply with, and access resources in the Compliance resource center that relate to those requirements.
Review the list of available blueprints and architectures in the Architecture Center for the security controls that you require for your particular workloads. The blueprints provide a list of recommended controls and the IaC code that you require to deploy that architecture.
Use the landing zone documentation and the recommendations in the security foundations guide to design a resource hierarchy and network architecture that meets your requirements. You can use the opinionated workload blueprints, like the secured data warehouse, to accelerate your development process.
After you deploy your workloads, verify that you’re meeting your security responsibilities using services such as the Risk Manager, Assured Workloads, Policy Intelligence tools, and Security Command Center Premium.

https://cloud.google.com/architecture/framework/security/shared-responsibility-shared-fate

96
Q

What Risk Protection Program put in place by Shared Fate does?

A

Shared fate also includes the Risk Protection Program (currently in preview), which helps you use the power of Google Cloud as a platform to manage risk, rather than just seeing cloud workloads as another source of risk that you need to manage. The Risk Protection Program is a collaboration between Google Cloud and two leading cyber insurance companies, Munich Re and Allianz Global & Corporate Speciality.

The Risk Protection Program includes Risk Manager, which provides data-driven insights that you can use to better understand your cloud security posture. If you’re looking for cyber insurance coverage, you can share these insights from Risk Manager directly with our insurance partners to obtain a quote. For more information, see Google Cloud Risk Protection Program now in Preview.

https://cloud.google.com/architecture/framework/security/shared-responsibility-shared-fate

97
Q

What are the challenges for the Shared Responsibility Model

A

Though shared responsibility helps define the security roles that you or the cloud provider has, relying on shared responsibility can still create challenges. Consider the following scenarios:

Most cloud security breaches are the direct result of misconfiguration (listed as number 3 in the Cloud Security Alliance’s Pandemic 11 Report) and this trend is expected to increase. Cloud products are constantly changing, and new ones are constantly being launched. Keeping up with constant change can seem overwhelming. Customers need cloud providers to provide them with opinionated best practices to help keep up with the change, starting with best practices by default and having a baseline secure configuration.
Though dividing items by cloud services is helpful, many enterprises have workloads that require multiple cloud services types. In this circumstance, you must consider how various security controls for these services interact, including whether they overlap between and across services. For example, you might have an on-premises application that you’re migrating to Compute Engine, use Google Workspace for corporate email, and also run BigQuery to analyze data to improve your products.
Your business and markets are constantly changing; as regulations change, as you enter new markets, or as you acquire other companies. Your new markets might have different requirements, and your new acquisition might host their workloads on another cloud. To manage the constant changes, you must constantly re-assess your risk profile and be able to implement new controls quickly.
How and where to manage your data encryption keys is an important decision that ties with your responsibilities to protect your data. The option that you choose depends on your regulatory requirements, whether you’re running a hybrid cloud environment or still have an on-premises environment, and the sensitivity of the data that you’re processing and storing.
Incident management is an important, and often overlooked, area where your responsibilities and the cloud provider responsibilities aren’t easily defined. Many incidents require close collaboration and support from the cloud provider to help investigate and mitigate them. Other incidents can result from poorly configured cloud resources or stolen credentials, and ensuring that you meet the best practices for securing your resources and accounts can be quite challenging.
Advanced persistent threats (APTs) and new vulnerabilities can impact your workloads in ways that you might not consider when you start your cloud transformation. Ensuring that you remain up-to-date on the changing landscape, and who is responsible for threat mitigation is difficult, particularly if your business doesn’t have a large security team.

https://cloud.google.com/architecture/framework/security/shared-responsibility-shared-fate

98
Q

How does an architect build reliability into a cloud solution?

A

To run a reliable service, your architecture must include the following:

Measurable reliability goals, with deviations that you promptly correct.

Design patterns for scalability, high availability, disaster recovery, and automated change management.

Components that self-heal where possible, and code that includes instrumentation for observability.

Operational procedures that run the service with minimal manual work and cognitive load on operators, and that let you rapidly detect and mitigate failures.

https://cloud.google.com/architecture/framework/reliability

99
Q

What are the key principles for running operations for a service?

A

The following are covered in this section of the Architecture Framework:
Assign clear service ownership.
Reduce time to detect (TTD) with well tuned alerts.
Reduce time to mitigate (TTM) with incident management plans and training.
Design dashboard layouts and content to minimize TTM.
Document diagnostic procedures and mitigation for known outage scenarios.
Use blameless postmortems to learn from outages and prevent recurrences.

100
Q

Google’s approach to reliability is based on the following core principles:

A

Relaiabiility Core principles

Reliability is your top feature
Reliability is defined by the user
100% reliability is the wrong target
Reliability and rapid innovation are complementary
Design and operational principles
Define your reliability goals
Build observability into your infrastructure and applications
Design for scale and high availability
Create reliable operational processes and tools
Build efficient alerts
Build a collaborative incident management process

https://cloud.google.com/architecture/framework/reliability/principles

101
Q

Service level indicator (SLI)

A

A service level indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is being provided. It is a metric, not a target.

https://cloud.google.com/architecture/framework/reliability/principles

102
Q

Service level objective (SLO)

A

A service level objective (SLO) specifies a target level for the reliability of your service. The SLO is a target value for an SLI. When the SLI is at or better than this value, the service is considered to be “reliable enough.” Because SLOs are key to making data-driven decisions about reliability, they are the focal point of site reliability engineering (SRE) practices.

103
Q

Error budget

A

An error budget is calculated as 100% – SLO over a period of time. Error budgets tell you if your system has been more or less reliable than is needed over a certain time window, and how many minutes of downtime are allowed during that period.

For example, if your availability SLO is 99.9%, your error budget over a 30-day period is (1 - 0.999) ✕ 30 days ✕ 24 hours ✕ 60 minutes = 43.2 minutes. The error budget for a system is consumed, or burned, whenever the system is unavailable. Using the previous example, if the system has had 10 minutes of downtime in the past 30 days and started the 30-day period with the full budget of 43.2 minutes unutilized, then the remaining error budget is reduced to 33.2 minutes.

We recommend using a rolling window of 30 days when computing your total error budget and the error budget burn rate.

104
Q

Service level agreement (SLA)

A

A service level agreement (SLA) is an explicit or implicit contract with your users that includes consequences if you meet, or miss, the SLOs referenced in the contract.

105
Q

How do you setup and manage reliability goals?

A

Define and measure customer-centric SLIs, such as the availability or latency of the service.
Define a customer-centric error budget that’s stricter than your external SLA. Include consequences for violations, such as production freezes.
Set up latency SLIs to capture outlier values, such as 90th or 99th percentile, to detect the slowest responses.
Review SLOs at least annually and confirm that they correlate well with user happiness and service outages.

https://cloud.google.com/architecture/framework/reliability/define-goals

106
Q

What SLIs are typical in systems that serve data?

A

Availability tells you the fraction of the time that a service is usable. It’s often defined in terms of the fraction of well-formed requests that succeed, such as 99%.
Latency tells you how quickly a certain percentage of requests can be fulfilled. It’s often defined in terms of a percentile other than 50th, such as “99th percentile at 300 ms”.
Quality tells you how good a certain response is. The definition of quality is often service-specific, and indicates the extent to which the content of the response to a request varies from the ideal response content. The response quality could be binary (good or bad) or expressed on a scale from 0% to 100%.

https://cloud.google.com/architecture/framework/reliability/define-goals

107
Q

What SLIs are typical in systems that process data?

A

Coverage tells you the fraction of data that has been processed, such as 99.9%.
Correctness tells you the fraction of output data deemed to be correct, such as 99.99%.
Freshness tells you how fresh the source data or the aggregated output data is. Typically the more recently updated, the better, such as 20 minutes.
Throughput tells you how much data is being processed, such as 500 MiB/sec or even 1000 requests per second (RPS).

https://cloud.google.com/architecture/framework/reliability/define-goals

108
Q

What SLIs are typical in systems that store data?

A

Storage systems

Durability tells you how likely the data written to the system can be retrieved in the future, such as 99.9999%. Any permanent data loss incident reduces the durability metric.
Throughput and latency are also common SLIs for storage systems.

https://cloud.google.com/architecture/framework/reliability/define-goals

109
Q

What are the best practices to add observability into your services so that you can better understand your service performance and quickly identify issues.

Observability includes monitoring, logging, tracing, profiling, debugging, and similar systems.

A

Implement monitoring early, such as before you initiate a migration or before you deploy a new application to a production environment.
Disambiguate between application issues and underlying cloud issues. Use the Monitoring API, or other Cloud Monitoring products and the Google Cloud Status Dashboard.
Define an observability strategy beyond monitoring that includes tracing, profiling, and debugging.
Regularly clean up observability artifacts that you don’t use or that don’t provide value, such as unactionable alerts.
If you generate large amounts of observability data, send application events to a data warehouse system such as BigQuery.

Monitoring is at the base of the service reliability hierarchy in the Google SRE Handbook. Without proper monitoring, you can’t tell whether an application works correctly.

https://cloud.google.com/architecture/framework/reliability/observability-infrastructure-applications

110
Q

What would you recommend to a customer in order for them to architect their services so that they can tolerate failures and scale in response to customer demand.

What is a reliable service?
A reliable service continues to respond to customer requests when there’s a high demand on the service or when there’s a maintenance event.

A

The following reliability design principles and best practices should be part of your system architecture and deployment plan.

Follow these recommendations:
Implement exponential backoff with randomization in the error retry logic of client applications.
Implement a multi-region architecture with automatic failover for high availability.
Use load balancing to distribute user requests across shards and regions.

Design the application to degrade gracefully under overload.
Serve partial responses or provide limited functionality rather than failing completely.

Establish a data-driven process for capacity planning, and use load tests and traffic forecasts to determine when to provision resources.

Establish disaster recovery procedures and test them periodically.

https://cloud.google.com/architecture/framework/reliability/design-scale-high-availability

111
Q

How would a member render failures or slowness in your service less harmful to other components that depend on it, by considering the following example design techniques and principles:

A

Use prioritized request queues and give higher priority to requests where a user is waiting for a response.
Serve responses out of a cache to reduce latency and load.
Fail safe in a way that preserves function.
Degrade gracefully when there’s a traffic overload.

https://cloud.google.com/architecture/framework/reliability/design-scale-high-availability

112
Q

Minimize the number of critical dependencies for your service, that is, other components whose failure will inevitably cause outages for your service. To make your service more resilient to failures or slowness in other components it depends on, consider the following example design techniques and principles to convert critical dependencies into non-critical dependencies:

A

Increase the level of redundancy in critical dependencies. Adding more replicas makes it less likely that an entire component will be unavailable.
Use asynchronous requests to other services instead of blocking on a response or use publish/subscribe messaging to decouple requests from responses.
Cache responses from other services to recover from short-term unavailability of dependencies.

https://cloud.google.com/architecture/framework/reliability/design-scale-high-availability

113
Q

What principles would you recommend to build reliable operational processes and tools?

Examples - how to deploy updates, run services in production environments, and test for failures.

A

Choose good names for applications and services
Avoid using internal code names in production configuration files, because they can be confusing, particularly to newer employees, potentially increasing time to mitigate (TTM) for outages. Implement progressive rollouts with canary testing

Instantaneous global changes to service binaries or configuration are inherently risky. Roll out new versions of executables and configuration changes incrementally. Start with a small scope, such as a few VM instances in a zone, and gradually expand the scope. Roll back rapidly if the change doesn’t perform as you expect, or negatively impacts users at any stage of the rollout. Your goal is to identify and address bugs when they only affect a small portion of user traffic, before you roll out the change globally.
Spread out traffic for timed promotions and launches
You might have promotional events, such as sales that start at a precise time and encourage many users to connect to the service simultaneously. If so, design client code to spread the traffic over a few seconds. Use random delays before they initiate requests.

Automate build, test, and deployment
Eliminate manual effort from your release process with the use of continuous integration and continuous delivery (CI/CD) pipelines. Perform automated integration testing and deployment. For example, create a modern CI/CD process with Anthos.

Defend against operator error
Design your operational tools to reject potentially invalid configurations. Detect and alert when a configuration version is empty, partial or truncated, corrupt, logically incorrect or unexpected, or not received within the expected time. Tools should also reject configuration versions that differ too much from the previous version.

Test failure recovery
Regularly test your operational procedures to recover from failures in your service. Without regular tests, your procedures might not work when you need them if there’s a real failure. Items to test periodically include regional failover, how to roll back a release, and how to restore data from backups.

Conduct disaster recovery tests
Like with failure recovery tests, don’t wait for a disaster to strike. Periodically test and verify your disaster recovery procedures and processes.

Practice chaos engineering
Consider the use of chaos engineering in your test practices. Introduce actual failures into different components of production systems under load in a safe environment. This approach helps to ensure that there’s no overall system impact because your service handles failures correctly at each level.

https://cloud.google.com/architecture/framework/reliability/create-operational-processes-tools

114
Q

What are 3 things you can do to implement progressive rollouts with canary testing?

A

Instantaneous global changes to service binaries or configuration are inherently risky. Roll out new versions of executables and configuration changes incrementally. Start with a small scope, such as a few VM instances in a zone, and gradually expand the scope. Roll back rapidly if the change doesn’t perform as you expect, or negatively impacts users at any stage of the rollout. Your goal is to identify and address bugs when they only affect a small portion of user traffic, before you roll out the change globally.

Set up a canary testing system that’s aware of service changes and does A/B comparison of the metrics of the changed servers with the remaining servers. The system should flag unexpected or anomalous behavior. If the change doesn’t perform as you expect, the canary testing system should automatically halt rollouts. Problems can be clear, such as user errors, or subtle, like CPU usage increase or memory bloat.

It’s better to stop and roll back at the first hint of trouble and diagnose issues without the time pressure of an outage. After the change passes canary testing, propagate it to larger scopes gradually, such as to a full zone, then to a second zone. Allow time for the changed system to handle progressively larger volumes of user traffic to expose any latent bugs.

https://cloud.google.com/architecture/framework/reliability/create-operational-processes-tools

115
Q

What are operational principles to create alerts that help you run reliable services?

A

. The more information you have on how your service performs, the more informed your decisions are when there’s an issue. Design your alerts for early and accurate detection of all user-impacting system problems, and minimize false positives.
Optimize the alert delay
There’s a balance between alerts that are sent too soon that stress the operations team and alerts that are sent too late and cause long service outages. Tune the alert delay before the monitoring system notifies humans of a problem to minimize time to detect, while maximizing signal versus noise. Use the error budget consumption rate to derive the optimal alert configuration.
Alert on symptoms rather than causes
Trigger alerts based on the direct impact to user experience. Noncompliance with global or per-customer SLOs indicates a direct impact. Don’t alert on every possible root cause of a failure, especially when the impact is limited to a single replica. A well-designed distributed system recovers seamlessly from single-replica failures.
Alert on outlier values rather than averages
When monitoring latency, define SLOs and set alerts for (pick two out of three) 90th, 95th, or 99th percentile latency, not for average or 50th percentile latency. Good mean or median latency values can hide unacceptably high values at the 90th percentile or above that cause very bad user experiences. Therefore you should apply this principle of alerting on outlier values when monitoring latency for any critical operation, such as a request-response interaction with a webserver, batch completion in a data processing pipeline, or a read or write operation on a storage service.

116
Q

What are the best practices to manage services and define processes to respond to incidents. Incidents occur in all services, so you need a well-documented process to efficiently respond to these issues and mitigate them.

A

Establish an incident management plan, and train your teams to use it.
To reduce TTD, implement the recommendations to build observability into your infrastructure and applications.
Build a “What’s changed?” dashboard that you can glance at when there’s an incident.
Document query snippets or build a Looker Studio dashboard with frequent log queries.
Evaluate Firebase Remote Config to mitigate rollout issues for mobile applications.
Test failure recovery, including restoring data from backups, to decrease TTM for a subset of your incidents.
Design for and test configuration and binary rollbacks.
Replicate data across regions for disaster recovery and use disaster recovery tests to decrease TTM after regional outages.
Design a multi-region architecture for resilience to regional outages if the business need for high availability justifies the cost, to increase TBF.

https://cloud.google.com/architecture/framework/reliability/build-incident-management-process

117
Q

How do cloud practitioners optimize the cost of workloads in Google Cloud?

Moving your IT workloads to the cloud can help you to innovate at scale, deliver features faster, and respond to evolving customer needs. To migrate existing workloads or deploy applications built for the cloud, you need a topology that’s optimized for security, resilience, operational excellence, cost, and performance.

A

In the cost optimization category of the Architecture Framework, you:

Adopt and implement FinOps: Strategies to help you encourage employees to consider the cost impact when provisioning and managing resources in Google Cloud.
Monitor and control cost: Best practices, tools, and techniques to track and control the cost of your resources in Google Cloud.

Optimize cost: Compute, containers, and serverless: Service-specific cost-optimization controls for Compute Engine, Google Kubernetes Engine, Cloud Run, Cloud Functions, and App Engine.

Optimize cost: Storage: Cost-optimization controls for Cloud Storage, Persistent Disk, and Filestore.

Optimize cost: Databases and smart analytics: Cost-optimization controls for BigQuery, Cloud Bigtable, Cloud Spanner, Cloud SQL, Dataflow, and Dataproc.

Optimize cost: Networking: Cost-optimization controls for your networking resources in Google Cloud.

Optimize cost: Cloud operations: Recommendations to help you optimize the cost of monitoring and managing your resources in Google Cloud.

118
Q

What are FinOps?
When should they be controlled centrally and when by the project teams?

A

FinOps is a practice that combines people, processes, and technology to promote financial accountability and the discipline of cost optimization in an organization, regardless of its size or maturity in the cloud.

The guidance in this section is intended for CTOs, CIOs, and executives responsible for controlling their organization’s spend in the cloud. The guidance also helps individual cloud operators understand and adopt FinOps.

Every employee in your organization can help reduce the cost of your resources in Google Cloud, regardless of role (analyst, architect, developer, or administrator). In teams that have not had to track infrastructure costs in the past, you might have to educate employees about the need for collective responsibility.

A common model is for a central FinOps team or Cloud Center of Excellence (CCoE) to standardize the process for optimizing cost across all the cloud workloads. This model assumes that the central team has the required knowledge and expertise to identify high-value opportunities to improve efficiency.

Although centralized cost-control might work well in the initial stages of cloud adoption when usage is low, it doesn’t scale well when cloud adoption and usage increase. The central team might struggle with scaling, and project teams might not accept decisions made by anyone outside their teams.

We recommend that the central team delegate the decision making for resource optimization to the project teams. The central team can drive broader efforts to encourage the adoption of FinOps across the organization. To enable the individual project teams to practice FinOps, the central team must standardize the process, reporting, and tooling for cost optimization. The central team must work closely with teams that aren’t familiar with FinOps practices, and help them consider cost in their decision-making processes. The central team must also act as an intermediary between the finance team and the individual project teams.

https://cloud.google.com/architecture/framework/cost-optimization/finops

119
Q

What are some ways that you will be able to orchestrate how to monitor and control costs?

A

Identify Cost-management focus areas
The cost of your resources in Google Cloud depends on the quantity of resources that you use and the rate at which you’re billed for the resources.
Cost visibility
Track how much you spend and how your resources and services are billed, so that you can analyze the effect of cost on business outcomes. We recommend that you follow the FinOps operating model, which suggests the following actions to make cost information visible across your organization:
Resource optimization
Align the number and size of your cloud resources to the requirements of your workload. Where feasible, consider using managed services or re-architecting your applications. Typically, individual engineering teams have more context than the central FinOps (financial operations) team on opportunities and techniques to optimize resource deployment. We recommend that the FinOps team work with the individual engineering teams to identify resource-optimization opportunities that can be applied across the organization.
Rate optimization
The FinOps team often makes rate optimization decisions centrally. We recommend that the individual engineering teams work with the central FinOps team to take advantage of deep discounts for reservations, committed usage, Spot VMs, flat-rate pricing, and volume and contract discounting.
Design recommendations
Consolidate billing and resource management
To manage billing and resources in Google Cloud efficiently, we recommend that you use a single billing account for your organization, and use internal chargeback mechanisms to allocate costs. Use multiple billing accounts for loosely structured conglomerates and organizations with entities that don’t affect each other. For example, resellers might need distinct accounts for each customer. Using separate billing accounts might also help you meet country-specific tax regulations.
Track and allocate cost using labels
Labels are key-value pairs that you can use to tag projects and resources. To categorize cost data at the required granularity, establish a labeling schema that suits your organization’s chargeback mechanism and helps you allocate costs appropriately. Assign cost allocation labels at the project level, and define a set of labels that can be applied by default to all the projects. You can automate the assignment of labels when you create projects.
Configure billing access control
To control access to Cloud Billing, we recommend that you assign the Billing Account Administrator role to only those users who manage billing contact information. For example, employees in finance, accounting, and operations might need this role.
Configure billing reports
Set up billing reports to provide data for the key metrics that you need to track. We recommend that you track the following metrics:
Analyze trends and forecast cost
Customize and analyze cost reports using BigQuery Billing Export, and visualize cost data using Looker Studio. Assess the trend of actual costs and how much you might spend by using the forecasting tool.
Optimize resource usage and cost
This section recommendeds best practices to help you optimize the usage and cost of your resources across Google Cloud services.
Tools and techniques
The on-demand provisioning and pay-per-use characteristics of the cloud help you to optimize your IT spend. This section describes tools that Google Cloud provides and techniques that you can use to track and control the cost of your resources in the cloud. Before you use these tools and techniques, review the basic Cloud Billing concepts.
Billing reports
Google Cloud provides billing reports within the Google Cloud console to help you view your current and forecasted spend. The billing reports enable you to view cost data on a single page, discover and analyze trends, forecast the end-of-period cost, and take corrective action when necessary.
Data export to BigQuery
You can export billing reports to BigQuery, and analyze costs using granular and historical views of data, including data that’s categorized using labels. You can perform more advanced analyses using BigQuery ML. We recommend that you enable export of billing reports to BigQuery when you create the Cloud Billing account. Your BigQuery dataset contains billing data from the date you set up Cloud Billing export. The dataset doesn’t include data for the period before you enabled export.
Billing access control
You can control access to Cloud Billing for specific resources by defining Identity and Access Management (IAM) policies for the resources. To grant or limit access to Cloud Billing, you can set an IAM policy at the organization level, the billing account level, or the project level.
Budgets, alerts, and quotas
Budgets help you track actual Google Cloud costs against planned spending. When you create a budget, you can configure alert rules to trigger email notifications when the actual or forecasted spend exceeds a defined threshold. You can also use budgets to automate cost-control responses.

120
Q

What sorts of things can you do to optimize costs for compute resources?

A

he following recommendations are applicable to all the compute, containers, and serverless services in Google Cloud that are discussed in this section.

Track usage and cost
Use the following tools and techniques to monitor resource usage and cost:

View and respond to cost-optimization recommendations in the Recommendation Hub.
Get email notifications for potential increases in resource usage and cost by configuring budget alerts.
Manage and respond to alerts programmatically by using the Pub/Sub and Cloud Functions services.
Control resource provisioning
Use the following recommendations to control the quantity of resources provisioned in the cloud and the location where the resources are created:

To help ensure that resource consumption and cost don’t exceed the forecast, use resource quotas.
Provision resources in the lowest-cost region that meets the latency requirements of your workload. To control where resources are provisioned, you can use the organization policy constraint gcp.resourceLocations.
Get discounts for committed use
Committed use discounts (CUDs) are ideal for workloads with predictable resource needs. After migrating your workload to Google Cloud, find the baseline for the resources required, and get deeper discounts for committed usage. For example, purchase a one or three-year commitment, and get a substantial discount on Compute Engine VM pricing.

Automate cost-tracking using labels
Define and assign labels consistently. The following are examples of how you can use labels to automate cost-tracking:

For VMs that only developers use during business hours, assign the label env: development. You can use Cloud Scheduler to set up a serverless Cloud Function to shut down these VMs after business hours, and restart them when necessary.

For an application that has several Cloud Run services and Cloud Functions instances, assign a consistent label to all the Cloud Run and Cloud Functions resources. Identify the high-cost areas, and take action to reduce cost.

Customize billing reports
Configure your Cloud Billing reports by setting up the required filters and grouping the data as necessary (for example, by projects, services, or labels).

Promote a cost-saving culture
Train your developers and operators on your cloud infrastructure. Create and promote learning programs using traditional or online classes, discussion groups, peer reviews, pair programming, and cost-saving games. As shown in Google’s DORA research, organizational culture is a key driver for improving performance, reducing rework and burnout, and optimizing cost. By giving employees visibility into the cost of their resources, you help them align their priorities and activities with business objectives and constraints.

d

https://cloud.google.com/architecture/framework/cost-optimization/compute

121
Q

what are ways to optimize costs for GKE resources?

A

Use Cloud Monitoring to get real-time information about your GKE clusters (spending, bin-packing, application right-sizing, and scaling).
Use GKE Autopilot to let GKE maximize the efficiency of your cluster’s infrastructure. You don’t need to monitor the health of your nodes, handle bin-packing, or calculate the capacity that your workloads need.
Fine-tune GKE autoscaling by using Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), Cluster Autoscaler (CA), or node auto-provisioning based on your workload’s requirements.
For batch workloads that aren’t sensitive to startup latency, use the optimization-utilization autoscaling profile to help improve the utilization of the cluster.
Use node auto-provisioning to extend the GKE cluster autoscaler, and efficiently create and delete node pools based on the specifications of pending pods without over-provisioning.
Use separate node pools: a static node pool for static load, and dynamic node pools with cluster autoscaling groups for dynamic loads.
Use Spot VMs for Kubernetes node pools when your pods are fault-tolerant and can terminate gracefully in less than 25 seconds. Combined with the GKE cluster autoscaler, this strategy helps you ensure that the node pool with lower-cost VMs (in this case, the node pool with Spot VMs) scales first.
Choose cost-efficient machine types (for example: E2, N2D, T2D), which provide 20–40% higher performance-to-price.
Use GKE usage metering to analyze your clusters’ usage profiles by namespaces and labels. Identify the team or application that’s spending the most, the environment or component that caused spikes in usage or cost, and the team that’s wasting resources.
Use resource quotas in multi-tenant clusters to prevent any tenant from using more than its assigned share of cluster resources.
Schedule automatic downscaling of development and test environments after business hours.

122
Q

What are some ways to optimize cloud run resources for costs?

A

Adjust the concurrency setting (default: 80) to reduce cost. Cloud Run determines the number of requests to be sent to an instance based on CPU and memory usage. By increasing the request concurrency, you can reduce the number of instances required.
Set a limit for the number of instances that can be deployed.
Estimate the number of instances required by using the Billable Instance Time metric. For example, if the metric shows 100s/s, around 100 instances were scheduled. Add a 30% buffer to preserve performance; that is, 130 instances for 100s/s of traffic.
To reduce the impact of cold starts, configure a minimum number of instances. When these instances are idle, they are billed at a tenth of the price.
Track CPU usage, and adjust the CPU limits accordingly.
Use traffic management to determine a cost-optimal configuration.
Consider using Cloud CDN or Firebase Hosting for serving static assets.
For Cloud Run apps that handle requests globally, consider deploying the app to multiple regions, because cross continent egress traffic can be expensive. This design is recommended if you use a load balancer and CDN.
Reduce the startup times for your instances, because the startup time is also billable.
Purchase Committed Use Discounts, and save up to 17% off the on-demand pricing for a one-year commitment.

123
Q

What are some ways to optimize costs for cloud functions?

A

Observe the execution time of your functions. Experiment and benchmark to design the smallest function that still meets your required performance threshold.
If your Cloud Functions workloads run constantly, consider using GKE or Compute Engine to handle the workloads. Containers or VMs might be lower-cost options for always-running workloads.
Limit the number of function instances that can co-exist.
Benchmark the runtime performance of the Cloud Functions programming languages against the workload of your function. Programs in compiled languages have longer cold starts, but run faster. Programs in interpreted languages run slower, but have a lower cold-start overhead. Short, simple functions that run frequently might cost less in an interpreted language.
Delete temporary files written to the local disk, which is an in-memory file system. Temporary files consume memory that’s allocated to your function, and sometimes persist between invocations. If you don’t delete these files, an out-of-memory error might occur and trigger a cold start, which increases the execution time and cost.

https://cloud.google.com/architecture/framework/cost-optimization/compute#general_recommendations

124
Q

What are ways to optimize costs of your App Engine Resources?

A

Set maximum instances based on your traffic and request latency. App Engine usually scales capacity based on the traffic that the applications receive. You can control cost by limiting the number of instances that App Engine can create.
To limit the memory or CPU available for your application, set an instance class. For CPU-intensive applications, allocate more CPU. Test a few configurations to determine the optimal size.
Benchmark your App Engine workload in multiple programming languages. For example, a workload implemented in one language may need fewer instances and lower cost to complete tasks on time than the same workload programmed in another language.
Optimize for fewer cold starts. When possible, reduce CPU-intensive or long-running tasks that occur in the global scope. Try to break down the task into smaller operations that can be “lazy loaded” into the context of a request.
If you expect bursty traffic, configure a minimum number of idle instances that are pre-warmed. If you are not expecting traffic, you can configure the minimum idle instances to zero.
To balance performance and cost, run an A/B test by splitting traffic between two versions, each with a different configuration. Monitor the performance and cost of each version, tune as necessary, and decide the configuration to which traffic should be sent.
Configure request concurrency, and set the maximum concurrent requests higher than the default. The more requests each instance can handle concurrently, the more efficiently you can use existing instances to serve traffic.

https://cloud.google.com/architecture/framework/cost-optimization/compute#general_recommendations

125
Q

How can an organization optimize costs of their solutions by using labels?

A

For VMs that only developers use during business hours, assign the label env: development. You can use Cloud Scheduler to set up a serverless Cloud Function to shut down these VMs after business hours, and restart them when necessary.

For an application that has several Cloud Run services and Cloud Functions instances, assign a consistent label to all the Cloud Run and Cloud Functions resources. Identify the high-cost areas, and take action to reduce cost.

https://cloud.google.com/architecture/framework/cost-optimization/compute#general_recommendations

126
Q

Transparent cloud practitioners optimize the cost of workloads in Google Cloud by what things that they do?

A

Adopt and implement FinOps: Strategies to help you encourage employees to consider the cost impact when provisioning and managing resources in Google Cloud.
Monitor and control cost: Best practices, tools, and techniques to track and control the cost of your resources in Google Cloud.
Optimize cost: Compute, containers, and serverless: Service-specific cost-optimization controls for Compute Engine, Google Kubernetes Engine, Cloud Run, Cloud Functions, and App Engine.
Optimize cost: Storage: Cost-optimization controls for Cloud Storage, Persistent Disk, and Filestore.
Optimize cost: Databases and smart analytics: Cost-optimization controls for BigQuery, Cloud Bigtable, Cloud Spanner, Cloud SQL, Dataflow, and Dataproc.
Optimize cost: Networking: Cost-optimization controls for your networking resources in Google Cloud.
Optimize cost: Cloud operations: Recommendations to help you optimize the cost of monitoring and managing your resources in Google Cloud.

https://cloud.google.com/architecture/framework/cost-optimization

127
Q

What are best models for FinOps for who knows data and manages cost optimizations operations (aka Fin Ops)?

A

Every employee in your organization can help reduce the cost of your resources in Google Cloud, regardless of role (analyst, architect, developer, or administrator). In teams that have not had to track infrastructure costs in the past, you might have to educate employees about the need for collective responsibility.

A common model is for a central FinOps team or Cloud Center of Excellence (CCoE) to standardize the process for optimizing cost across all the cloud workloads. This model assumes that the central team has the required knowledge and expertise to identify high-value opportunities to improve efficiency.

Although centralized cost-control might work well in the initial stages of cloud adoption when usage is low, it doesn’t scale well when cloud adoption and usage increase. The central team might struggle with scaling, and project teams might not accept decisions made by anyone outside their teams.

We recommend that the central team delegate the decision making for resource optimization to the project teams. The central team can drive broader efforts to encourage the adoption of FinOps across the organization. To enable the individual project teams to practice FinOps, the central team must standardize the process, reporting, and tooling for cost optimization. The central team must work closely with teams that aren’t familiar with FinOps practices, and help them consider cost in their decision-making processes. The central team must also act as an intermediary between the finance team and the individual project teams.

https://cloud.google.com/architecture/framework/cost-optimization/finops

128
Q

Describe the design principles that we recommend your central team promote.

A

**Encourage individual accountability
**Any employee who creates and uses cloud resources affects the usage and the cost of those resources. implementing data-driven cost-optimization actions.
Educate users about cost-optimization opportunities and techniques.
Reward employees who optimize cost, and celebrate success.
Make costs visible across the organization.
Use a single, well-defined method for calculating the fully loaded costs of cloud resources. For example, the method could consider the total cloud spend adjusted for purchased discounts and shared costs, like the cost of shared databases.
Set up dashboards that enable employees to view their cloud spend in near real time.
To motivate individuals in the team to own their costs, allow wide visibility of cloud spending across teams.
**Enable collaborative behavior
**
Create a workload-onboarding process that helps ensure cost efficiency in the design stage through peer reviews of proposed architectures by other engineers.
Create a cross-team knowledge base of cost-efficient architectural patterns.
Establish a blameless culture
Promote a culture of learning and growth that makes it safe to take risks, make corrections when required, and innovate.

**While FinOps practices are often focused on cost reduction, the focus for a central team must be on enabling project teams to make decisions that maximize the business value of their cloud resources. **

129
Q

To manage the cost of cloud resources what areas should you focus on?

A

Cost visibility
Resource optimization
Rate optimization

Cost visibility
Track how much you spend and how your resources and services are billed, so that you can analyze the effect of cost on business outcomes. We recommend that you follow the FinOps operating model, which suggests the following actions to make cost information visible across your organization:

Allocate: Assign an owner for every cost item.
Report: Make cost data available, consumable, and actionable.
Forecast: Estimate and track future spend.
Resource optimization
Align the number and size of your cloud resources to the requirements of your workload. Where feasible, consider using managed services or re-architecting your applications. Typically, individual engineering teams have more context than the central FinOps (financial operations) team on opportunities and techniques to optimize resource deployment. We recommend that the FinOps team work with the individual engineering teams to identify resource-optimization opportunities that can be applied across the organization.

Rate optimization
The FinOps team often makes rate optimization decisions centrally. We recommend that the individual engineering teams work with the central FinOps team to take advantage of deep discounts for reservations, committed usage, Spot VMs, flat-rate pricing, and volume and contract discounting.

130
Q

Cloud Bigtable
What recommendations would you provide to a customer to optimize the performance of their Bigtable instances?

A

Plan capacity based on performance requirements
You can use Bigtable in a broad spectrum of applications, each with a different optimization goal. For example, for batch data-processing jobs, throughput might be more important than latency. For an online service that serves user requests, you might need to prioritize lower latency over throughput. When you plan capacity for your Bigtable clusters, consider the tradeoffs between throughput and latency. For more information, see Plan your Bigtable capacity.

Follow schema-design best practices
Your tables can scale to billions of rows and thousands of columns, enabling you to store petabytes of data. When you design the schema for your Bigtable tables, consider the schema design best practices.

Monitor performance and make adjustments
Monitor the CPU and disk usage for your instances, analyze the performance of each cluster, and review the sizing recommendations that are shown in the monitoring charts.

https://cloud.google.com/architecture/framework/performance-optimization/databases

131
Q

Cloud Spanner
What recommendations would you provide to help you optimize performance their Spanner instances.

A

Choose a primary key that prevents a hotspot
A hotspot is a single server that is forced to handle many requests.

Follow best practices for SQL coding

Use query options to manage the SQL query optimizer

Visualize and tune the structure of query execution plans

Use operations APIs to manage long-running operations

Follow best practices for bulk loading

Monitor and control CPU utilization
Analyze and solve latency issues

Launch applications after the database reaches the warm state

https://cloud.google.com/architecture/framework/performance-optimization/databases

132
Q

What recommendations would help you to optimize the performance of your Cloud SQL instances running SQL Server, MySQL, and PostgreSQL databases.

A

For SQL Server databases, Google recommends that you modify certain parameters and retain the default values for some parameters.
When you choose the storage type for MySQL or PostgreSQL databases, consider the cost-performance tradeoff between SSD and HDD storage.
To identify and analyze performance issues with PostgreSQL databases, use the Cloud SQL Insights dashboard.
To diagnose poor performance when running SQL queries, use the EXPLAIN statement.

https://cloud.google.com/architecture/framework/performance-optimization/databases

133
Q

What recommendations would help your customers optimize the performance of their analytics workloads in Google Cloud Storage?

A

Reduce latency when using Cloud Storage
To reduce latency when you access data that’s stored in Cloud Storage, we recommend the following:

Create your Cloud Storage bucket in the same region as the Dataproc cluster.
Disable auto.purge for Apache Hive-managed tables stored in Cloud Storage.
When using Spark SQL, consider creating Dataproc clusters with the latest versions of available images . By using the latest version, you can avoid performance issues that might remain in older versions, such as slow INSERT OVERWRITE performance in Spark 2.x.
To minimize the possibility of writing many files with varying or small sizes to Cloud Storage, you can configure the Spark SQL parameters spark.sql.shuffle.partitions and spark.default.parallelism or the Hadoop parameter mapreduce.job.reduces.

https://cloud.google.com/architecture/framework/performance-optimization/analytics

134
Q

How do you optimize the performance of Dataflow?

A

When you create and deploy pipelines, you can configure execution parameters, like the Compute Engine machine type that should be used for the Dataflow worker VMs. For more information, see Pipeline options.

After you deploy pipelines, Dataflow manages the Compute Engine and Cloud Storage resources that are necessary to run your jobs. In addition, the following features of Dataflow help optimize the performance of the pipelines:

Parallelization: Dataflow automatically partitions your data and distributes your worker code to Compute Engine instances for parallel processing. For more information, see parallelization and distribution.
Optimization: Dataflow uses your pipeline code to create an execution graph that represents PCollection objects and transforms in the pipeline. It then optimizes the graph for the most efficient performance and resource usage. Dataflow also automatically optimizes potentially costly operations, such as data aggregations. For more information, see Fusion optimization and Combine optimization.
Automatic tuning: Dataflow dynamically optimizes jobs while they are running by using Horizontal Autoscaling, Vertical Autoscaling, and Dynamic Work Rebalancing.

https://cloud.google.com/architecture/framework/performance-optimization/analytics

135
Q
A

Optimize query design
Query performance depends on factors like the number of bytes that your queries read and write, and the volume of data that’s passed between slots. To optimize the performance of your queries in BigQuery, apply the best practices that are described in the following documentation:

Introduction to optimizing query performance
Managing input data and data sources
Optimizing communication between slots
Optimize query computation
Manage query outputs
Avoiding SQL anti-patterns
Define and use materialized views efficiently
To improve the performance of workloads that use common and repeated queries, you can use materialized views. There are limits to the number of materialized views that you can create. Don’t create a separate materialized view for every permutation of a query. Instead, define materialized views that you can use for multiple patterns of queries.

https://cloud.google.com/architecture/framework/performance-optimization/analytics

136
Q

Manage capacity and quota

A

Manage capacity and quota
In contrast, when you use Google Cloud you cede most capacity planning to Google. Using the cloud means you don’t have to provision and maintain idle resources when they aren’t needed. For example, you can create, scale up, and scale down VM instances as needed. Because you pay for what you use, you can optimize your spending, including excess capacity that you only need at peak traffic times. To help you save, Compute Engine provides machine type recommendations if it detects that you have underutilized VM instances that can be resized or deleted.
Evaluate your cloud capacity requirements
To manage your capacity effectively, you need to know your organization’s capacity requirements.
To evaluate your capacity requirements, start by identifying your top cloud workloads. Evaluate the average and peak utilizations of these workloads, and their current and future capacity needs.
Identify the teams who use these top workloads. Work with them to establish an internal demand-planning process. Use this process to understand their current and forecasted cloud resource needs.
View your infrastructure utilization metrics
To make capacity planning easier, gather and store historical data about your organization’s use of cloud resources.
Ensure you have visibility into infrastructure utilization metrics. For example, for top workloads, evaluate the following:
Average and peak utilization
Spikes in usage patterns
Seasonal spikes based on business requirements, such as holiday periods for retailers
How much over-provisioning is needed to prepare for peak events and rapidly handle potential traffic spikes
Ensure your organization has set up alerts to automatically notify of when you get close to quota and capacity limitations.
Use Google’s monitoring tools to get insights on application usage and capacity. For example, you can define custom metrics with Monitoring. Use these custom metrics to define alerting trends. Monitoring also provides flexible dashboards and rich visualization tools to help identify emergent issues.
Create a process for capacity planning
Ensure your quotas match your capacity requirements
Google Cloud uses quotas to restrict how much of a particular shared Google Cloud resource that you can use. Each quota represents a specific countable resource, such as API calls to a particular service, the number of load balancers used concurrently by your project, or the number of projects that you can create. For example, quotas ensure that a few customers or projects can’t monopolize CPU cores in a particular region or zone.

https://cloud.google.com/architecture/framework/operational-excellence/manage-capacity-and-quota

137
Q

What are the best practices an architect must consider to support VM Migration for a solution?

A

Evaluate built-in migration tools
Evaluate built-in migration tools to move your workloads from another cloud or from on-premises. For more information, see Migration to Google Cloud. Google Cloud offers tools and services to help you migrate your workloads and optimize for cost and performance. To receive a free migration cost assessment based on your current IT landscape, see Google Cloud Rapid Assessment & Migration Program.

Use virtual disk import for customized operating systems
To import customized supported operating systems, see Importing virtual disks. Sole-tenant nodes can help you meet your hardware bring-your-own-license requirements for per-core or per-processor licenses. For more information, see Bringing your own licenses.

https://cloud.google.com/architecture/framework/system-design/compute