Storage & Databases Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is the difference between Block Storage and Object Storage?

A
  • Block storage is fixed-sized raw storage capacity
  • Block storage stores data in volumes that can be shared and mounted; SAN, iSCSI and local disks
  • Block storage is most common for applications and databases
  • Object storage does not require a guest OS to exist, accessible via API’s
  • Object storage grows as needed
  • Object storage is redundant and can be replicated
  • Unstructured data like music, image, video
  • Log files and database dumps - Large data sets - Archive files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are all the Google Cloud ‘Storage Options’?

A
  • Cloud Storage - not structured, no mobile sdk
  • Cloud Storage for Firebase - not structured, needs mobile sdk
  • BigQuery - structured, analytics, read-only
  • Cloud Bigtable - structured, analytics, updates with low-latency
  • Cloud Datastore - structured, not analytics, non-relational, no mobile sdk
  • Cloud Firestore for Firebase - structured, not analytics, non-relational, needs mobile sdk
  • Cloud SQL - structured, not analytics, relational, no-horizontal scaling
  • Cloud Spanner - structured, not analytics, relational, needs scaling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the three blocks that the Internet Assigned Number Authority (IANA) has reserved for private internets?

A
  1. 10.0.0.0 - 10.255.255.255 (10/8 prefix)
  2. 172.16.0.0 - 172.31.255.255 (172.16/12 prefix)
  3. 192.168.0.0 - 192.168.255.255 (192.168/16 prefix)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is persistent disk, it’s features and what is it good/used for?

A
  • Fully-managed, block storage for VM’s and containers
  • Good for Compute Engine and Kubernetes Engine
  • Good for snapshots of data backup
  • Used for VM disks
  • Used for sharing read-only data across multiple VMs

Features:

  • Durable, independent volumes, 64 TB size, online resize
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Cloud Storage and what is it good for?

A
  • A scalable, fully-managed, highly reliable, and cost-efficient object / blob store.
  • Good for: Images, pictures, and videos, Objects and blobs, Unstructured data
  • Workloads: Sotring and streaming multimedia, stroring custom data analytics pipelines
  • Achive, backup and disaster recovery
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the different storage classes for Cloud Storage and what are they good for?

A
  1. Multi-Regional: across geographic regions
  2. Regional: ideal for compute, analytics, and ML workloads in a particular region
  3. Nearline: backups, low-cost, once a month access
  4. Coldline: archive, lowest-cost, once a year access
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Bigtable?

A
  • Massively scalable NoSQL
  • Single table that can scale to billions of rows and thousands of columns
  • Stores terabytes or petabyes of data
  • Ideal for single-keyed data with very low latency
  • Ideal data source for MapReduce operations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Bigtable good for?

A

Cloud Bigtable is ideal for applications that need very high throughput and scalability for non-structured key/value data, where each value is typically no larger than 10 MB. Cloud Bigtable also excels as a storage engine for batch MapReduce operations, stream processing/analytics, and machine-learning applications.

You can use Cloud Bigtable to store and query all of the following types of data:

  • Marketing data such as purchase histories and customer preferences.
  • Financial data such as transaction histories, stock prices, and currency exchange rates.
  • Internet of Things data such as usage reports from energy meters and home appliances.
  • Time-series data such as CPU and memory usage over time for multiple servers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Cloud Spanner?

A
  • fully-managed, horizontally distibuted relational database service
  • handles massive transactional loads
  • Uses Paxos algorithm to shard data across hundreds of data centers
  • Mission critical, relaional database service with transactional consistency, global scale and high availability
  • Cloud Spanner is ideal for relational, structured, and semi-structured data that requires high availability, strong consistency, and transactional reads and writes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Cloud Datastore?

A
  • highly scalable NoSQL ‘document’ database for your applications
  • non-relational
  • automatic sharding and replication
  • highly-available and durable, scales automatically to handle load
  • ACID, SQL-like queries, indexes, etc
  • RESTful interfaces
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Dataproc

A

Dataproc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
A

Apache Hadoop software is an open source framework that allows for the distributed storage and processing of large datasets across clusters of computers using simple programming models. Hadoop is designed to scale up from a single computer to thousands of clustered computers, with each machine offering local computation and storage. In this way, Hadoop can efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the different storage classes for GCP?

A

Standard, Nearline, Coldline, Archive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why use nearline?

A

Nearline storage is a low-cost, highly durable storage service for storing infrequently accessed data. Nearline storage is a better choice than Standard storage in scenarios where slightly lower availability, a 30-day minimum storage duration, and costs for data access are acceptable trade-offs for lowered at-rest storage costs.

Nearline storage is ideal for data you plan to read or modify on average once per month or less. For example, if you want to continuously add files to Cloud Storage and plan to access those files once a month for analysis, Nearline storage is a great choice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why use regional over multi regional?

A

Lower cost.
To comply with specific legal restrictions.
Only needs to be read by a specific VM in a region.
Has higher availability. 99.9% vs 99.5%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

You cannot change a bucket to regional from multi-regional (T/F)

A

T/F You cannot change a bucket to regional from multi-regional.

You permanently set a geographic location for storing your object data when you create a bucket.

You can select from the following location types:

A region is a specific geographic place, such as São Paulo.

A dual-region is a specific pair of regions, such as Tokyo and Osaka.

A multi-region is a large geographic area, such as the United States, that contains two or more geographic places.

All Cloud Storage data is redundant across at least two zones within at least one geographic place as soon as you upload it.

Additionally, objects stored in a multi-region or dual-region are geo-redundant. Objects that are geo-redundant are stored redundantly in at least two separate geographic places separated by at least 100 miles.

Default replication is designed to provide geo-redundancy for 99.9% of newly written objects within a target of one hour. Newly written objects include uploads, rewrites, copies, and compositions.

Turbo replication provides geo-redundancy for all newly written objects within a target of 15 minutes. Applicable only for dual-region buckets.

Cloud Storage stores object data in the selected location in accordance with the Service Specific Terms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you change the storage class of an object?

A

The storage class set for an object affects the object’s availability and pricing model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you set the default storage class of a bucket?

What if you don’t set it?

A

When you create a bucket, you can specify a default storage class for the bucket. When you add objects to the bucket, they inherit this storage class unless explicitly set otherwise.

If you don’t specify a default storage class when you create a bucket, that bucket’s default storage class is set to Standard storage.

Changing the default storage class of a bucket does not affect any of the objects that already exist in the bucket.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How to change the default storage class for a bucket. When you upload an object to the bucket, if you don’t specify a storage class for the object, the object is assigned the bucket’s default storage class.

A

Two ways to use - gcloud and gsutil

gcloud storage buckets update gs://BUCKET_NAME –default-storage-class=STORAGE_CLASS

Use the gsutil defstorageclass set command:

gsutil defstorageclass set STORAGE_CLASS gs://BUCKET_NAME

example gsutil defstorageclass set nearline gs://help_bucket

Where:

  • STORAGE_CLASS is the new storage class you want for your bucket. For example, nearline.
  • BUCKET_NAME is the name of the relevant bucket. For example, my-bucket.

The response looks like the following example:

Setting default storage class to “nearline” for bucket gs://my-bucket

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Can you share a disk between VMs

A

You can attach an SSD persistent disk in multi-writer mode to up to two N2 virtual machine (VM) instances simultaneously so that both VMs can read and write to the disk.

To enable multi-writer mode for new persistent disks, create a new persistent disk and specify the –multi-writer flag in the gcloud CLI or the multiWriter property in the Compute Engine API.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are some of the different storage options compute engine instances?

A

If you are not sure which option to use, the most common solution is to add a persistent disk to your instance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

When you configure a persistent disk, you can select one of the following disk types.

A
  • Standard persistent disks (pd-standard) are backed by standard hard disk drives (HDD).
  • Balanced persistent disks (pd-balanced) are backed by solid-state drives (SSD). They are an alternative to SSD persistent disks that balance performance and cost.
  • SSD persistent disks (pd-ssd) are backed by solid-state drives (SSD).
  • Extreme persistent disks (pd-extreme) are backed by solid-state drives (SSD). With consistently high performance for both random access workloads and bulk throughput, extreme persistent disks are designed for high-end database workloads. Unlike other disk types, you can provision your desired IOPS. For more information, see Extreme persistent disks.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can you share a persistent disk across VMs?

A

Share a zonal persistent disk between VM instances

  1. Connect your instances to Cloud Storage.
  2. Connect your instances to Filestore.
  3. Create a network file server on Compute Engine.
  4. Create a persistent disk with multi-writer mode enabled and attach it to up to two instances.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How do you create a HA File Server with two GCE Instances and regional disks?

A

Database HA configurations typically have at least two VM instances. Preferably these instances are part of one or more managed instance groups:

  • A primary VM instance in the primary zone
  • A standby VM instance in a secondary zone

A primary VM instance has at least two persistent disks: a boot disk, and a regional persistent disk. The regional persistent disk contains database data and any other mutable data that should be preserved to another zone in case of an outage.

A standby VM instance requires a separate boot disk to be able to recover from configuration-related outages, which could result from an operating system upgrade, for example. You cannot force attach a boot disk to another VM during a failover.

The primary and standby VM instances are configured to use a load balancer with the traffic directed to the primary VM based on health check signals. This configuration is also known as a hot standby.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the difference between stopping and suspending an instance?

A

Please have a look at the documentation Suspending and resuming an instance:

> Suspending an instance differs from stopping an instance in the following ways:

  • Suspended instances preserve the guest OS memory, device state, and application state.
  • Google charges for the storage necessary to save instance memory.
  • You can only suspend an instance for up to 60 days. After 60 days, the instance is automatically moved to the TERMINATED state.

and at the article Stopping and starting an instance:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the different states of an instance?

A
  • PROVISIONING: resources are allocated for the VM. The VM is not running yet.
  • STAGING: resources are acquired, and the VM is preparing for first boot.
  • RUNNING: the VM is booting up or running.
  • STOPPING: the VM is being stopped. You requested a stop, or a failure occurred. This is a temporary status after which the VM enters the TERMINATED status.
  • REPAIRING: the VM is being repaired. Repairing occurs when the VM encounters an internal error or the underlying machine is unavailable due to maintenance. During this time, the VM is unusable. If repair succeeds, the VM returns to one of the above states.
  • TERMINATED: the VM is stopped. You stopped the VM, or the VM encountered a failure. You can restart or delete the VM.
  • SUSPENDING: The VM is in the process of being suspended. You suspended the VM.
  • SUSPENDED: The VM is in a suspended state. You can resume the VM or delete it.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the difference between stopped, suspended, and reset states?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Why would you want to stop a VM?

A

You might want to stop a VM for several reasons:

  • You no longer need the VM but want the resources that are attached to the VM—such as its internal IPs, MAC address, and persistent disk.
  • You don’t need to preserve the guest OS memory, device state, or application state.
  • You want to change certain properties of the VM that require you to first stop the VM.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Why would you want to suspend a VM?

A

You might want to suspend a VM for the following reasons:

  • You want to stop paying for the core and memory costs of running a VM and pay the comparatively cheaper cost of storage to preserve the state of your VM instead.
  • You don’t need the VM at this time but want to be able to bring it back up quickly with its OS and application state where you left it.

You can resume a suspended VM when you need to use it again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Here are three things you should consider as you address storage needs:

A

First, consider data replication requirements.

Second, consider that GCP offers replication across availability zones, even if you are in the same region.

Third, if storing in a single region poses a risk for disaster recovery, you should consider multiregional replication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Where do persistent disks attach to?

A

Persistent disks do not directly attach to a server. Rather, they attach to the server hosting the network-accessible virtual machine. With a VM, if you attach a disk locally and then shut down, data stored on a persistent disk is lost when a virtual machine is terminated. However, the data on the disk itself remains when an instance is terminated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Two types of persistent disks are available:

A

Solid-state drive (SSD) and hard disk drive (HDD). You select an SSD when you require high throughput and consistent performance across an environment.

HDDs have longer latencies and cost less. An HDD is the preferred choice for large data ingest, when you are performing a batch operation, and you require less sensitivity to data variability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Persistent disks allow for several features, what are they?

A

First, if you mount a persistent disk on multiple virtual machines, it provides multistorage capacity. Second, snapshots, when leveraging persistent disks, can be created quickly, supporting quick virtual machine distribution. If you intend to use a snapshot mounted to a single virtual machine instance, read/write operations are often permissible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Memorystore does what

A

If you are looking for storage that can hold user session data, maintain short-lived web and mobile applications data, or handle gaming data at speed and scale, Cloud Memorystore is the storage option to consider. Cloud Memorystore is a managed Redis service, which is an open source cache solution. Memorystore offers a fully managed in-memory data store with features such as scalability, a well-built security posture, and high availability, all managed by Google.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Configuration varies upon accessing the Memorystore form. When you access the menu, you have two choices

A

:

  • Redis In-memory data structure store that can be used as a database, cache, and message broker
  • Memcached In-memory key-value store intended exclusively for caching data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Object Storage

A

Object storage is a strategy to manage and manipulate data storage as a distinct unit, called an object. Each object can be stored in a single storage unit instead of being embedded into files or folders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Google Cloud Platform has three broad categories of storage:

A

object, relational, and nonrelational. The database platforms vary in size, scale, and capability. Nonrelational databases consist of platforms that support NoSQL as well as alternative solutions developed by Google, such as Cloud Firestore and Firebase. These two platforms are mobile NoSQL solutions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is problem with automated backups?

A

Backing Up a Database
Backups can be created at any time with GCP. For example, if you are about to complete a risky task, you’ll want to back up your database or storage system. For these occasions, you can utilize on-demand backups, as you do not have to wait for the backup window to arrive to create a copy. Unlike automated backups, on-demand backups do not automatically get deleted. Instead, you need to delete the backups. Failing to delete them yourself results in a hefty billing charge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Dataproc Deployment and Management

A

Dataproc is Google’s managed Apache Spark and Hadoop service. Like BigQuery, Dataproc is designed for big data applications. You should be aware that Spark is intended for analysis and machine learning, whereas Hadoop is appropriate for batching data, with emphasis on big data applications.

For the exam, you need to be familiar with creating Dataproc clusters and storage facilities as well as know how to submit jobs that run in those clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is a dataset?

A

A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views.

A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.

41
Q

What is a dataset?

A

A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views.

A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.

42
Q

Why would you use Cloud Spanner Over SQL?

A

Cloud Spanner

Cloud Spanner is a good option when you plan to use large amounts of data (more than 10TB) and need transactional consistency. It is also good if you want to use sharding for higher throughput and accessibility.

If you know or think that you might eventually need to be able to horizontally scale your Google Cloud database, Cloud Scanner is a better option than Cloud SQL. If you start with Cloud SQL and need to eventually move to Cloud Spanner, be prepared to re-write your application in addition to migrating your database.

43
Q

Why would you choose Firestore?

A

Cloud Firestore/Datastore

Cloud Firestore or Datastore are good options when you plan to focus on app development and need live synchronization and offline support.

If you need to store unstructured data in JSON documents, Cloud Datastore is the recommended option. This is in comparison to if you need to store structured data, in which case Cloud Spanner is recommended.

An additional factor to consider is whether you need atomicity, consistency, isolation, durability (ACID) compliance. If so, you need to choose Cloud Spanner since Cloud Datastore only offers atomic and durable transactions.

44
Q

When would you choose Big Table over Cloud Spanner?

A

Cloud Bigtable is a good option if you are using large amounts of single key data. In particular, it is good for low-latency, high throughput workloads.

If you need to perform single-region analytics, Cloud Bigtable is preferred over Cloud Spanner. However, if you need multi-regional operations, Cloud Spanner is the recommended solution. For example, Cloud Bigtable is a good option for a time series app created for DevOps monitoring. Meanwhile, Cloud Spanner is the recommended option for an infrastructure monitoring platform designed for software as a service (SaaS) offering.

45
Q

What are the different types of NoSQL Databases

A
  • Document databases: Store information as documents (in formats such as JSON and XML). For example: Firestore
  • Key-value stores: Group associated data in collections with records that are identified with unique keys for easy retrieval. Key-value stores have just enough structure to mirror the value of relational databases while still preserving the benefits of NoSQL. For example: Bigtable, Memorystore
  • In-memory database: Purpose-built database that relies primarily on memory for data storage. These are designed to attain minimal response time by eliminating the need to access disks. They are ideal for applications that require microsecond response times and can have large spikes in traffic. For example: Memorystore
  • Wide-column databases: Use the tabular format but allow a wide variance in how data is named and formatted in each row, even in the same table. They have some basic structure while preserving a lot of flexibility. For example: Bigtable
  • Graph databases: Use graph structures to define the relationships between stored data points; useful for identifying patterns in unstructured and semi-structured information. For example: JanusGraph
46
Q

What is a node in Spanner, how does data get stored in Spanner.

A
  • A node is a measure of compute in Spanner.
    • Node servers serve the read and write/commit transaction requests, but they don’t store the data.
      • Each node is replicated across three zones in the region.
  • The database storage is also replicated across the three zones.
    • Nodes in a zone are responsible for reading and writing to the storage in their zone.
    • The data is stored in Google’s underlying Colossus distributed replicated file system.
    • This provides huge advantages when it comes to redistributing load, as the data is not linked to individual nodes.
  • If a node or a zone fails, the database remains available, being served by the remaining nodes.
  • No manual intervention is needed to maintain availability.
47
Q
A

Each table in the database is stored sorted by primary key. Tables are divided by ranges of the primary key and these divisions are known as splits. Each split is managed completely independently by different Spanner nodes. The number of splits for a table varies according to the amount of data: empty tables have only a single split. The splits are rebalanced dynamically depending on the amount of data and the load (dynamic resharding). But remember that the table and nodes are replicated across three zones, how does that work?

48
Q

Remember that Spanner has a table and nodes are replicated across three zones, how does that work?

A

Everything is replicated across the three zones - the same goes for split management.

Split replicas are associated with a group (Paxos) that spans zones. Using Paxos consensus protocols, one of the zones is determined to be a leader.

The leader is responsible for managing write transactions for that split, while the other replicas can be used for reads. If a leader fails, the consensus is redetermined and a new leader may be chosen. For different splits, different zones can become leaders, thus distributing the leadership roles among all the Cloud Spanner compute nodes.

Nodes will likely be both leaders for some splits and replicas for others. Using this distributed mechanism of splits, leaders, and replicas, Cloud Spanner achieves both high availability and scalability.

49
Q

How does Spanner provide global consistency?

A

TrueTime is essential to make Spanner work as well as it does…so, what is it, and how does it help?

TrueTime is a way to synchronize clocks in all machines across multiple datacenters. The system uses a combination of GPS and atomic clocks, each correcting for the failure modes of the other. Combining the two sources (using multiple redundancy, of course) gives an accurate source of time for all Google applications.

50
Q

What is Big Table?

A

Cloud Bigtable is a fully managed wide-column NoSQL database that scales to petabyte-scale. It’s optimized for low latency, large numbers of reads and writes, and maintaining performance at scale. It offers really low latency of the order of single-digit milliseconds. It is an ideal data source for time series and MapReduce-style operations. Bigtable supports the open-source HBase API standard to easily integrate with the Apache ecosystem including HBase, Beam, Hadoop and Spark. It also integrates with Google Cloud ecosystem including Memorystore, BigQuery, Dataproc, Dataflow and more.

51
Q

How does Big Table Scale?

A

How BIG is Bigtable? Bigtable has nearly 10 Exabytes of data under management.

It delivers highly predictable performance that is linearly scalable. Throughput can be adjusted by adding/removing nodes – each node provides up to 10,000 operations per second (read and write). You can use Bigtable as the storage engine for large-scale, low-latency applications as well as throughput-intensive data processing and analytics. It offers high availability with an SLA of 99.9% for zonal instances. It’s strongly consistent in a single cluster; replication between clusters adds eventual consistency. If you leverage Bigtable’s multi cluster routing across two clusters, the SLA increases to 99.99% and if that routing policy is utilized across clusters in 3 different regions you get a 99.999% uptime SLA.

52
Q

What type of database is Big Table?

A

Bigtable is another NoSQL database, but unlike Datastore, it is a wide-column database, not a document database. Wide-column databases, as the name implies, store tables that can have a large number of columns. Not all rows need to use all columns, so in that way it is like Datastore—neither require a fixed schema to structure the data.

Bigtable is designed for petabyte-scale databases. Both operational databases, like storing IoT data, and analytic processing, like data science applications, can effectively use Bigtable. This database is designed to provide consistent, low-millisecond latency. Bigtable runs in clusters and scales horizontally.

53
Q

Know the four storage classes in Cloud Storage.

A

Regional, multiregional, nearline, and coldline are the four storage classes. Multiregional class replicates data across regions. Regional storage replicates data across zones. Nearline is designed for infrequent access, less than once per month. Coldline storage is designed for archival storage, with files being accessed less than once per year. Both nearline and coldline storage incur retrieval charges in addition to charges based on the size of data.

54
Q

What are the features of Dataproc?

A

With your existing MapReduce, you can operate on an immense amount of data each day without any overhead worries.
With the in-built monitoring system, you can transfer your cluster data to your applications. You can get quick-reports from the system and also have the feature of storing data in Google’s BigQuery.
Quick launch and delete smaller clusters stored in blob storage, as and when required using Spark (Spark SQL, PySpark, Spark shell).
Spark Machine Learning Libraries and Data Science to customize and run classification algorithms.

55
Q

You can get your data into Google Cloud using any of four major approaches:

A

Cloud Storage transfer tools—These tools help you upload data directly from your computer into Google Cloud Storage. You would typically use this option for small transfers up to a few TBs. These include the Google Cloud Console UI, the JSON API, and the GSUTIL command line interface.

Storage Transfer Service—This service enables you to quickly import online data into Cloud Storage from other clouds, from on-premises sources, or from one bucket to another within Google Cloud. You can set up recurring transfer jobs to save time and resources and it can scale to 10’s of Gbps.

Transfer Appliance—This is a great option if you want to migrate a large dataset and don’t have lots of bandwidth to spare. Transfer Appliance enables seamless, secure, and speedy data transfer to Google Cloud. For example, a 1 PB data transfer can be completed in just over 40 days using the Transfer

BigQuery Data Transfer Service—With this option your analytics team can lay the foundation for a BigQuery data warehouse without writing a single line of code. It automates data movement into BigQuery on a scheduled, managed basis.

56
Q

What are the Cloud SQL Features?

A
57
Q

Cloud SQL Backups are what?

A
58
Q

What is High Availability SQL?

A
59
Q

What are Cloud SQL Read Replicas?

A
60
Q

What are the different variants of Read Replicas?

A
61
Q

How do you configure IP Addresses for a Cloud SQL Application with no proxy?

A
62
Q

Several factors influence the choice of storage systems, such as the?

A

Is the data structured or unstructured?
How frequently will the data be accessed?
What is the read/write pattern?

What is the frequency of reads versus writes?

What are the consistency requirements?

Can Google managed keys be used for encryption, or do you need to deploy customer managed keys?

What are the most common query patterns?

Does your application require mobile support, such as synchronization?

For structured data, is the workload analytic or transactional?

Does your application require low latency writes?

63
Q

What is the FUSE project allow?

A

Cloud Storage FUSE

Filesystem in Userspace (FUSE) is a framework for exposing a filesystem to the Linux kernel. FUSE uses a stand-alone application that runs on Linux and provides a filesystem API along with an adapter for implementing filesystem functions in the underlying storage system.

64
Q

WHat is the difference between Consitency vs atomicity?

A

Atomicity
Atomic operations ensure that all steps in a transaction complete or no steps take effect. For example, a sales transaction might include reducing the number of products available in inventory and charging a customer’s credit card. If there isn’t sufficient inventory, the transaction will fail, and the customer’s credit card will not be charged.

Consistency, specifically transactional consistency, is a property that guarantees that when a transaction executes, the database is left in a state that complies with constraints, such as uniqueness requirements and referential integrity, which ensures foreign keys reference a valid primary key. When a database is distributed, consistency also refers to querying data from different servers in a database cluster and receiving the same data from each.

For example, some NoSQL databases replicate data on multiple servers to improve availability. If there is an update to a record, each copy must be updated. In the time between the first and last copies being updated, it is possible to have two instances of the same query receive different results. This is considered an inconsistent read. Eventually, all replicas will be updated, so this is referred to as eventual consistency.

65
Q

What is the durability property?

A

The information is stored when changed?

The durability property ensures that once a transaction is executed, the state of the database will always reflect or account for that change. This property usually requires databases to write data to persistent storage—even when the data is also stored in memory—so that in the event of a crash, the effects of the transactions are not lost.

Google Cloud Platform offers two managed relational database services: Cloud SQL and Cloud Spanner. Each is designed for distinct use cases. In addition to the two managed services, GCP customers can run their own databases on GCP virtual machines.

66
Q

What are the IAM Roles for BigQuery?

A

BigQuery is integrated with Cloud IAM, which has several predefined roles for BigQuery. Access can be granted at the organization, project, dataset, and table/view levels. When access is provided at the organization or project level, that access applies to all of a project’s BigQuery resources. Datasets are children of projects in the resource hierarchy, so access granted at the dataset level apply only to that dataset and its tables and views. You can also assign access at the table and view levels.

roles/bigquery.dataViewer: This role allows a user to list projects and tables and get table data and metadata.
roles/bigquery.dataEditor: This has the same permissions as dataViewer, plus permissions to create and modify tables and datasets.
roles/bigquery.dataOwner: This role is similar to dataEditor, but it can also create, modify, and delete datasets.
roles/bigquery.metadataViewer: This role gives permissions to list tables, projects, and datasets.
roles/bigquery.user: The user role gives permissions to list projects and tables, view metadata, create datasets, and create jobs.
roles/bigquery.jobUser: A jobUser can list projects and create jobs and queries.
roles/bigquery.admin: An admin can perform all operations on BigQuery resources.

67
Q

What is cloud datastore?

A

Cloud Datastore is a managed document database, which is a kind of NoSQL database that uses a flexible JSON-like data structure called a document.

The terminology used to describe the structure of a document is different than that for relational databases. A table in a relational database corresponds to a kind in Cloud Datastore, while a row is referred to as an entity. The equivalent of a relational column is a property, and a primary key in relational databases is simply called the key in Cloud Datastore.

Cloud Datastore is fully managed. GCP manages all data management operations including distributing data to maintain performance. The flexible data structure makes Cloud Datastore a good choice for applications like product catalogs or user profiles.

68
Q

What can you adjust with Networking and Latency to help prepare a better storage system?

A

Network latency is a consideration when designing storage systems, particularly when data is transmitted between regions within GCP or outside GCP to globally distributed devices. Three ways of addressing network latency concerns are as follows:

Replicating data in multiple regions and across continents
Distributing data using Cloud CDN
Using Google Cloud Premium Network tier

69
Q

Understand the major types of storage systems available in GCP.

A

These include object storage, persistent local and attached storage, and relational and NoSQL databases. Object storage is often used to store unstructured data, archived data, and files that are treated as atomic units. Persistent local and attached storage provides storage to virtual machines. Relational databases are used for structured data, while NoSQL databases are used when it helps to have flexible schemas.

70
Q

Cloud Filestore is a network-attached storage service that provides a filesystem that is accessible from Compute Engine and Kubernetes Engine.

A

Cloud Filestore is designed to provide low latency and IOPS, so it can be used for databases and other performance-sensitive services.

71
Q

loud Firestore and Cloud Datastore are ??

A

Cloud Firestore and Cloud Datastore are managed document databases, which are a kind of NoSQL database that uses a flexible JSON-like data structure called a document. Cloud Firestore and Cloud Datastore are fully managed. GCP manages all data management operations, including distributing data to maintain performance. They are designed so that the response time to return query results is a function of the size of the data returned and not the size of the dataset that is queried. The flexible data structure makes Cloud Firestore and Cloud Datastore good choices for applications like product catalogs or user profiles. Cloud Firestore is the next generation of GCP-managed document database.

72
Q

What are the benefits of alias IP addressing?

A
73
Q

How is routing in GCP Managed?

A
74
Q

What are the two main types of routes?

A

System Generated Routes

  • System-generated default routes
    0.0.0.0/0 for IPv4
    ::/0 for IPv6 default-internet-gateway Applies to the whole VPC network
  • Can be removed or replaced
  • Subnet route
    Created automatically for each subnet IP address range VPC network
    Forwards packets to VMs and internal load balancers Applies to the whole VPC network
  • Created, updated, and removed automatically by Google Cloud when you create, modify, or delete a subnet or secondary IP address range of a subnet.

Custom Routes

  • Static route
    Supports various destinations Forwards packets to a static route next hop For details about each static route next hop, see considerations for:
    Instances and internal TCP/UDP load balancers
  • Next hop instances
    Internal TCP/UDP load balancer next hops
    Classic VPN tunnel next hops
  • Dynamic route
    Destinations that don’t conflict with subnet routes or static routes
75
Q

What are the firewall rule components?

A
76
Q

How is Global Load Balancing implemented?

A
77
Q

How is regional load balancing implemented?

A
78
Q

When do you use internal load balancing?

A
79
Q

What is the architecture for a load balancer - External

A
80
Q

What is TCP Proxy Load Balancing?

A
81
Q

What is SSL Proxy Load balancing?

A
82
Q

What is Google Internal TCP/UDP Load balancing?

A
83
Q

How does Autoscaling work?

A
84
Q

What are the autoscaling policy requirements?

A
85
Q

What are the 3 types of accounts for IAM?

A
86
Q

What is the difference between service accounts and user accounts?

A
87
Q

WHat are the IAM best practices?

A

Common uses of labels

We do not recommend creating large numbers of unique labels, such as for timestamps or individual values for every API call. Here are some common use cases for labels:

Team or cost center labels: Add labels based on team or cost center to distinguish instances owned by different teams (for example, team:research and team:analytics). You can use this type of label for cost accounting or budgeting.

Component labels: For example, component:redis, component:frontend, component:ingest, and component:dashboard.

Environment or stage labels: For example, environment:production and environment:test.

State labels: For example, state:active, state:readytodelete, and state:archive.

  • Virtual machine labels: A label can be attached to a virtual machine. Virtual machine tags that you defined in the past appear as a label without a value.

Use labels on Compute Engine

You can apply labels to the following Compute Engine resources:

  • Virtual machine (VM) instances
  • Images
  • Persistent disks
  • Persistent disk snapshots

You can also use labels on related Google Cloud components such as the following:

88
Q

What is the Big Query Architecture

A
89
Q

What role allows you to manage storage buckets and objects altogether?

A

roles.storage.admin

Grants full control of buckets and objects.

When applied to an individual bucket, control applies only to the specified bucket and objects within the bucket.

90
Q

What are the requirements to estimate pricing on a Flexible GAE?

A
91
Q

What is BigQuery?

A

BigQuery is a fully managed big data tool for companies that need a cloud-based interactive query service for massive datasets.

BigQuery is not a database, it’s a query service.

BigQuery supports SQL queries, which makes it quite user-friendly. It can be accessed from Console, CLI, or using SDK. You can query billions of rows, it only takes seconds to write, and seconds to return.

You can use its REST APIs and get your work done by sending a JSON request.

Let’s understand with help of an example, Suppose you are a data analyst and you need to analyze tons of data. If you choose a tool like traditional MySQL, you need to have an infrastructure ready, that can store this huge data.

You can focus on analysis rather than working on infrastructure. Hardware is completely abstracted.

Designing this infrastructure itself will be a difficult task because you will have to figure out RAM size, CPU type, or any other configurations.

BigQuery is mainly for Big Data. You shouldn’t confuse it with OLTP (Online Transaction Processing) database.

92
Q
A

Datasets: Datasets hold one or more tables of data.

Tables: Tables are row-column structures that hold actual data

Jobs: Operations that you perform on the data, such as loading data, running queries, or exporting data.

93
Q

What is Cloud Spanner?

A

Cloud Spanner is used to handle large amounts of data. It provides petabytes of capacity. Main use cases include financials and inventory applications.

Cloud Spanner can be considered a replacement for traditional SQL. For e.g. In a traditional database environment, when database query response times get close to or even exceed the threshold limit due to an increase in the number of users or queries, you can bring response times down to acceptable levels through manual intervention.

Cloud Spanner can scale horizontally easily with minimal intervention. You can scale horizontally by just increasing the number of nodes (just change one digit).

Scaling horizontally means thousands of small machines will do the work together for you.

Scaling vertically means one big machine will do all the work for you.

Simply, Horizontal scaling implies scaling by adding more machines into your resource pool whereas Vertical scaling implies scaling by adding more power and strength to an existing machine.

For e.g multiple cloud-based POS solutions for retailers, restaurateurs, and eCommerce merchants around the globe.

Spanner is quite more expensive than Cloud SQL.

For Cloud SQL you can select machine type, type of hard disk and size, region, and zone. You are restricted to have everything on one server.

Spanner is not for general SQL needs, Spanner is mainly used for massive-scale applications.

94
Q

What is BigTable?

A

Bigtable is a distributed database that runs on clusters for applications that has massive data. Its mainly designed for unstructured data, and scales horizontally.

Cloud Bigtable is not a relational database system. It stores data in key-value pairs.

95
Q

When to choose BigTable and BigQuery?

A

For Interactive querying in an online analytical processing system use BigQuery.

BigQuery is a data warehouse application, and it stores data in structured tables.

BigQuery supports SQL queries whereas BigTable doesn’t support SQL queries.

BigTable is not a recommended solution for a small volume of data(< 1 TB).

BigTable is characteristic of a NoSQL system whereas BigQuery is somewhat of a hybrid where it is mainly used for SQL queries but it does support NoSQL as well.

For e.g. If you want to do analytics or business intelligence from collected data from different sources into one location i.e BigQuery.

Simple, Database - BigTable whereas Analytics - BigQuery

96
Q

What are the location and replication types for Google Cloud Storage?

A

You can select from the following location types:

A region is a specific geographic place, such as São Paulo.

A dual-region is a specific pair of regions, such as Tokyo and Osaka.

A multi-region is a large geographic area, such as the United States, that contains two or more geographic places.

All Cloud Storage data is redundant across at least two zones within at least one geographic place as soon as you upload it.

Additionally, objects stored in a multi-region or dual-region are geo-redundant. Objects that are geo-redundant are stored redundantly in at least two separate geographic places separated by at least 100 miles.

Default replication is designed to provide geo-redundancy for 99.9% of newly written objects within a target of one hour. Newly written objects include uploads, rewrites, copies, and compositions.

Turbo replication provides geo-redundancy for all newly written objects within a target of 15 minutes. Applicable only for dual-region buckets.

Cloud Storage stores object data in the selected location in accordance with the Service Specific Terms.

97
Q

What are the recommendations for the different storage locations?

A
98
Q

When you consider location types for Google Cloud, what should you consider?

A
99
Q

Draw a TCP Proxy - External LB diagram with two regions configured.

A