Storage & Databases Flashcards
What is the difference between Block Storage and Object Storage?
- Block storage is fixed-sized raw storage capacity
- Block storage stores data in volumes that can be shared and mounted; SAN, iSCSI and local disks
- Block storage is most common for applications and databases
- Object storage does not require a guest OS to exist, accessible via API’s
- Object storage grows as needed
- Object storage is redundant and can be replicated
- Unstructured data like music, image, video
- Log files and database dumps - Large data sets - Archive files
What are all the Google Cloud ‘Storage Options’?
- Cloud Storage - not structured, no mobile sdk
- Cloud Storage for Firebase - not structured, needs mobile sdk
- BigQuery - structured, analytics, read-only
- Cloud Bigtable - structured, analytics, updates with low-latency
- Cloud Datastore - structured, not analytics, non-relational, no mobile sdk
- Cloud Firestore for Firebase - structured, not analytics, non-relational, needs mobile sdk
- Cloud SQL - structured, not analytics, relational, no-horizontal scaling
- Cloud Spanner - structured, not analytics, relational, needs scaling
What are the three blocks that the Internet Assigned Number Authority (IANA) has reserved for private internets?
- 10.0.0.0 - 10.255.255.255 (10/8 prefix)
- 172.16.0.0 - 172.31.255.255 (172.16/12 prefix)
- 192.168.0.0 - 192.168.255.255 (192.168/16 prefix)
What is persistent disk, it’s features and what is it good/used for?
- Fully-managed, block storage for VM’s and containers
- Good for Compute Engine and Kubernetes Engine
- Good for snapshots of data backup
- Used for VM disks
- Used for sharing read-only data across multiple VMs
Features:
- Durable, independent volumes, 64 TB size, online resize
What is Cloud Storage and what is it good for?
- A scalable, fully-managed, highly reliable, and cost-efficient object / blob store.
- Good for: Images, pictures, and videos, Objects and blobs, Unstructured data
- Workloads: Sotring and streaming multimedia, stroring custom data analytics pipelines
- Achive, backup and disaster recovery
What are the different storage classes for Cloud Storage and what are they good for?
- Multi-Regional: across geographic regions
- Regional: ideal for compute, analytics, and ML workloads in a particular region
- Nearline: backups, low-cost, once a month access
- Coldline: archive, lowest-cost, once a year access
What is Bigtable?
- Massively scalable NoSQL
- Single table that can scale to billions of rows and thousands of columns
- Stores terabytes or petabyes of data
- Ideal for single-keyed data with very low latency
- Ideal data source for MapReduce operations
What is Bigtable good for?
Cloud Bigtable is ideal for applications that need very high throughput and scalability for non-structured key/value data, where each value is typically no larger than 10 MB. Cloud Bigtable also excels as a storage engine for batch MapReduce operations, stream processing/analytics, and machine-learning applications.
You can use Cloud Bigtable to store and query all of the following types of data:
- Marketing data such as purchase histories and customer preferences.
- Financial data such as transaction histories, stock prices, and currency exchange rates.
- Internet of Things data such as usage reports from energy meters and home appliances.
- Time-series data such as CPU and memory usage over time for multiple servers.
What is Cloud Spanner?
- fully-managed, horizontally distibuted relational database service
- handles massive transactional loads
- Uses Paxos algorithm to shard data across hundreds of data centers
- Mission critical, relaional database service with transactional consistency, global scale and high availability
- Cloud Spanner is ideal for relational, structured, and semi-structured data that requires high availability, strong consistency, and transactional reads and writes.
What is Cloud Datastore?
- highly scalable NoSQL ‘document’ database for your applications
- non-relational
- automatic sharding and replication
- highly-available and durable, scales automatically to handle load
- ACID, SQL-like queries, indexes, etc
- RESTful interfaces
What is Dataproc
Dataproc
Apache Hadoop software is an open source framework that allows for the distributed storage and processing of large datasets across clusters of computers using simple programming models. Hadoop is designed to scale up from a single computer to thousands of clustered computers, with each machine offering local computation and storage. In this way, Hadoop can efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.
What are the different storage classes for GCP?
Standard, Nearline, Coldline, Archive
Why use nearline?
Nearline storage is a low-cost, highly durable storage service for storing infrequently accessed data. Nearline storage is a better choice than Standard storage in scenarios where slightly lower availability, a 30-day minimum storage duration, and costs for data access are acceptable trade-offs for lowered at-rest storage costs.
Nearline storage is ideal for data you plan to read or modify on average once per month or less. For example, if you want to continuously add files to Cloud Storage and plan to access those files once a month for analysis, Nearline storage is a great choice.
Why use regional over multi regional?
Lower cost.
To comply with specific legal restrictions.
Only needs to be read by a specific VM in a region.
Has higher availability. 99.9% vs 99.5%
You cannot change a bucket to regional from multi-regional (T/F)
T/F You cannot change a bucket to regional from multi-regional.
You permanently set a geographic location for storing your object data when you create a bucket.
- You cannot change a bucket’s location after it’s created, but you can move your data to a bucket in a different location.
You can select from the following location types:
A region is a specific geographic place, such as São Paulo.
A dual-region is a specific pair of regions, such as Tokyo and Osaka.
A multi-region is a large geographic area, such as the United States, that contains two or more geographic places.
All Cloud Storage data is redundant across at least two zones within at least one geographic place as soon as you upload it.
Additionally, objects stored in a multi-region or dual-region are geo-redundant. Objects that are geo-redundant are stored redundantly in at least two separate geographic places separated by at least 100 miles.
Default replication is designed to provide geo-redundancy for 99.9% of newly written objects within a target of one hour. Newly written objects include uploads, rewrites, copies, and compositions.
Turbo replication provides geo-redundancy for all newly written objects within a target of 15 minutes. Applicable only for dual-region buckets.
Cloud Storage stores object data in the selected location in accordance with the Service Specific Terms.
How do you change the storage class of an object?
The storage class set for an object affects the object’s availability and pricing model.
- You can change the storage class of an existing object either by rewriting the object or by using Object Lifecycle Management.
- gsutil rewrite -s nearline -k -r gs://bucket
How do you set the default storage class of a bucket?
What if you don’t set it?
When you create a bucket, you can specify a default storage class for the bucket. When you add objects to the bucket, they inherit this storage class unless explicitly set otherwise.
If you don’t specify a default storage class when you create a bucket, that bucket’s default storage class is set to Standard storage.
Changing the default storage class of a bucket does not affect any of the objects that already exist in the bucket.
How to change the default storage class for a bucket. When you upload an object to the bucket, if you don’t specify a storage class for the object, the object is assigned the bucket’s default storage class.
Two ways to use - gcloud and gsutil
gcloud storage buckets update gs://BUCKET_NAME –default-storage-class=STORAGE_CLASS
Use the gsutil defstorageclass set command:
gsutil defstorageclass set STORAGE_CLASS gs://BUCKET_NAME
example gsutil defstorageclass set nearline gs://help_bucket
Where:
- STORAGE_CLASS is the new storage class you want for your bucket. For example, nearline.
- BUCKET_NAME is the name of the relevant bucket. For example, my-bucket.
The response looks like the following example:
Setting default storage class to “nearline” for bucket gs://my-bucket
Can you share a disk between VMs
You can attach an SSD persistent disk in multi-writer mode to up to two N2 virtual machine (VM) instances simultaneously so that both VMs can read and write to the disk.
To enable multi-writer mode for new persistent disks, create a new persistent disk and specify the –multi-writer flag in the gcloud CLI or the multiWriter property in the Compute Engine API.
What are some of the different storage options compute engine instances?
- Zonal persistent disk: Efficient, reliable block storage.
- Regional persistent disk: Regional block storage replicated in two zones.
- Local SSD: High performance, transient, local block storage.
- Cloud Storage buckets: Affordable object storage.
- Filestore: High performance file storage for Google Cloud users.
If you are not sure which option to use, the most common solution is to add a persistent disk to your instance.
When you configure a persistent disk, you can select one of the following disk types.
- Standard persistent disks (pd-standard) are backed by standard hard disk drives (HDD).
- Balanced persistent disks (pd-balanced) are backed by solid-state drives (SSD). They are an alternative to SSD persistent disks that balance performance and cost.
- SSD persistent disks (pd-ssd) are backed by solid-state drives (SSD).
- Extreme persistent disks (pd-extreme) are backed by solid-state drives (SSD). With consistently high performance for both random access workloads and bulk throughput, extreme persistent disks are designed for high-end database workloads. Unlike other disk types, you can provision your desired IOPS. For more information, see Extreme persistent disks.
How can you share a persistent disk across VMs?
Share a zonal persistent disk between VM instances
- Connect your instances to Cloud Storage.
- Connect your instances to Filestore.
- Create a network file server on Compute Engine.
- Create a persistent disk with multi-writer mode enabled and attach it to up to two instances.
How do you create a HA File Server with two GCE Instances and regional disks?
Database HA configurations typically have at least two VM instances. Preferably these instances are part of one or more managed instance groups:
- A primary VM instance in the primary zone
- A standby VM instance in a secondary zone
A primary VM instance has at least two persistent disks: a boot disk, and a regional persistent disk. The regional persistent disk contains database data and any other mutable data that should be preserved to another zone in case of an outage.
A standby VM instance requires a separate boot disk to be able to recover from configuration-related outages, which could result from an operating system upgrade, for example. You cannot force attach a boot disk to another VM during a failover.
The primary and standby VM instances are configured to use a load balancer with the traffic directed to the primary VM based on health check signals. This configuration is also known as a hot standby.
What is the difference between stopping and suspending an instance?
Please have a look at the documentation Suspending and resuming an instance:
> Suspending an instance differs from stopping an instance in the following ways:
- Suspended instances preserve the guest OS memory, device state, and application state.
- Google charges for the storage necessary to save instance memory.
- You can only suspend an instance for up to 60 days. After 60 days, the instance is automatically moved to the TERMINATED state.
and at the article Stopping and starting an instance:
What are the different states of an instance?
- PROVISIONING: resources are allocated for the VM. The VM is not running yet.
- STAGING: resources are acquired, and the VM is preparing for first boot.
- RUNNING: the VM is booting up or running.
- STOPPING: the VM is being stopped. You requested a stop, or a failure occurred. This is a temporary status after which the VM enters the TERMINATED status.
- REPAIRING: the VM is being repaired. Repairing occurs when the VM encounters an internal error or the underlying machine is unavailable due to maintenance. During this time, the VM is unusable. If repair succeeds, the VM returns to one of the above states.
- TERMINATED: the VM is stopped. You stopped the VM, or the VM encountered a failure. You can restart or delete the VM.
- SUSPENDING: The VM is in the process of being suspended. You suspended the VM.
- SUSPENDED: The VM is in a suspended state. You can resume the VM or delete it.
What are the difference between stopped, suspended, and reset states?
Why would you want to stop a VM?
You might want to stop a VM for several reasons:
- You no longer need the VM but want the resources that are attached to the VM—such as its internal IPs, MAC address, and persistent disk.
- You don’t need to preserve the guest OS memory, device state, or application state.
- You want to change certain properties of the VM that require you to first stop the VM.
Why would you want to suspend a VM?
You might want to suspend a VM for the following reasons:
- You want to stop paying for the core and memory costs of running a VM and pay the comparatively cheaper cost of storage to preserve the state of your VM instead.
- You don’t need the VM at this time but want to be able to bring it back up quickly with its OS and application state where you left it.
You can resume a suspended VM when you need to use it again.
Here are three things you should consider as you address storage needs:
First, consider data replication requirements.
Second, consider that GCP offers replication across availability zones, even if you are in the same region.
Third, if storing in a single region poses a risk for disaster recovery, you should consider multiregional replication.
Where do persistent disks attach to?
Persistent disks do not directly attach to a server. Rather, they attach to the server hosting the network-accessible virtual machine. With a VM, if you attach a disk locally and then shut down, data stored on a persistent disk is lost when a virtual machine is terminated. However, the data on the disk itself remains when an instance is terminated.
Two types of persistent disks are available:
Solid-state drive (SSD) and hard disk drive (HDD). You select an SSD when you require high throughput and consistent performance across an environment.
HDDs have longer latencies and cost less. An HDD is the preferred choice for large data ingest, when you are performing a batch operation, and you require less sensitivity to data variability.
Persistent disks allow for several features, what are they?
First, if you mount a persistent disk on multiple virtual machines, it provides multistorage capacity. Second, snapshots, when leveraging persistent disks, can be created quickly, supporting quick virtual machine distribution. If you intend to use a snapshot mounted to a single virtual machine instance, read/write operations are often permissible.
Memorystore does what
If you are looking for storage that can hold user session data, maintain short-lived web and mobile applications data, or handle gaming data at speed and scale, Cloud Memorystore is the storage option to consider. Cloud Memorystore is a managed Redis service, which is an open source cache solution. Memorystore offers a fully managed in-memory data store with features such as scalability, a well-built security posture, and high availability, all managed by Google.
Configuration varies upon accessing the Memorystore form. When you access the menu, you have two choices
:
- Redis In-memory data structure store that can be used as a database, cache, and message broker
- Memcached In-memory key-value store intended exclusively for caching data
Object Storage
Object storage is a strategy to manage and manipulate data storage as a distinct unit, called an object. Each object can be stored in a single storage unit instead of being embedded into files or folders.
Google Cloud Platform has three broad categories of storage:
object, relational, and nonrelational. The database platforms vary in size, scale, and capability. Nonrelational databases consist of platforms that support NoSQL as well as alternative solutions developed by Google, such as Cloud Firestore and Firebase. These two platforms are mobile NoSQL solutions.
What is problem with automated backups?
Backing Up a Database
Backups can be created at any time with GCP. For example, if you are about to complete a risky task, you’ll want to back up your database or storage system. For these occasions, you can utilize on-demand backups, as you do not have to wait for the backup window to arrive to create a copy. Unlike automated backups, on-demand backups do not automatically get deleted. Instead, you need to delete the backups. Failing to delete them yourself results in a hefty billing charge.
Dataproc Deployment and Management
Dataproc is Google’s managed Apache Spark and Hadoop service. Like BigQuery, Dataproc is designed for big data applications. You should be aware that Spark is intended for analysis and machine learning, whereas Hadoop is appropriate for batching data, with emphasis on big data applications.
For the exam, you need to be familiar with creating Dataproc clusters and storage facilities as well as know how to submit jobs that run in those clusters.