Chapter 8: Storage, Databases, and Data Analytics Flashcards
- You’re an architect for Memegen, a global meme generating company, designing a real-time analytics platform that is intended to stream millions of events per second. Their platform needs to dynamically scale up and down, process incoming data on the fly, and process data that arrives late because of slow mobile networks. They also want to run SQL queries to access 10TB of historical data. They’d like to stick to managed services only. What services would you leverage?
- Cloud SQL, Cloud Pub/Sub, Kubernetes
- Cloud Functions, Cloud Dataproc, Cloud Bigtable
- Cloud Dataflow, Cloud Storage, Cloud Pub/Sub, BigQuery
- Cloud Dataproc, BigQuery, Google Compute Engine
- You’re an architect for Memegen, a global meme generating company, designing a real-time analytics platform that is intended to stream millions of events per second. Their platform needs to dynamically scale up and down, process incoming data on the fly, and process data that arrives late because of slow mobile networks. They also want to run SQL queries to access 10TB of historical data. They’d like to stick to managed services only. What services would you leverage?
* Cloud Dataflow, Cloud Storage, Cloud Pub/Sub, BigQuery
The requirements for this architecture include dynamic scaling, streaming data on the fly and batch processing data that arrives late, and using SQL to query massive scales of batch data, all through a managed service. Dataflow, GCS, Pub/Sub, and BigQuery are the only solutions that meet all these requirements.
- You need a solution to analyze your data stream and optimize your operations. Your data stream involves both batch and stream processing. Your team wants to leverage a serverless solution. What should you use?
- Cloud Dataflow
- Cloud Dataproc
- Kubernetes with BigQuery
- Compute Engine with BigQuery
- You need a solution to analyze your data stream and optimize your operations. Your data stream involves both batch and stream processing. Your team wants to leverage a serverless solution. What should you use?
* Cloud Dataflow
Dataflow is a serverless solution that can be leveraged for both batch and stream processing. Dataproc is not fully serverless.
- Your team is running many Apache Spark and Hadoop jobs in their on-premises environment and would like to migrate to the cloud with the least amount of change to their tooling. What should they use?
- Cloud Dataflow
- Compute Engine with a Dataflow Connector
- Kubernetes Engine with a Dataflow Connector
- Cloud Dataproc
- Your team is running many Apache Spark and Hadoop jobs in their on-premises environment and would like to migrate to the cloud with the least amount of change to their tooling. What should they use?
* Cloud Dataproc
Dataproc is designed for Spark and Hadoop workloads.
- You need to develop a solution that will process data from one of your organization’s APIs in strict chronological order with no repeated data. How would you build this solution?
- Cloud Dataflow
- Cloud Pub/Sub to a Cloud SQL backend
- Cloud Pub/Sub to a Stackdriver backend
- Cloud Pub/Sub
- You need to develop a solution that will process data from one of your organization’s APIs in strict chronological order with no repeated data. How would you build this solution?
* Cloud Pub/Sub to a Cloud SQL backend
Pub/Sub offers first in, first out (FIFO) ordering of messages, but when the content is stored, it will need to be stored in an ACID-based system such as Cloud SQL.
- Memegen just got breached, and the Security Operations team is kicking off their incident response process. They’re investigating a production VM and want to copy the VM as evidence in a secure location so they can conduct their forensics before taking an action. What should they do?
- Create a snapshot of the root disk, create a restricted GCS bucket that is accessible only by the forensics team, and create an image file in GCS from the snapshot.
- Shut down the VM, create a snapshot, create an image file in GCS, and restrict the GCS bucket.
- Use the gcloud copy tool to copy the file directory onto an attached Cloud Filestore network file system.
- Create a clone of the VM, migrate user traffic onto the new VM, and use the old VM for forensics.
- Memegen just got breached, and the Security Operations team is kicking off their incident response process. They’re investigating a production VM and want to copy the VM as evidence in a secure location so they can conduct their forensics before taking an action. What should they do?
* Create a snapshot of the root disk, create a restricted GCS bucket that is accessible only by the forensics team, and create an image file in GCS from the snapshot.
This is the only valid solution here. They’re looking to investigate a production VM, so taking the server down is not a recommended action at this point. They also want to conduct forensics in a secure location to ensure the evidence is not tampered with.
- You’re planning on migrating 5 petabytes of data to your project. This data requires 24/7 availability, and your data analyst team is familiar with SQL. What tool should you use to surface this data to your analyst team for analytical purposes?
- Cloud Datastore
- Cloud SQL
- Cloud Spanner
- BigQuery
- You’re planning on migrating 5 petabytes of data to your project. This data requires 24/7 availability, and your data analyst team is familiar with SQL. What tool should you use to surface this data to your analyst team for analytical purposes?
* BigQuery
There are a few indicators here as to why BigQuery is the right answer: large-scale migration, requirement to use SQL, and an analytical use case.
- You’re consulting for an IoT company that has hundreds of thousands of IoT sensors that capture readings every two seconds. You’d like to optimize the performance of this database, so you’re looking to identify a more accurate, time-series database solution. What would you use?
- Cloud Bigtable
- Cloud Storage
- BigQuery
- Cloud Filestore
- You’re consulting for an IoT company that has hundreds of thousands of IoT sensors that capture readings every two seconds. You’d like to optimize the performance of this database, so you’re looking to identify a more accurate, time-series database solution. What would you use?
* Cloud Bigtable
The dead giveaway here is leveraging a time-series database for IoT sensors. This is where Bigtable shines.
- You have a customer who wants to store data for at least ten years that will be accessed infrequently, at most once a year. The customer wants to optimize their cost. What solution should they use?
- Google Cloud Storage
- Google Cloud Storage with a Nearline storage class
- Google Cloud Storage with a Coldline storage class
- Google Cloud Storage with a Archive storage class
- You have a customer who wants to store data for at least ten years that will be accessed infrequently, at most once a year. The customer wants to optimize their cost. What solution should they use?
* Google Cloud Storage with a Archive storage class
Using an archival storage class will be sufficient and the most cost-effective here because the use case is infrequently accessing the data, at most once a year.
- BankyBank wants to build an online transactional processing tool that requires a relational database with petabyte-scale data. What tool would you use?
- BigQuery
- Cloud SQL
- Cloud Spanner
- Cloud Bigtable
- BankyBank wants to build an online transactional processing tool that requires a relational database with petabyte-scale data. What tool would you use?
* Cloud Spanner
Cloud Spanner is the OLTP solution that is relational and offers petabyte scalability. Cloud SQL is not designed for petabyte-scale data.
- Memegen wants to introduce a shopping functionality for their users to connect all of their user purchasing history and activities to their user profiles. They need massive scalability with high performance, atomic transactions, and a highly available document database. What should they use?
- Cloud Spanner
- BigQuery
- Cloud Bigtable
- Cloud Firestore
- Memegen wants to introduce a shopping functionality for their users to connect all of their user purchasing history and activities to their user profiles. They need massive scalability with high performance, atomic transactions, and a highly available document database. What should they use?
* Cloud Firestore
Cloud Firestore, formerly known as Datastore, is a great solution for profile storage and purchasing history.
When you’re taking the exam, knowing which Google Cloud storage technologies are related to file, object, and block storage may help you get to a more clear answer.
Be careful, though, and don’t assume a Google-managed service is always the answer. Read through each question very carefully for the requirements.
Persistant Disk
If you need to modify the size of your persistent disk, it’s as easy as increasing the size in the Cloud Console. If you need to resize your mounted file system, you can use the standard resize2fs command in Linux to do online resizing.
PDs are not actually physically attached to the servers that host your VMs, but they are virtually attached. You can only resize up, but not down!
Persistant Disk
The command to modify the persistent disk auto-delete behavior for instances attached to VMs is gcloud compute instances set-disk-auto-delete.
Auto-delete is on by default, so you will need to turn this syntax off if you don’t want your PD to be deleted when the instance attached to it is deleted.
Local SSD
Local SSDs disappear when you stop an instance, whereas all three types of persistent disks persist when you stop an instance—hence the name, persistent disk.
Each Local SSD is only 375GB, but you can attach 24 Local SSDs per instance. Because of their benefits and limitations, Local SSDs make a great use case for temporary storage such as caches, processing space, or low-value data.