Exam questions Flashcards
What is the storage transfer service?
The storage transfer service is a tool to import inline data into GCS.
When would you use the storage transfer service?
When transferring data from an online source (HTTP(S), Amazon S3 etc.) to a data sink (Eg. GCS).
What is GCS?
Google Cloud Storage, Google’s online file storage service.
What is a bucket?
It’s similar to a partition (file system) on a hard drive. A segregated place to store files.
Which access rights do you need to use the storage transfer service?
Editor or owner of the project that creates the transfer job. Viewers can view info/jobs.
When do you use storage transfer service vs. gsutil?
On-premise, use gsutil. Use STS when transferring from remote cloud storage providers.
What is Google Transfer Appliance?
High capacity storage servers rented from Google. Used to transfer data too large for network transferring.
When would you use the Google Transfer Appliance?
When transferring your data over the Internet takes more than a few weeks. (Evaluate data / speed)
How is Google Transfer Appliance priced?
100TB: $300
480TB: $1800
Late fees:
How is the storage transfer service priced?
No extra costs for the transfer service itself. Other fees apply, like:
- GCS storage/bandwidth ($0.01/GB regional or ~$0.1 inter-regional)
- Data source’s pricing
- Data insertion costs (PUT operations).
What is nearline storage?
Storage for commonly used data. Higher cost per GB, lower price for bandwidth.
What is coldline storage?
Storage for rarely used data, eg. backups. Low cost per GB, but high bandwidth cost.
What’s the price difference between nearline/coldline storage?
Nearline: $0.01/GB
Coldline: $0.007/GB
What is data egress?
Data transfer across regions.
When would you use GCS as your storage platform?
If
- your data is not structured
- you don’t need mobile SDKs
When would you use Google Firebase as your storage platform?
When
- your data is (un)structured,
- your data is non-relational
- your main workload is not analytics
- you need mobile SDKs.
When would you use Cloud Spanner as your storage platform?
When
- your data is relational,
- you don’t primarily need analytics
- you need horizontal scaling.
What is horizontal scaling?
When you add more machines.
What is vertical scaling?
When you add more power to an existing machine.
When would you use Cloud SQL as your storage platform?
When
- your data is relational
- you don’t primarily need analytics
- you don’t need to scale horizontally.
When would you use Cloud Datastore as your storage platform?
When
- your data is structured
- your data is non-relational
- your primary workload is not analytics
When would you use Cloud Bigtable as your storage platform?
When
- your data is structured
- your data is non-relational
- your workload is analytics
- you need low latency
When would you use BigQuery as your storage platform?
When
- your data is structured
- your data is non-relational
- your workload is analytics
- you don’t need low latency
Describe Google Cloud Storage.
A scalable, fully-managed, highly reliable, and cost-efficient object / blob store.
What is CBT?
A scalable, fully-managed NoSQL wide-column database that is suitable for both real-time access and analytics workloads.
What is cloud datastore?
A scalable, fully-managed NoSQL document database for your web and mobile applications.
What is cloud sql?
A fully-managed MySQL and PostgreSQL database service that is built on the strength and reliability of Google’s infrastructure.
What is cloud spanner?
Mission-critical, relational database service with transactional consistency, global scale and high availability.
What is bigquery?
A scalable, fully-managed Enterprise Data Warehouse (EDW) with SQL and fast response times.
What is OLAP short for?
Online analytical processing.
Give some examples of unused data.
- Google Street View data.
- Emails.
- Parking footage.
- Purchase history.
What are some barriers to big data analysis?
- Unstructured data
- Too large data amounts
- Data quality
- Too fast data streams
Big data is often called counting problems. What’s the difference between easy and hard counting problems?
Hard problems:
Difficult to quantify “fitness”. Eg. vision analysis or natural language processing.
Easy problems:
Straightforward problems but large data amounts.
Is one petabyte large?
Depends on data type and funds. PB is a lot of text, but not necessarily with pictures or video.
BUT a lot does not necessarily impact processing time.
Describe how MapReduce works.
Split the data into small, parallelizable chunks. The output is then aggregated later.
What is the difference between typical development with Dataproc and typical Spark/Hadoop?
Dataproc manages all the setup necessary in Spark/Hadoop.
Spark/Hadoop has a lot of setup, config and optimization.
What are some drawbacks of managing a Hadoop cluster yourself?
- Difficult to scale/add new hardware.
- Less than 100% utilization -> bigger cost.
- Downtime when upgrading/redistributing tasks.
What is a cluster?
A setup of master and worker nodes for crunching big data tasks. Data is centralized in master nodes and distributed (mapped) to worker nodes.
Why use nearby zones?
- Lower latency
- Egress (exporting) data might incur costs
What’s the difference between cluster masters and nodes?
Master: Contains and splits data so workers can work in chunks. This is called mapping. Aggregates data later in reducing.
Worker: Data power attached to a master node. Receives data and processes it. Workers might be configured as preemptive and disappear from the cluster.
What is a preemptive worker?
Unused data power from Google may be allocated and utilized. Think of last-minute airplane tickets. They can be revoked when someone requests that data power.
What can images do for Dataproc?
Clusters can be installed with different versions of software stack.