Exam Preparation Flashcards
What are the four stages of data lifecycle?
ingest, storage, process and Analyze, and explore and Visualise.
What is streaming data?
Streaming data is a set of data that is sent in small messages that are transmitted continuously from the data source. Streaming data may be telemetry data, which is data generated at regular intervals, and event data, which is data generated in response to a particular event. Stream ingestion services need to deal with potentially late and missing data. Streaming data is often ingested using Cloud Pub/Sub.
What is bulk data?
Batch data is ingested in bulk, typically in files. Examples of batch data ingestion include uploading files of data exported from one application to be processed by another. Both batch and streaming data can be transformed and processed using Cloud Dataflow.
What are the technical considerations to consider when choosing a data store?
These factors include the volume and velocity of data, the type of structure of the data, access control require- ments, and data access patterns.
Know the three levels of structure of data.
Unstructured, semi-structured and structured.
What products store structured data in GCP?
CloudSQL and CloudSpanner for transactional
BigQuery for analytical
What products store semi-structured data in GCP?
If data access requires full index- ing, Cloud Datastore, else BigTable.
What products store unstructured data in GCP?
Cloud Storage
What are the four types of NoSQL databases?
Four types of NoSQL databases are key-value, document, wide-column, and graph databases
What are some concerns about streaming data?
Stream ingestion services need to deal with potentially late and missing data
What tool can transform batch and streaming data?
Both batch and streaming data can be transformed and processed using Cloud Dataflow.`
What SQL does CloudSQL support?
Cloud SQL supports MySQL, PostgreSQL, and SQL Server (beta).
How is CloudSQL initially setup for availability?
Cloud SQL instances are created in a single zone by default, but they can be created for high availability and use instances in multiple zones
How can you improve reads in CloudSQL?
Use read replicas.
What is CloudSpanner?
Cloud Spanner is a horizontally scalable relational database that automatically replicates data
What are the three types of replicas in CloudSpanner?
Three types of replicas are read-write replicas, read-only replicas, and witness replicas.
How can you avoid hot-spotting in CloudSpanner?
Avoid hotspots by not using consecutive values for primary keys.
What kind of configuration does CloudSpanner have?
Cloud Spanner is configured as regional or multi-regional instances
What is BigTable?
Cloud Bigtable is a wide-column NoSQL database used for high-volume databases that require sub-10 ms latency (fast write).
What use-cases are their for BigTable?
Cloud Bigtable is used for IoT, time-series, finance, and similar applications.
How do you make BigTable highly available?
For multi-regional high availability, you can create a replicated cluster in another region. All data is replicated between clusters.
How is data stored in Bigtable?
Data is stored in Bigtable lexicographically by row-key, which is the one indexed column in a Bigtable table.
How do you improve reads in BigTable?
Keeping related data in adjacent rows can help make reads more efficient.
What is Cloud Firestore?
Cloud Firestore is a document database that is replacing Cloud Datastore as the managed document database.
What is BigQuery?
BigQuery is an analytics database that uses SQL as a query language
What are datasets in BigQuery?
Datasets are the basic unit of organization for sharing data in BigQuery. A dataset can have multiple tables.
What SQL does BigQuery support?
A dataset can have multiple tables. BigQuery supports two dialects of SQL: legacy and standard
What is streaming insets on BigQuery?
Streaming inserts allow adding one row at a time.
What does Stackdriver do in BigQuery?
Stackdriver is used for monitoring and logging in BigQuery. Stackdriver
How are BigQuery costs managed?
BigQuery costs are based on the amount of data stored, the amount of data streamed, and the workload required to execute queries.
What is Cloud Memorystore?
Cloud Memorystore is a managed Redis service. Redis instances can be created using the Cloud Console or gcloud commands
When is Cloud Memorystore under memory pressure?
When the memory used by Redis exceeds 80 percent of system memory, the instance is considered under memory pressure.
What is Google Cloud Storage?
It’s an object storage, like S3.
What are objects stored in Google Cloud Storage?
A bucket, that share access controls at the bucket level.
What are the four storage tiers of Google Cloud Storage?
The four storage tiers are Regional, Multi-regional, Nearline, and Coldline.
What data structures use DAGs?
Data pipelines are modeled as directed acyclic graphs (DAGs)
What are the four stages of the data pipeline?
Ingestion - bringing data into the GCP environment. Transformation - mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline.
Cloud Storage can be used as both the staging area for storing data immediately after ingestion and also as a long-term store for transformed data.
BigQuery and Cloud Storage treat data as external tables and query them. Cloud Dataproc can use Cloud Storage as HDFS-compatible storage. Analysis can take on several forms, from simple SQL querying and report generation to machine learning model training and data science analysis.
What are the common patterns in data warehousing pipelines?
ETL
ELT
CDC
EL
What are the unique considerations for streaming data?
Difference between event time and processing time, sliding and tumbling windows,
late- arriving data and watermarks, and missing data.
What are the components of a typical ML pipeline?
This includes data ingestion, data preprocessing, feature engineering, model training and evaluation, and deployment.
What is Cloud Pub/Sub?
Cloud Pub/Sub is a managed message queue service.
Does Cloud Pub/Sub scale as needed?
Cloud Pub/Sub will automatically scale as needed.
When are messaging queues used?
Messaging queues are used in distributed systems to decouple services in a pipeline. This allows one service to produce more output than the consum- ing service can process without adversely affecting the consuming service. This is especially helpful when one process is subject to spikes.
I.E Lots of messages = no worries, add to queue
What is Cloud Dataflow?
Cloud Dataflow is a managed stream and batch processing service.
How does Cloud Dataflow work to help?
In the past, developers would typically create a stream processing pipeline (hot path) and a separate batch processing pipeline (cold path). Cloud Dataflow combines the two.
What is Cloud Dataproc?
Cloud Dataproc is a managed Hadoop and Spark service.
How does Cloud Dataproc support on-prem migrations?
You can move your on-prem Hadoop to Dataproc.
What is cloud composer?
Cloud Composer is a managed service implementing Apache Airflow.
What type of nodes does Dataproc support?
Cloud Dataproc clusters consist of two types of nodes: master nodes and worker nodes.
What does cloud composer do?
Cloud Composer automates the scheduling and moni- toring of workflows.
What do you need to do when migrating from on-premises Hadoop and Spark to GCP?
Do it incrementally.
Migrate HBase to Bigtable.
Manage the syncronization between the on-prem and cloud.
What is Compute Engine?
It’s like EC2, you have complete control over it.
What is GKE?
Kubernetes is a container orchestration system, and Kubernetes Engine is a managed Kubernetes service. With Kubernetes Engine, Google maintains the cluster and assumes responsibility for installing and configuring the Kubernetes platform on the cluster. Kubernetes Engine deploys Kubernetes on managed instance groups.