Big Data Solutions Flashcards
What is Cloud Pub/Sub?
A messaging event service that is fully managed and used for data pipelines
Which services align with the Cloud Dataflow pipeline?
Cloud Dataflow is based on Apache Beam
Which services align with the Cloud Dataproc pipeline?
Cloud Dtaproc is used for Apache Spark and Hadoop clusters
What is Big Query?
BigQuery is a fully managed anaylitics service used to help analyze large amounts of data
What language are queries executed in BigQuery?
SQL language
What services can Cloud Pub/Sub integrate with?
Cloud Logs, Cloud API, Cloud Dataflow, Cloud Storage, and Compute Engine
What is the primary difference between Cloud Dataflow and Cloud Dataproc?
You must provision your own servers in Cloud Dataproc
What types of instances are available for Cloud Dataproc jobs?
Compute Engine instances, preemptible instances
What is Cloud IoT Core?
Cloud IoT Core is a fully managed Google service that offers secure connections, management, and ingestion of data from IoT devices
What type of Pub/Sub protocol does Cloud IoT Core use?
It typically uses MQTT Pub/Sub protocol more effectively than HTTP although it can use both
What can you do with Cloud IoT Core?
You can register, configure, update, and control IoT devices
How much data can be loaded into BigQuery?
BigQuery can be scaled to petabytes of data, although it must always contain at least 1 dataset
What is a publisher?
An application that can create and send messages to a topic
What is a topic?
A topic is a resource to which messages are sent by publishers
What is a message?
Data a publisher will send to a topic (data in transit)
What is a subscription?
Subscriptions are the stream of messages from a single topic to be delivered to the subscribing application
What are common uses for Cloud Pub/Sub?
Distributed Even notifications, Balancing workloads, and Logging
What is a data pipeline?
A pipeline is a piece of code that determines how we wish to process our data
What is Cloud Dataflow typically used for?
Cloud Dataflow ingests data from Cloud Pub/Sub and transforms it into the data that we need to use as a part of the data pipeline
What should you do if data being uploaded into BigQuery is only being used temporarily?
Data tables in BigQuery can be given a table expiriation in order to cut down on storage costs when the data set is being created
How long must data be in BigQuery for it’s storage price to drop?
Data must be in BigQuery for 90 days unedited before it’s storage costs drop to 50%
What is an advantage of using CLoud Dataproc clusters?
Clusters are only used for job’s lifetime and are therefore cost effective
What types of machines are available in a Cloud Dataproc cluster?
Master notes, worker nodes, and preeptible worker nodes
What type of cluster configuration in Cloud Dataproc has one master nodes and N worker nodes?
Standard; In flight jobs will fail and the file system will be inaccessible until the master node reboots if there is a compute failure
What type of cluster configuration in Cloud Dataproc has includes 3 master nodes and N worker nodes?
High Availability; Designed to allow uninterrupted operations in the event of a compute engine failure
What type of cluster configuration for Cloud Dataproc combines both master and worker nodes?
Single Node; Not suitable for large data processing and should be used for PoC or small scale non-critical data processesing
What components from Apache Hadoop are auto installed onto a Cloud Dataproc ecosystem?
Apache Spark, Apache Hadoop, Apache Pig, Apache Hive, Python, Java, and the Hadoop Distributed File System
What is MQTT?
MQTT is a Publish/Subscribe protocol that is often used with devices because it is data focused (often considered better for IoT jobs)
What is HTTP in relation to Pub/Sub?
HTTP is a Publish/Subscribe protocol that is connectionless and can maintain a connection to the core. They are considered document focused.
How do protocols communicate with Cloud IoT Core?
They communicate via a protocol bridge which provides MQTT and HTTP protocol endpoints, automatic load balancing and Global data access via Pub/Sub