Big Data Solutions Flashcards by Kaitlyn Spaeth

What is Cloud Pub/Sub?

A messaging event service that is fully managed and used for data pipelines

How well did you know this?

Not at all

Perfectly

Which services align with the Cloud Dataflow pipeline?

Cloud Dataflow is based on Apache Beam

How well did you know this?

Not at all

Perfectly

Which services align with the Cloud Dataproc pipeline?

Cloud Dtaproc is used for Apache Spark and Hadoop clusters

How well did you know this?

Not at all

Perfectly

What is Big Query?

BigQuery is a fully managed anaylitics service used to help analyze large amounts of data

How well did you know this?

Not at all

Perfectly

What language are queries executed in BigQuery?

SQL language

How well did you know this?

Not at all

Perfectly

What services can Cloud Pub/Sub integrate with?

Cloud Logs, Cloud API, Cloud Dataflow, Cloud Storage, and Compute Engine

How well did you know this?

Not at all

Perfectly

What is the primary difference between Cloud Dataflow and Cloud Dataproc?

You must provision your own servers in Cloud Dataproc

How well did you know this?

Not at all

Perfectly

What types of instances are available for Cloud Dataproc jobs?

Compute Engine instances, preemptible instances

How well did you know this?

Not at all

Perfectly

What is Cloud IoT Core?

Cloud IoT Core is a fully managed Google service that offers secure connections, management, and ingestion of data from IoT devices

How well did you know this?

Not at all

Perfectly

What type of Pub/Sub protocol does Cloud IoT Core use?

It typically uses MQTT Pub/Sub protocol more effectively than HTTP although it can use both

How well did you know this?

Not at all

Perfectly

What can you do with Cloud IoT Core?

You can register, configure, update, and control IoT devices

How well did you know this?

Not at all

Perfectly

How much data can be loaded into BigQuery?

BigQuery can be scaled to petabytes of data, although it must always contain at least 1 dataset

How well did you know this?

Not at all

Perfectly

What is a publisher?

An application that can create and send messages to a topic

How well did you know this?

Not at all

Perfectly

What is a topic?

A topic is a resource to which messages are sent by publishers

How well did you know this?

Not at all

Perfectly

What is a message?

Data a publisher will send to a topic (data in transit)

How well did you know this?

Not at all

Perfectly

What is a subscription?

Study These Flashcards

Subscriptions are the stream of messages from a single topic to be delivered to the subscribing application

What are common uses for Cloud Pub/Sub?

Study These Flashcards

Distributed Even notifications, Balancing workloads, and Logging

What is a data pipeline?

Study These Flashcards

A pipeline is a piece of code that determines how we wish to process our data

What is Cloud Dataflow typically used for?

Study These Flashcards

Cloud Dataflow ingests data from Cloud Pub/Sub and transforms it into the data that we need to use as a part of the data pipeline

What should you do if data being uploaded into BigQuery is only being used temporarily?

Study These Flashcards

Data tables in BigQuery can be given a table expiriation in order to cut down on storage costs when the data set is being created

How long must data be in BigQuery for it’s storage price to drop?

Study These Flashcards

Data must be in BigQuery for 90 days unedited before it’s storage costs drop to 50%

What is an advantage of using CLoud Dataproc clusters?

Study These Flashcards

Clusters are only used for job’s lifetime and are therefore cost effective

What types of machines are available in a Cloud Dataproc cluster?

Study These Flashcards

Master notes, worker nodes, and preeptible worker nodes

What type of cluster configuration in Cloud Dataproc has one master nodes and N worker nodes?

Study These Flashcards

Standard; In flight jobs will fail and the file system will be inaccessible until the master node reboots if there is a compute failure

What type of cluster configuration in Cloud Dataproc has includes 3 master nodes and N worker nodes?

High Availability; Designed to allow uninterrupted operations in the event of a compute engine failure

What type of cluster configuration for Cloud Dataproc combines both master and worker nodes?

Single Node; Not suitable for large data processing and should be used for PoC or small scale non-critical data processesing

What components from Apache Hadoop are auto installed onto a Cloud Dataproc ecosystem?

Apache Spark, Apache Hadoop, Apache Pig, Apache Hive, Python, Java, and the Hadoop Distributed File System

What is MQTT?

MQTT is a Publish/Subscribe protocol that is often used with devices because it is data focused (often considered better for IoT jobs)

What is HTTP in relation to Pub/Sub?

HTTP is a Publish/Subscribe protocol that is connectionless and can maintain a connection to the core. They are considered document focused.

How do protocols communicate with Cloud IoT Core?

They communicate via a protocol bridge which provides MQTT and HTTP protocol endpoints, automatic load balancing and Global data access via Pub/Sub

Big Data Solutions Flashcards

(30 cards)