Cloud Big Data Platform Flashcards
What does serverless mean?
you don’t have to worry about provisioning Compute Instances to run your jobs. The services are fully managed, and you pay only for the resources you consume.
What is Cloud Dataproc?
a fast, easy, managed way to run Hadoop, Spark, Hive, and Pig on Google Cloud Platform. All you have to do is request a Hadoop cluster. It will be built for you in 90 seconds or less, on top of Compute Engine virtual machines whose number and type you control. If you need more or less processing power while your cluster is running, you can scale it up or down.
How can you monitor your Hadoop / Dataproc cluster?
Stackdriver
what is a benefit for running Dataproc over on-premise Hadoop?
Running on-premises, Hadoop jobs requires a capital hardware investment.
Running these jobs in Cloud Dataproc, allows you to only pay for hardware resources used during the life of the cluster you create
What tools can you leverage once your data is in Dataproc?
you can use Spark and Spark SQL to do data mining. And you can use MLib, which is Apache Spark’s machine learning libraries to discover patterns through machine learning.
What is Cloud Dataflow?
It’s both a unified programming model and a managed service and it lets you develop and execute a big range of data processing patterns: extract, transform, and load batch computation and continuous computation
Cloud Dataflow is used to build ______.
Pipelines
Bigquery can stream in data a rate of _______
100,000 rows per second
How are you charged for Bigquery?
Compute & storage are separate. Pay for the compute only when queries are running and pay for storage separately.
What does Pub/Sub stand for?
Publishers and Subscribers.
How many messages can Pub/Sub scale to ?
one million messages per second and beyond
What is Cloud Datalab ?
It runs in a Compute Engine virtual machine. To get started, you specify the virtual machine type you want and what GCP region it should run in. When it launches, it presents an interactive Python environment that’s ready to use
Cloud Datalab is integrated with?
. It’s integrated with BigQuery, Compute Engine, and Cloud Storage, so accessing your data doesn’t run into authentication hassles
Cloud Machine Learning Platform provides ______ .
modern machine learning services with pre-trained models and a platform to generate your own tailored models
Cloud machine learning falls into what two categories generally?
Based on structured data, you can use ML for various kinds of classification and regression tasks like customer churn analysis, product diagnostics and forecasting. It can be the heart of a recommendation engine for content personalization and cross-sells and up-sells
Unstructured data, you can use ML for image analytics such as identifying damaged shipment, identifying styles and flagging content. You can do text analytics too, like a call center, blog analysis, language identification, topic classification and sentiment analysis.