GCP Professional Data Engineer Cert Flashcards

1
Q

Relational Databases

A

Has relationship between tables

Google Cloud SQL: Managed SQL instances-don’t have to set much up, Multiple database engines like MySQL, Scalability and availability vertically scales to 64 cores, MySQL has different instances it is also secure-Cloud SQL proxy or SSL/TLS, or have private IPs there are also maintenance windows and automated backups , point in time recovery instance stores
Importing MySQL Data Commands: InnoDB mysqldump export/import, CSV import, External replica promotion-need binary log retention
PostgreSQL Instances are another option-have automated maintenance, unsupported features but it has high availability
Import PostgreSQL commands: SQL dump export/import, CSV import

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Cloud Firestore

A
  1. Fully managed No SQL database-server less autoscaling, NoSQL document store
  2. Realtime DB with mobile SDKs, Android and IOS client libraries, frameworks for popular programming languages
  3. Strong scalability and consistency-horizontal autoscaling
    Bundle multiple documents=collection
    Messages are sub collections
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cloud Spanner

A
  1. Managed SQL -compliant DB-SQL schemas and queries with ACID transactions
  2. Horizontally scalable: Strong consistency across rows, regions from 1 to 1,000s of nodes
  3. Highly available-automatic global replication, no planned downtime and 99.9999% SLA

High Cost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

CAP Theorem

A

Consistency-one change data with specific rules, Availability-always available to do queries, Partition Tolerance-needs to tolerate failures and has to be tolerant of any loss of partition parts
Most likely you will have two parts at once
Spanner is strongly consistent and highly available, sometimes it will choose consistency over availability, global private network, five 9s of availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cloud Spanner Architecture

A

An allocation of resources, instance configuration-regional or multi-regional, initial number of nodes
Region configuration: a region has a zone/multiple zones. With each instance, specify the node count as 1 and each replica is powered by each virtual machine, by moving the node number up you are adding more machines for more computing power. The replicas stay the same, but machines/nodes can change. Therefore, you can connect the different replicas across different zones to create a node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Cloud Memorystore

A

In memory database
1. Fully managed Redis Instance-provisioning, replication, failover-fully automated
2. Basic tier: efficient cache that that can withstand a cold restart and a full data flush
3. Standard tier-adds a cross-zone replication and automatic failover
Benefits-no need to provision own VMs,scale instances with minimal impact, private IPs and IAM, automatic replication and failover
Creating an Instance: Version 3.2 or 4, choose service tier and region, memory capacity 1-300GB(Determines network throughput), add configuration parameters
Connecting to Instances: Compute Engine, Kubernetes Engine, App Engine, Cloud Function(server-less VPC connector)
Import and Export: Export to RDB backup: BETA, admin operation not permitted during esport, may increase latency, RDB file written to Cloud Storage
Import from RDB backup: Overwrites, all current instance data, instance unavailable during import process
Use Cases: Redis can be used as a Session Cache that the common uses are logins, and shopping carts, a Message Queue that queues messages and operates to enable loosely-coupled services, or a Pub/Sub advanced message

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Comparing Storage Options

A
  1. Ask yourself if this is structured or unstructured data? Structured: SQL data, NoSQL data, Analytics data, Keys and values Unstructured: Binary blobs, videos, images, proprietary files-unstructured data use the Cloud Storage Option
  2. Is the data going to be used for Analytics? Low Latency vs Warehouse. Low latency: Petabyte scale, single-key rows, time series or IoT data.-Choose Cloud Bigtable Warehouse: petabyte scale, analytics warehouse, SQL queries-Choose Bigquery
  3. Is this relational data? Horizontal Scaling vs Vertical scaling. Horizontal Scaling: ANSI SQL works, global replication, high availability and consistency, it’s expensive but can the client afford it, most financial institutions would probably use this-Choose Cloud Spanner. Vertical scaling: MySQL or PostgreSQL, managed service, and high availability-Choose Cloud SQL.
  4. Is the data Non-relational? NoSQL vs Key/Value. NoSQL: Fully managed document database, strong consistency, mobile SDKs and offline data-Choose Cloud Firestore. Key/Value: managed Redis instances, does what Redis does-Choose Cloud Memorystore.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Streaming

A

Continuous collection of data, near real time analytics, windows and micro batches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Batch

A

Data gathered with a defined time window, large volumes of data, data from legacy systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

No-SQL

A

Anything not sql-key values stores, json document stores, mongoDB and Cassandra tools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SQL

A

Row tabular data ,Relational-connect to other tables/queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

On-Line Analytical Processing (ONLAP)

A

Low volume of long running queries
Aggregated historical data-purchasing analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

On-Line Transactional Processing(ONLTP)

A

High volume of short transactions, high integrity, sql
Modifies the database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Defines Big Data

A
  1. Volume: Scale of information being handled by data processing systems
  2. Velocity: Speed at which data is being processed, ingested, analyzed, and visualized
  3. Variety The diversity of data sources, formats, and quality.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Map Reduce

A

A programming model-Map and Reduce functions
Distributed Implementation
Created at Google to solve problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Map Function

A

takes an input from the user, produces a set of intermediate key/value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Reduce Function

A

Merges intermediate values associated with the same intermediate key, forms a smaller set of values
This method standardized the framework, implementation abstracts away the distributed computing framework: Parallelizing and executing-partitioning, scheduling and fault tolerance
Splits all the jobs to small chunks
Master and worker cluster model
Failed worker jobs reassigned
Worker files buffered to local disk
Partitioned output files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Hadoop and HDFS

A

Named after a toy elephant-inspired by google file system-originated in Apache Dutch-sub project began in 2006

Modules: Hadoop Common-base model and has starting scripts, Hadoop Distributed File System(HDFS)-distributed fault tolerates system that runs on commodity hardware as part of a Hadoop cluster, Hadoop YARN-handles resource management tasks like job scheduling and monitoring for Hadoop jobs, Hadoop MapReduce-Hadoop’s own implementation of the MapReduce model which includes libraries for map and reduce functions, partitioning, reduction, and custom job configuration parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

HDFS Architecture-can help with Cloud Dataproc

A

There is a Server-within the server there is a Name Node-within the Name Node, there is Metadata
In the other server-there is a Data Node which stores very large files across a cluster and the files are stored as a series of blocks
The Racks are in between the cluster to design the shortest network path possible
The client can make multiple requests to a name node across racks to get data from multiple nodes
Servers/clusters can be replicated for fault tolerance
The YARN architecture is similar but in the Server, it has a Node Manager and a Server can have a Resource Manager-The client sends jobs to the resource manager, then on individual workers, the Node Manager process runs to handle local resources, request tasks from the master and return the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Apache Pig-A high level framework for running MapReduce jobs on Hadoop clusters

A

Platform for analyzing large datasets
Pig Latin defines analytics jobs: Merging, Filtering, and transformation-high level but like SQL simplicity
Good for ETL jobs since it has a procedural data flow
And it is an abstraction for MapReduce
The Apache Pig will compile our instructions into MapReduce jobs and then are sent to Hadoop for parallel processing across the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Apache Spark

A

Linear flow of data was an issue- like reading mapping across data reduce results and writing to a disk

The Adobe Spark-General purpose cluster-computing framework-allows for concurrent computational jobs to be run across massive datasets
It uses general purpose cluster-computing framework, resilient distributed data multisets, working set as a form of distributed shared memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Spark Modules

A

Spark SQL-structured data in spark stored in abstraction, programmatic querying-data frames API
Spark Streaming-streaming data ingestion in addition to batch processing-very small batches
MLLib-machine learning library, machine learning algorithms-classification, regression, decision trees
GraphX-iterative graph computation
Supports languages: Python, Java, Scala, R, SQL
MUST have 2 Things: A Cluster Manager-YARN or Kubernetes and a distributed Storage System-HDFS, Apache HBASE, and Cassandra

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Hadoop vs Spark

A

Hadoop: Slow disk storage, high latency, slow, reliable batch processing
Spark: Fast memory storage, low latency, stream processing, 100x faster in-memory, 10x faster on disk, more expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Apache Kafka

A

Publish/subscribe to streams of records
Like a message bus but for data
High throughput and low-latency-ingesting millions events through devices
Ex: Handling >800 Billion messages a day at LinkedIn
Four main APIs in Kafka: Producer-allows app to stream records to a Kafka topic. Consumer-allows app to subscribe to one or more topics/process a stream of records contained within. Streams-an API designed to allow an application to be a stream processor itself-transform data then send it back to Kafka

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Kafka vs Pub/Sub

A

Kafka: Guaranteed message ordering, tunable message retention, polling(Pull) subscriptions only, unmanaged
Pub/Sub: No message ordering guaranteed, 7 day maximum message retention, pull or push subscription, managed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Pub Sub Intro

A

Message Bus takes care of all messages between devices
Pub/Sub splits it in different topics-anything can publish a message to a topic or choose to receive a message from a topic.
Information from users/apps are published to a topic
Topics are covered by a message bus-introduced resilience-Pub/Sub is a shock absorber
Cloud Pub/Sub: Global messaging and event ingestion, server less and fully managed, 500 million messages per second, 1TB/s of data
Pub/Sub Great Features-Multiple publisher/subscriber patterns, at least once delivery, real time or batch, integrates with Cloud Dataflow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Use Case Distributing Workloads Pub/Sub

A

queue up a large number of tasks in a Pub/Sub topic and distribute it amongst multiple workers-like compute engine instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Asynchronous Workflows Pub/Sub

A

controls order of events, order can be sent into a topic which could then be consumed by a worker system like invoicing before passing it into a queue for the next system to consume like packaging and posting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Distributing Event Notifications Pub/Sub

A

A systems sets up new users when they register with your service, a registration could publish a message and the system could be notified to the set the user up
Distributed Logging:Logs could be sent to a Pub/Sub topic to be consumed by ,multiple subscribers. Like a monitoring system and an analytics database for later querying

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Device Data Streaming Pub/Sub

A

Hundreds of thousands and more internet connected devices can stream their data into Pub/Sub topics so that they can be consumes on demand by your analytic streams or could be transformed through Dataflow first

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

One-to One Pub/Sub

A

There is a Publisher, Topic, and then a Subscriber
Publisher sends messages to the topic in Pub/Sub
The subscriber receives the messages and reads them through their own subscription

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Many to many Pub/Sub

A

Just like the one-to-one pattern but this has multiple topics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Publishing Messages

A

Create a message containing your data, JSON payload that’s base64 encoded, size of payload 10MB or less, then send payload as a request to the Pub/Sub API-specify the topic the message should be published on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Receiving Messages

A

Create a subscription to a topic, subscriptions are always associated with a single topic. Pull delivery method is the default delivery method and can take ad hoc pull requests to the Pub/Sub API-specifying your subscription to receive messages and when you receive the message, note that you have received it or else you won’t get the next message. Push delivery method will send messages to an endpoint-the endpoint must be HTTPS with a valid SSL cert-accepts POST requests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Integrations Pub/Sub

A

Client libraries for popular languages like Python, C#, Go, Java, Node, PHP, and Ruby. Cloud Dataflow supported and you can use the Apache Beam SDK to read messages or in batches. Also supported Cloud Functions and Cloud Run, Foundation of Cloud IoT Core-sends and receives messages from connected devices
Developing for Pub/Sub: Local Pub/Sub emulator-Google Cloud SDK and Java Runtime Environment 7+

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Advanced Pub/Sub Topics

A

At Least Once Delivery: Each message is delivered at least once for every subscription
Undelivered Messages: deleted after the message retention duration-default 7 days-can’t be longer
Messages published before a subscription is created will not be delivered to that subscription
Subscriptions expire after 31 days of inactivity-new subscriptions with same name have no relationship to the previous subscription

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Other Features

A

Seeking Feature: Set retain ack messages to True so it retains the messages sent to the topic-default messages are retained for a maximum of 7 days. Then you can tell the subscription to seek to a specific time period in the timeline-basically rewinds the clock to receive past messages. You can also seek in a future timestamp
Snapshots: useful if you are deploying new code. You can save snapshot ahead of time to save the current state of the subscription and to save future and unacknowledged messages
Ordering Messages: May not receive messages in the right order=use timestamps when final order matters, or consider an alternatives for transactional ordering-maybe through a SQL query
Resource Locations: Messages stored in nearest region, message storage policies allow you to control this, additional egress fees may apply

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Access Control Pub/Sub

A

Use service accounts for authorization, grant per-topic or per-subscription permission, grant limited access to publish or consume messages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Exam Tips Pub/Sub

A

Think about where you can decouple data-Pub/Sub is a shock absorber, receives data globally and it can be consumed by other components at their own pace
Where can you use Pub/Sub for events. It can add event logic to a stack and it can pass events through one system to another
Be aware of Pub/Sub limitations-message data must be 10MB or less, beware of expired messages and unused subscriptions
Look for Apache Kafka in use cases, if this comes up, Pub/Sub can be a good option
Keep an eye out for Cloud IoT as a solution
Google Cloud Tasks-get familiar with it
Browse the reference architectures-Smart Analytics references

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is Dataflow

A

Fully managed, server less tool, uses open source Apache Beam SDK, Supports expressive SQL, Java, and Python APIs, Realtime and batch processing, stack integration
Beams unify, develop and model which allows us to reuse code across streaming and batch pipelines
Sources: Cloud Pub/Sub, BigQuery, and Cloud Storage, it can be external to GCP like Kafka
Common Sinks: Cloud Storage, BigQuery, and Bigtable, Cloud Machine Learning can be applied to sync Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Dataflow Process

A

You have a Pipeline, Source, and Sink
The Pipeline will take data from the Source processes data and then places it into the Sink
Apache Beam connectors allow you to connect to the Source and Sink so you can read and then write your output data into the Sink
Common

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Common Dataflow Sources

A

Cloud Pub/Sub, BigQuery, and Cloud Storage, it can be external to GCP like Kafka

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Common Dataflow Sinks

A

Cloud Storage, BigQuery, and Bigtable, Cloud Machine Learning can be applied to sync Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Driver

A

program you write using the Apache SDK-Java or Python. It defines your pipeline
Pipeline: full set of transformations that your data undergoes from initial ingestion to final output
Driver goes to the runner

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Runner

A

a software that manages the execution of your pipeline, a translator for the Backend execution framework can also manage local execution of Driver programs for testing and debugging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

PCollections

A

used in pipelines and represent data as it is transformed within the pipeline kind of represents a multi-element dataset. They can also represent both batch and streaming data.
Data coming from a fixed source, the dataset=Bounded, treated like a batch
Continuously updating source, dataset=Unbounded(Stream)
The PCollection is usually from reading an external source
Transform usually represents a step in your pipeline, transforms use PCollections as inputs and outputs, each transform takes one or more PCollections as inputs and generates 0 or more output PCollections

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Pipeline Development Cycle

A
  1. You have to Design your pipeline first-input and output methods, structure and transformations
  2. Then you Create it-instentiating a pipeline object, implementing transformations that were identified
  3. Testing: debugging a failed pipeline execution on a remote system, try to do local unit testing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Considerations

A
  1. Start with the location of your data
  2. Input data structure and format
  3. Transformation objectives
  4. Output data structure and location
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Pipelines Structures

A

Basic: Input linear to output
Branching: PCollection and there is branching that applies to a single PCollection which result in two different PCollections
Branching can also be conducted on a Transform
Pipeline branches can also be merged, you need to merge all branches of your pipeline at some point through a Flatten or Joined Transform
Pipelines can also have multiple sources and they can be independently transformed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

DAG

A

Dataflow Pipelines represent a Directed Acyclic Graph or DAG-a graph with a finite number of vertices and edges-no directed cycles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Pipeline Creation

A
  1. Create an Object
  2. Create a PCollection using read or create transform
  3. Apply multiple transforms as required
  4. Write out final PCollection
  5. Execute the pipeline using the pipeline runner
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

ParDo

A

generic parallel processing transform: can take an element from PCollection1 and transform it to PCollection2, can output 1, none, or multiple output elements from a single input element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

User-defined function(UDF)

A

user written code that describes the operation to apply to each element of the input PCollection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Aggregation Transformation

A

The process of computing a single value from multiple input elements, doing this for all elements and then going into a single window

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Characteristics of PCollections

A
  1. Any Data Type-must be same type
  2. Don’t support random access
  3. Immutable or unchanging
  4. Boundedness-no limit to the number of elements a PCollection can contain-can be Bounded-finite number of elements or Unbounded-does not have an upper limit
  5. Timestamp is associated with every element of a PCollection-initially assigned by the source that results in the creation of the PCollection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Core Beam Transforms

A
  1. ParDo-generic parallel processing transform
  2. GroupByKey-processes collection of key value pairs, collects all values associated with a unique key
  3. CoGroupByKey-used when combining multiple PCollections-performs a relational join of two or more key value PCollections where they have the same key type
  4. Combine-requires you to provide a function that defines the logic for combining elements, had to be associative and commutative-sum, min, max
  5. Flatten- merges multiple input PCollections into a single logical PCollection
  6. Partitioning-provides the logic that determines how the elements of the PCollection are split up.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Event time Dataflow

A

event time data element occurs determine by timestamp on data element itself, Processing time refers to the different times the element was processed during the transit in your pipeline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Windowing

A

Assigned to a PCollection, subdivides the elements of a PCollection according to their timestamps, do this to allow grouping or aggregating operations over unbounded collections, it groups elements into finite windows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Fixed Window

A
  1. Fixed-simplest, constant non overlapping time interval
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Sliding Window

A
  1. Sliding-represent time intervals- but it can overlap, and an element can belong to more than one window-useful to take running averages of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Per Sessions

A

a different session window is created in a stream when there is an interruption in the flow of events which exceeds a certain time period, apply on a per key basis-useful for irregularly distributed data with respect to time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Single global

A

everything else-window transform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Watermark

A

the system’s notion of when all the data for a certain window can be expected to have arrived-late data = watermark moves past the end of the window and any further data elements arrive with a timestamp within that window

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Triggers

A
  1. Event time, event-time based
  2. Processing time
  3. Data driven-when data in a particular window meets a certain criterion
  4. Composite-combine other triggers in different ways
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Pipeline Access

A

Run Cloud Dataflow pipelines
1. Can be run locally
2. Submit pipeline to GCP Dataflow managed service
GCP service accounts
1. Cloud Dataflow service-uses Dataflow service account
2. Worker instances-Controller service account

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Cloud Dataflow Managed Service

A
  1. The pipeline gets submitted to the GCP Dataflow Service
  2. The Dataflow will create a Job
  3. The Job creates managers and workers to carry out various tasks
  4. For the execution, the workers need files/resources from Cloud Storage
  5. The Job can be monitored with the Cloud Dataflow Monitoring Interface or the Cloud Dataflow Command-line Interface
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Cloud Dataflow Service Account

A
  1. Automatically created when Cloud Dataflow project is created
  2. Manipulates job resources
  3. Assumes the Cloud Dataflow service agent role
  4. Has Read/Write Access to project resources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Controller Service Account-used by the workers-uses the Compute Engine

A
  1. Compute Engine instances-execute pipeline operations
  2. Run Metadata operations-don’t run on local clients or compute engine workers-determine size of file in Cloud Storage
  3. User-managed controller service account-used resources with fined grained access control
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Security Mechanisms

A
  1. Submission of the pipeline-users have to have the right permissions
  2. Evaluation of the pipeline-encrypted, not persisted beyond evaluation of the pipeline, communication between workers over a private network-subject to projects, permissions, and firewalls specify region and zone
  3. Accessing telemetry or metrics-encrypted at rest-controlled by project’s permissions
  4. You can also use Cloud Dataflow IAM roles
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Regional Endpoints in Dataflow

A
  1. Manages metadata about Cloud Dataflow jobs
  2. Controls Cloud Dataflow workers
  3. Automatically selects best zone

Good reasons for regional endpoints
1. Security and compliance
2. Data locality
3. Resiliency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Machine Learning with Cloud Dataflow

A
  1. Handles data extraction from Cloud Storage
  2. Data Preprocessing in Apache Beam pipeline through Cloud Dataflow, TensorFlow API used to normalize some values between 0 and 1, the Beam partition transform is used to split the data set into the training data set and the evaluation data set
  3. TensorFlow is used to train a model locally on your machine or through Cloud Machine Learning-doesn’t use Cloud Dataflow
  4. Predictions-Cloud Dataflow-read from Cloud Dataflow from Pub/Sub into another Pub/Sub topic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Benefits of Dataflow

A

You can use customer-managed encryption keys
Batch pipelines can be processed in a cost-effective manner with Flexible Resource Scheduling(FlexRS)-uses Advance scheduling, Cloud Dataflow Shuffle service, preemptive VMs
Cloud Dataflow is great for MapReduce jobs to Cloud Dataflow-on premises map reduce jobs can be rebuilt on cloud dataflow
Cloud Dataflow with Pub/Sub Seek-replay and reprocess previously acknowledged messages-especially in bulk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Cloud Dataflow SQL

A
  1. Develop and run Cloud Dataflow jobs from the BigQuery web UI
  2. Cloud Dataflow SQL (ZetaSQL variant) integrates with Apache Beam SQL
    Apache Beam SQL-Query bounded and unbounded PCollections, Query is converted to a SQL transform
    Cloud Dataflow SQL-Utilise existing SQL skills, join streams with BigQuery tables, query streams or static datasets, write output to BigQuery for analysis and visualization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Dataflow Exam Tips

A

Beam and Dataflow are the preferred solution for streaming data-especially for streaming data
Pipeline: represents the complete set of stages required to read data perform any transformations and write data
PCollection: represents a multi-element dataset that is processed by the Pipeline
ParDo: core parallel processing function of Apache Beam which can transform elements of an input PCollection into an output PCollection.
DoFn: template you use to create user-defined functions that are referenced by a ParDo
Sources-where data is read from
Sinks-where data is written from
Window: allows streaming data to be grouped into finite collections according to time or session-based windows
Watermark: indicates when Dataflow expects all data in a window but past the watermark is considered late
Dataflow is normally the preferred solution for data ingestion pipelines
Cloud Composer is sometimes used for ad hoc orchestration/provide manual control of Dataflow pipelines themselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What is Dataproc?

A

A managed cluster service for Hadoop and Apache Spark
Managed preferable because it is low costing and you can control which clusters to grow and which clusters to turn off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

Dataproc Architecture

A

Master: it creates a master node running the YARN resource manager and then runs the Hadoop, HDFS name nodes
It also runs the Worker Nodes
Pre-installed and have Hadoop, Apache Spark, Zookeeper, Hive, Pig, Tez, and other tools like Jupyter Notebooks and GCS connector
Storage and configuration handles by Dataproc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

Dataproc benefits

A
  1. Cluster actions complete in ~90 seconds
  2. Pay-per-second minimum 1 min
  3. Scale up/down or turn off at will
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Using Dataproc

A

You can submit Hadoop/Spark jobs, Enable autoscaling-if necessary to cope with the load of the job, Output to GCP Services-like Google Cloud Storage, BigQuery and BigTable, you can also Monitor with Stackdriver-fully integrated logging and monitoring for the job performance and output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

Cluster Location

A

Regional: Isolate resources used for Dataproc into one region like us-east1 and Europe-west1
Global: Resources not isolated to a single region-can place cluster in any zone worldwide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

Single Node Cluster

A

a single VM that will run the master and work the processes-can’t autoscale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

Standard Cluster

A

Has a Master VM that runs YARN Resource Manager and the HDFS Name Node, and it has two Worker Nodes that run a YARN Node Manager and a HDFS Data Node-this is customizable for the disk, there are also Pre-emptible Workers-sometimes help with large projects, but can’t provide storage for HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

High Availability Cluster

A

You have three Masters with YARN and HDFS configured to run in high availability mode-no interruptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

Submitting Jobs

A
  1. Gcloud command line
  2. GCP Console
  3. Dataproc API
  4. SSH to Master Node
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

Monitoring and Logging

A
  1. Use Stackdriver Monitoring to monitor cluster health
  2. Cluster/yarn/allocated_memory_percentage
  3. Cluster/hdfs/storage_utilization
  4. Cluster/hdfs/unhealthy_blocks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

Custom Clusters

A

You can customize the Dataproc default image, Google gives a script, then under the Dataproc default image, there are Custom PKGs where you can apply the customization script you have written, then you can store it in Google

You can also have:
Custom cluster properties-so you can change the values
You can add initialization actions that are custom to the cluster-scripts loaded to a Cloud Storage Bucket-mostly for Staging binaries
You can also Custom Java/Scala dependencies-saves you from precompiling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

Autoscaling in Dataproc

A

Huge Bonus: you can create lightweight clusters and have them automatically scale up to the demands of the job-written in YAML, has configuration numbers for primary workers and secondary workers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

When to not use Autoscaling

A
  1. When having HDFS
  2. When having Apache Spark Streaming
  3. When having Idle Clusters
  4. YARN Node Labels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

Workflow Templates

A

Written in YAML that can specify multiple jobs w/ different configs and parameters that can be run in succession
Workflow Templates have to be created, then instantiated with GCloud-you can send jobs to a new cluster each time or to an existing cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

Advanced Compute Features Dataproc

A
  1. Local SSDs-faster runtimes
  2. GPUs to nodes-for machine learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

Cloud Storage Connector

A
  1. Use GCS instead of HDFS
  2. Cheaper than persistent disk
  3. High availability and durability
  4. Decouple storage from cluster lifecycle
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

Exam Tips

A

Know when to choose Dataproc: Quickly migrating Hadoop and Spark workloads into Google Cloud Platform
Understand the benefits of Dataproc: Managed over Hadoop or Spark cluster-Ease of scaling, being able to use Cloud Storage instead of HDFS, and the connectors to other GCP services like BigQuery and Bigtable
Know Cluster Options: When to pick standard vs high availability, autoscaling and ephemeral
Get to know open-source Big Data Ecosystem-Hadoop, Spark, Zookeeper, Hive, Tea, and Jupyter
Know when to choose Dataflow-sometimes it is the preferred product for big data ingesting, like in streaming workloads and it implements the Apache Beam SDK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

Bigtable Concepts

A

Managed wide-column NoSQL database-series of key value pairs where the values are split into columns
Has a very High Throughput-10,000 reads per second
Also has low-latency-6 milliseconds per node
Scales linearly
Out of the box high availability-cross cluster replication
Developed internally by Google and was used for Google Earth, Finance, and Web Indexing
Since HBase was created and was the open source implementation of the Bigtable model, it was adopted as a top level Apache project and the Cloud Bigtable supports Apache HBase library for Java

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

Cloud Bigtable

A
  1. Has a ROW KEY as the only index
  2. Then it can be attached to columns
  3. The columns can be grouped by families
  4. The empty values don’t take up any space since it’s a sparse db
  5. Scaled to thousands of columns and billions of rows
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

Important Bigtable Features

A
  1. Blocks of contiguous rows are -shared into tablets
  2. Tablets are chunks of sorted rows-put together they form a complete table-managed by nodes in your cluster
  3. Tablet data is stored in Google Colossus-can scale cluster sizes
  4. Splitting, merging, and rebalancing happen automatically
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

Bigtable Scenarios

A

Suited well for financial, marketing data and transactional data
Also good for time series data and data from IoT devices
Good for streaming data and machine learning applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

Bigtable Architecture

A

You create an instance
You have a instance type, storage type and app profiles-describes parameters for incoming connections
To connect, you use an instance ID and an application profile
Inside the instance, you have clusters
Inside the clusters you have nodes which are workhorses of Bigtable
The flexibility of Data Storage comes from separating our cluster nodes and storing data in Colossus
Nodes control tablets and a tablet can’t be shared by more than one node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

Instance Types

A
  1. Production-1+ clusters, 3+ Nodes per cluster
  2. Development-Single node cluster for developmental work-development instance can’t use replication and doesn’t have SLA and a cheaper option
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

SSD Storage Type

A

Almost always the right choice, fastest and most predictable option, 6ms latency for 99% of reads and writes, each node can well process 2.5 TB SSD data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

HDD Storage Type

A

Each node can process 8 TB HDD data, throughput is limited so it will not have as much IO overhead for processing nodes, then the row reads are 5% the speed of SSD reads, the storing at least 10 TB of infrequently-accessed data with no latency sensitivity-could spend more money on clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

Application Profiles

A
  1. Custom application specific settings for handling incoming connections
  2. Single or multi-cluster routing
  3. In single routing: it will route to a single router that you define even if you have multiple clusters in an instance
  4. Multi: Will route to nearest most available cluster and if it is unavailable, it will go to the next cluster
  5. You have to ask if data needs single row transactions-then you have to have single routing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

Bigtable Configuration

A
  1. Instances can run up to four clusters
  2. Clusters exist in a single zone
  3. Up to 30 nodes per project
  4. Maximum of 1,000 tables per instance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

Bigtable Access Control

A
  1. Cloud IAM roles
  2. Applied at project or instance level to-
  3. Restrict access or administration.
  4. Restrict reads and writes
  5. Restrict development instances or production access
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

Data Storage Model

A

ROW KEYs can only be indexed
Column families allow is to grab what we need only
Column names are called column qualifiers
You can write new data values and the old ones aren’t overwritten
You can control how much is stored and for how long it is configurable-detailed granularity, array of bytes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

Alternative Options to Bigtable

A
  1. Need SQL Support OLTP: Cloud SQL
    1. Need Interactive Queries OLAP and cheaper: BigQuery
    2. Need structured NoSQL Documents: Cloud Firestore
    3. Need In-memory Key/Value Pairs: Memorystore
    4. Need Realtime Database: Firebase
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

Important Bigtable Info

A

Rows are sorted alphabetically-design of row key very important
Atomic operations are by row only-be careful when updating
Sparse table system-doesn’t hurt to have a lot of columns/families even if they don’t apply to every entity
Row sizing: no larger than 10MB, total row not including the key should be under 100MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

Timestamps and Garbage Collection

A
  1. Each cell has multiple versions
  2. Server recorded timestamps
  3. Sequential numbers
  4. Expiry policies define garbage collection: can expire based on a specific age or specific number of versions
  5. Setting the client with an HBase will set the policy to only retain the test version of a cell, if you use any other client library, then it will set the column family to store infinite versions

CBT is an alternate way to connect to Bigtable

104
Q

Bigtable Schema Design

A

Have to scan the entire table and then filter the results based on a regular expression match tot he string contained in that column cell-most time expensive way to query Bigtable
You have to pain your queries ahead of time
Field promotion: taking data that you already know and then moving it into the row key itself
Then you can write a command like: scan ‘vehicles’, {ROWPREFIXFILTER => ‘NYMT#86#’}
You can also include a timestamo in the ROW KEY design
Never put a timestamp in front of the row key

104
Q

Designing Row Keys

A
  1. Queries use: A row key, a row prefix
  2. A row range is returned
  3. Reverse domain names
  4. String identifiers-reads and writes evenly spread
  5. Timestamps as only a part of a bigger row key design if it is not first and is reversed
105
Q

Row Keys to Avoid

A
  1. Domain names in order
  2. Sequential numbers
  3. Frequently updated identifiers
  4. Hashed values
106
Q

Design for Performance

A
  1. Lexicographic sorting
  2. Store related entities in adjacent rows
  3. Distribute reads and writes evenly
  4. Balanced access patterns enable linear scaling of performance
107
Q

Avoid Hotspots

A
  1. Use Field Promotion instead
  2. Try Salting-salted hash to your row key artificially distributes the rows, based on total number of nodes
  3. Use Google’s Key Visualizer tool
108
Q

Time Series Data in Bigtable

A
  1. Use tall and narrow tables where each row might contain a key and maybe only a single column
  2. Use rows instead of versioned cells
  3. Logically separate tables
  4. Don’t reinvent wheel -already good timetable schemas out there Open TSDT project
109
Q

Monitoring Bigtable

A
  1. Via GCP Console or Stackdriver
  2. Average CPU utilization of cluster and hottest node
  3. Single cluster instance-aim for average CPU load of 70% and the hottest node not over the CPU of 90%
  4. For 2 clusters and replication instances-multi cluster routing brings in additional overhead where the average CPU load should be 35% and the hottest node CPU load at max 45%
  5. For storage utilization, try to keep it on 70% per node
  6. To monitor this, try to create application profiles for each application
109
Q

Autoscaling Bigtable

A
  1. Stackdriver metrics can be used for programmatic scaling-done on local computer
  2. Client libraries query metrics
  3. Update cluster node counts via API
  4. Rebalancing tablets can take time and the performance might not improve for 20 mins
  5. Adding nodes to a cluster doesn’t solve the problem of a bad schema
110
Q

Replication Bigtable

A
  1. Adding additional clusters automatically starts replication I.e data synchronization
  2. Replication is eventually consistent
  3. Used for availability and failover
  4. Application isolation
  5. Global presence
111
Q

Good Performance in Bigtable

A

Replication improves read throughput but does not affect write throughput
Use batch writes for bulk data with rows that are close together lexicographically
Monitor instances and use the Key Visualizer to monitor hotspots and bad row keys
Bigtable rebalances tablets-first they all go in the first node, but then they rebalance or spread out to the other growing nodes-the tablets are also being split, merged, and rebalanced to maintain the sorted order of rows
Hotspots sometimes pop up and take a lot of CPU, but the other tablets in the node go to other nodes so the overwhelmed tablets are less overwhelmed

112
Q

Good vs Bad Performance in Bigtable

A

Good Performance: Optimized schema and row key design, large datasets, correct row and column sizing
Bad Performance: Datasets short lived or smaller than 300GB

113
Q

Exam tips Bigtable

A

Know when to choose Bigtable: many questions make you choose the right product for the workload, migrating from an on-premise environ look at HBASE and consider when Bigtable is a better option than BigQuery. Look at time-series data or use cases where latency is an issue
Understand the architecture of Bigtable: Concepts of an instance and a cluster, where Bigtable stores data, and how tablets are re-balanced by the service between nodes
Be aware of causes of bad performance: Like under-resourced clusters, bad schema design, and poorly chosen row keys.
UNDERSTAND ROW KEYS: Linear scale and performance of Bigtable depends on good row keys. Understand row key design, I might have to point out flaws or pick an ideal row key
Understand Tall vs Wide: Wide Table-stores multiple columns for a given row-key where the query pattern is likely to require all the information about a single entity. Tall Table-suit time-series or graph data and often only contain a single column
Remember Organizational Design: Consider when a development instance is appropriate, remember IAM roles that can be used to isolate access to the necessary groups.

114
Q

What is BigQuery?

A

Peta byte scale, server less, highly scalable cloud enterprise data warehouse
In memory BI Engine-fast interactive reports
Has machine Learning capabilities (BigQuery ML)-using SQL
Support for geospatial data storage and processing

115
Q

Key Features of BQ

A
  1. High availability
  2. Supports SQL-can do SQL queries
  3. Federated Data-can connect to and process data stored outside of BigQuery
  4. Automatic Backups
  5. Governance and Security support-data encrypted at rest and in transit
  6. Separation of Storage and Compute-cost effective scalable storage and stateless resilient compute
115
Q

Interacting with BQ

A
  1. Web console
  2. Command line tool (bq)
  3. Client libraries like C#, Go, Java, Node.JS, PHP, Python, and Ruby
116
Q

Managing Data with BQ

A

You have a Project and within each Project, you have a Dataset, and within each Dataset, you can have Native Tables, External Tables, or Views

117
Q

Native Table

A

Data is held within a BigQuery Storage

118
Q

External Tables

A

Backed by storage outside of BigQuery

119
Q

Views

A

created by a SQL query

120
Q

Real Time Events BQ

A

Streaming, common to push events to Cloud Pub/Sub, then use a Cloud Dataflow job to process and push them into BigQuery

121
Q

Batch Sources BQ

A

Comes in a Bulk Load, common to push files to Cloud Storage, then have a cloud Dataflow job pick that data up, process it and then push it into BigQuery

122
Q

Legacy SQL

A
  1. Previously Called BigQuery SQL
  2. Non-standard SQL dialect
  3. Migration to standard SQL is recommended
123
Q

Standard SQL

A
  1. Preferred dialect
  2. Compliant with SQL 2011 standard
  3. Extensions for querying nested and repeated data
124
Q

What you can do with BQ

A

With BigQuery Data you can: Use BI Tools, Use Cloud Datalab, Export to sheets or Cloud Storage, send it to Colleagues, or use it for GCP Big Data Tools like Dataflow or Dataproc

125
Q

Jobs and Operations in BQ

A

Job: action that is run in BigQuery on your behalf
Load Job: Load data onto BQ
Export Job: Export from BQ
Query Job: Queries the data in BQ
Copy Job: Copies tables and datasets from BQ
Query Job priorities: Interactivity(default)-always saved to a temporary table or to a Permanent table, and Batch

126
Q

Table Storage in BQ

A
  1. Capacitor columnar data format
  2. Tables can be partitioned
  3. Individual records exist as rows
  4. Each record is composed of columns
  5. Table schemas specified at the creation of the table or at a data load
127
Q

Capacitor in BQ

A
  1. The Storage system: proprietary columnar data storage that supports semi-structured data(nested and repeated tables), imports can be CSV JSON to capacitor format
  2. Each value is also stored together with a repetition level and a definition level (value, repetition level, definition level)
128
Q

Denormalization

A
  1. BQ performance best when data is denormalized
  2. Nested and repeated columns
  3. Maintain data relationships in an efficient manner
  4. RECORD(STRUCT) data type-nested records or columns.. EX: Address=address.number, address.street, address.city, a single ID can have multiple addresses
129
Q

Data Formats in BQ

A

CSV, JSON(newline delimited), Avro(open source data format where schema is stored together with data-compressed data), Parquet(encoded, smaller files), ORC(hive data), Cloud Datastore export, and then Cloud Firestore ports

130
Q

BQ Views

A

A virtual table defined by a SQL query
SQL query (view definition)=Tables
Then Those get sent to the Dataset
You can also query that view, has billing implications since you would also still be running the underlying query

131
Q

Uses of Views

A
  1. Control access to data
  2. Reduce query complexity
  3. Construct logical tables
  4. Ability to create authorized views-can connect to different subsets of rows from the view
132
Q

Limitations of Views

A
  1. Can’t export data since unmaterialized
  2. Can’t use JSON API to retrieve data from a view
  3. Cant combine standard and legacy SQL
  4. No user defined functions
  5. No wildcard table references
  6. Limited to 1,000 authorized views per dataset
133
Q

External Data in BQ

A

You can query directly even though the data is not directly held in BQ
BQ supports Cloud Bigtable, Cloud Storage, and Google Drive
Use Cases for using External Data Source: Load and clean your data in one pass, or you have small, frequently changing data joined with other tables

134
Q

Limitations for External Data Source in BQ

A
  1. No guarantee of consistency
  2. Lower query performance
  3. Can’t use TableDataList API method
  4. Can’t export jobs on external data
  5. Can’t reference wildcard table query
  6. Can query Parquet or ORC formats
  7. Query results not cached
  8. Limited to 4 concurrent queries
135
Q

Other Data Sources

A
  1. Public Datasets-available to everyone
  2. Shared Datasets-Have been shared with you
  3. Stackdriver log information
136
Q

Data Transfer Service

A

Can easily pull Bulk Data into BQ with the Data Transfer Service
Data Transfer Service has multiple Connectors to Google sources, GCP, AWS services like S3 and Redshift, and other third party services like LinkedIn and Facebook
Can be one off events or scheduled to run repeatedly
DTS allows historical data reference and uptime and delivery SLA

137
Q

Table Partitioning BQ

A

Table partitioning-break up big table into smaller tables
Partitions stored separately on physical level
Partitions usually based on a single column called the partition key.

Partitioning BQ 2 Ways
1. Ingestion Time partitioned tables
2. Partitioned Tables

138
Q

Ingestion Time Partitioning

A

Partitioned by load or arrival date, data automatically loaded into databased partitions(daily), tables include the pseudo-column _PARTITIONTIME, use _PARTITIONEDTIME in queries to limit partitions scanned

139
Q

Partitioned Tables in BQ

A

Partitioned based on a certain TIMESTAMP or DATE column, Data partitioned based on value supplied in partitioning column, 2 additional partitions: NULL and UNPARTITIONED , use partitioning column in queries
BQ automatically places data in right partitions, need to say it is a partition table when creating the table

140
Q

Clustering Tables BQ

A

Clustering tables: can do them on a partitioned table, you can use clustering when you have filters or aggregations against specific columns in your queries. When partitioned and clustering tables together, it is partitioned by the partition key and then clustered based on the cluster key

In cluster tables: the data associated with a certain cluster key is generally stored together
Ordering is important

141
Q

Clustering Limitations

A
  1. Only supported only for partitioned tables
  2. Standard SQL only for querying clustered tables
  3. Standard SQL only for writing query results to clustered tables
  4. Specify clustering columns only when table is created
  5. Clustering columns can’t be modified after table creation
  6. Clustering columns have to be top-level, non-repeated columns
  7. You can specify one to four clustering columns
142
Q

Querying Guidelines for Clustering Tables

A
  1. Filter clustered Columns in the order they were specified
  2. Avoid using clustered columns in complex filter expressions
  3. Avoid comparing cluster columns to other columns
    Why partition tables
  4. Improve query performance
  5. Control costs
143
Q

Benefits of BQ Slots

A

Slots:
Unit of computational capacity required to execute SQL queries-good for pricing and resource allocation
Number slots query-determined by Query size and query complexity
BQ automatically manages your slots quota
Flat rate pricing available-purchase fixed number of slots
You can see slot usage using Stackdriver

144
Q

Cost Controls of BQ

A
  1. Avoid using SELECT *
  2. Use preview options to sample data
  3. Price queries before executing them
  4. Remember the using LIMT doesn’t affect cost
  5. View costs using a dashboard and query audit logs
  6. Partition by date
  7. Materialize query results in stages
  8. Consider the cost of large result sets
  9. Use streaming inserts with caution
145
Q

Query Performance Dimensions

A
  1. Input data and data sources
  2. Shuffling
  3. Query computation
  4. Materialisation
  5. SQL anti-patterns
145
Q

Input data and data sources best practices BQ

A

Input data and data sources: Prune partitioned queries, denormalize data whenever possible, use external data sources appropriately, avoid excessive wildcard tables

146
Q

Query Computation Best Practices

A

Avoid repeatedly transforming data via SQL queries, avoid JavaScript user-defined functions, order query operations to maximize performance, optimize JOIN patterns

147
Q

SQL Anti-Patterns

A

Avoid Self-Joins, avoid data skew, avoid unbalanced joins, avoid joins that generate more outputs than inputs (Cartesian product), avoid DML statements that update or insert single rows

148
Q

Optimizing Storage BQ

A
  1. Use expiration settings-Control Storage Costs and Optimize use of storage space
  2. Take advantage of long-term storage-lower monthly charges apply for data stored in tables or in patterns that have not been modified in the last 90 days
  3. Use the Google pricing calculator to estimate the storage costs
149
Q

Primitive Roles

A

at the project level, granting access to the related project data sets, individual dataset access will overwrite the primitive access. Three types of these roles-Owner, Editor, Viewer

150
Q

Predefined Roles

A

grant more granular access, defined at the service level, GCP managed

151
Q

Custom Roles

A

User managed

152
Q

Cloud DLP

A

Handling Sensitive Data: credit card numbers, med info, SSN, people names, address info can be protected by the Cloud Data Loss Prevention (Cloud DLP)

Cloud DLP

  1. Fully managed service
  2. Identify and protect sensitive data at scale
  3. Over 100 predefined detectors to identify patterns, formats, and checksums
  4. It also de-identifies the data
153
Q

Encryption in BQ

A

BQ encrypts data through the Data Encryption Key (DEK)
For highest levels of security, the DEK key needs to be encrypted to form the Wrapped DEK-this is done using the Key Encryption Key (KEK)
Wrapped DEK/DEK stored together
KEK is stored in the Cloud Key Management Service

154
Q

Monitoring/Alerts in BQ

A

Alerts should be created when a monitoring metric crosses a specifies threshold
BQ uses Stackdriver to monitor, bq sends logs to it
In stacks-driver, you can filter for the big query logs, create dashboards and then charts to the dashboards, you can create alerts

155
Q

Cloud Audit Logs

A

Collections of logs that are provided by GCP to allow insights to various services

Log Versions

AuditData(old)-map directly to individual API calls designed against the query
BigQueryAuditMetadata-not strongly coupled to particular API calls, more aligned to resource itself, closely associated with the state of the BigQuery resources can be changed by API calls, services and API tasks

Stackdriver has three different streams: Admin, System, and Data-streams are just groupings for different types of logs

156
Q

BQ ML Access

A
  1. Web console (UI)
  2. Bq command line tool
  3. BQ rest API
  4. Jupyter notebooks(Cloud Datalab) and other external BI tools
157
Q

Linear Regression

A

where you have a number of data points and try to fit a line to those data points

158
Q

Binary Logistic Regression

A

We have 2 classes and you assign each example to tone of the classes

159
Q

Multi-class Logistic Regression

A

We assign each example to one of these

160
Q

K-Means Clustering

A

We have a number of points and are able to separate them out into different clusters-newest one on BQ ML

161
Q

Benefits of BQ ML

A
  1. Democratizing ML
  2. Models trained and evaluated using SQL
  3. Speed and agility
  4. Simplicity
  5. Avoid regulatory restrictions
162
Q

EXAM TIPS for BQ

A

Understand good organizational design: consider how different teams should be granted different types of access to BQ and how the decisions affect cost control
Learn the most common IAM roles: learn how to grant access to teams based on needs and how to use authorized views to share data across projects
Consider costs when designing queries: Avoid using SELECT * and previews and price queries before executing them
Partition tables appropriately: partitioned tables can reduce the cost -consider clustering to reduce scans of unnecessary data
Optimize query operations and JOINS

163
Q

What is Datalab?

A

What is it?
1. A pre-existing technology, wrapped in some GCP conveniences-Jupyter Notebooks

164
Q

Jupyter Notebooks

A
  1. Interactive web pages that have…
  2. Documentation
  3. Code
  4. Elements which are the results of compiled code
165
Q

Datalab functions

A

When typing code, there os a Cloud Datalab Vm that has a Python Kernel
The kernel can run code and access GCP services like BigQuery or ML Engine

Good way to collaborate and share code
Also a good way to annotate

Has marplot.lib and it is great for statistical data and graphs

When saving your work though the notebook, it will be in the GCR Repo which is sent to the persistent disk attached to the Datalab instance

166
Q

Why Do We Need Datalab?

A
  1. Manages instance lifecycle
  2. Create Datalab VMs in seconds
  3. Notebooks stored in GCR
  4. Storage can persist after the instance is destroyed
167
Q

Intro to Data Studio

A

Data Sources
Reports and Dashboards
Data sources underneath are Databases or Files
Files-usually CSV files, stored in Cloud Storage
Databases-GCP databases like BigQuery, Cloud SQL, MySQL, Cloud Spanner, PostgreSQL
Google Products-Google analytics, Sheets, Youtube, Ads, Google Marketing Platform
Third Party-Trello, Quickbooks, Facebook Ads
You can Share your dashboards and reports by Viewing or allowing Users to Edit-Like in Google Drive

168
Q

Chart and Filters in Data Studio

A

Tables-detailed, heat map inclusion, bar chart inclusion, pagination
Scorecards-KPIs, high level
Pie Chart-proportions, %s or absolute values, doughnut or whole, small amounts of data
Times series-time order, trends, curve filtering, forecasting
Bar charts-categorical, vertical or horizontal, single or stacked, reports and dashboards
Geomaps-geographical data, dashboards
Area charts-composition, cumulative totals, reports and dashboards, can combine with time series
Scatter plot-cartesian plane, typically 2 variables-or more through color or size through the points, dashboards and reports

Filter-allows you to select specific values from a category
Data Range-can specify start and end days or predefined intervals like last week, last month, current year or quarter

169
Q

Cloud Composer Overview

A

Built on Apache Airflow

Google is contributing back to the airflow project

Task orchestrated system that is designed to automate complex interdependent tasks into pipelines or workflows
Each stage of the pipeline is written in code
Each workflow is written in Python
Provides central management and scheduling
Provides and extensive CLI tool and a comprehensive web UI

170
Q

DAG-Directed Acyclic Graph

A

A graph consisting of nodes connected by edges, edges-how we travel from one node to another, directed-travel in one direction which is acyclic-never circles back, can’t reach the same node more than once by traveling along the edges
Possible to represent dependencies-between nodes, that must be traversed in a specific order
These dependency nodes represent all the tasks in a workflow organized in a away to show their relationships and dependencies
DAGs are represented in Python and it matters when the tasks should be aligned to execute
Inside the tasks, there are operators to specify what is to be done
DAGs can contain parameters when they should run and what the dependencies are and who should be notified once the dependencies are completed
Cloud Composer manages resources to make sure workflow completes successfully

171
Q

Composer Architecture

A

A microservices architecture
Uses multiple GCP resources grouped together into a Cloud Composer environment
Can have more than one environment in a GCP project, but each environment is an isolated installation of Airflow and all of its component parts
Some parts get put on a Tenant Project-you can’t see or control, places Airflow database and Airflow web server-provides its web UI on App Engine Flex, it will configure the Identity-Aware Proxy to control access to the web server

171
Q

Why use Cloud Composer rather than Dataflow?

A

Dataflow-Process Batch or Streamed Data-Apache Beam
Cloud Composer-Orchestrate tasks with Python and can use any Python code at any stage of the pipeline-more as a scheduler
Can orchestrate Cloud Composer with Dataflow
Cloud Composer workflow example: Spark Analytics- A workflow runs daily, sets up a Dataproc cluster, performs Spark analytics, writes results to GCS and emails an administrator, and then deletes the Dataproc cluster

Cloud Composer can be used as any scheduled automation task outside of big data

172
Q

Composer in a GCP Environment

A

In the GCP Project environment- Make a Kubernetes cluster that deploys Redis, the Airflow Scheduler, and the Airflow Workers along with the Cloud SQL Proxy, it will also create 2 Pub/Sub topics for messaging between micro services and a Cloud Storage Bucket for logs, plugins and the DAGs themselves
Cloud Composer configures- Airflow Parameters and Environment Labels, you can also customize some parameters
DAGs can have a file or multiple files with imports and dependencies- the Python script has the Variables, Operators, and Stages of the Workflow tied together with a DAG object and definition

173
Q

Tasks

A

A Task is an instance of the Airflow Operator
The scheduler will find any DAG object that you have defines in the Python scripts uploaded to the GCS bucket, if all dependencies met, workflow will be scheduled
When you delete an environment, it won’t clean up all resources it created, but the Tenant Project and the GKE cluster that was spun up will be removed
You will have to manually delete the Pub/Sub topics and the GCS bucket

174
Q

Advanced Composer Features

A

Custom Airflow parameters get written on airflow.cfg file that is used to configure services when airflow is ran for the first time
Can’t change all of service settings, some can only be modified
Can create Environment Variables which will be passed by Cloud Composer and passed to elements of Airflow like the scheduler, web server, and worker processes
Environment Variables are in a Section which are defined in Key Value Pairs
Airflow Connection: A collection of authentication information which can include hostnames, logins, keys/secret information
Will create connections for BigQuery, Datastore, Cloud Storage, and a Generic GCP connection-these will have a service account key that will authenticate against the GCP API in question
Can make custom connections using the Airflow Web UI, connections to Airflow’s own database use the Cloud SQL Proxy
In the Web UI, you can use the ad hoc query page to run SQL queries against any connected database-some built in visualizations as well

175
Q

Extending Airflow

A
  1. Local Python Environment by adding additional custom libraries
  2. Airflow Plugins-write own custom operator
  3. PythonVirtualEnv Operator to create a custom environment for a task in a workflow without installing dependencies across all of your workers-creates individual envs with own libraries
  4. KubernetesPod Operator-need complete control over an environment-to run a task inside a pod on the Cloud Composer GKE Cluster
176
Q

Google Pre-Trained ML Models

A
  1. Cloud Vision API-can detect and label objects within images, facial recognition, can read handwritten info
  2. Cloud Video Intelligence API-can identify huge numbers of objects, places, and actions which are taking place in videos
  3. Cloud Translation API-can translate between more than 100 languages
  4. Cloud Text to Speech API: Convert text to audio/human speaking
  5. Cloud Speech to Text-convert from audio to text
  6. Cloud Natural Language API-can perform sentiment analysis, entity analysis, entity sentiment analysis, content analysis, content classification
177
Q

Reusable Models

A

Model training required, min knowledge of ML required, relatively small datasets for training, transfer learning, neural architecture search-like if you use GCP, you can search out what the best model is to solve problem

178
Q

Re-use models-Through Cloud AutoML:

A

when you need something very specific
AutoML-transfer learning, allows you to train your own custom models to solve your own specific problems: You have AutoML Vision, Video Intelligence, Natural Language, Natural Translation, and Tables 
Google AI Platform allows you to train your own models, manage and share models-gives easy access to TensorFlow, TensorFlow Extended (TFX)-end to end platform for deploying machine learning pipelines, gives access to TPUs-speed up process, Kubeflow

179
Q

ML Pipeline

A

ML has a Model that contains Rules from the Data that’s used to train it
ML Pipeline
1. Data preparation-raw data becomes processed data
2. Model Training-processed data is split into Training Data and then Testing Data. Training Data gets inputted into the Model, then it is evaluated against the Test Data to determine how well it inferred rules from the training data
3. Operating- Trained Model is used for Real-time Predictions like predicting what people are going to buy, or for Batch Predictions which are normally offline predictions when many predictions are made in a short period of time

180
Q

Label in ML

A

Label: Something that is an interest to us like a house price-denote labels with y
Labelled Example: Has a label associated with a set of features-house size, rooms, locations.. House price =$K, (x1,x2,x3,x4,…)->y
Unlabelled Examples: have the set of feature values, but don’t have a label value,size, number of rooms, location.. House price=?, (x1,x2,x3,x4,…)-we want to predict the table value called y prime

181
Q

Feature in ML

A

Feature: attribute associated with the label and has a relationship with the label, like the size of a house or the number of rooms in a house, denoted with an x-x1,x2…
Examples: Features together with labels

182
Q

Measuring Loss and Loss Squared

A

Measuring Loss: Calculating the difference between the prime value (x7,y7’) and the plot value (x7,y7) with the equation y7-yy7’, so basically the difference between the actual values and predicted values
Memory Loss Squared: You have one line going through the actual points, then another line that doesn’t match actual points well, you get the differences between the actual and the predictive points, you calculate loss by Loss=(y1-y1’)^2+(y2-y2’)^2»0, you have to square because the positive and negative would cancel out

183
Q

Mean Squared Error

A

similar to loss equation but calculating the mean
Optimization using gradient descent, y=wx -> MSE(w), gradient gives direction were loss function increases, move where loss function is decreasing, keep doing this until we fin the w(minimum) which is where the loss function has its lowest value, then you can use the w in the line equation to get the line with the lowest lost, learning weight is how much we moved to the minimum

183
Q

Supervised Learning Type

A

We train the model using data that is labeled, each example has a label and a feature
We have training data that have a set of features that are called x, we include the correct labels which are the ys, features with the correct labels are used to train the model, the unseen data which has only the features is presented to the model, the model uses the training to predict what associated labels should be for each example, this will give us y prime

184
Q

Unsupervised Learning Type

A

uncover structure within the data set itself
Have a set of input data with no labels, have features only and send it to the model, the model then creates outputs which uncover a structure within the input data, maybe use this method to uncover personas within customer data

185
Q

Reinforcement Learning

A

common in gameplaying and other types of ML
You have an agent which is basically our model, the agent interacts with an environment(chess board in game of chess), the state of the environment(state(t)) is fed into our model at time t at a particular moment in time, the agent will propose an action at time t which affects the environment(move chess piece from one square to another), the state will change state(t+1), agent is rewarded, reward(t), to how well the action that it took at time t affected the environment in terms of the outcome it wants to get

186
Q

Regression

A

predict real number (y’)
There is a regression model and features are inputted into the model, and the model associates the set of features with the predicted model with a y’

187
Q

Classification

A

predict class from specified set{A,B,C,D} with probability
There is a classification model, it will take a set of features, then it will associate it with a particular class and an associated probability, ex: (X1,X2,…,Xn,A,0.9817)

188
Q

Clustering

A

group elements into clusters or groups
There is a clustering model, you input an element set, then you associate the element set with a cluster in the output

189
Q

Transfer Learning

A

Train the model using images to identify them in specific categories, then you can copy the model and make it into a slightly newer classification model with new classification categories and you can train it with new images, on this new model, the training images can usually be far less

190
Q

Underfitting

A

Like a Line, doesn’t capture underlying structure

190
Q

Balanced

A

Curved that fits the data very well, still simple parabola

191
Q

Overfitting

A

Fits the data very very well better than the other two cases ew data point with the curve there is a very large distance, the problem is it doesn’t generalize to new data

192
Q

L1 Regularization

A

L1 Regularization term: |W1|+|W2|+…+|Wn^2|-take sum of the absolute value of all the weights
Penalizes |weight”, drives weights of non-contributing features towards 0-sparse matrices

193
Q

L2 Regularization

A

W1^2+W2^2+…+Wn^2-take sum of squares of all the weights
Penalizes weight squared, drives all weights towards 0-simpler model

194
Q

Avoid Overfitting

A
  1. Regularization
  2. Increase Training Data
  3. Feature Selection
  4. Early Stopping-don’t allow data to train for too long, only a certain amount of iterations
  5. Cross Validation-take training data and split it into much smaller training sets, smaller sets known as folds
  6. Dropout Layers-where weights are most likely all set to 0
195
Q

Hyperparameters

A

Hyperparameters: values that need to be selected before the training process can begin

Hyperparameter examples: Batch size, training epochs, number of hidden layers in network-model, regularization type-l1 or l2, regularization rate, learning rate

Hyperparameter Characteristics
1. Selection: Hyperparameter values need to be specified before training begins
2. Model hyper parameter: relate directly to the model that is selected, Algorithm hyperparameters: relate to the training of the model
3. Training and Tuning: the process of finding the optimal or near optimal values for hyperparameters

196
Q

Feature Engineering

A

One-hot Encoding(Categorical Data)- from a fixed set of values, can transform categorical data to numeric values, you can have columns that represent the categories and us binary 1 or 0 to indicate which ones are in each column
Linear Scaling: transform values across one range into values in anothe, 1 to 1 or 0 to 1
Z-Score: can have a lot of values around mu and then you can transform them into a cluster around 0
Log Scaling: a small number of values have many points while the vast majority have few points but can transform it with X’=log(x)
Bucketing: distribution where data points are related to each other, relationship not linear, create defined ranges, data points within range are aligned to a single data point

197
Q

TensorFlow

A
  1. Google’s open source, end to end, ML framework-good for machine learning
  2. Compatible with wide range of hardware and devices-models can be trained and deployed across a wide range of hardware CPU GPU TCU
  3. TensorFLow Lite-deploying models to mobile and embedded devices
  4. TensorFlow.js-JavaScript library for training and deploying models on node.js
  5. TensorFlow extended-deploying ML pipelines
198
Q

Keras

A
  1. Open Source neural network library
  2. Made in Python
  3. Runs ontop of other ML frameworks
  4. High level API for fast experimentation-deep neural networks, easy to use extensible
  5. CPUs and GPUs
  6. tf.keras-implementation of Keras API specification-build and train models for first class support for TensorFlow specific functionality more flexible and makes tensor flow easier to use
199
Q

Google Colab

A
  1. Free Cloud service base on Jupyter Notebooks, free service with Jupyter Notebooks within Google Docs
  2. Provides free GPU support
  3. Supports some BASH commands
  4. Includes pre-installed python libraries
200
Q

Neural Network Layers

A

Input Layer: allows us to feed data unto the model
Output Layer: represent the way we want the neural network to provide answers, we have a neuron available for each of the classes
There are Hidden Layers between the Input and Output Layers, the number of hidden layers will vary based on the type of problem you are trying to solve-model hyperparameter
Input=L0 Hidden=L1,L2,L3,L4,L5 Output=L6
Each layer has a specified number for neuron/nodes
A pixel represents a matrix

201
Q

Fully Connected Layer

A

where every neuron of 1 layer is connected to every neuron in the following layer

202
Q

Partially Connected Layer

A

where every neuron in one layer is not connected to every neuron in the adjacent layer

202
Q

Neurons

A
  1. Will have an input vector X and an output value Y
  2. Has a weight vector number of weights in a weight vector will correspond to the number of inputs in the input vector
  3. X1*W1-each weight is multiplied by the corresponding input value
  4. Then the values are summed-(X1W1)+(X2W2)
  5. A function is applied to these sum values-f((X1W1)+(X2W2)) which gives us our output y
  6. The function is called the Activation function
  7. Activation functions:Rectified linear units(ReLu): f(x)=max(0,x), Sigmoid/Logistic,Hyperbolic tangent
    Logits: Element in the layer
    Softmax function: converts Logits to probabilities-SUM(p)=1
203
Q

Feed Forward Neural Network

A
  1. Simplest and most common type of deep neural network
  2. Applications in computer vision and NLP
  3. First type of deep neural network to be used
  4. Information flows in one direction only(input to output)
203
Q

Recurrent Neural Networks

A
  1. Flow of information forms cycles/loops
  2. Directed cycles
  3. RNNs can be difficult to train
  4. RNNs are dynamic with their state continuously changing
  5. There are cycles
204
Q

Convolutional Neural Network

A
  1. Neural network commonly used for visual learning tasks
  2. Common uses: Image and video recognition, image classification, natural language processing
  3. Have an input, output, and many hidden layers
  4. Convolutional layers are paired with pooling layers
  5. They layer applies small filters that make out certain features of an image
205
Q

Pooling Layer

A
  1. Simplifies (downsamples) inputs
  2. Usually succeed a non-linear activation function
  3. There is Average Pooling-Calculated average value
  4. There is Max Pooling-Calculates the maximum value
  5. Strides move along the pixels in the image to calculate results for the output
206
Q

GANS

A
  1. Gans are deep neural networks that are composite of two opposing neural networks: Generator network, Discriminator network adversarial
  2. Allows for the creation of things like: Images, music, speech, poetry, deep fakes(face imitation)
  3. A generator network outputs these generated images to go into a Discriminator network to see which images are real or fake, it will predict if the image is realer fake and will send feedback
207
Q

Vision APIs

A

Optical Character Recognition (OCR): Can detect text in images, there is text detection(Can see text, uses JSON extraction and puts text in block, can also read handwriting) and document text detection(Can also read text from images but more meant for text in the documents of a file)
Cropping Hints: Gives a cropping suggestion on how to crop the image and gives the vertices to crop it
Face Detection: Can detect faces and facial features
Image Property Detection: Sees dominant colors, these colors can group images/object or be used for recommendations
Label Detection: Can detect object, locations, activities, animal species, products, and much more
Landmark Detection: Detects landmark in an image, gives the landmark name, bounding polygon vertices, and the location using latitude and longitude
There is also Logo Detection
Explicit Content Detection-safe search detects if the image is adult, spoof, medical, violence, and racy
Web Entity and Page Detection-where is the image is being used, links to web pages, similar images, best guess labels

208
Q

Video Intelligence APIs

A

Detect Labels: annotates videos where entities are detected, list of video segments, frame annotations and shots
Shot change detection: annotates videos based on shots or scenes, entities associated with specific scenes
Detect explicit content: detect adult content, annotates explicit content, and timestamps where detected
Transcribe Speech: transcribes spoken words, profanity filtering, transcription hints, automatic punctuation, handling multiple speakers
Track objects: track multiple objects, provides location of each object with frames, bounding boxes for each object, time segments with offset
Detect Text: OCR on text occurring within videos, text and location of the text
Google Knowledge graph Search API allows you to do searches on Google’s Knowledge Graph-all entities and relationships between them, each object has an identifier
1. Getting ranked list of most notable entities that match criteria
2. Predictively completing entities within a search box
3. Annotating or organizing content using the knowledge graph entities

209
Q

Natural Language API

A

Looks at patterns within language/text, uses sentiment analysis, entity analysis, syntax analysis, entity-sentiment analysis, and content analysis

210
Q

Sentiment Analysis

A
  1. Score: Indicates overall emotion, ranges between -1 and 1(positive)
  2. Magnitude: How much emotional content is there, number ranges from 0 and infinity, not normalized (proportional ti length of document being assessed)
211
Q

Entity Analysis

A
  1. Identifies entities within text
  2. Provides information on the identified entities
  3. Entities are nouns/things-Proper Nouns(Specific like Albert Einstein) and Common Nouns(mug=any mug)
212
Q

Entity-Sentiment Analysis

A
  1. Combines Entity and Sentiment Analysis
  2. Tries to determine sentiment expressed towards each of the identified entities
  3. Numerical. And magnitude scores
213
Q

Syntax Analysis and Content Classification

A

Syntax Analysis: Takes streams of text through Sentences, Tokenization of text/streams breaks them up into tokens, sentences and tokens determine grammatical information

Content Classification: API will return categories that are most specific to the source text

214
Q

Dialogflow

A
  1. Natural Language interaction platform
  2. Mobile and web app, devices bots
  3. Analyzes text or audio inputs
  4. Responds to using text or speech
  5. Intents; categorized an end user’s intention-understands and responds-like a classification/object
  6. Intents have different training phrases that are mapped onto an intent, then we can have extracted parameters
  7. Each parameter has a type called an entity type
  8. End user gives an input phrase, then it goes to an agent, then we get intents through the intent classification, then from this we get parameters, that can do an action, from there we get a response that will go back to the end user
214
Q

Cloud Speech to Text: Has audio files and audio stream

A
  1. Synchronous Recognition: Returns result after all input audio has been processed
  2. Asynchronous Recognition: Initiates long running operation
  3. Streaming Recognition: Audio data is provided within a gRPC bi-directional stream
  4. Models: video, phone call, command and search, default
215
Q

Text-to-Speech API

A
  1. Uses text files or Speech Synthesis Markup Language(SSML)-allows you to control the way text is converted to speech
  2. SSML: Pauses, play sounds, speak cardinals, speak ordinals, speak characters, phrase substitution
216
Q

AutoML

A

Suite of ML Products
Facilitates the training of custom ML models
Highly performant
Speed of Delivery
Human labelling service
When you ask a problem and give it a potential result, it needs to find a neural network using NN Search
When it gets a neural network that works from the NN Bank, it uses transfer learning to the AutoML where it is easily trained to handle novel data

216
Q

AutoML Process

A
  1. Prepare and managed images-label training images, create dataset(single or multi-label)
  2. Training models-requires prepared dataset, may take a few hours to complete, training creates a new model
  3. Evaluating models-after training on a test set, aggregated and detailed information(are under curve, confidence threshold curve, confusion matrix)
  4. Deploying models-deploy before making online predictions
  5. Making predictions-individual or bulk
  6. Undeploying models-after successful predictions, have a better model, cost implications of not underlying models
217
Q

Vision Edge

A
  1. Export custom trained models
  2. Models optimized for edge devices
  3. TensorFlow Lite, Core ML, container export formats
  4. Edge TPUs, ARM and NVIDIA
  5. AutoML Vision Edge in ML Kit
218
Q

AutoML Translation

A

AutoML Translation-you can use specific to translate from English to French phrase
AutoML Translation Training-you use source target pairs, source sentences are in the source language, target sentences are in the target language
Translation Considerations
1. Data Coverage-include examples of vocal, usage, and grammatical peculiarities that are specific to your domain, model need to be exposed to the language in some form
2. Human Involvement-people who understand both languages should be involved
3. Data Quality- This is VERY IMPORTANT for translation training, source and target documents need to align

218
Q

AutoML Natural Language

A
  1. Create custom models to classify content into custom categories you define
  2. When pre-defined categories are insufficient
  3. When you want to categorize content from free text
  4. You want to create own categories for categorization
219
Q

AutoML Table Capabilities

A
  1. Data Support: AutoML Tables provide info on missing data, AutoML tables provide correlations, cardinality, and distributions for each feature
  2. Automatic Feature Engineering: Normalizes and bucketizes numerical features, creates one-hot encoding and embeddings for categorical features, performs basic text processing for text fields, extract time and date features from timestamp features
  3. Model Training: Parallel testing of multiple model types like Linear and Feed forward deep neural network, selects best model for predictions
220
Q

AutoML vs BQ

A

AutoML Tables vs BigQuery ML
1. BigQuery: Rapid Iteration, still deciding on features to include
2. AutoML: Optimizing model quality, have time available for model optimization, multifarious input features

221
Q

Kubeflow

A
  1. ML Toolkit for Kubernetes
  2. Data modeling with Jupyter Notebooks
  3. Tuning and training with TensorFlow
  4. Model serving and monitoring

Production Phase: Transform Data with pipelines, Train Model with (MPI, MXNET, PyTorch), Serve Model using (TFServing,KFServing,NVIDIA TensorRT), Monitor the model using(TensorBoard, Metadata)

Pipeline: Description of a ML workflow, including all of the components in the workflow and how the components relate to each other in the form of a graph
Pipeline Component: Self-contained set of user code, packaged as a container, that performs one step int he pipeline

221
Q

AI Platform

A

Ingest Data: Cloud Storage, Cloud Storage Transfer Service
Prepare and Preprocess Data: Cloud Dataflow, Cloud Dataproc, BigQuery, Cloud Dataprep
AI Platform data labeling service can label training data by applying classification, object detection, and entity extraction
Develop and Train Models: Deep Learning VM, AI Platform Notebooks, AI Platform Training, KubeFlow
Test and Deploy Models: TensorFlow Extended, AI Platform Prediction, Kubeflow
Discovery: Google AI Hub

Quick and Ready to Go ML

222
Q

IAM Best Practices

A

The principle of Least privilege: Used predefined roles specific to each GCP product or service
Each stack have its own boundary

Policies applied to a parent object will be inherited by a child object
Use Groups in G-Suite or Cloud Identity, grant roles to groups and not individual users

222
Q

Service Accounts

A

Service Accounts: Special type of Google account designed to represent non-human users
1. Virtual Machines-act via SAs that determine the services they can access
2. Programmatic access-should always be achieved using a SA
3. IAM Roles-assigned to SAs in exactly the same way as a human user accounts

223
Q

Human User Accounts

A

Human User Accounts: Passwords and multi factor authentication
Service Accounts have Keys that can be downloaded in JSON format
A Key can be used by an application to authenticate against Google APIs, keys rotated and massively protected
Cloud IAM API: Request OAUTH, OPENID, JWT
Service Account User Role-can impersonate actual SA and have access to everything/IAM policies

224
Q

Data Security

A

Offers encryption in flight and at rest

Can use Cloud Key Management Service (KMS) if you want to make your own keys
Keys can be grouped together in key rings and can be used in multiple GCP services
Limit Blast Radius
VPC Service Controls-define security perimeter, only access services inside the perimeter
GCP Security Command Center: Asset Management Features, Web Security Scanner, Anomaly Detection, Threat Detection-internal

225
Q

Data Privacy

A

Should people have access to all the data?
What is the data? Do we need this to complete the task? Are we allowed to store it?-PII Personally Identifiable Information-personal information to identify a specific individual
GCP Cloud Data Loss Prevention: Text, Images, Pseudo-Anonymization(dummy data), Risk Analysis
DLP API can become expensive

226
Q

Industry Regulation

A

FEDRAMP: Department of Defense, Homeland Security, USA-how data is used and stored securely by cloud vendors, high compliance for most GCP Services
Children’s Online Privacy Protection Act: Use of PII data for children under age of 13, incorporate parent consent, clear private policy, justification for data collection
HIPAA:Protects Personal Health information, acceptance of business associate agreement
Pci DSS Compliant: GCP certified compliant it is secure enough, apps compliant
GDPR: Europe, protects personal data of EU citizens, any region that interacts with EU

227
Q

Dataprep Overview

A

Explores, cleaning and preparing data
Visually define transformations
Export to Cloud Dataflow
Integrated partner service from Trifacta-links to GCP project and datasets

228
Q

Flows

A

Top level container for bringing together and organizing datasets recipes and transformations to one place

229
Q

Datasets

A

Collections of data that we will use from Dataprep Flow-can import datasets from local machine, Cloud Storage or BigQuery

230
Q

Recipes

A

Like an instruction manual, series of steps that perform a series of transformations on the datasets-create new data sets
For recipes, there are audio visual controls to do these transformations
You can also see a visual preview
Can use the Automator to execute certain recipes at certain times
Flows are then executed based on Cloud Dataflow

231
Q

Cloud Storage

A

Unstructured object storage, Regional/dual-region-or multi-region, standard, near line, or cold line, storage event triggers
fully managed object storage -can store images, access via API, SDKs, also has multiple storage classes like lifecycle management for objects and buckets, very secure and durable

232
Q

Cloud Bigtable Def

A

Petabyte-scale NoSQL database, High-throughput and scalability-Wide column key/value data, Time-series, transactional, IoT data

233
Q

BQ Def

A

Petabyte scale data warehouse, Fast SQL queries across large datasets, Foundations for BI and AI, and has Public Datasets

234
Q

Cloud SQL

A

Managed MySQL + PostageSQL instances, built in hacks, replicas, and failover, vertically scalable, SQL server

235
Q

Cloud Spanner

A

Global SQL-based relational database, horizontal scalability and high availability, strong consistency, good for financial sector

236
Q

Cloud FIrestore

A

Fully managed NoSQL document database, large collections of small JSON documents, provides a real time database SDKs

237
Q

Cloud Memorystore

A

Managed Redis instances, in-memory DB cache or message broker, built in high availability, vertically scalable

238
Q

Data Modeling

A

structured data-consistent model, model maybe in place, data could require prep or transformation. 3 stages: Conceptual-what are entities/relationships, Logical-what are the structures of the entities, can the model be normalized? Physical-how will I implement this into the database?What keys or indexes do I need?

239
Q

Relational Data vs Non Relational Data

A

good schema design, normalization and reducing waste, have accuracy and integrity-data types and tables
Non-relational-simple key/value store or document store, high volume columnar database-NoSQL
Pipeline could be going from datable to big query

240
Q

Bucket

A

tore object or files, name unique, exists within projects, regional-low cost, dual-regional, and multi-regional all geo redundancy
Storage classes: Standard $0.02 per GB, Nearline $.01 cent per GB and 30 days minimum storage and data retrieval fee, Cold line 90 day min storage $0.004 per

241
Q

GCS Info

A

GB, Archive storage for at least a year min $0.0012 GB and data retrieval fee
Objects are stored as opaque data, object immutable, overwrites atomic, can be versioned
Access through google cloud console, HTTP API, SDKs and gsutil command in terminal, parallel uploads, transcoding, integrity checking, requestor pays
GCS Costs: operation charges Class A expensive uploading B downloading, also network charges-like retrieving data from a bucket, data retrieval charges. Can apply life cycle rules to a bucket, IAM access for buckets, ACLS for granular access or signed policy documents-IAMs: has members and roles

242
Q

Service Accounts Best Practices

A

IAM for bulk access to buckets, roles assigned to members, ACLs for granular access to buckets, ACLs grant permissions to a scope, IAM is more recommended

243
Q

Data Transfer Service

A

Cloud Storage-source to sink, http, amazon s3 cloud storage, filters names and dates, schedule transfers, delete objects in destination bucket, delete objects in source bucket
Full access: storage transfer.admin, Submit transfers: storage transfer.user, List jobs and operations: storage transfer.viewer

244
Q

BQ Transfer Service

A

automates data transfer to Bigquery, data loaded on a regular basis, backfills can recover gaps, google marketing sources, sources in beta

245
Q

Google Transfer Device

A

very very large amounts of data , physical storage device that is attached to the server terabyte versions, security guaranteed

246
Q

Human Accounts Cont

A

Users are human users, authenticate with one credentials -not used for non human operations, passwords could leak

247
Q

Service Accounts Cont

A

Created for a specific non human task for granular authorization , identity can be assumed by an application keys can also be easily rotated, there are google and user manages service accounts SAs are managed by keys, user managed keys are downloadable JSON File - very powerful