GCP Professional Data Engineer Cert Flashcards
Relational Databases
Has relationship between tables
Google Cloud SQL: Managed SQL instances-don’t have to set much up, Multiple database engines like MySQL, Scalability and availability vertically scales to 64 cores, MySQL has different instances it is also secure-Cloud SQL proxy or SSL/TLS, or have private IPs there are also maintenance windows and automated backups , point in time recovery instance stores
Importing MySQL Data Commands: InnoDB mysqldump export/import, CSV import, External replica promotion-need binary log retention
PostgreSQL Instances are another option-have automated maintenance, unsupported features but it has high availability
Import PostgreSQL commands: SQL dump export/import, CSV import
Cloud Firestore
- Fully managed No SQL database-server less autoscaling, NoSQL document store
- Realtime DB with mobile SDKs, Android and IOS client libraries, frameworks for popular programming languages
- Strong scalability and consistency-horizontal autoscaling
Bundle multiple documents=collection
Messages are sub collections
Cloud Spanner
- Managed SQL -compliant DB-SQL schemas and queries with ACID transactions
- Horizontally scalable: Strong consistency across rows, regions from 1 to 1,000s of nodes
- Highly available-automatic global replication, no planned downtime and 99.9999% SLA
High Cost
CAP Theorem
Consistency-one change data with specific rules, Availability-always available to do queries, Partition Tolerance-needs to tolerate failures and has to be tolerant of any loss of partition parts
Most likely you will have two parts at once
Spanner is strongly consistent and highly available, sometimes it will choose consistency over availability, global private network, five 9s of availability
Cloud Spanner Architecture
An allocation of resources, instance configuration-regional or multi-regional, initial number of nodes
Region configuration: a region has a zone/multiple zones. With each instance, specify the node count as 1 and each replica is powered by each virtual machine, by moving the node number up you are adding more machines for more computing power. The replicas stay the same, but machines/nodes can change. Therefore, you can connect the different replicas across different zones to create a node.
Cloud Memorystore
In memory database
1. Fully managed Redis Instance-provisioning, replication, failover-fully automated
2. Basic tier: efficient cache that that can withstand a cold restart and a full data flush
3. Standard tier-adds a cross-zone replication and automatic failover
Benefits-no need to provision own VMs,scale instances with minimal impact, private IPs and IAM, automatic replication and failover
Creating an Instance: Version 3.2 or 4, choose service tier and region, memory capacity 1-300GB(Determines network throughput), add configuration parameters
Connecting to Instances: Compute Engine, Kubernetes Engine, App Engine, Cloud Function(server-less VPC connector)
Import and Export: Export to RDB backup: BETA, admin operation not permitted during esport, may increase latency, RDB file written to Cloud Storage
Import from RDB backup: Overwrites, all current instance data, instance unavailable during import process
Use Cases: Redis can be used as a Session Cache that the common uses are logins, and shopping carts, a Message Queue that queues messages and operates to enable loosely-coupled services, or a Pub/Sub advanced message
Comparing Storage Options
- Ask yourself if this is structured or unstructured data? Structured: SQL data, NoSQL data, Analytics data, Keys and values Unstructured: Binary blobs, videos, images, proprietary files-unstructured data use the Cloud Storage Option
- Is the data going to be used for Analytics? Low Latency vs Warehouse. Low latency: Petabyte scale, single-key rows, time series or IoT data.-Choose Cloud Bigtable Warehouse: petabyte scale, analytics warehouse, SQL queries-Choose Bigquery
- Is this relational data? Horizontal Scaling vs Vertical scaling. Horizontal Scaling: ANSI SQL works, global replication, high availability and consistency, it’s expensive but can the client afford it, most financial institutions would probably use this-Choose Cloud Spanner. Vertical scaling: MySQL or PostgreSQL, managed service, and high availability-Choose Cloud SQL.
- Is the data Non-relational? NoSQL vs Key/Value. NoSQL: Fully managed document database, strong consistency, mobile SDKs and offline data-Choose Cloud Firestore. Key/Value: managed Redis instances, does what Redis does-Choose Cloud Memorystore.
Streaming
Continuous collection of data, near real time analytics, windows and micro batches
Batch
Data gathered with a defined time window, large volumes of data, data from legacy systems
No-SQL
Anything not sql-key values stores, json document stores, mongoDB and Cassandra tools
SQL
Row tabular data ,Relational-connect to other tables/queries
On-Line Analytical Processing (ONLAP)
Low volume of long running queries
Aggregated historical data-purchasing analytics
On-Line Transactional Processing(ONLTP)
High volume of short transactions, high integrity, sql
Modifies the database
Defines Big Data
- Volume: Scale of information being handled by data processing systems
- Velocity: Speed at which data is being processed, ingested, analyzed, and visualized
- Variety The diversity of data sources, formats, and quality.
Map Reduce
A programming model-Map and Reduce functions
Distributed Implementation
Created at Google to solve problems
Map Function
takes an input from the user, produces a set of intermediate key/value
Reduce Function
Merges intermediate values associated with the same intermediate key, forms a smaller set of values
This method standardized the framework, implementation abstracts away the distributed computing framework: Parallelizing and executing-partitioning, scheduling and fault tolerance
Splits all the jobs to small chunks
Master and worker cluster model
Failed worker jobs reassigned
Worker files buffered to local disk
Partitioned output files
Hadoop and HDFS
Named after a toy elephant-inspired by google file system-originated in Apache Dutch-sub project began in 2006
Modules: Hadoop Common-base model and has starting scripts, Hadoop Distributed File System(HDFS)-distributed fault tolerates system that runs on commodity hardware as part of a Hadoop cluster, Hadoop YARN-handles resource management tasks like job scheduling and monitoring for Hadoop jobs, Hadoop MapReduce-Hadoop’s own implementation of the MapReduce model which includes libraries for map and reduce functions, partitioning, reduction, and custom job configuration parameters
HDFS Architecture-can help with Cloud Dataproc
There is a Server-within the server there is a Name Node-within the Name Node, there is Metadata
In the other server-there is a Data Node which stores very large files across a cluster and the files are stored as a series of blocks
The Racks are in between the cluster to design the shortest network path possible
The client can make multiple requests to a name node across racks to get data from multiple nodes
Servers/clusters can be replicated for fault tolerance
The YARN architecture is similar but in the Server, it has a Node Manager and a Server can have a Resource Manager-The client sends jobs to the resource manager, then on individual workers, the Node Manager process runs to handle local resources, request tasks from the master and return the results
Apache Pig-A high level framework for running MapReduce jobs on Hadoop clusters
Platform for analyzing large datasets
Pig Latin defines analytics jobs: Merging, Filtering, and transformation-high level but like SQL simplicity
Good for ETL jobs since it has a procedural data flow
And it is an abstraction for MapReduce
The Apache Pig will compile our instructions into MapReduce jobs and then are sent to Hadoop for parallel processing across the cluster
Apache Spark
Linear flow of data was an issue- like reading mapping across data reduce results and writing to a disk
The Adobe Spark-General purpose cluster-computing framework-allows for concurrent computational jobs to be run across massive datasets
It uses general purpose cluster-computing framework, resilient distributed data multisets, working set as a form of distributed shared memory
Spark Modules
Spark SQL-structured data in spark stored in abstraction, programmatic querying-data frames API
Spark Streaming-streaming data ingestion in addition to batch processing-very small batches
MLLib-machine learning library, machine learning algorithms-classification, regression, decision trees
GraphX-iterative graph computation
Supports languages: Python, Java, Scala, R, SQL
MUST have 2 Things: A Cluster Manager-YARN or Kubernetes and a distributed Storage System-HDFS, Apache HBASE, and Cassandra
Hadoop vs Spark
Hadoop: Slow disk storage, high latency, slow, reliable batch processing
Spark: Fast memory storage, low latency, stream processing, 100x faster in-memory, 10x faster on disk, more expensive
Apache Kafka
Publish/subscribe to streams of records
Like a message bus but for data
High throughput and low-latency-ingesting millions events through devices
Ex: Handling >800 Billion messages a day at LinkedIn
Four main APIs in Kafka: Producer-allows app to stream records to a Kafka topic. Consumer-allows app to subscribe to one or more topics/process a stream of records contained within. Streams-an API designed to allow an application to be a stream processor itself-transform data then send it back to Kafka
Kafka vs Pub/Sub
Kafka: Guaranteed message ordering, tunable message retention, polling(Pull) subscriptions only, unmanaged
Pub/Sub: No message ordering guaranteed, 7 day maximum message retention, pull or push subscription, managed
Pub Sub Intro
Message Bus takes care of all messages between devices
Pub/Sub splits it in different topics-anything can publish a message to a topic or choose to receive a message from a topic.
Information from users/apps are published to a topic
Topics are covered by a message bus-introduced resilience-Pub/Sub is a shock absorber
Cloud Pub/Sub: Global messaging and event ingestion, server less and fully managed, 500 million messages per second, 1TB/s of data
Pub/Sub Great Features-Multiple publisher/subscriber patterns, at least once delivery, real time or batch, integrates with Cloud Dataflow
Use Case Distributing Workloads Pub/Sub
queue up a large number of tasks in a Pub/Sub topic and distribute it amongst multiple workers-like compute engine instances
Asynchronous Workflows Pub/Sub
controls order of events, order can be sent into a topic which could then be consumed by a worker system like invoicing before passing it into a queue for the next system to consume like packaging and posting
Distributing Event Notifications Pub/Sub
A systems sets up new users when they register with your service, a registration could publish a message and the system could be notified to the set the user up
Distributed Logging:Logs could be sent to a Pub/Sub topic to be consumed by ,multiple subscribers. Like a monitoring system and an analytics database for later querying
Device Data Streaming Pub/Sub
Hundreds of thousands and more internet connected devices can stream their data into Pub/Sub topics so that they can be consumes on demand by your analytic streams or could be transformed through Dataflow first
One-to One Pub/Sub
There is a Publisher, Topic, and then a Subscriber
Publisher sends messages to the topic in Pub/Sub
The subscriber receives the messages and reads them through their own subscription
Many to many Pub/Sub
Just like the one-to-one pattern but this has multiple topics
Publishing Messages
Create a message containing your data, JSON payload that’s base64 encoded, size of payload 10MB or less, then send payload as a request to the Pub/Sub API-specify the topic the message should be published on
Receiving Messages
Create a subscription to a topic, subscriptions are always associated with a single topic. Pull delivery method is the default delivery method and can take ad hoc pull requests to the Pub/Sub API-specifying your subscription to receive messages and when you receive the message, note that you have received it or else you won’t get the next message. Push delivery method will send messages to an endpoint-the endpoint must be HTTPS with a valid SSL cert-accepts POST requests
Integrations Pub/Sub
Client libraries for popular languages like Python, C#, Go, Java, Node, PHP, and Ruby. Cloud Dataflow supported and you can use the Apache Beam SDK to read messages or in batches. Also supported Cloud Functions and Cloud Run, Foundation of Cloud IoT Core-sends and receives messages from connected devices
Developing for Pub/Sub: Local Pub/Sub emulator-Google Cloud SDK and Java Runtime Environment 7+
Advanced Pub/Sub Topics
At Least Once Delivery: Each message is delivered at least once for every subscription
Undelivered Messages: deleted after the message retention duration-default 7 days-can’t be longer
Messages published before a subscription is created will not be delivered to that subscription
Subscriptions expire after 31 days of inactivity-new subscriptions with same name have no relationship to the previous subscription
Other Features
Seeking Feature: Set retain ack messages to True so it retains the messages sent to the topic-default messages are retained for a maximum of 7 days. Then you can tell the subscription to seek to a specific time period in the timeline-basically rewinds the clock to receive past messages. You can also seek in a future timestamp
Snapshots: useful if you are deploying new code. You can save snapshot ahead of time to save the current state of the subscription and to save future and unacknowledged messages
Ordering Messages: May not receive messages in the right order=use timestamps when final order matters, or consider an alternatives for transactional ordering-maybe through a SQL query
Resource Locations: Messages stored in nearest region, message storage policies allow you to control this, additional egress fees may apply
Access Control Pub/Sub
Use service accounts for authorization, grant per-topic or per-subscription permission, grant limited access to publish or consume messages
Exam Tips Pub/Sub
Think about where you can decouple data-Pub/Sub is a shock absorber, receives data globally and it can be consumed by other components at their own pace
Where can you use Pub/Sub for events. It can add event logic to a stack and it can pass events through one system to another
Be aware of Pub/Sub limitations-message data must be 10MB or less, beware of expired messages and unused subscriptions
Look for Apache Kafka in use cases, if this comes up, Pub/Sub can be a good option
Keep an eye out for Cloud IoT as a solution
Google Cloud Tasks-get familiar with it
Browse the reference architectures-Smart Analytics references
What is Dataflow
Fully managed, server less tool, uses open source Apache Beam SDK, Supports expressive SQL, Java, and Python APIs, Realtime and batch processing, stack integration
Beams unify, develop and model which allows us to reuse code across streaming and batch pipelines
Sources: Cloud Pub/Sub, BigQuery, and Cloud Storage, it can be external to GCP like Kafka
Common Sinks: Cloud Storage, BigQuery, and Bigtable, Cloud Machine Learning can be applied to sync Data
Dataflow Process
You have a Pipeline, Source, and Sink
The Pipeline will take data from the Source processes data and then places it into the Sink
Apache Beam connectors allow you to connect to the Source and Sink so you can read and then write your output data into the Sink
Common
Common Dataflow Sources
Cloud Pub/Sub, BigQuery, and Cloud Storage, it can be external to GCP like Kafka
Common Dataflow Sinks
Cloud Storage, BigQuery, and Bigtable, Cloud Machine Learning can be applied to sync Data
Driver
program you write using the Apache SDK-Java or Python. It defines your pipeline
Pipeline: full set of transformations that your data undergoes from initial ingestion to final output
Driver goes to the runner
Runner
a software that manages the execution of your pipeline, a translator for the Backend execution framework can also manage local execution of Driver programs for testing and debugging
PCollections
used in pipelines and represent data as it is transformed within the pipeline kind of represents a multi-element dataset. They can also represent both batch and streaming data.
Data coming from a fixed source, the dataset=Bounded, treated like a batch
Continuously updating source, dataset=Unbounded(Stream)
The PCollection is usually from reading an external source
Transform usually represents a step in your pipeline, transforms use PCollections as inputs and outputs, each transform takes one or more PCollections as inputs and generates 0 or more output PCollections
Pipeline Development Cycle
- You have to Design your pipeline first-input and output methods, structure and transformations
- Then you Create it-instentiating a pipeline object, implementing transformations that were identified
- Testing: debugging a failed pipeline execution on a remote system, try to do local unit testing
Considerations
- Start with the location of your data
- Input data structure and format
- Transformation objectives
- Output data structure and location
Pipelines Structures
Basic: Input linear to output
Branching: PCollection and there is branching that applies to a single PCollection which result in two different PCollections
Branching can also be conducted on a Transform
Pipeline branches can also be merged, you need to merge all branches of your pipeline at some point through a Flatten or Joined Transform
Pipelines can also have multiple sources and they can be independently transformed
DAG
Dataflow Pipelines represent a Directed Acyclic Graph or DAG-a graph with a finite number of vertices and edges-no directed cycles
Pipeline Creation
- Create an Object
- Create a PCollection using read or create transform
- Apply multiple transforms as required
- Write out final PCollection
- Execute the pipeline using the pipeline runner
ParDo
generic parallel processing transform: can take an element from PCollection1 and transform it to PCollection2, can output 1, none, or multiple output elements from a single input element
User-defined function(UDF)
user written code that describes the operation to apply to each element of the input PCollection
Aggregation Transformation
The process of computing a single value from multiple input elements, doing this for all elements and then going into a single window
Characteristics of PCollections
- Any Data Type-must be same type
- Don’t support random access
- Immutable or unchanging
- Boundedness-no limit to the number of elements a PCollection can contain-can be Bounded-finite number of elements or Unbounded-does not have an upper limit
- Timestamp is associated with every element of a PCollection-initially assigned by the source that results in the creation of the PCollection
Core Beam Transforms
- ParDo-generic parallel processing transform
- GroupByKey-processes collection of key value pairs, collects all values associated with a unique key
- CoGroupByKey-used when combining multiple PCollections-performs a relational join of two or more key value PCollections where they have the same key type
- Combine-requires you to provide a function that defines the logic for combining elements, had to be associative and commutative-sum, min, max
- Flatten- merges multiple input PCollections into a single logical PCollection
- Partitioning-provides the logic that determines how the elements of the PCollection are split up.
Event time Dataflow
event time data element occurs determine by timestamp on data element itself, Processing time refers to the different times the element was processed during the transit in your pipeline
Windowing
Assigned to a PCollection, subdivides the elements of a PCollection according to their timestamps, do this to allow grouping or aggregating operations over unbounded collections, it groups elements into finite windows
Fixed Window
- Fixed-simplest, constant non overlapping time interval
Sliding Window
- Sliding-represent time intervals- but it can overlap, and an element can belong to more than one window-useful to take running averages of data
Per Sessions
a different session window is created in a stream when there is an interruption in the flow of events which exceeds a certain time period, apply on a per key basis-useful for irregularly distributed data with respect to time
Single global
everything else-window transform
Watermark
the system’s notion of when all the data for a certain window can be expected to have arrived-late data = watermark moves past the end of the window and any further data elements arrive with a timestamp within that window
Triggers
- Event time, event-time based
- Processing time
- Data driven-when data in a particular window meets a certain criterion
- Composite-combine other triggers in different ways
Pipeline Access
Run Cloud Dataflow pipelines
1. Can be run locally
2. Submit pipeline to GCP Dataflow managed service
GCP service accounts
1. Cloud Dataflow service-uses Dataflow service account
2. Worker instances-Controller service account
Cloud Dataflow Managed Service
- The pipeline gets submitted to the GCP Dataflow Service
- The Dataflow will create a Job
- The Job creates managers and workers to carry out various tasks
- For the execution, the workers need files/resources from Cloud Storage
- The Job can be monitored with the Cloud Dataflow Monitoring Interface or the Cloud Dataflow Command-line Interface
Cloud Dataflow Service Account
- Automatically created when Cloud Dataflow project is created
- Manipulates job resources
- Assumes the Cloud Dataflow service agent role
- Has Read/Write Access to project resources
Controller Service Account-used by the workers-uses the Compute Engine
- Compute Engine instances-execute pipeline operations
- Run Metadata operations-don’t run on local clients or compute engine workers-determine size of file in Cloud Storage
- User-managed controller service account-used resources with fined grained access control
Security Mechanisms
- Submission of the pipeline-users have to have the right permissions
- Evaluation of the pipeline-encrypted, not persisted beyond evaluation of the pipeline, communication between workers over a private network-subject to projects, permissions, and firewalls specify region and zone
- Accessing telemetry or metrics-encrypted at rest-controlled by project’s permissions
- You can also use Cloud Dataflow IAM roles
Regional Endpoints in Dataflow
- Manages metadata about Cloud Dataflow jobs
- Controls Cloud Dataflow workers
- Automatically selects best zone
Good reasons for regional endpoints
1. Security and compliance
2. Data locality
3. Resiliency
Machine Learning with Cloud Dataflow
- Handles data extraction from Cloud Storage
- Data Preprocessing in Apache Beam pipeline through Cloud Dataflow, TensorFlow API used to normalize some values between 0 and 1, the Beam partition transform is used to split the data set into the training data set and the evaluation data set
- TensorFlow is used to train a model locally on your machine or through Cloud Machine Learning-doesn’t use Cloud Dataflow
- Predictions-Cloud Dataflow-read from Cloud Dataflow from Pub/Sub into another Pub/Sub topic
Benefits of Dataflow
You can use customer-managed encryption keys
Batch pipelines can be processed in a cost-effective manner with Flexible Resource Scheduling(FlexRS)-uses Advance scheduling, Cloud Dataflow Shuffle service, preemptive VMs
Cloud Dataflow is great for MapReduce jobs to Cloud Dataflow-on premises map reduce jobs can be rebuilt on cloud dataflow
Cloud Dataflow with Pub/Sub Seek-replay and reprocess previously acknowledged messages-especially in bulk
Cloud Dataflow SQL
- Develop and run Cloud Dataflow jobs from the BigQuery web UI
- Cloud Dataflow SQL (ZetaSQL variant) integrates with Apache Beam SQL
Apache Beam SQL-Query bounded and unbounded PCollections, Query is converted to a SQL transform
Cloud Dataflow SQL-Utilise existing SQL skills, join streams with BigQuery tables, query streams or static datasets, write output to BigQuery for analysis and visualization
Dataflow Exam Tips
Beam and Dataflow are the preferred solution for streaming data-especially for streaming data
Pipeline: represents the complete set of stages required to read data perform any transformations and write data
PCollection: represents a multi-element dataset that is processed by the Pipeline
ParDo: core parallel processing function of Apache Beam which can transform elements of an input PCollection into an output PCollection.
DoFn: template you use to create user-defined functions that are referenced by a ParDo
Sources-where data is read from
Sinks-where data is written from
Window: allows streaming data to be grouped into finite collections according to time or session-based windows
Watermark: indicates when Dataflow expects all data in a window but past the watermark is considered late
Dataflow is normally the preferred solution for data ingestion pipelines
Cloud Composer is sometimes used for ad hoc orchestration/provide manual control of Dataflow pipelines themselves
What is Dataproc?
A managed cluster service for Hadoop and Apache Spark
Managed preferable because it is low costing and you can control which clusters to grow and which clusters to turn off
Dataproc Architecture
Master: it creates a master node running the YARN resource manager and then runs the Hadoop, HDFS name nodes
It also runs the Worker Nodes
Pre-installed and have Hadoop, Apache Spark, Zookeeper, Hive, Pig, Tez, and other tools like Jupyter Notebooks and GCS connector
Storage and configuration handles by Dataproc
Dataproc benefits
- Cluster actions complete in ~90 seconds
- Pay-per-second minimum 1 min
- Scale up/down or turn off at will
Using Dataproc
You can submit Hadoop/Spark jobs, Enable autoscaling-if necessary to cope with the load of the job, Output to GCP Services-like Google Cloud Storage, BigQuery and BigTable, you can also Monitor with Stackdriver-fully integrated logging and monitoring for the job performance and output
Cluster Location
Regional: Isolate resources used for Dataproc into one region like us-east1 and Europe-west1
Global: Resources not isolated to a single region-can place cluster in any zone worldwide
Single Node Cluster
a single VM that will run the master and work the processes-can’t autoscale
Standard Cluster
Has a Master VM that runs YARN Resource Manager and the HDFS Name Node, and it has two Worker Nodes that run a YARN Node Manager and a HDFS Data Node-this is customizable for the disk, there are also Pre-emptible Workers-sometimes help with large projects, but can’t provide storage for HDFS
High Availability Cluster
You have three Masters with YARN and HDFS configured to run in high availability mode-no interruptions
Submitting Jobs
- Gcloud command line
- GCP Console
- Dataproc API
- SSH to Master Node
Monitoring and Logging
- Use Stackdriver Monitoring to monitor cluster health
- Cluster/yarn/allocated_memory_percentage
- Cluster/hdfs/storage_utilization
- Cluster/hdfs/unhealthy_blocks
Custom Clusters
You can customize the Dataproc default image, Google gives a script, then under the Dataproc default image, there are Custom PKGs where you can apply the customization script you have written, then you can store it in Google
You can also have:
Custom cluster properties-so you can change the values
You can add initialization actions that are custom to the cluster-scripts loaded to a Cloud Storage Bucket-mostly for Staging binaries
You can also Custom Java/Scala dependencies-saves you from precompiling
Autoscaling in Dataproc
Huge Bonus: you can create lightweight clusters and have them automatically scale up to the demands of the job-written in YAML, has configuration numbers for primary workers and secondary workers
When to not use Autoscaling
- When having HDFS
- When having Apache Spark Streaming
- When having Idle Clusters
- YARN Node Labels
Workflow Templates
Written in YAML that can specify multiple jobs w/ different configs and parameters that can be run in succession
Workflow Templates have to be created, then instantiated with GCloud-you can send jobs to a new cluster each time or to an existing cluster
Advanced Compute Features Dataproc
- Local SSDs-faster runtimes
- GPUs to nodes-for machine learning
Cloud Storage Connector
- Use GCS instead of HDFS
- Cheaper than persistent disk
- High availability and durability
- Decouple storage from cluster lifecycle
Exam Tips
Know when to choose Dataproc: Quickly migrating Hadoop and Spark workloads into Google Cloud Platform
Understand the benefits of Dataproc: Managed over Hadoop or Spark cluster-Ease of scaling, being able to use Cloud Storage instead of HDFS, and the connectors to other GCP services like BigQuery and Bigtable
Know Cluster Options: When to pick standard vs high availability, autoscaling and ephemeral
Get to know open-source Big Data Ecosystem-Hadoop, Spark, Zookeeper, Hive, Tea, and Jupyter
Know when to choose Dataflow-sometimes it is the preferred product for big data ingesting, like in streaming workloads and it implements the Apache Beam SDK
Bigtable Concepts
Managed wide-column NoSQL database-series of key value pairs where the values are split into columns
Has a very High Throughput-10,000 reads per second
Also has low-latency-6 milliseconds per node
Scales linearly
Out of the box high availability-cross cluster replication
Developed internally by Google and was used for Google Earth, Finance, and Web Indexing
Since HBase was created and was the open source implementation of the Bigtable model, it was adopted as a top level Apache project and the Cloud Bigtable supports Apache HBase library for Java
Cloud Bigtable
- Has a ROW KEY as the only index
- Then it can be attached to columns
- The columns can be grouped by families
- The empty values don’t take up any space since it’s a sparse db
- Scaled to thousands of columns and billions of rows
Important Bigtable Features
- Blocks of contiguous rows are -shared into tablets
- Tablets are chunks of sorted rows-put together they form a complete table-managed by nodes in your cluster
- Tablet data is stored in Google Colossus-can scale cluster sizes
- Splitting, merging, and rebalancing happen automatically
Bigtable Scenarios
Suited well for financial, marketing data and transactional data
Also good for time series data and data from IoT devices
Good for streaming data and machine learning applications
Bigtable Architecture
You create an instance
You have a instance type, storage type and app profiles-describes parameters for incoming connections
To connect, you use an instance ID and an application profile
Inside the instance, you have clusters
Inside the clusters you have nodes which are workhorses of Bigtable
The flexibility of Data Storage comes from separating our cluster nodes and storing data in Colossus
Nodes control tablets and a tablet can’t be shared by more than one node
Instance Types
- Production-1+ clusters, 3+ Nodes per cluster
- Development-Single node cluster for developmental work-development instance can’t use replication and doesn’t have SLA and a cheaper option
SSD Storage Type
Almost always the right choice, fastest and most predictable option, 6ms latency for 99% of reads and writes, each node can well process 2.5 TB SSD data
HDD Storage Type
Each node can process 8 TB HDD data, throughput is limited so it will not have as much IO overhead for processing nodes, then the row reads are 5% the speed of SSD reads, the storing at least 10 TB of infrequently-accessed data with no latency sensitivity-could spend more money on clusters
Application Profiles
- Custom application specific settings for handling incoming connections
- Single or multi-cluster routing
- In single routing: it will route to a single router that you define even if you have multiple clusters in an instance
- Multi: Will route to nearest most available cluster and if it is unavailable, it will go to the next cluster
- You have to ask if data needs single row transactions-then you have to have single routing
Bigtable Configuration
- Instances can run up to four clusters
- Clusters exist in a single zone
- Up to 30 nodes per project
- Maximum of 1,000 tables per instance
Bigtable Access Control
- Cloud IAM roles
- Applied at project or instance level to-
- Restrict access or administration.
- Restrict reads and writes
- Restrict development instances or production access
Data Storage Model
ROW KEYs can only be indexed
Column families allow is to grab what we need only
Column names are called column qualifiers
You can write new data values and the old ones aren’t overwritten
You can control how much is stored and for how long it is configurable-detailed granularity, array of bytes
Alternative Options to Bigtable
- Need SQL Support OLTP: Cloud SQL
- Need Interactive Queries OLAP and cheaper: BigQuery
- Need structured NoSQL Documents: Cloud Firestore
- Need In-memory Key/Value Pairs: Memorystore
- Need Realtime Database: Firebase
Important Bigtable Info
Rows are sorted alphabetically-design of row key very important
Atomic operations are by row only-be careful when updating
Sparse table system-doesn’t hurt to have a lot of columns/families even if they don’t apply to every entity
Row sizing: no larger than 10MB, total row not including the key should be under 100MB
Timestamps and Garbage Collection
- Each cell has multiple versions
- Server recorded timestamps
- Sequential numbers
- Expiry policies define garbage collection: can expire based on a specific age or specific number of versions
- Setting the client with an HBase will set the policy to only retain the test version of a cell, if you use any other client library, then it will set the column family to store infinite versions
CBT is an alternate way to connect to Bigtable
Bigtable Schema Design
Have to scan the entire table and then filter the results based on a regular expression match tot he string contained in that column cell-most time expensive way to query Bigtable
You have to pain your queries ahead of time
Field promotion: taking data that you already know and then moving it into the row key itself
Then you can write a command like: scan ‘vehicles’, {ROWPREFIXFILTER => ‘NYMT#86#’}
You can also include a timestamo in the ROW KEY design
Never put a timestamp in front of the row key
Designing Row Keys
- Queries use: A row key, a row prefix
- A row range is returned
- Reverse domain names
- String identifiers-reads and writes evenly spread
- Timestamps as only a part of a bigger row key design if it is not first and is reversed
Row Keys to Avoid
- Domain names in order
- Sequential numbers
- Frequently updated identifiers
- Hashed values
Design for Performance
- Lexicographic sorting
- Store related entities in adjacent rows
- Distribute reads and writes evenly
- Balanced access patterns enable linear scaling of performance
Avoid Hotspots
- Use Field Promotion instead
- Try Salting-salted hash to your row key artificially distributes the rows, based on total number of nodes
- Use Google’s Key Visualizer tool
Time Series Data in Bigtable
- Use tall and narrow tables where each row might contain a key and maybe only a single column
- Use rows instead of versioned cells
- Logically separate tables
- Don’t reinvent wheel -already good timetable schemas out there Open TSDT project
Monitoring Bigtable
- Via GCP Console or Stackdriver
- Average CPU utilization of cluster and hottest node
- Single cluster instance-aim for average CPU load of 70% and the hottest node not over the CPU of 90%
- For 2 clusters and replication instances-multi cluster routing brings in additional overhead where the average CPU load should be 35% and the hottest node CPU load at max 45%
- For storage utilization, try to keep it on 70% per node
- To monitor this, try to create application profiles for each application
Autoscaling Bigtable
- Stackdriver metrics can be used for programmatic scaling-done on local computer
- Client libraries query metrics
- Update cluster node counts via API
- Rebalancing tablets can take time and the performance might not improve for 20 mins
- Adding nodes to a cluster doesn’t solve the problem of a bad schema
Replication Bigtable
- Adding additional clusters automatically starts replication I.e data synchronization
- Replication is eventually consistent
- Used for availability and failover
- Application isolation
- Global presence
Good Performance in Bigtable
Replication improves read throughput but does not affect write throughput
Use batch writes for bulk data with rows that are close together lexicographically
Monitor instances and use the Key Visualizer to monitor hotspots and bad row keys
Bigtable rebalances tablets-first they all go in the first node, but then they rebalance or spread out to the other growing nodes-the tablets are also being split, merged, and rebalanced to maintain the sorted order of rows
Hotspots sometimes pop up and take a lot of CPU, but the other tablets in the node go to other nodes so the overwhelmed tablets are less overwhelmed
Good vs Bad Performance in Bigtable
Good Performance: Optimized schema and row key design, large datasets, correct row and column sizing
Bad Performance: Datasets short lived or smaller than 300GB
Exam tips Bigtable
Know when to choose Bigtable: many questions make you choose the right product for the workload, migrating from an on-premise environ look at HBASE and consider when Bigtable is a better option than BigQuery. Look at time-series data or use cases where latency is an issue
Understand the architecture of Bigtable: Concepts of an instance and a cluster, where Bigtable stores data, and how tablets are re-balanced by the service between nodes
Be aware of causes of bad performance: Like under-resourced clusters, bad schema design, and poorly chosen row keys.
UNDERSTAND ROW KEYS: Linear scale and performance of Bigtable depends on good row keys. Understand row key design, I might have to point out flaws or pick an ideal row key
Understand Tall vs Wide: Wide Table-stores multiple columns for a given row-key where the query pattern is likely to require all the information about a single entity. Tall Table-suit time-series or graph data and often only contain a single column
Remember Organizational Design: Consider when a development instance is appropriate, remember IAM roles that can be used to isolate access to the necessary groups.
What is BigQuery?
Peta byte scale, server less, highly scalable cloud enterprise data warehouse
In memory BI Engine-fast interactive reports
Has machine Learning capabilities (BigQuery ML)-using SQL
Support for geospatial data storage and processing
Key Features of BQ
- High availability
- Supports SQL-can do SQL queries
- Federated Data-can connect to and process data stored outside of BigQuery
- Automatic Backups
- Governance and Security support-data encrypted at rest and in transit
- Separation of Storage and Compute-cost effective scalable storage and stateless resilient compute
Interacting with BQ
- Web console
- Command line tool (bq)
- Client libraries like C#, Go, Java, Node.JS, PHP, Python, and Ruby
Managing Data with BQ
You have a Project and within each Project, you have a Dataset, and within each Dataset, you can have Native Tables, External Tables, or Views
Native Table
Data is held within a BigQuery Storage
External Tables
Backed by storage outside of BigQuery
Views
created by a SQL query
Real Time Events BQ
Streaming, common to push events to Cloud Pub/Sub, then use a Cloud Dataflow job to process and push them into BigQuery
Batch Sources BQ
Comes in a Bulk Load, common to push files to Cloud Storage, then have a cloud Dataflow job pick that data up, process it and then push it into BigQuery
Legacy SQL
- Previously Called BigQuery SQL
- Non-standard SQL dialect
- Migration to standard SQL is recommended
Standard SQL
- Preferred dialect
- Compliant with SQL 2011 standard
- Extensions for querying nested and repeated data
What you can do with BQ
With BigQuery Data you can: Use BI Tools, Use Cloud Datalab, Export to sheets or Cloud Storage, send it to Colleagues, or use it for GCP Big Data Tools like Dataflow or Dataproc
Jobs and Operations in BQ
Job: action that is run in BigQuery on your behalf
Load Job: Load data onto BQ
Export Job: Export from BQ
Query Job: Queries the data in BQ
Copy Job: Copies tables and datasets from BQ
Query Job priorities: Interactivity(default)-always saved to a temporary table or to a Permanent table, and Batch
Table Storage in BQ
- Capacitor columnar data format
- Tables can be partitioned
- Individual records exist as rows
- Each record is composed of columns
- Table schemas specified at the creation of the table or at a data load
Capacitor in BQ
- The Storage system: proprietary columnar data storage that supports semi-structured data(nested and repeated tables), imports can be CSV JSON to capacitor format
- Each value is also stored together with a repetition level and a definition level (value, repetition level, definition level)
Denormalization
- BQ performance best when data is denormalized
- Nested and repeated columns
- Maintain data relationships in an efficient manner
- RECORD(STRUCT) data type-nested records or columns.. EX: Address=address.number, address.street, address.city, a single ID can have multiple addresses
Data Formats in BQ
CSV, JSON(newline delimited), Avro(open source data format where schema is stored together with data-compressed data), Parquet(encoded, smaller files), ORC(hive data), Cloud Datastore export, and then Cloud Firestore ports
BQ Views
A virtual table defined by a SQL query
SQL query (view definition)=Tables
Then Those get sent to the Dataset
You can also query that view, has billing implications since you would also still be running the underlying query
Uses of Views
- Control access to data
- Reduce query complexity
- Construct logical tables
- Ability to create authorized views-can connect to different subsets of rows from the view
Limitations of Views
- Can’t export data since unmaterialized
- Can’t use JSON API to retrieve data from a view
- Cant combine standard and legacy SQL
- No user defined functions
- No wildcard table references
- Limited to 1,000 authorized views per dataset
External Data in BQ
You can query directly even though the data is not directly held in BQ
BQ supports Cloud Bigtable, Cloud Storage, and Google Drive
Use Cases for using External Data Source: Load and clean your data in one pass, or you have small, frequently changing data joined with other tables
Limitations for External Data Source in BQ
- No guarantee of consistency
- Lower query performance
- Can’t use TableDataList API method
- Can’t export jobs on external data
- Can’t reference wildcard table query
- Can query Parquet or ORC formats
- Query results not cached
- Limited to 4 concurrent queries
Other Data Sources
- Public Datasets-available to everyone
- Shared Datasets-Have been shared with you
- Stackdriver log information
Data Transfer Service
Can easily pull Bulk Data into BQ with the Data Transfer Service
Data Transfer Service has multiple Connectors to Google sources, GCP, AWS services like S3 and Redshift, and other third party services like LinkedIn and Facebook
Can be one off events or scheduled to run repeatedly
DTS allows historical data reference and uptime and delivery SLA
Table Partitioning BQ
Table partitioning-break up big table into smaller tables
Partitions stored separately on physical level
Partitions usually based on a single column called the partition key.
Partitioning BQ 2 Ways
1. Ingestion Time partitioned tables
2. Partitioned Tables
Ingestion Time Partitioning
Partitioned by load or arrival date, data automatically loaded into databased partitions(daily), tables include the pseudo-column _PARTITIONTIME, use _PARTITIONEDTIME in queries to limit partitions scanned
Partitioned Tables in BQ
Partitioned based on a certain TIMESTAMP or DATE column, Data partitioned based on value supplied in partitioning column, 2 additional partitions: NULL and UNPARTITIONED , use partitioning column in queries
BQ automatically places data in right partitions, need to say it is a partition table when creating the table
Clustering Tables BQ
Clustering tables: can do them on a partitioned table, you can use clustering when you have filters or aggregations against specific columns in your queries. When partitioned and clustering tables together, it is partitioned by the partition key and then clustered based on the cluster key
In cluster tables: the data associated with a certain cluster key is generally stored together
Ordering is important
Clustering Limitations
- Only supported only for partitioned tables
- Standard SQL only for querying clustered tables
- Standard SQL only for writing query results to clustered tables
- Specify clustering columns only when table is created
- Clustering columns can’t be modified after table creation
- Clustering columns have to be top-level, non-repeated columns
- You can specify one to four clustering columns
Querying Guidelines for Clustering Tables
- Filter clustered Columns in the order they were specified
- Avoid using clustered columns in complex filter expressions
- Avoid comparing cluster columns to other columns
Why partition tables - Improve query performance
- Control costs
Benefits of BQ Slots
Slots:
Unit of computational capacity required to execute SQL queries-good for pricing and resource allocation
Number slots query-determined by Query size and query complexity
BQ automatically manages your slots quota
Flat rate pricing available-purchase fixed number of slots
You can see slot usage using Stackdriver
Cost Controls of BQ
- Avoid using SELECT *
- Use preview options to sample data
- Price queries before executing them
- Remember the using LIMT doesn’t affect cost
- View costs using a dashboard and query audit logs
- Partition by date
- Materialize query results in stages
- Consider the cost of large result sets
- Use streaming inserts with caution
Query Performance Dimensions
- Input data and data sources
- Shuffling
- Query computation
- Materialisation
- SQL anti-patterns
Input data and data sources best practices BQ
Input data and data sources: Prune partitioned queries, denormalize data whenever possible, use external data sources appropriately, avoid excessive wildcard tables
Query Computation Best Practices
Avoid repeatedly transforming data via SQL queries, avoid JavaScript user-defined functions, order query operations to maximize performance, optimize JOIN patterns
SQL Anti-Patterns
Avoid Self-Joins, avoid data skew, avoid unbalanced joins, avoid joins that generate more outputs than inputs (Cartesian product), avoid DML statements that update or insert single rows
Optimizing Storage BQ
- Use expiration settings-Control Storage Costs and Optimize use of storage space
- Take advantage of long-term storage-lower monthly charges apply for data stored in tables or in patterns that have not been modified in the last 90 days
- Use the Google pricing calculator to estimate the storage costs
Primitive Roles
at the project level, granting access to the related project data sets, individual dataset access will overwrite the primitive access. Three types of these roles-Owner, Editor, Viewer
Predefined Roles
grant more granular access, defined at the service level, GCP managed
Custom Roles
User managed
Cloud DLP
Handling Sensitive Data: credit card numbers, med info, SSN, people names, address info can be protected by the Cloud Data Loss Prevention (Cloud DLP)
Cloud DLP
- Fully managed service
- Identify and protect sensitive data at scale
- Over 100 predefined detectors to identify patterns, formats, and checksums
- It also de-identifies the data
Encryption in BQ
BQ encrypts data through the Data Encryption Key (DEK)
For highest levels of security, the DEK key needs to be encrypted to form the Wrapped DEK-this is done using the Key Encryption Key (KEK)
Wrapped DEK/DEK stored together
KEK is stored in the Cloud Key Management Service
Monitoring/Alerts in BQ
Alerts should be created when a monitoring metric crosses a specifies threshold
BQ uses Stackdriver to monitor, bq sends logs to it
In stacks-driver, you can filter for the big query logs, create dashboards and then charts to the dashboards, you can create alerts
Cloud Audit Logs
Collections of logs that are provided by GCP to allow insights to various services
Log Versions
AuditData(old)-map directly to individual API calls designed against the query
BigQueryAuditMetadata-not strongly coupled to particular API calls, more aligned to resource itself, closely associated with the state of the BigQuery resources can be changed by API calls, services and API tasks
Stackdriver has three different streams: Admin, System, and Data-streams are just groupings for different types of logs
BQ ML Access
- Web console (UI)
- Bq command line tool
- BQ rest API
- Jupyter notebooks(Cloud Datalab) and other external BI tools
Linear Regression
where you have a number of data points and try to fit a line to those data points
Binary Logistic Regression
We have 2 classes and you assign each example to tone of the classes
Multi-class Logistic Regression
We assign each example to one of these
K-Means Clustering
We have a number of points and are able to separate them out into different clusters-newest one on BQ ML
Benefits of BQ ML
- Democratizing ML
- Models trained and evaluated using SQL
- Speed and agility
- Simplicity
- Avoid regulatory restrictions
EXAM TIPS for BQ
Understand good organizational design: consider how different teams should be granted different types of access to BQ and how the decisions affect cost control
Learn the most common IAM roles: learn how to grant access to teams based on needs and how to use authorized views to share data across projects
Consider costs when designing queries: Avoid using SELECT * and previews and price queries before executing them
Partition tables appropriately: partitioned tables can reduce the cost -consider clustering to reduce scans of unnecessary data
Optimize query operations and JOINS
What is Datalab?
What is it?
1. A pre-existing technology, wrapped in some GCP conveniences-Jupyter Notebooks
Jupyter Notebooks
- Interactive web pages that have…
- Documentation
- Code
- Elements which are the results of compiled code
Datalab functions
When typing code, there os a Cloud Datalab Vm that has a Python Kernel
The kernel can run code and access GCP services like BigQuery or ML Engine
Good way to collaborate and share code
Also a good way to annotate
Has marplot.lib and it is great for statistical data and graphs
When saving your work though the notebook, it will be in the GCR Repo which is sent to the persistent disk attached to the Datalab instance
Why Do We Need Datalab?
- Manages instance lifecycle
- Create Datalab VMs in seconds
- Notebooks stored in GCR
- Storage can persist after the instance is destroyed
Intro to Data Studio
Data Sources
Reports and Dashboards
Data sources underneath are Databases or Files
Files-usually CSV files, stored in Cloud Storage
Databases-GCP databases like BigQuery, Cloud SQL, MySQL, Cloud Spanner, PostgreSQL
Google Products-Google analytics, Sheets, Youtube, Ads, Google Marketing Platform
Third Party-Trello, Quickbooks, Facebook Ads
You can Share your dashboards and reports by Viewing or allowing Users to Edit-Like in Google Drive
Chart and Filters in Data Studio
Tables-detailed, heat map inclusion, bar chart inclusion, pagination
Scorecards-KPIs, high level
Pie Chart-proportions, %s or absolute values, doughnut or whole, small amounts of data
Times series-time order, trends, curve filtering, forecasting
Bar charts-categorical, vertical or horizontal, single or stacked, reports and dashboards
Geomaps-geographical data, dashboards
Area charts-composition, cumulative totals, reports and dashboards, can combine with time series
Scatter plot-cartesian plane, typically 2 variables-or more through color or size through the points, dashboards and reports
Filter-allows you to select specific values from a category
Data Range-can specify start and end days or predefined intervals like last week, last month, current year or quarter
Cloud Composer Overview
Built on Apache Airflow
Google is contributing back to the airflow project
Task orchestrated system that is designed to automate complex interdependent tasks into pipelines or workflows
Each stage of the pipeline is written in code
Each workflow is written in Python
Provides central management and scheduling
Provides and extensive CLI tool and a comprehensive web UI
DAG-Directed Acyclic Graph
A graph consisting of nodes connected by edges, edges-how we travel from one node to another, directed-travel in one direction which is acyclic-never circles back, can’t reach the same node more than once by traveling along the edges
Possible to represent dependencies-between nodes, that must be traversed in a specific order
These dependency nodes represent all the tasks in a workflow organized in a away to show their relationships and dependencies
DAGs are represented in Python and it matters when the tasks should be aligned to execute
Inside the tasks, there are operators to specify what is to be done
DAGs can contain parameters when they should run and what the dependencies are and who should be notified once the dependencies are completed
Cloud Composer manages resources to make sure workflow completes successfully
Composer Architecture
A microservices architecture
Uses multiple GCP resources grouped together into a Cloud Composer environment
Can have more than one environment in a GCP project, but each environment is an isolated installation of Airflow and all of its component parts
Some parts get put on a Tenant Project-you can’t see or control, places Airflow database and Airflow web server-provides its web UI on App Engine Flex, it will configure the Identity-Aware Proxy to control access to the web server
Why use Cloud Composer rather than Dataflow?
Dataflow-Process Batch or Streamed Data-Apache Beam
Cloud Composer-Orchestrate tasks with Python and can use any Python code at any stage of the pipeline-more as a scheduler
Can orchestrate Cloud Composer with Dataflow
Cloud Composer workflow example: Spark Analytics- A workflow runs daily, sets up a Dataproc cluster, performs Spark analytics, writes results to GCS and emails an administrator, and then deletes the Dataproc cluster
Cloud Composer can be used as any scheduled automation task outside of big data
Composer in a GCP Environment
In the GCP Project environment- Make a Kubernetes cluster that deploys Redis, the Airflow Scheduler, and the Airflow Workers along with the Cloud SQL Proxy, it will also create 2 Pub/Sub topics for messaging between micro services and a Cloud Storage Bucket for logs, plugins and the DAGs themselves
Cloud Composer configures- Airflow Parameters and Environment Labels, you can also customize some parameters
DAGs can have a file or multiple files with imports and dependencies- the Python script has the Variables, Operators, and Stages of the Workflow tied together with a DAG object and definition
Tasks
A Task is an instance of the Airflow Operator
The scheduler will find any DAG object that you have defines in the Python scripts uploaded to the GCS bucket, if all dependencies met, workflow will be scheduled
When you delete an environment, it won’t clean up all resources it created, but the Tenant Project and the GKE cluster that was spun up will be removed
You will have to manually delete the Pub/Sub topics and the GCS bucket
Advanced Composer Features
Custom Airflow parameters get written on airflow.cfg file that is used to configure services when airflow is ran for the first time
Can’t change all of service settings, some can only be modified
Can create Environment Variables which will be passed by Cloud Composer and passed to elements of Airflow like the scheduler, web server, and worker processes
Environment Variables are in a Section which are defined in Key Value Pairs
Airflow Connection: A collection of authentication information which can include hostnames, logins, keys/secret information
Will create connections for BigQuery, Datastore, Cloud Storage, and a Generic GCP connection-these will have a service account key that will authenticate against the GCP API in question
Can make custom connections using the Airflow Web UI, connections to Airflow’s own database use the Cloud SQL Proxy
In the Web UI, you can use the ad hoc query page to run SQL queries against any connected database-some built in visualizations as well
Extending Airflow
- Local Python Environment by adding additional custom libraries
- Airflow Plugins-write own custom operator
- PythonVirtualEnv Operator to create a custom environment for a task in a workflow without installing dependencies across all of your workers-creates individual envs with own libraries
- KubernetesPod Operator-need complete control over an environment-to run a task inside a pod on the Cloud Composer GKE Cluster
Google Pre-Trained ML Models
- Cloud Vision API-can detect and label objects within images, facial recognition, can read handwritten info
- Cloud Video Intelligence API-can identify huge numbers of objects, places, and actions which are taking place in videos
- Cloud Translation API-can translate between more than 100 languages
- Cloud Text to Speech API: Convert text to audio/human speaking
- Cloud Speech to Text-convert from audio to text
- Cloud Natural Language API-can perform sentiment analysis, entity analysis, entity sentiment analysis, content analysis, content classification
Reusable Models
Model training required, min knowledge of ML required, relatively small datasets for training, transfer learning, neural architecture search-like if you use GCP, you can search out what the best model is to solve problem
Re-use models-Through Cloud AutoML:
when you need something very specific
AutoML-transfer learning, allows you to train your own custom models to solve your own specific problems: You have AutoML Vision, Video Intelligence, Natural Language, Natural Translation, and Tables
Google AI Platform allows you to train your own models, manage and share models-gives easy access to TensorFlow, TensorFlow Extended (TFX)-end to end platform for deploying machine learning pipelines, gives access to TPUs-speed up process, Kubeflow
ML Pipeline
ML has a Model that contains Rules from the Data that’s used to train it
ML Pipeline
1. Data preparation-raw data becomes processed data
2. Model Training-processed data is split into Training Data and then Testing Data. Training Data gets inputted into the Model, then it is evaluated against the Test Data to determine how well it inferred rules from the training data
3. Operating- Trained Model is used for Real-time Predictions like predicting what people are going to buy, or for Batch Predictions which are normally offline predictions when many predictions are made in a short period of time
Label in ML
Label: Something that is an interest to us like a house price-denote labels with y
Labelled Example: Has a label associated with a set of features-house size, rooms, locations.. House price =$K, (x1,x2,x3,x4,…)->y
Unlabelled Examples: have the set of feature values, but don’t have a label value,size, number of rooms, location.. House price=?, (x1,x2,x3,x4,…)-we want to predict the table value called y prime
Feature in ML
Feature: attribute associated with the label and has a relationship with the label, like the size of a house or the number of rooms in a house, denoted with an x-x1,x2…
Examples: Features together with labels
Measuring Loss and Loss Squared
Measuring Loss: Calculating the difference between the prime value (x7,y7’) and the plot value (x7,y7) with the equation y7-yy7’, so basically the difference between the actual values and predicted values
Memory Loss Squared: You have one line going through the actual points, then another line that doesn’t match actual points well, you get the differences between the actual and the predictive points, you calculate loss by Loss=(y1-y1’)^2+(y2-y2’)^2»0, you have to square because the positive and negative would cancel out
Mean Squared Error
similar to loss equation but calculating the mean
Optimization using gradient descent, y=wx -> MSE(w), gradient gives direction were loss function increases, move where loss function is decreasing, keep doing this until we fin the w(minimum) which is where the loss function has its lowest value, then you can use the w in the line equation to get the line with the lowest lost, learning weight is how much we moved to the minimum
Supervised Learning Type
We train the model using data that is labeled, each example has a label and a feature
We have training data that have a set of features that are called x, we include the correct labels which are the ys, features with the correct labels are used to train the model, the unseen data which has only the features is presented to the model, the model uses the training to predict what associated labels should be for each example, this will give us y prime
Unsupervised Learning Type
uncover structure within the data set itself
Have a set of input data with no labels, have features only and send it to the model, the model then creates outputs which uncover a structure within the input data, maybe use this method to uncover personas within customer data
Reinforcement Learning
common in gameplaying and other types of ML
You have an agent which is basically our model, the agent interacts with an environment(chess board in game of chess), the state of the environment(state(t)) is fed into our model at time t at a particular moment in time, the agent will propose an action at time t which affects the environment(move chess piece from one square to another), the state will change state(t+1), agent is rewarded, reward(t), to how well the action that it took at time t affected the environment in terms of the outcome it wants to get
Regression
predict real number (y’)
There is a regression model and features are inputted into the model, and the model associates the set of features with the predicted model with a y’
Classification
predict class from specified set{A,B,C,D} with probability
There is a classification model, it will take a set of features, then it will associate it with a particular class and an associated probability, ex: (X1,X2,…,Xn,A,0.9817)
Clustering
group elements into clusters or groups
There is a clustering model, you input an element set, then you associate the element set with a cluster in the output
Transfer Learning
Train the model using images to identify them in specific categories, then you can copy the model and make it into a slightly newer classification model with new classification categories and you can train it with new images, on this new model, the training images can usually be far less
Underfitting
Like a Line, doesn’t capture underlying structure
Balanced
Curved that fits the data very well, still simple parabola
Overfitting
Fits the data very very well better than the other two cases ew data point with the curve there is a very large distance, the problem is it doesn’t generalize to new data
L1 Regularization
L1 Regularization term: |W1|+|W2|+…+|Wn^2|-take sum of the absolute value of all the weights
Penalizes |weight”, drives weights of non-contributing features towards 0-sparse matrices
L2 Regularization
W1^2+W2^2+…+Wn^2-take sum of squares of all the weights
Penalizes weight squared, drives all weights towards 0-simpler model
Avoid Overfitting
- Regularization
- Increase Training Data
- Feature Selection
- Early Stopping-don’t allow data to train for too long, only a certain amount of iterations
- Cross Validation-take training data and split it into much smaller training sets, smaller sets known as folds
- Dropout Layers-where weights are most likely all set to 0
Hyperparameters
Hyperparameters: values that need to be selected before the training process can begin
Hyperparameter examples: Batch size, training epochs, number of hidden layers in network-model, regularization type-l1 or l2, regularization rate, learning rate
Hyperparameter Characteristics
1. Selection: Hyperparameter values need to be specified before training begins
2. Model hyper parameter: relate directly to the model that is selected, Algorithm hyperparameters: relate to the training of the model
3. Training and Tuning: the process of finding the optimal or near optimal values for hyperparameters
Feature Engineering
One-hot Encoding(Categorical Data)- from a fixed set of values, can transform categorical data to numeric values, you can have columns that represent the categories and us binary 1 or 0 to indicate which ones are in each column
Linear Scaling: transform values across one range into values in anothe, 1 to 1 or 0 to 1
Z-Score: can have a lot of values around mu and then you can transform them into a cluster around 0
Log Scaling: a small number of values have many points while the vast majority have few points but can transform it with X’=log(x)
Bucketing: distribution where data points are related to each other, relationship not linear, create defined ranges, data points within range are aligned to a single data point
TensorFlow
- Google’s open source, end to end, ML framework-good for machine learning
- Compatible with wide range of hardware and devices-models can be trained and deployed across a wide range of hardware CPU GPU TCU
- TensorFLow Lite-deploying models to mobile and embedded devices
- TensorFlow.js-JavaScript library for training and deploying models on node.js
- TensorFlow extended-deploying ML pipelines
Keras
- Open Source neural network library
- Made in Python
- Runs ontop of other ML frameworks
- High level API for fast experimentation-deep neural networks, easy to use extensible
- CPUs and GPUs
- tf.keras-implementation of Keras API specification-build and train models for first class support for TensorFlow specific functionality more flexible and makes tensor flow easier to use
Google Colab
- Free Cloud service base on Jupyter Notebooks, free service with Jupyter Notebooks within Google Docs
- Provides free GPU support
- Supports some BASH commands
- Includes pre-installed python libraries
Neural Network Layers
Input Layer: allows us to feed data unto the model
Output Layer: represent the way we want the neural network to provide answers, we have a neuron available for each of the classes
There are Hidden Layers between the Input and Output Layers, the number of hidden layers will vary based on the type of problem you are trying to solve-model hyperparameter
Input=L0 Hidden=L1,L2,L3,L4,L5 Output=L6
Each layer has a specified number for neuron/nodes
A pixel represents a matrix
Fully Connected Layer
where every neuron of 1 layer is connected to every neuron in the following layer
Partially Connected Layer
where every neuron in one layer is not connected to every neuron in the adjacent layer
Neurons
- Will have an input vector X and an output value Y
- Has a weight vector number of weights in a weight vector will correspond to the number of inputs in the input vector
- X1*W1-each weight is multiplied by the corresponding input value
- Then the values are summed-(X1W1)+(X2W2)
- A function is applied to these sum values-f((X1W1)+(X2W2)) which gives us our output y
- The function is called the Activation function
- Activation functions:Rectified linear units(ReLu): f(x)=max(0,x), Sigmoid/Logistic,Hyperbolic tangent
Logits: Element in the layer
Softmax function: converts Logits to probabilities-SUM(p)=1
Feed Forward Neural Network
- Simplest and most common type of deep neural network
- Applications in computer vision and NLP
- First type of deep neural network to be used
- Information flows in one direction only(input to output)
Recurrent Neural Networks
- Flow of information forms cycles/loops
- Directed cycles
- RNNs can be difficult to train
- RNNs are dynamic with their state continuously changing
- There are cycles
Convolutional Neural Network
- Neural network commonly used for visual learning tasks
- Common uses: Image and video recognition, image classification, natural language processing
- Have an input, output, and many hidden layers
- Convolutional layers are paired with pooling layers
- They layer applies small filters that make out certain features of an image
Pooling Layer
- Simplifies (downsamples) inputs
- Usually succeed a non-linear activation function
- There is Average Pooling-Calculated average value
- There is Max Pooling-Calculates the maximum value
- Strides move along the pixels in the image to calculate results for the output
GANS
- Gans are deep neural networks that are composite of two opposing neural networks: Generator network, Discriminator network adversarial
- Allows for the creation of things like: Images, music, speech, poetry, deep fakes(face imitation)
- A generator network outputs these generated images to go into a Discriminator network to see which images are real or fake, it will predict if the image is realer fake and will send feedback
Vision APIs
Optical Character Recognition (OCR): Can detect text in images, there is text detection(Can see text, uses JSON extraction and puts text in block, can also read handwriting) and document text detection(Can also read text from images but more meant for text in the documents of a file)
Cropping Hints: Gives a cropping suggestion on how to crop the image and gives the vertices to crop it
Face Detection: Can detect faces and facial features
Image Property Detection: Sees dominant colors, these colors can group images/object or be used for recommendations
Label Detection: Can detect object, locations, activities, animal species, products, and much more
Landmark Detection: Detects landmark in an image, gives the landmark name, bounding polygon vertices, and the location using latitude and longitude
There is also Logo Detection
Explicit Content Detection-safe search detects if the image is adult, spoof, medical, violence, and racy
Web Entity and Page Detection-where is the image is being used, links to web pages, similar images, best guess labels
Video Intelligence APIs
Detect Labels: annotates videos where entities are detected, list of video segments, frame annotations and shots
Shot change detection: annotates videos based on shots or scenes, entities associated with specific scenes
Detect explicit content: detect adult content, annotates explicit content, and timestamps where detected
Transcribe Speech: transcribes spoken words, profanity filtering, transcription hints, automatic punctuation, handling multiple speakers
Track objects: track multiple objects, provides location of each object with frames, bounding boxes for each object, time segments with offset
Detect Text: OCR on text occurring within videos, text and location of the text
Google Knowledge graph Search API allows you to do searches on Google’s Knowledge Graph-all entities and relationships between them, each object has an identifier
1. Getting ranked list of most notable entities that match criteria
2. Predictively completing entities within a search box
3. Annotating or organizing content using the knowledge graph entities
Natural Language API
Looks at patterns within language/text, uses sentiment analysis, entity analysis, syntax analysis, entity-sentiment analysis, and content analysis
Sentiment Analysis
- Score: Indicates overall emotion, ranges between -1 and 1(positive)
- Magnitude: How much emotional content is there, number ranges from 0 and infinity, not normalized (proportional ti length of document being assessed)
Entity Analysis
- Identifies entities within text
- Provides information on the identified entities
- Entities are nouns/things-Proper Nouns(Specific like Albert Einstein) and Common Nouns(mug=any mug)
Entity-Sentiment Analysis
- Combines Entity and Sentiment Analysis
- Tries to determine sentiment expressed towards each of the identified entities
- Numerical. And magnitude scores
Syntax Analysis and Content Classification
Syntax Analysis: Takes streams of text through Sentences, Tokenization of text/streams breaks them up into tokens, sentences and tokens determine grammatical information
Content Classification: API will return categories that are most specific to the source text
Dialogflow
- Natural Language interaction platform
- Mobile and web app, devices bots
- Analyzes text or audio inputs
- Responds to using text or speech
- Intents; categorized an end user’s intention-understands and responds-like a classification/object
- Intents have different training phrases that are mapped onto an intent, then we can have extracted parameters
- Each parameter has a type called an entity type
- End user gives an input phrase, then it goes to an agent, then we get intents through the intent classification, then from this we get parameters, that can do an action, from there we get a response that will go back to the end user
Cloud Speech to Text: Has audio files and audio stream
- Synchronous Recognition: Returns result after all input audio has been processed
- Asynchronous Recognition: Initiates long running operation
- Streaming Recognition: Audio data is provided within a gRPC bi-directional stream
- Models: video, phone call, command and search, default
Text-to-Speech API
- Uses text files or Speech Synthesis Markup Language(SSML)-allows you to control the way text is converted to speech
- SSML: Pauses, play sounds, speak cardinals, speak ordinals, speak characters, phrase substitution
AutoML
Suite of ML Products
Facilitates the training of custom ML models
Highly performant
Speed of Delivery
Human labelling service
When you ask a problem and give it a potential result, it needs to find a neural network using NN Search
When it gets a neural network that works from the NN Bank, it uses transfer learning to the AutoML where it is easily trained to handle novel data
AutoML Process
- Prepare and managed images-label training images, create dataset(single or multi-label)
- Training models-requires prepared dataset, may take a few hours to complete, training creates a new model
- Evaluating models-after training on a test set, aggregated and detailed information(are under curve, confidence threshold curve, confusion matrix)
- Deploying models-deploy before making online predictions
- Making predictions-individual or bulk
- Undeploying models-after successful predictions, have a better model, cost implications of not underlying models
Vision Edge
- Export custom trained models
- Models optimized for edge devices
- TensorFlow Lite, Core ML, container export formats
- Edge TPUs, ARM and NVIDIA
- AutoML Vision Edge in ML Kit
AutoML Translation
AutoML Translation-you can use specific to translate from English to French phrase
AutoML Translation Training-you use source target pairs, source sentences are in the source language, target sentences are in the target language
Translation Considerations
1. Data Coverage-include examples of vocal, usage, and grammatical peculiarities that are specific to your domain, model need to be exposed to the language in some form
2. Human Involvement-people who understand both languages should be involved
3. Data Quality- This is VERY IMPORTANT for translation training, source and target documents need to align
AutoML Natural Language
- Create custom models to classify content into custom categories you define
- When pre-defined categories are insufficient
- When you want to categorize content from free text
- You want to create own categories for categorization
AutoML Table Capabilities
- Data Support: AutoML Tables provide info on missing data, AutoML tables provide correlations, cardinality, and distributions for each feature
- Automatic Feature Engineering: Normalizes and bucketizes numerical features, creates one-hot encoding and embeddings for categorical features, performs basic text processing for text fields, extract time and date features from timestamp features
- Model Training: Parallel testing of multiple model types like Linear and Feed forward deep neural network, selects best model for predictions
AutoML vs BQ
AutoML Tables vs BigQuery ML
1. BigQuery: Rapid Iteration, still deciding on features to include
2. AutoML: Optimizing model quality, have time available for model optimization, multifarious input features
Kubeflow
- ML Toolkit for Kubernetes
- Data modeling with Jupyter Notebooks
- Tuning and training with TensorFlow
- Model serving and monitoring
Production Phase: Transform Data with pipelines, Train Model with (MPI, MXNET, PyTorch), Serve Model using (TFServing,KFServing,NVIDIA TensorRT), Monitor the model using(TensorBoard, Metadata)
Pipeline: Description of a ML workflow, including all of the components in the workflow and how the components relate to each other in the form of a graph
Pipeline Component: Self-contained set of user code, packaged as a container, that performs one step int he pipeline
AI Platform
Ingest Data: Cloud Storage, Cloud Storage Transfer Service
Prepare and Preprocess Data: Cloud Dataflow, Cloud Dataproc, BigQuery, Cloud Dataprep
AI Platform data labeling service can label training data by applying classification, object detection, and entity extraction
Develop and Train Models: Deep Learning VM, AI Platform Notebooks, AI Platform Training, KubeFlow
Test and Deploy Models: TensorFlow Extended, AI Platform Prediction, Kubeflow
Discovery: Google AI Hub
Quick and Ready to Go ML
IAM Best Practices
The principle of Least privilege: Used predefined roles specific to each GCP product or service
Each stack have its own boundary
Policies applied to a parent object will be inherited by a child object
Use Groups in G-Suite or Cloud Identity, grant roles to groups and not individual users
Service Accounts
Service Accounts: Special type of Google account designed to represent non-human users
1. Virtual Machines-act via SAs that determine the services they can access
2. Programmatic access-should always be achieved using a SA
3. IAM Roles-assigned to SAs in exactly the same way as a human user accounts
Human User Accounts
Human User Accounts: Passwords and multi factor authentication
Service Accounts have Keys that can be downloaded in JSON format
A Key can be used by an application to authenticate against Google APIs, keys rotated and massively protected
Cloud IAM API: Request OAUTH, OPENID, JWT
Service Account User Role-can impersonate actual SA and have access to everything/IAM policies
Data Security
Offers encryption in flight and at rest
Can use Cloud Key Management Service (KMS) if you want to make your own keys
Keys can be grouped together in key rings and can be used in multiple GCP services
Limit Blast Radius
VPC Service Controls-define security perimeter, only access services inside the perimeter
GCP Security Command Center: Asset Management Features, Web Security Scanner, Anomaly Detection, Threat Detection-internal
Data Privacy
Should people have access to all the data?
What is the data? Do we need this to complete the task? Are we allowed to store it?-PII Personally Identifiable Information-personal information to identify a specific individual
GCP Cloud Data Loss Prevention: Text, Images, Pseudo-Anonymization(dummy data), Risk Analysis
DLP API can become expensive
Industry Regulation
FEDRAMP: Department of Defense, Homeland Security, USA-how data is used and stored securely by cloud vendors, high compliance for most GCP Services
Children’s Online Privacy Protection Act: Use of PII data for children under age of 13, incorporate parent consent, clear private policy, justification for data collection
HIPAA:Protects Personal Health information, acceptance of business associate agreement
Pci DSS Compliant: GCP certified compliant it is secure enough, apps compliant
GDPR: Europe, protects personal data of EU citizens, any region that interacts with EU
Dataprep Overview
Explores, cleaning and preparing data
Visually define transformations
Export to Cloud Dataflow
Integrated partner service from Trifacta-links to GCP project and datasets
Flows
Top level container for bringing together and organizing datasets recipes and transformations to one place
Datasets
Collections of data that we will use from Dataprep Flow-can import datasets from local machine, Cloud Storage or BigQuery
Recipes
Like an instruction manual, series of steps that perform a series of transformations on the datasets-create new data sets
For recipes, there are audio visual controls to do these transformations
You can also see a visual preview
Can use the Automator to execute certain recipes at certain times
Flows are then executed based on Cloud Dataflow
Cloud Storage
Unstructured object storage, Regional/dual-region-or multi-region, standard, near line, or cold line, storage event triggers
fully managed object storage -can store images, access via API, SDKs, also has multiple storage classes like lifecycle management for objects and buckets, very secure and durable
Cloud Bigtable Def
Petabyte-scale NoSQL database, High-throughput and scalability-Wide column key/value data, Time-series, transactional, IoT data
BQ Def
Petabyte scale data warehouse, Fast SQL queries across large datasets, Foundations for BI and AI, and has Public Datasets
Cloud SQL
Managed MySQL + PostageSQL instances, built in hacks, replicas, and failover, vertically scalable, SQL server
Cloud Spanner
Global SQL-based relational database, horizontal scalability and high availability, strong consistency, good for financial sector
Cloud FIrestore
Fully managed NoSQL document database, large collections of small JSON documents, provides a real time database SDKs
Cloud Memorystore
Managed Redis instances, in-memory DB cache or message broker, built in high availability, vertically scalable
Data Modeling
structured data-consistent model, model maybe in place, data could require prep or transformation. 3 stages: Conceptual-what are entities/relationships, Logical-what are the structures of the entities, can the model be normalized? Physical-how will I implement this into the database?What keys or indexes do I need?
Relational Data vs Non Relational Data
good schema design, normalization and reducing waste, have accuracy and integrity-data types and tables
Non-relational-simple key/value store or document store, high volume columnar database-NoSQL
Pipeline could be going from datable to big query
Bucket
tore object or files, name unique, exists within projects, regional-low cost, dual-regional, and multi-regional all geo redundancy
Storage classes: Standard $0.02 per GB, Nearline $.01 cent per GB and 30 days minimum storage and data retrieval fee, Cold line 90 day min storage $0.004 per
GCS Info
GB, Archive storage for at least a year min $0.0012 GB and data retrieval fee
Objects are stored as opaque data, object immutable, overwrites atomic, can be versioned
Access through google cloud console, HTTP API, SDKs and gsutil command in terminal, parallel uploads, transcoding, integrity checking, requestor pays
GCS Costs: operation charges Class A expensive uploading B downloading, also network charges-like retrieving data from a bucket, data retrieval charges. Can apply life cycle rules to a bucket, IAM access for buckets, ACLS for granular access or signed policy documents-IAMs: has members and roles
Service Accounts Best Practices
IAM for bulk access to buckets, roles assigned to members, ACLs for granular access to buckets, ACLs grant permissions to a scope, IAM is more recommended
Data Transfer Service
Cloud Storage-source to sink, http, amazon s3 cloud storage, filters names and dates, schedule transfers, delete objects in destination bucket, delete objects in source bucket
Full access: storage transfer.admin, Submit transfers: storage transfer.user, List jobs and operations: storage transfer.viewer
BQ Transfer Service
automates data transfer to Bigquery, data loaded on a regular basis, backfills can recover gaps, google marketing sources, sources in beta
Google Transfer Device
very very large amounts of data , physical storage device that is attached to the server terabyte versions, security guaranteed
Human Accounts Cont
Users are human users, authenticate with one credentials -not used for non human operations, passwords could leak
Service Accounts Cont
Created for a specific non human task for granular authorization , identity can be assumed by an application keys can also be easily rotated, there are google and user manages service accounts SAs are managed by keys, user managed keys are downloadable JSON File - very powerful