GCP Professional Data Engineer Cert Flashcards

Question

Kafka vs Pub/Sub

Answer 1

Kafka: Guaranteed message ordering, tunable message retention, polling(Pull) subscriptions only, unmanaged Pub/Sub: No message ordering guaranteed, 7 day maximum message retention, pull or push subscription, managed

Answer 2

Message Bus takes care of all messages between devices Pub/Sub splits it in different topics-anything can publish a message to a topic or choose to receive a message from a topic. Information from users/apps are published to a topic Topics are covered by a message bus-introduced resilience-Pub/Sub is a shock absorber Cloud Pub/Sub: Global messaging and event ingestion, server less and fully managed, 500 million messages per second, 1TB/s of data Pub/Sub Great Features-Multiple publisher/subscriber patterns, at least once delivery, real time or batch, integrates with Cloud Dataflow

Answer 3

queue up a large number of tasks in a Pub/Sub topic and distribute it amongst multiple workers-like compute engine instances

Answer 4

controls order of events, order can be sent into a topic which could then be consumed by a worker system like invoicing before passing it into a queue for the next system to consume like packaging and posting

Answer 5

A systems sets up new users when they register with your service, a registration could publish a message and the system could be notified to the set the user up Distributed Logging:Logs could be sent to a Pub/Sub topic to be consumed by ,multiple subscribers. Like a monitoring system and an analytics database for later querying

Answer 6

Hundreds of thousands and more internet connected devices can stream their data into Pub/Sub topics so that they can be consumes on demand by your analytic streams or could be transformed through Dataflow first

Answer 7

There is a Publisher, Topic, and then a Subscriber Publisher sends messages to the topic in Pub/Sub The subscriber receives the messages and reads them through their own subscription

Answer 8

Just like the one-to-one pattern but this has multiple topics

Answer 9

Create a message containing your data, JSON payload that’s base64 encoded, size of payload 10MB or less, then send payload as a request to the Pub/Sub API-specify the topic the message should be published on

Answer 10

Create a subscription to a topic, subscriptions are always associated with a single topic. Pull delivery method is the default delivery method and can take ad hoc pull requests to the Pub/Sub API-specifying your subscription to receive messages and when you receive the message, note that you have received it or else you won’t get the next message. Push delivery method will send messages to an endpoint-the endpoint must be HTTPS with a valid SSL cert-accepts POST requests

Answer 11

Client libraries for popular languages like Python, C#, Go, Java, Node, PHP, and Ruby. Cloud Dataflow supported and you can use the Apache Beam SDK to read messages or in batches. Also supported Cloud Functions and Cloud Run, Foundation of Cloud IoT Core-sends and receives messages from connected devices Developing for Pub/Sub: Local Pub/Sub emulator-Google Cloud SDK and Java Runtime Environment 7+

Answer 12

At Least Once Delivery: Each message is delivered at least once for every subscription Undelivered Messages: deleted after the message retention duration-default 7 days-can’t be longer Messages published before a subscription is created will not be delivered to that subscription Subscriptions expire after 31 days of inactivity-new subscriptions with same name have no relationship to the previous subscription

Answer 13

Seeking Feature: Set retain ack messages to True so it retains the messages sent to the topic-default messages are retained for a maximum of 7 days. Then you can tell the subscription to seek to a specific time period in the timeline-basically rewinds the clock to receive past messages. You can also seek in a future timestamp Snapshots: useful if you are deploying new code. You can save snapshot ahead of time to save the current state of the subscription and to save future and unacknowledged messages Ordering Messages: May not receive messages in the right order=use timestamps when final order matters, or consider an alternatives for transactional ordering-maybe through a SQL query Resource Locations: Messages stored in nearest region, message storage policies allow you to control this, additional egress fees may apply

Answer 14

Use service accounts for authorization, grant per-topic or per-subscription permission, grant limited access to publish or consume messages

Answer 15

Think about where you can decouple data-Pub/Sub is a shock absorber, receives data globally and it can be consumed by other components at their own pace Where can you use Pub/Sub for events. It can add event logic to a stack and it can pass events through one system to another Be aware of Pub/Sub limitations-message data must be 10MB or less, beware of expired messages and unused subscriptions Look for Apache Kafka in use cases, if this comes up, Pub/Sub can be a good option Keep an eye out for Cloud IoT as a solution Google Cloud Tasks-get familiar with it Browse the reference architectures-Smart Analytics references

Answer 16

Fully managed, server less tool, uses open source Apache Beam SDK, Supports expressive SQL, Java, and Python APIs, Realtime and batch processing, stack integration Beams unify, develop and model which allows us to reuse code across streaming and batch pipelines Sources: Cloud Pub/Sub, BigQuery, and Cloud Storage, it can be external to GCP like Kafka Common Sinks: Cloud Storage, BigQuery, and Bigtable, Cloud Machine Learning can be applied to sync Data

Answer 17

You have a Pipeline, Source, and Sink The Pipeline will take data from the Source processes data and then places it into the Sink Apache Beam connectors allow you to connect to the Source and Sink so you can read and then write your output data into the Sink Common

Answer 18

Cloud Pub/Sub, BigQuery, and Cloud Storage, it can be external to GCP like Kafka

Answer 19

Cloud Storage, BigQuery, and Bigtable, Cloud Machine Learning can be applied to sync Data

Answer 20

program you write using the Apache SDK-Java or Python. It defines your pipeline Pipeline: full set of transformations that your data undergoes from initial ingestion to final output Driver goes to the runner

Answer 21

a software that manages the execution of your pipeline, a translator for the Backend execution framework can also manage local execution of Driver programs for testing and debugging

Answer 22

used in pipelines and represent data as it is transformed within the pipeline kind of represents a multi-element dataset. They can also represent both batch and streaming data. Data coming from a fixed source, the dataset=Bounded, treated like a batch Continuously updating source, dataset=Unbounded(Stream) The PCollection is usually from reading an external source Transform usually represents a step in your pipeline, transforms use PCollections as inputs and outputs, each transform takes one or more PCollections as inputs and generates 0 or more output PCollections

Answer 23

1. You have to Design your pipeline first-input and output methods, structure and transformations 2. Then you Create it-instentiating a pipeline object, implementing transformations that were identified 3. Testing: debugging a failed pipeline execution on a remote system, try to do local unit testing

Answer 24

1. Start with the location of your data 2. Input data structure and format 3. Transformation objectives 4. Output data structure and location

Answer 25

Basic: Input linear to output Branching: PCollection and there is branching that applies to a single PCollection which result in two different PCollections Branching can also be conducted on a Transform Pipeline branches can also be merged, you need to merge all branches of your pipeline at some point through a Flatten or Joined Transform Pipelines can also have multiple sources and they can be independently transformed

Answer 26

Dataflow Pipelines represent a Directed Acyclic Graph or DAG-a graph with a finite number of vertices and edges-no directed cycles

Answer 27

1. Create an Object 2. Create a PCollection using read or create transform 3. Apply multiple transforms as required 4. Write out final PCollection 5. Execute the pipeline using the pipeline runner

Answer 28

generic parallel processing transform: can take an element from PCollection1 and transform it to PCollection2, can output 1, none, or multiple output elements from a single input element

Answer 29

user written code that describes the operation to apply to each element of the input PCollection

Answer 30

The process of computing a single value from multiple input elements, doing this for all elements and then going into a single window

Answer 31

1. Any Data Type-must be same type 2. Don’t support random access 3. Immutable or unchanging 4. Boundedness-no limit to the number of elements a PCollection can contain-can be Bounded-finite number of elements or Unbounded-does not have an upper limit 5. Timestamp is associated with every element of a PCollection-initially assigned by the source that results in the creation of the PCollection

Answer 32

1. ParDo-generic parallel processing transform 2. GroupByKey-processes collection of key value pairs, collects all values associated with a unique key 3. CoGroupByKey-used when combining multiple PCollections-performs a relational join of two or more key value PCollections where they have the same key type 4. Combine-requires you to provide a function that defines the logic for combining elements, had to be associative and commutative-sum, min, max 5. Flatten- merges multiple input PCollections into a single logical PCollection 6. Partitioning-provides the logic that determines how the elements of the PCollection are split up.

Answer 33

event time data element occurs determine by timestamp on data element itself, Processing time refers to the different times the element was processed during the transit in your pipeline

Answer 34

Assigned to a PCollection, subdivides the elements of a PCollection according to their timestamps, do this to allow grouping or aggregating operations over unbounded collections, it groups elements into finite windows

Answer 35

1. Fixed-simplest, constant non overlapping time interval

Answer 36

2. Sliding-represent time intervals- but it can overlap, and an element can belong to more than one window-useful to take running averages of data

Answer 37

a different session window is created in a stream when there is an interruption in the flow of events which exceeds a certain time period, apply on a per key basis-useful for irregularly distributed data with respect to time

Answer 38

everything else-window transform

Answer 39

the system’s notion of when all the data for a certain window can be expected to have arrived-late data = watermark moves past the end of the window and any further data elements arrive with a timestamp within that window

Answer 40

1. Event time, event-time based 2. Processing time 3. Data driven-when data in a particular window meets a certain criterion 4. Composite-combine other triggers in different ways

Answer 41

Run Cloud Dataflow pipelines 1. Can be run locally 2. Submit pipeline to GCP Dataflow managed service GCP service accounts 1. Cloud Dataflow service-uses Dataflow service account 2. Worker instances-Controller service account

Answer 42

1. The pipeline gets submitted to the GCP Dataflow Service 2. The Dataflow will create a Job 3. The Job creates managers and workers to carry out various tasks 4. For the execution, the workers need files/resources from Cloud Storage 5. The Job can be monitored with the Cloud Dataflow Monitoring Interface or the Cloud Dataflow Command-line Interface

Answer 43

1. Automatically created when Cloud Dataflow project is created 2. Manipulates job resources 3. Assumes the Cloud Dataflow service agent role 4. Has Read/Write Access to project resources

Answer 44

1. Compute Engine instances-execute pipeline operations 2. Run Metadata operations-don’t run on local clients or compute engine workers-determine size of file in Cloud Storage 3. User-managed controller service account-used resources with fined grained access control

Answer 45

1. Submission of the pipeline-users have to have the right permissions 2. Evaluation of the pipeline-encrypted, not persisted beyond evaluation of the pipeline, communication between workers over a private network-subject to projects, permissions, and firewalls specify region and zone 3. Accessing telemetry or metrics-encrypted at rest-controlled by project’s permissions 4. You can also use Cloud Dataflow IAM roles

Answer 46

1. Manages metadata about Cloud Dataflow jobs 2. Controls Cloud Dataflow workers 3. Automatically selects best zone Good reasons for regional endpoints 1. Security and compliance 2. Data locality 3. Resiliency

Answer 47

1. Handles data extraction from Cloud Storage 2. Data Preprocessing in Apache Beam pipeline through Cloud Dataflow, TensorFlow API used to normalize some values between 0 and 1, the Beam partition transform is used to split the data set into the training data set and the evaluation data set 3. TensorFlow is used to train a model locally on your machine or through Cloud Machine Learning-doesn’t use Cloud Dataflow 4. Predictions-Cloud Dataflow-read from Cloud Dataflow from Pub/Sub into another Pub/Sub topic

Answer 48

You can use customer-managed encryption keys Batch pipelines can be processed in a cost-effective manner with Flexible Resource Scheduling(FlexRS)-uses Advance scheduling, Cloud Dataflow Shuffle service, preemptive VMs Cloud Dataflow is great for MapReduce jobs to Cloud Dataflow-on premises map reduce jobs can be rebuilt on cloud dataflow Cloud Dataflow with Pub/Sub Seek-replay and reprocess previously acknowledged messages-especially in bulk

Answer 49

1. Develop and run Cloud Dataflow jobs from the BigQuery web UI 2. Cloud Dataflow SQL (ZetaSQL variant) integrates with Apache Beam SQL Apache Beam SQL-Query bounded and unbounded PCollections, Query is converted to a SQL transform Cloud Dataflow SQL-Utilise existing SQL skills, join streams with BigQuery tables, query streams or static datasets, write output to BigQuery for analysis and visualization

Answer 50

Beam and Dataflow are the preferred solution for streaming data-especially for streaming data Pipeline: represents the complete set of stages required to read data perform any transformations and write data PCollection: represents a multi-element dataset that is processed by the Pipeline ParDo: core parallel processing function of Apache Beam which can transform elements of an input PCollection into an output PCollection. DoFn: template you use to create user-defined functions that are referenced by a ParDo Sources-where data is read from Sinks-where data is written from Window: allows streaming data to be grouped into finite collections according to time or session-based windows Watermark: indicates when Dataflow expects all data in a window but past the watermark is considered late Dataflow is normally the preferred solution for data ingestion pipelines Cloud Composer is sometimes used for ad hoc orchestration/provide manual control of Dataflow pipelines themselves

Answer 51

A managed cluster service for Hadoop and Apache Spark Managed preferable because it is low costing and you can control which clusters to grow and which clusters to turn off

Answer 52

Master: it creates a master node running the YARN resource manager and then runs the Hadoop, HDFS name nodes It also runs the Worker Nodes Pre-installed and have Hadoop, Apache Spark, Zookeeper, Hive, Pig, Tez, and other tools like Jupyter Notebooks and GCS connector Storage and configuration handles by Dataproc

Answer 53

1. Cluster actions complete in ~90 seconds 2. Pay-per-second minimum 1 min 3. Scale up/down or turn off at will

Answer 54

You can submit Hadoop/Spark jobs, Enable autoscaling-if necessary to cope with the load of the job, Output to GCP Services-like Google Cloud Storage, BigQuery and BigTable, you can also Monitor with Stackdriver-fully integrated logging and monitoring for the job performance and output

Answer 55

Regional: Isolate resources used for Dataproc into one region like us-east1 and Europe-west1 Global: Resources not isolated to a single region-can place cluster in any zone worldwide

Answer 56

a single VM that will run the master and work the processes-can’t autoscale

Answer 57

Has a Master VM that runs YARN Resource Manager and the HDFS Name Node, and it has two Worker Nodes that run a YARN Node Manager and a HDFS Data Node-this is customizable for the disk, there are also Pre-emptible Workers-sometimes help with large projects, but can’t provide storage for HDFS

Answer 58

You have three Masters with YARN and HDFS configured to run in high availability mode-no interruptions

Answer 59

1. Gcloud command line 2. GCP Console 3. Dataproc API 4. SSH to Master Node

Answer 60

1. Use Stackdriver Monitoring to monitor cluster health 2. Cluster/yarn/allocated_memory_percentage 3. Cluster/hdfs/storage_utilization 4. Cluster/hdfs/unhealthy_blocks

Answer 61

You can customize the Dataproc default image, Google gives a script, then under the Dataproc default image, there are Custom PKGs where you can apply the customization script you have written, then you can store it in Google You can also have: Custom cluster properties-so you can change the values You can add initialization actions that are custom to the cluster-scripts loaded to a Cloud Storage Bucket-mostly for Staging binaries You can also Custom Java/Scala dependencies-saves you from precompiling

Answer 62

Huge Bonus: you can create lightweight clusters and have them automatically scale up to the demands of the job-written in YAML, has configuration numbers for primary workers and secondary workers

Answer 63

1. When having HDFS 2. When having Apache Spark Streaming 3. When having Idle Clusters 4. YARN Node Labels

Answer 64

Written in YAML that can specify multiple jobs w/ different configs and parameters that can be run in succession Workflow Templates have to be created, then instantiated with GCloud-you can send jobs to a new cluster each time or to an existing cluster

Answer 65

1. Local SSDs-faster runtimes 2. GPUs to nodes-for machine learning

Answer 66

1. Use GCS instead of HDFS 2. Cheaper than persistent disk 3. High availability and durability 4. Decouple storage from cluster lifecycle

Answer 67

Know when to choose Dataproc: Quickly migrating Hadoop and Spark workloads into Google Cloud Platform Understand the benefits of Dataproc: Managed over Hadoop or Spark cluster-Ease of scaling, being able to use Cloud Storage instead of HDFS, and the connectors to other GCP services like BigQuery and Bigtable Know Cluster Options: When to pick standard vs high availability, autoscaling and ephemeral Get to know open-source Big Data Ecosystem-Hadoop, Spark, Zookeeper, Hive, Tea, and Jupyter Know when to choose Dataflow-sometimes it is the preferred product for big data ingesting, like in streaming workloads and it implements the Apache Beam SDK

Answer 68

Managed wide-column NoSQL database-series of key value pairs where the values are split into columns Has a very High Throughput-10,000 reads per second Also has low-latency-6 milliseconds per node Scales linearly Out of the box high availability-cross cluster replication Developed internally by Google and was used for Google Earth, Finance, and Web Indexing Since HBase was created and was the open source implementation of the Bigtable model, it was adopted as a top level Apache project and the Cloud Bigtable supports Apache HBase library for Java

Answer 69

1. Has a ROW KEY as the only index 2. Then it can be attached to columns 3. The columns can be grouped by families 4. The empty values don’t take up any space since it’s a sparse db 5. Scaled to thousands of columns and billions of rows

Answer 70

1. Blocks of contiguous rows are -shared into tablets 2. Tablets are chunks of sorted rows-put together they form a complete table-managed by nodes in your cluster 3. Tablet data is stored in Google Colossus-can scale cluster sizes 4. Splitting, merging, and rebalancing happen automatically

Answer 71

Suited well for financial, marketing data and transactional data Also good for time series data and data from IoT devices Good for streaming data and machine learning applications

Answer 72

You create an instance You have a instance type, storage type and app profiles-describes parameters for incoming connections To connect, you use an instance ID and an application profile Inside the instance, you have clusters Inside the clusters you have nodes which are workhorses of Bigtable The flexibility of Data Storage comes from separating our cluster nodes and storing data in Colossus Nodes control tablets and a tablet can’t be shared by more than one node

Answer 73

1. Production-1+ clusters, 3+ Nodes per cluster 2. Development-Single node cluster for developmental work-development instance can’t use replication and doesn’t have SLA and a cheaper option

Answer 74

Almost always the right choice, fastest and most predictable option, 6ms latency for 99% of reads and writes, each node can well process 2.5 TB SSD data

Answer 75

Each node can process 8 TB HDD data, throughput is limited so it will not have as much IO overhead for processing nodes, then the row reads are 5% the speed of SSD reads, the storing at least 10 TB of infrequently-accessed data with no latency sensitivity-could spend more money on clusters

Answer 76

1. Custom application specific settings for handling incoming connections 2. Single or multi-cluster routing 3. In single routing: it will route to a single router that you define even if you have multiple clusters in an instance 4. Multi: Will route to nearest most available cluster and if it is unavailable, it will go to the next cluster 5. You have to ask if data needs single row transactions-then you have to have single routing

Answer 77

1. Instances can run up to four clusters 2. Clusters exist in a single zone 3. Up to 30 nodes per project 4. Maximum of 1,000 tables per instance

Answer 78

1. Cloud IAM roles 2. Applied at project or instance level to- 3. Restrict access or administration. 4. Restrict reads and writes 5. Restrict development instances or production access

Answer 79

ROW KEYs can only be indexed Column families allow is to grab what we need only Column names are called column qualifiers You can write new data values and the old ones aren’t overwritten You can control how much is stored and for how long it is configurable-detailed granularity, array of bytes

Answer 80

1. Need SQL Support OLTP: Cloud SQL 2. Need Interactive Queries OLAP and cheaper: BigQuery 3. Need structured NoSQL Documents: Cloud Firestore 4. Need In-memory Key/Value Pairs: Memorystore 5. Need Realtime Database: Firebase

Answer 81

Rows are sorted alphabetically-design of row key very important Atomic operations are by row only-be careful when updating Sparse table system-doesn’t hurt to have a lot of columns/families even if they don’t apply to every entity Row sizing: no larger than 10MB, total row not including the key should be under 100MB

Answer 82

1. Each cell has multiple versions 2. Server recorded timestamps 3. Sequential numbers 4. Expiry policies define garbage collection: can expire based on a specific age or specific number of versions 5. Setting the client with an HBase will set the policy to only retain the test version of a cell, if you use any other client library, then it will set the column family to store infinite versions CBT is an alternate way to connect to Bigtable

Answer 83

Have to scan the entire table and then filter the results based on a regular expression match tot he string contained in that column cell-most time expensive way to query Bigtable You have to pain your queries ahead of time Field promotion: taking data that you already know and then moving it into the row key itself Then you can write a command like: scan ‘vehicles’, {ROWPREFIXFILTER => ‘NYMT#86#’} You can also include a timestamo in the ROW KEY design Never put a timestamp in front of the row key

Answer 84

1. Queries use: A row key, a row prefix 2. A row range is returned 3. Reverse domain names 4. String identifiers-reads and writes evenly spread 5. Timestamps as only a part of a bigger row key design if it is not first and is reversed

Answer 85

1. Domain names in order 2. Sequential numbers 3. Frequently updated identifiers 4. Hashed values

Answer 86

1. Lexicographic sorting 2. Store related entities in adjacent rows 3. Distribute reads and writes evenly 4. Balanced access patterns enable linear scaling of performance

Answer 87

1. Use Field Promotion instead 2. Try Salting-salted hash to your row key artificially distributes the rows, based on total number of nodes 3. Use Google’s Key Visualizer tool

Answer 88

1. Use tall and narrow tables where each row might contain a key and maybe only a single column 2. Use rows instead of versioned cells 3. Logically separate tables 4. Don’t reinvent wheel -already good timetable schemas out there Open TSDT project

Answer 89

1. Via GCP Console or Stackdriver 2. Average CPU utilization of cluster and hottest node 3. Single cluster instance-aim for average CPU load of 70% and the hottest node not over the CPU of 90% 4. For 2 clusters and replication instances-multi cluster routing brings in additional overhead where the average CPU load should be 35% and the hottest node CPU load at max 45% 5. For storage utilization, try to keep it on 70% per node 6. To monitor this, try to create application profiles for each application

Answer 90

1. Stackdriver metrics can be used for programmatic scaling-done on local computer 2. Client libraries query metrics 3. Update cluster node counts via API 4. Rebalancing tablets can take time and the performance might not improve for 20 mins 5. Adding nodes to a cluster doesn’t solve the problem of a bad schema

Answer 91

1. Adding additional clusters automatically starts replication I.e data synchronization 2. Replication is eventually consistent 3. Used for availability and failover 4. Application isolation 5. Global presence

Answer 92

Replication improves read throughput but does not affect write throughput Use batch writes for bulk data with rows that are close together lexicographically Monitor instances and use the Key Visualizer to monitor hotspots and bad row keys Bigtable rebalances tablets-first they all go in the first node, but then they rebalance or spread out to the other growing nodes-the tablets are also being split, merged, and rebalanced to maintain the sorted order of rows Hotspots sometimes pop up and take a lot of CPU, but the other tablets in the node go to other nodes so the overwhelmed tablets are less overwhelmed

Answer 93

Good Performance: Optimized schema and row key design, large datasets, correct row and column sizing Bad Performance: Datasets short lived or smaller than 300GB

Answer 94

Know when to choose Bigtable: many questions make you choose the right product for the workload, migrating from an on-premise environ look at HBASE and consider when Bigtable is a better option than BigQuery. Look at time-series data or use cases where latency is an issue Understand the architecture of Bigtable: Concepts of an instance and a cluster, where Bigtable stores data, and how tablets are re-balanced by the service between nodes Be aware of causes of bad performance: Like under-resourced clusters, bad schema design, and poorly chosen row keys. UNDERSTAND ROW KEYS: Linear scale and performance of Bigtable depends on good row keys. Understand row key design, I might have to point out flaws or pick an ideal row key Understand Tall vs Wide: Wide Table-stores multiple columns for a given row-key where the query pattern is likely to require all the information about a single entity. Tall Table-suit time-series or graph data and often only contain a single column Remember Organizational Design: Consider when a development instance is appropriate, remember IAM roles that can be used to isolate access to the necessary groups.

Answer 95

Peta byte scale, server less, highly scalable cloud enterprise data warehouse In memory BI Engine-fast interactive reports Has machine Learning capabilities (BigQuery ML)-using SQL Support for geospatial data storage and processing

Answer 96

1. High availability 2. Supports SQL-can do SQL queries 3. Federated Data-can connect to and process data stored outside of BigQuery 4. Automatic Backups 5. Governance and Security support-data encrypted at rest and in transit 6. Separation of Storage and Compute-cost effective scalable storage and stateless resilient compute

Answer 97

1. Web console 2. Command line tool (bq) 3. Client libraries like C#, Go, Java, Node.JS, PHP, Python, and Ruby

Answer 98

You have a Project and within each Project, you have a Dataset, and within each Dataset, you can have Native Tables, External Tables, or Views

Answer 99

Data is held within a BigQuery Storage

Answer 100

Backed by storage outside of BigQuery

Answer 101

created by a SQL query

Answer 102

Streaming, common to push events to Cloud Pub/Sub, then use a Cloud Dataflow job to process and push them into BigQuery

Answer 103

Comes in a Bulk Load, common to push files to Cloud Storage, then have a cloud Dataflow job pick that data up, process it and then push it into BigQuery

Answer 104

1. Previously Called BigQuery SQL 2. Non-standard SQL dialect 3. Migration to standard SQL is recommended

Answer 105

1. Preferred dialect 2. Compliant with SQL 2011 standard 3. Extensions for querying nested and repeated data

Answer 106

With BigQuery Data you can: Use BI Tools, Use Cloud Datalab, Export to sheets or Cloud Storage, send it to Colleagues, or use it for GCP Big Data Tools like Dataflow or Dataproc

Answer 107

Job: action that is run in BigQuery on your behalf Load Job: Load data onto BQ Export Job: Export from BQ Query Job: Queries the data in BQ Copy Job: Copies tables and datasets from BQ Query Job priorities: Interactivity(default)-always saved to a temporary table or to a Permanent table, and Batch

Answer 108

1. Capacitor columnar data format 2. Tables can be partitioned 3. Individual records exist as rows 4. Each record is composed of columns 5. Table schemas specified at the creation of the table or at a data load

Answer 109

1. The Storage system: proprietary columnar data storage that supports semi-structured data(nested and repeated tables), imports can be CSV JSON to capacitor format 2. Each value is also stored together with a repetition level and a definition level (value, repetition level, definition level)

Answer 110

1. BQ performance best when data is denormalized 2. Nested and repeated columns 3. Maintain data relationships in an efficient manner 4. RECORD(STRUCT) data type-nested records or columns.. EX: Address=address.number, address.street, address.city, a single ID can have multiple addresses

Answer 111

CSV, JSON(newline delimited), Avro(open source data format where schema is stored together with data-compressed data), Parquet(encoded, smaller files), ORC(hive data), Cloud Datastore export, and then Cloud Firestore ports

Answer 112

A virtual table defined by a SQL query SQL query (view definition)=Tables Then Those get sent to the Dataset You can also query that view, has billing implications since you would also still be running the underlying query

Answer 113

1. Control access to data 2. Reduce query complexity 3. Construct logical tables 4. Ability to create authorized views-can connect to different subsets of rows from the view

Answer 114

1. Can’t export data since unmaterialized 2. Can’t use JSON API to retrieve data from a view 3. Cant combine standard and legacy SQL 4. No user defined functions 5. No wildcard table references 6. Limited to 1,000 authorized views per dataset

Answer 115

You can query directly even though the data is not directly held in BQ BQ supports Cloud Bigtable, Cloud Storage, and Google Drive Use Cases for using External Data Source: Load and clean your data in one pass, or you have small, frequently changing data joined with other tables

Answer 116

1. No guarantee of consistency 2. Lower query performance 3. Can’t use TableDataList API method 4. Can’t export jobs on external data 5. Can’t reference wildcard table query 6. Can query Parquet or ORC formats 7. Query results not cached 8. Limited to 4 concurrent queries

Answer 117

1. Public Datasets-available to everyone 2. Shared Datasets-Have been shared with you 3. Stackdriver log information

Answer 118

Can easily pull Bulk Data into BQ with the Data Transfer Service Data Transfer Service has multiple Connectors to Google sources, GCP, AWS services like S3 and Redshift, and other third party services like LinkedIn and Facebook Can be one off events or scheduled to run repeatedly DTS allows historical data reference and uptime and delivery SLA

Answer 119

Table partitioning-break up big table into smaller tables Partitions stored separately on physical level Partitions usually based on a single column called the partition key. Partitioning BQ 2 Ways 1. Ingestion Time partitioned tables 2. Partitioned Tables

Answer 120

Partitioned by load or arrival date, data automatically loaded into databased partitions(daily), tables include the pseudo-column _PARTITIONTIME, use _PARTITIONEDTIME in queries to limit partitions scanned

Answer 121

Partitioned based on a certain TIMESTAMP or DATE column, Data partitioned based on value supplied in partitioning column, 2 additional partitions: _NULL_ and _UNPARTITIONED_ , use partitioning column in queries BQ automatically places data in right partitions, need to say it is a partition table when creating the table

Answer 122

Clustering tables: can do them on a partitioned table, you can use clustering when you have filters or aggregations against specific columns in your queries. When partitioned and clustering tables together, it is partitioned by the partition key and then clustered based on the cluster key In cluster tables: the data associated with a certain cluster key is generally stored together Ordering is important

Answer 123

1. Only supported only for partitioned tables 2. Standard SQL only for querying clustered tables 3. Standard SQL only for writing query results to clustered tables 4. Specify clustering columns only when table is created 5. Clustering columns can’t be modified after table creation 6. Clustering columns have to be top-level, non-repeated columns 7. You can specify one to four clustering columns

Answer 124

1. Filter clustered Columns in the order they were specified 2. Avoid using clustered columns in complex filter expressions 3. Avoid comparing cluster columns to other columns Why partition tables 1. Improve query performance 2. Control costs

Answer 125

Slots: Unit of computational capacity required to execute SQL queries-good for pricing and resource allocation Number slots query-determined by Query size and query complexity BQ automatically manages your slots quota Flat rate pricing available-purchase fixed number of slots You can see slot usage using Stackdriver

Answer 126

1. Avoid using SELECT * 2. Use preview options to sample data 3. Price queries before executing them 4. Remember the using LIMT doesn’t affect cost 5. View costs using a dashboard and query audit logs 6. Partition by date 7. Materialize query results in stages 8. Consider the cost of large result sets 9. Use streaming inserts with caution

Answer 127

1. Input data and data sources 2. Shuffling 3. Query computation 4. Materialisation 5. SQL anti-patterns

Answer 128

Input data and data sources: Prune partitioned queries, denormalize data whenever possible, use external data sources appropriately, avoid excessive wildcard tables

Answer 129

Avoid repeatedly transforming data via SQL queries, avoid JavaScript user-defined functions, order query operations to maximize performance, optimize JOIN patterns

Answer 130

Avoid Self-Joins, avoid data skew, avoid unbalanced joins, avoid joins that generate more outputs than inputs (Cartesian product), avoid DML statements that update or insert single rows

Answer 131

1. Use expiration settings-Control Storage Costs and Optimize use of storage space 2. Take advantage of long-term storage-lower monthly charges apply for data stored in tables or in patterns that have not been modified in the last 90 days 3. Use the Google pricing calculator to estimate the storage costs

Answer 132

at the project level, granting access to the related project data sets, individual dataset access will overwrite the primitive access. Three types of these roles-Owner, Editor, Viewer

Answer 133

grant more granular access, defined at the service level, GCP managed

Answer 134

User managed

Answer 135

Handling Sensitive Data: credit card numbers, med info, SSN, people names, address info can be protected by the Cloud Data Loss Prevention (Cloud DLP) Cloud DLP 1. Fully managed service 2. Identify and protect sensitive data at scale 3. Over 100 predefined detectors to identify patterns, formats, and checksums 4. It also de-identifies the data

Answer 136

BQ encrypts data through the Data Encryption Key (DEK) For highest levels of security, the DEK key needs to be encrypted to form the Wrapped DEK-this is done using the Key Encryption Key (KEK) Wrapped DEK/DEK stored together KEK is stored in the Cloud Key Management Service

Answer 137

Alerts should be created when a monitoring metric crosses a specifies threshold BQ uses Stackdriver to monitor, bq sends logs to it In stacks-driver, you can filter for the big query logs, create dashboards and then charts to the dashboards, you can create alerts

Answer 138

Collections of logs that are provided by GCP to allow insights to various services Log Versions AuditData(old)-map directly to individual API calls designed against the query BigQueryAuditMetadata-not strongly coupled to particular API calls, more aligned to resource itself, closely associated with the state of the BigQuery resources can be changed by API calls, services and API tasks Stackdriver has three different streams: Admin, System, and Data-streams are just groupings for different types of logs

Answer 139

1. Web console (UI) 2. Bq command line tool 3. BQ rest API 4. Jupyter notebooks(Cloud Datalab) and other external BI tools

Answer 140

where you have a number of data points and try to fit a line to those data points

Answer 141

We have 2 classes and you assign each example to tone of the classes

Answer 142

We assign each example to one of these

Answer 143

We have a number of points and are able to separate them out into different clusters-newest one on BQ ML

Answer 144

1. Democratizing ML 2. Models trained and evaluated using SQL 3. Speed and agility 4. Simplicity 5. Avoid regulatory restrictions

Answer 145

Understand good organizational design: consider how different teams should be granted different types of access to BQ and how the decisions affect cost control Learn the most common IAM roles: learn how to grant access to teams based on needs and how to use authorized views to share data across projects Consider costs when designing queries: Avoid using SELECT * and previews and price queries before executing them Partition tables appropriately: partitioned tables can reduce the cost -consider clustering to reduce scans of unnecessary data Optimize query operations and JOINS

Answer 146

What is it? 1. A pre-existing technology, wrapped in some GCP conveniences-Jupyter Notebooks

Answer 147

1. Interactive web pages that have… 2. Documentation 3. Code 4. Elements which are the results of compiled code

Answer 148

When typing code, there os a Cloud Datalab Vm that has a Python Kernel The kernel can run code and access GCP services like BigQuery or ML Engine Good way to collaborate and share code Also a good way to annotate Has marplot.lib and it is great for statistical data and graphs When saving your work though the notebook, it will be in the GCR Repo which is sent to the persistent disk attached to the Datalab instance

Answer 149

1. Manages instance lifecycle 2. Create Datalab VMs in seconds 3. Notebooks stored in GCR 4. Storage can persist after the instance is destroyed

Answer 150

Data Sources Reports and Dashboards Data sources underneath are Databases or Files Files-usually CSV files, stored in Cloud Storage Databases-GCP databases like BigQuery, Cloud SQL, MySQL, Cloud Spanner, PostgreSQL Google Products-Google analytics, Sheets, Youtube, Ads, Google Marketing Platform Third Party-Trello, Quickbooks, Facebook Ads You can Share your dashboards and reports by Viewing or allowing Users to Edit-Like in Google Drive

Answer 151

Tables-detailed, heat map inclusion, bar chart inclusion, pagination Scorecards-KPIs, high level Pie Chart-proportions, %s or absolute values, doughnut or whole, small amounts of data Times series-time order, trends, curve filtering, forecasting Bar charts-categorical, vertical or horizontal, single or stacked, reports and dashboards Geomaps-geographical data, dashboards Area charts-composition, cumulative totals, reports and dashboards, can combine with time series Scatter plot-cartesian plane, typically 2 variables-or more through color or size through the points, dashboards and reports Filter-allows you to select specific values from a category Data Range-can specify start and end days or predefined intervals like last week, last month, current year or quarter

Answer 152

Built on Apache Airflow Google is contributing back to the airflow project Task orchestrated system that is designed to automate complex interdependent tasks into pipelines or workflows Each stage of the pipeline is written in code Each workflow is written in Python Provides central management and scheduling Provides and extensive CLI tool and a comprehensive web UI

Answer 153

A graph consisting of nodes connected by edges, edges-how we travel from one node to another, directed-travel in one direction which is acyclic-never circles back, can’t reach the same node more than once by traveling along the edges Possible to represent dependencies-between nodes, that must be traversed in a specific order These dependency nodes represent all the tasks in a workflow organized in a away to show their relationships and dependencies DAGs are represented in Python and it matters when the tasks should be aligned to execute Inside the tasks, there are operators to specify what is to be done DAGs can contain parameters when they should run and what the dependencies are and who should be notified once the dependencies are completed Cloud Composer manages resources to make sure workflow completes successfully

Answer 154

A microservices architecture Uses multiple GCP resources grouped together into a Cloud Composer environment Can have more than one environment in a GCP project, but each environment is an isolated installation of Airflow and all of its component parts Some parts get put on a Tenant Project-you can’t see or control, places Airflow database and Airflow web server-provides its web UI on App Engine Flex, it will configure the Identity-Aware Proxy to control access to the web server

Answer 155

Dataflow-Process Batch or Streamed Data-Apache Beam Cloud Composer-Orchestrate tasks with Python and can use any Python code at any stage of the pipeline-more as a scheduler Can orchestrate Cloud Composer with Dataflow Cloud Composer workflow example: Spark Analytics- A workflow runs daily, sets up a Dataproc cluster, performs Spark analytics, writes results to GCS and emails an administrator, and then deletes the Dataproc cluster Cloud Composer can be used as any scheduled automation task outside of big data

Answer 156

In the GCP Project environment- Make a Kubernetes cluster that deploys Redis, the Airflow Scheduler, and the Airflow Workers along with the Cloud SQL Proxy, it will also create 2 Pub/Sub topics for messaging between micro services and a Cloud Storage Bucket for logs, plugins and the DAGs themselves Cloud Composer configures- Airflow Parameters and Environment Labels, you can also customize some parameters DAGs can have a file or multiple files with imports and dependencies- the Python script has the Variables, Operators, and Stages of the Workflow tied together with a DAG object and definition

Answer 157

A Task is an instance of the Airflow Operator The scheduler will find any DAG object that you have defines in the Python scripts uploaded to the GCS bucket, if all dependencies met, workflow will be scheduled When you delete an environment, it won’t clean up all resources it created, but the Tenant Project and the GKE cluster that was spun up will be removed You will have to manually delete the Pub/Sub topics and the GCS bucket

Answer 158

Custom Airflow parameters get written on airflow.cfg file that is used to configure services when airflow is ran for the first time Can’t change all of service settings, some can only be modified Can create Environment Variables which will be passed by Cloud Composer and passed to elements of Airflow like the scheduler, web server, and worker processes Environment Variables are in a Section which are defined in Key Value Pairs Airflow Connection: A collection of authentication information which can include hostnames, logins, keys/secret information Will create connections for BigQuery, Datastore, Cloud Storage, and a Generic GCP connection-these will have a service account key that will authenticate against the GCP API in question Can make custom connections using the Airflow Web UI, connections to Airflow’s own database use the Cloud SQL Proxy In the Web UI, you can use the ad hoc query page to run SQL queries against any connected database-some built in visualizations as well

Answer 159

1. Local Python Environment by adding additional custom libraries 2. Airflow Plugins-write own custom operator 3. PythonVirtualEnv Operator to create a custom environment for a task in a workflow without installing dependencies across all of your workers-creates individual envs with own libraries 4. KubernetesPod Operator-need complete control over an environment-to run a task inside a pod on the Cloud Composer GKE Cluster

Answer 160

1. Cloud Vision API-can detect and label objects within images, facial recognition, can read handwritten info 2. Cloud Video Intelligence API-can identify huge numbers of objects, places, and actions which are taking place in videos 3. Cloud Translation API-can translate between more than 100 languages 4. Cloud Text to Speech API: Convert text to audio/human speaking 5. Cloud Speech to Text-convert from audio to text 6. Cloud Natural Language API-can perform sentiment analysis, entity analysis, entity sentiment analysis, content analysis, content classification

Answer 161

Model training required, min knowledge of ML required, relatively small datasets for training, transfer learning, neural architecture search-like if you use GCP, you can search out what the best model is to solve problem

Answer 162

when you need something very specific AutoML-transfer learning, allows you to train your own custom models to solve your own specific problems: You have AutoML Vision, Video Intelligence, Natural Language, Natural Translation, and Tables  Google AI Platform allows you to train your own models, manage and share models-gives easy access to TensorFlow, TensorFlow Extended (TFX)-end to end platform for deploying machine learning pipelines, gives access to TPUs-speed up process, Kubeflow

Answer 163

ML has a Model that contains Rules from the Data that’s used to train it ML Pipeline 1. Data preparation-raw data becomes processed data 2. Model Training-processed data is split into Training Data and then Testing Data. Training Data gets inputted into the Model, then it is evaluated against the Test Data to determine how well it inferred rules from the training data 3. Operating- Trained Model is used for Real-time Predictions like predicting what people are going to buy, or for Batch Predictions which are normally offline predictions when many predictions are made in a short period of time

Answer 164

Label: Something that is an interest to us like a house price-denote labels with y Labelled Example: Has a label associated with a set of features-house size, rooms, locations.. House price =$K, (x1,x2,x3,x4,…)->y Unlabelled Examples: have the set of feature values, but don’t have a label value,size, number of rooms, location.. House price=?, (x1,x2,x3,x4,…)-we want to predict the table value called y prime

Answer 165

Feature: attribute associated with the label and has a relationship with the label, like the size of a house or the number of rooms in a house, denoted with an x-x1,x2… Examples: Features together with labels

Answer 166

Measuring Loss: Calculating the difference between the prime value (x7,y7’) and the plot value (x7,y7) with the equation y7-yy7’, so basically the difference between the actual values and predicted values Memory Loss Squared: You have one line going through the actual points, then another line that doesn’t match actual points well, you get the differences between the actual and the predictive points, you calculate loss by Loss=(y1-y1’)^2+(y2-y2’)^2>>0, you have to square because the positive and negative would cancel out

Answer 167

similar to loss equation but calculating the mean Optimization using gradient descent, y=wx -> MSE(w), gradient gives direction were loss function increases, move where loss function is decreasing, keep doing this until we fin the w(minimum) which is where the loss function has its lowest value, then you can use the w in the line equation to get the line with the lowest lost, learning weight is how much we moved to the minimum

Answer 168

We train the model using data that is labeled, each example has a label and a feature We have training data that have a set of features that are called x, we include the correct labels which are the ys, features with the correct labels are used to train the model, the unseen data which has only the features is presented to the model, the model uses the training to predict what associated labels should be for each example, this will give us y prime

Answer 169

uncover structure within the data set itself Have a set of input data with no labels, have features only and send it to the model, the model then creates outputs which uncover a structure within the input data, maybe use this method to uncover personas within customer data

Answer 170

common in gameplaying and other types of ML You have an agent which is basically our model, the agent interacts with an environment(chess board in game of chess), the state of the environment(state(t)) is fed into our model at time t at a particular moment in time, the agent will propose an action at time t which affects the environment(move chess piece from one square to another), the state will change state(t+1), agent is rewarded, reward(t), to how well the action that it took at time t affected the environment in terms of the outcome it wants to get

Answer 171

predict real number (y’) There is a regression model and features are inputted into the model, and the model associates the set of features with the predicted model with a y’

Answer 172

predict class from specified set{A,B,C,D} with probability There is a classification model, it will take a set of features, then it will associate it with a particular class and an associated probability, ex: (X1,X2,…,Xn,A,0.9817)

Answer 173

group elements into clusters or groups There is a clustering model, you input an element set, then you associate the element set with a cluster in the output

Answer 174

Train the model using images to identify them in specific categories, then you can copy the model and make it into a slightly newer classification model with new classification categories and you can train it with new images, on this new model, the training images can usually be far less

Answer 175

Like a Line, doesn’t capture underlying structure

Answer 176

Curved that fits the data very well, still simple parabola

Answer 177

Fits the data very very well better than the other two cases ew data point with the curve there is a very large distance, the problem is it doesn’t generalize to new data

Answer 178

L1 Regularization term: |W1|+|W2|+…+|Wn^2|-take sum of the absolute value of all the weights Penalizes |weight”, drives weights of non-contributing features towards 0-sparse matrices

Answer 179

W1^2+W2^2+…+Wn^2-take sum of squares of all the weights Penalizes weight squared, drives all weights towards 0-simpler model

Answer 180

1. Regularization 2. Increase Training Data 3. Feature Selection 4. Early Stopping-don’t allow data to train for too long, only a certain amount of iterations 5. Cross Validation-take training data and split it into much smaller training sets, smaller sets known as folds 6. Dropout Layers-where weights are most likely all set to 0

Answer 181

Hyperparameters: values that need to be selected before the training process can begin Hyperparameter examples: Batch size, training epochs, number of hidden layers in network-model, regularization type-l1 or l2, regularization rate, learning rate Hyperparameter Characteristics 1. Selection: Hyperparameter values need to be specified before training begins 2. Model hyper parameter: relate directly to the model that is selected, Algorithm hyperparameters: relate to the training of the model 3. Training and Tuning: the process of finding the optimal or near optimal values for hyperparameters

Answer 182

One-hot Encoding(Categorical Data)- from a fixed set of values, can transform categorical data to numeric values, you can have columns that represent the categories and us binary 1 or 0 to indicate which ones are in each column Linear Scaling: transform values across one range into values in anothe, 1 to 1 or 0 to 1 Z-Score: can have a lot of values around mu and then you can transform them into a cluster around 0 Log Scaling: a small number of values have many points while the vast majority have few points but can transform it with X’=log(x) Bucketing: distribution where data points are related to each other, relationship not linear, create defined ranges, data points within range are aligned to a single data point

Answer 183

1. Google’s open source, end to end, ML framework-good for machine learning 2. Compatible with wide range of hardware and devices-models can be trained and deployed across a wide range of hardware CPU GPU TCU 3. TensorFLow Lite-deploying models to mobile and embedded devices 4. TensorFlow.js-JavaScript library for training and deploying models on node.js 5. TensorFlow extended-deploying ML pipelines

Answer 184

1. Open Source neural network library 2. Made in Python 3. Runs ontop of other ML frameworks 4. High level API for fast experimentation-deep neural networks, easy to use extensible 5. CPUs and GPUs 6. tf.keras-implementation of Keras API specification-build and train models for first class support for TensorFlow specific functionality more flexible and makes tensor flow easier to use

Answer 185

1. Free Cloud service base on Jupyter Notebooks, free service with Jupyter Notebooks within Google Docs 2. Provides free GPU support 3. Supports some BASH commands 4. Includes pre-installed python libraries

Answer 186

Input Layer: allows us to feed data unto the model Output Layer: represent the way we want the neural network to provide answers, we have a neuron available for each of the classes There are Hidden Layers between the Input and Output Layers, the number of hidden layers will vary based on the type of problem you are trying to solve-model hyperparameter Input=L0 Hidden=L1,L2,L3,L4,L5 Output=L6 Each layer has a specified number for neuron/nodes A pixel represents a matrix

Answer 187

where every neuron of 1 layer is connected to every neuron in the following layer

Answer 188

where every neuron in one layer is not connected to every neuron in the adjacent layer

Answer 189

1. Will have an input vector X and an output value Y 2. Has a weight vector number of weights in a weight vector will correspond to the number of inputs in the input vector 3. X1*W1-each weight is multiplied by the corresponding input value 4. Then the values are summed-(X1*W1)+(X2*W2) 5. A function is applied to these sum values-f((X1*W1)+(X2*W2)) which gives us our output y 6. The function is called the Activation function 7. Activation functions:Rectified linear units(ReLu): f(x)=max(0,x), Sigmoid/Logistic,Hyperbolic tangent Logits: Element in the layer Softmax function: converts Logits to probabilities-SUM(p)=1

Answer 190

1. Simplest and most common type of deep neural network 2. Applications in computer vision and NLP 3. First type of deep neural network to be used 4. Information flows in one direction only(input to output)

Answer 191

1. Flow of information forms cycles/loops 2. Directed cycles 3. RNNs can be difficult to train 4. RNNs are dynamic with their state continuously changing 5. There are cycles

Answer 192

1. Neural network commonly used for visual learning tasks 2. Common uses: Image and video recognition, image classification, natural language processing 3. Have an input, output, and many hidden layers 4. Convolutional layers are paired with pooling layers 5. They layer applies small filters that make out certain features of an image

Answer 193

1. Simplifies (downsamples) inputs 2. Usually succeed a non-linear activation function 3. There is Average Pooling-Calculated average value 4. There is Max Pooling-Calculates the maximum value 5. Strides move along the pixels in the image to calculate results for the output

Answer 194

1. Gans are deep neural networks that are composite of two opposing neural networks: Generator network, Discriminator network adversarial 2. Allows for the creation of things like: Images, music, speech, poetry, deep fakes(face imitation) 3. A generator network outputs these generated images to go into a Discriminator network to see which images are real or fake, it will predict if the image is realer fake and will send feedback

Answer 195

Optical Character Recognition (OCR): Can detect text in images, there is text detection(Can see text, uses JSON extraction and puts text in block, can also read handwriting) and document text detection(Can also read text from images but more meant for text in the documents of a file) Cropping Hints: Gives a cropping suggestion on how to crop the image and gives the vertices to crop it Face Detection: Can detect faces and facial features Image Property Detection: Sees dominant colors, these colors can group images/object or be used for recommendations Label Detection: Can detect object, locations, activities, animal species, products, and much more Landmark Detection: Detects landmark in an image, gives the landmark name, bounding polygon vertices, and the location using latitude and longitude There is also Logo Detection Explicit Content Detection-safe search detects if the image is adult, spoof, medical, violence, and racy Web Entity and Page Detection-where is the image is being used, links to web pages, similar images, best guess labels

Answer 196

Detect Labels: annotates videos where entities are detected, list of video segments, frame annotations and shots Shot change detection: annotates videos based on shots or scenes, entities associated with specific scenes Detect explicit content: detect adult content, annotates explicit content, and timestamps where detected Transcribe Speech: transcribes spoken words, profanity filtering, transcription hints, automatic punctuation, handling multiple speakers Track objects: track multiple objects, provides location of each object with frames, bounding boxes for each object, time segments with offset Detect Text: OCR on text occurring within videos, text and location of the text Google Knowledge graph Search API allows you to do searches on Google’s Knowledge Graph-all entities and relationships between them, each object has an identifier 1. Getting ranked list of most notable entities that match criteria 2. Predictively completing entities within a search box 3. Annotating or organizing content using the knowledge graph entities

Answer 197

Looks at patterns within language/text, uses sentiment analysis, entity analysis, syntax analysis, entity-sentiment analysis, and content analysis

Answer 198

1. Score: Indicates overall emotion, ranges between -1 and 1(positive) 2. Magnitude: How much emotional content is there, number ranges from 0 and infinity, not normalized (proportional ti length of document being assessed)

Answer 199

1. Identifies entities within text 2. Provides information on the identified entities 3. Entities are nouns/things-Proper Nouns(Specific like Albert Einstein) and Common Nouns(mug=any mug)

Answer 200

1. Combines Entity and Sentiment Analysis 2. Tries to determine sentiment expressed towards each of the identified entities 3. Numerical. And magnitude scores

Answer 201

Syntax Analysis: Takes streams of text through Sentences, Tokenization of text/streams breaks them up into tokens, sentences and tokens determine grammatical information Content Classification: API will return categories that are most specific to the source text

Answer 202

1. Natural Language interaction platform 2. Mobile and web app, devices bots 3. Analyzes text or audio inputs 4. Responds to using text or speech 5. Intents; categorized an end user’s intention-understands and responds-like a classification/object 6. Intents have different training phrases that are mapped onto an intent, then we can have extracted parameters 7. Each parameter has a type called an entity type 8. End user gives an input phrase, then it goes to an agent, then we get intents through the intent classification, then from this we get parameters, that can do an action, from there we get a response that will go back to the end user

Answer 203

1. Synchronous Recognition: Returns result after all input audio has been processed 2. Asynchronous Recognition: Initiates long running operation 3. Streaming Recognition: Audio data is provided within a gRPC bi-directional stream 4. Models: video, phone call, command and search, default

Answer 204

1. Uses text files or Speech Synthesis Markup Language(SSML)-allows you to control the way text is converted to speech 2. SSML: Pauses, play sounds, speak cardinals, speak ordinals, speak characters, phrase substitution

Answer 205

Suite of ML Products Facilitates the training of custom ML models Highly performant Speed of Delivery Human labelling service When you ask a problem and give it a potential result, it needs to find a neural network using NN Search When it gets a neural network that works from the NN Bank, it uses transfer learning to the AutoML where it is easily trained to handle novel data

Answer 206

1. Prepare and managed images-label training images, create dataset(single or multi-label) 2. Training models-requires prepared dataset, may take a few hours to complete, training creates a new model 3. Evaluating models-after training on a test set, aggregated and detailed information(are under curve, confidence threshold curve, confusion matrix) 4. Deploying models-deploy before making online predictions 5. Making predictions-individual or bulk 6. Undeploying models-after successful predictions, have a better model, cost implications of not underlying models

Answer 207

1. Export custom trained models 2. Models optimized for edge devices 3. TensorFlow Lite, Core ML, container export formats 4. Edge TPUs, ARM and NVIDIA 5. AutoML Vision Edge in ML Kit

Answer 208

AutoML Translation-you can use specific to translate from English to French phrase AutoML Translation Training-you use source target pairs, source sentences are in the source language, target sentences are in the target language Translation Considerations 1. Data Coverage-include examples of vocal, usage, and grammatical peculiarities that are specific to your domain, model need to be exposed to the language in some form 2. Human Involvement-people who understand both languages should be involved 3. Data Quality- This is VERY IMPORTANT for translation training, source and target documents need to align

Answer 209

1. Create custom models to classify content into custom categories you define 2. When pre-defined categories are insufficient 3. When you want to categorize content from free text 4. You want to create own categories for categorization

Answer 210

1. Data Support: AutoML Tables provide info on missing data, AutoML tables provide correlations, cardinality, and distributions for each feature 2. Automatic Feature Engineering: Normalizes and bucketizes numerical features, creates one-hot encoding and embeddings for categorical features, performs basic text processing for text fields, extract time and date features from timestamp features 3. Model Training: Parallel testing of multiple model types like Linear and Feed forward deep neural network, selects best model for predictions

Answer 211

AutoML Tables vs BigQuery ML 1. BigQuery: Rapid Iteration, still deciding on features to include 2. AutoML: Optimizing model quality, have time available for model optimization, multifarious input features

Answer 212

1. ML Toolkit for Kubernetes 2. Data modeling with Jupyter Notebooks 3. Tuning and training with TensorFlow 4. Model serving and monitoring Production Phase: Transform Data with pipelines, Train Model with (MPI, MXNET, PyTorch), Serve Model using (TFServing,KFServing,NVIDIA TensorRT), Monitor the model using(TensorBoard, Metadata) Pipeline: Description of a ML workflow, including all of the components in the workflow and how the components relate to each other in the form of a graph Pipeline Component: Self-contained set of user code, packaged as a container, that performs one step int he pipeline

Answer 213

Ingest Data: Cloud Storage, Cloud Storage Transfer Service Prepare and Preprocess Data: Cloud Dataflow, Cloud Dataproc, BigQuery, Cloud Dataprep AI Platform data labeling service can label training data by applying classification, object detection, and entity extraction Develop and Train Models: Deep Learning VM, AI Platform Notebooks, AI Platform Training, KubeFlow Test and Deploy Models: TensorFlow Extended, AI Platform Prediction, Kubeflow Discovery: Google AI Hub Quick and Ready to Go ML

Answer 214

The principle of Least privilege: Used predefined roles specific to each GCP product or service Each stack have its own boundary Policies applied to a parent object will be inherited by a child object Use Groups in G-Suite or Cloud Identity, grant roles to groups and not individual users

Answer 215

Service Accounts: Special type of Google account designed to represent non-human users 1. Virtual Machines-act via SAs that determine the services they can access 2. Programmatic access-should always be achieved using a SA 3. IAM Roles-assigned to SAs in exactly the same way as a human user accounts

Answer 216

Human User Accounts: Passwords and multi factor authentication Service Accounts have Keys that can be downloaded in JSON format A Key can be used by an application to authenticate against Google APIs, keys rotated and massively protected Cloud IAM API: Request OAUTH, OPENID, JWT Service Account User Role-can impersonate actual SA and have access to everything/IAM policies

Answer 217

Offers encryption in flight and at rest Can use Cloud Key Management Service (KMS) if you want to make your own keys Keys can be grouped together in key rings and can be used in multiple GCP services Limit Blast Radius VPC Service Controls-define security perimeter, only access services inside the perimeter GCP Security Command Center: Asset Management Features, Web Security Scanner, Anomaly Detection, Threat Detection-internal

Answer 218

Should people have access to all the data? What is the data? Do we need this to complete the task? Are we allowed to store it?-PII Personally Identifiable Information-personal information to identify a specific individual GCP Cloud Data Loss Prevention: Text, Images, Pseudo-Anonymization(dummy data), Risk Analysis DLP API can become expensive

Answer 219

FEDRAMP: Department of Defense, Homeland Security, USA-how data is used and stored securely by cloud vendors, high compliance for most GCP Services Children’s Online Privacy Protection Act: Use of PII data for children under age of 13, incorporate parent consent, clear private policy, justification for data collection HIPAA:Protects Personal Health information, acceptance of business associate agreement Pci DSS Compliant: GCP certified compliant it is secure enough, apps compliant GDPR: Europe, protects personal data of EU citizens, any region that interacts with EU

Answer 220

Explores, cleaning and preparing data Visually define transformations Export to Cloud Dataflow Integrated partner service from Trifacta-links to GCP project and datasets

Answer 221

Top level container for bringing together and organizing datasets recipes and transformations to one place

Answer 222

Collections of data that we will use from Dataprep Flow-can import datasets from local machine, Cloud Storage or BigQuery

Answer 223

Like an instruction manual, series of steps that perform a series of transformations on the datasets-create new data sets For recipes, there are audio visual controls to do these transformations You can also see a visual preview Can use the Automator to execute certain recipes at certain times Flows are then executed based on Cloud Dataflow

Answer 224

Unstructured object storage, Regional/dual-region-or multi-region, standard, near line, or cold line, storage event triggers fully managed object storage -can store images, access via API, SDKs, also has multiple storage classes like lifecycle management for objects and buckets, very secure and durable

Answer 225

Petabyte-scale NoSQL database, High-throughput and scalability-Wide column key/value data, Time-series, transactional, IoT data

Answer 226

Petabyte scale data warehouse, Fast SQL queries across large datasets, Foundations for BI and AI, and has Public Datasets

Answer 227

Managed MySQL + PostageSQL instances, built in hacks, replicas, and failover, vertically scalable, SQL server

Answer 228

Global SQL-based relational database, horizontal scalability and high availability, strong consistency, good for financial sector

Answer 229

Fully managed NoSQL document database, large collections of small JSON documents, provides a real time database SDKs

Answer 230

Managed Redis instances, in-memory DB cache or message broker, built in high availability, vertically scalable

Answer 231

structured data-consistent model, model maybe in place, data could require prep or transformation. 3 stages: Conceptual-what are entities/relationships, Logical-what are the structures of the entities, can the model be normalized? Physical-how will I implement this into the database?What keys or indexes do I need?

Answer 232

good schema design, normalization and reducing waste, have accuracy and integrity-data types and tables Non-relational-simple key/value store or document store, high volume columnar database-NoSQL Pipeline could be going from datable to big query

Answer 233

tore object or files, name unique, exists within projects, regional-low cost, dual-regional, and multi-regional all geo redundancy Storage classes: Standard $0.02 per GB, Nearline $.01 cent per GB and 30 days minimum storage and data retrieval fee, Cold line 90 day min storage $0.004 per

Answer 234

GB, Archive storage for at least a year min $0.0012 GB and data retrieval fee Objects are stored as opaque data, object immutable, overwrites atomic, can be versioned Access through google cloud console, HTTP API, SDKs and gsutil command in terminal, parallel uploads, transcoding, integrity checking, requestor pays GCS Costs: operation charges Class A expensive uploading B downloading, also network charges-like retrieving data from a bucket, data retrieval charges. Can apply life cycle rules to a bucket, IAM access for buckets, ACLS for granular access or signed policy documents-IAMs: has members and roles

Answer 235

IAM for bulk access to buckets, roles assigned to members, ACLs for granular access to buckets, ACLs grant permissions to a scope, IAM is more recommended

Answer 236

Cloud Storage-source to sink, http, amazon s3 cloud storage, filters names and dates, schedule transfers, delete objects in destination bucket, delete objects in source bucket Full access: storage transfer.admin, Submit transfers: storage transfer.user, List jobs and operations: storage transfer.viewer

Answer 237

automates data transfer to Bigquery, data loaded on a regular basis, backfills can recover gaps, google marketing sources, sources in beta

Answer 238

very very large amounts of data , physical storage device that is attached to the server terabyte versions, security guaranteed

Answer 239

Users are human users, authenticate with one credentials -not used for non human operations, passwords could leak

Answer 240

Created for a specific non human task for granular authorization , identity can be assumed by an application keys can also be easily rotated, there are google and user manages service accounts SAs are managed by keys, user managed keys are downloadable JSON File - very powerful

GCP Professional Data Engineer Cert Flashcards

(264 cards)