Data Engineer Flashcards by Kevin McKenzie

Pub/Sub delivery: what order and delivery guarantees?

Pub/Sub delivers at least once to each EXISTING subscription. No order is guaranteed. Held for 7 days before dropping.

How well did you know this?

Not at all

Perfectly

What are the two accumulation modes for DataFlow?

accumulatingFiredPanes - Keeps entire set of results as long as it is in the window. This means some of them are output multiple times if there are multiple triggers in the window.
discardingFiredPanes - Keeps only new results for outputs triggered within the window.

How well did you know this?

Not at all

Perfectly

Watermark

Tracks how far behind the system is. Can be guaranteed if coming from pub/sub. This averages the skew over time. Watermarks determine WHEN in processing time.

How well did you know this?

Not at all

Perfectly

Windowing

Determines where in event time results are calculated. Windowing subdivides a PCollection according to the timestamps of its individual elements. Dataflow transforms that aggregate multiple elements, such as GroupByKey and Combine, work implicitly on a per-window basis—that is, they process each PCollection as a succession of multiple, finite windows, though the entire collection itself may be of unlimited or infinite size.

How well did you know this?

Not at all

Perfectly

Triggering

Triggering determines when to “close” each finite window as unbounded data arrives. Using a trigger can help to refine the windowing strategy for your PCollection to deal with late-arriving data or to provide early results. Default is to trigger at watermark.

How well did you know this?

Not at all

Perfectly

Side input

A side input is an additional input that your DoFn can access each time it processes an element in the input PCollection. When you specify a side input, you create a view of some other data that can be read from within the ParDo transform’s DoFn while processing each element. A side input can be a value computed by a separate branch of your pipeline.

How well did you know this?

Not at all

Perfectly

Gradient descent

Iterative used to find the best parameters by reducing the error

How well did you know this?

Not at all

Perfectly

Epoch

Traversal through the entire dataset

How well did you know this?

Not at all

Perfectly

Softmax

Function that helps deal with multiple labels (suppresses lower inputs and increases max) and makes combination of all add to 1

How well did you know this?

Not at all

Perfectly

Neuron

one unit of combining inputs (weighted input + activation function)

How well did you know this?

Not at all

Perfectly

Hidden Layer

another layer of neuron(s) to combine outputs from previous layer

How well did you know this?

Not at all

Perfectly

Inputs

data taken into the neuron

How well did you know this?

Not at all

Perfectly

Features

transformations of inputs, such as x^2

How well did you know this?

Not at all

Perfectly

Feature engineering

Determining the correct features to use the better use the machine learning model

How well did you know this?

Not at all

Perfectly

Accuracy

correct / #total

How well did you know this?

Not at all

Perfectly

Precision

TP / (TP + FP)

How well did you know this?

Not at all

Perfectly

Recall

TP / (TP + FN)

How well did you know this?

Not at all

Perfectly

How does TensorFlow evaluate?

Lazy evaluation - need to run the graph to get results

How well did you know this?

Not at all

Perfectly

Estimator API models

Study These Flashcards

LinearRegressor - linear regression
LinearClassifier - linear classification
DNNRegressor - Deep Neural Network regression
DNNClassifier - Deep Neural Network classifier

Logistic vs Linear Regression

Study These Flashcards

Linear regression outputs a continuous value (how much will this house sell for) and Logistic regression outputs a binary value (will this house sell for more than 500k)

How to encode categorical data

Study These Flashcards

Use one-hot encoding if vocabulary is the same at prediction time as training time, cold start is when you can’t do predictions on new things in the vocabulary. If you don’t have the vocabulary of all possible values, use a hash bucket.

When to bucketize

Study These Flashcards

Bucketize floats so they aren’t all different

What type of data are DNNs good at

Study These Flashcards

Dense, highly correlated values

What type of data are Linear models good at

Study These Flashcards

Sparse independent features

What model can do a combo of sparse and dense features

Wide-and-deep network in tf.estimator

Hyperparameter tuning

optimizes a target variable that you specify

TextLineDataset

Used to read data from CSV files/etc. when data can't fit in memory

What is BigQuery schema detection?

Schema auto-detection is available when you load data into BigQuery, and when you query an external data source. When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.

Supported external data sources for BigQuery

BigTable, Cloud Storage, Google Drive (CSV, JSON, Avro, or Google Sheets)

Writing BigQuery query results

All BigQuery searches are recorded in either a permanent or temporary table. Temporary tables are used to cache query results for 24 hours. Permanent tables can be either new or existing tables.

Supported data locations for Dataproc

HDFS (on cluster), GCS, Bigtable, or BigQuery

AutoML offerings

Vision, Natural Language, Translation

What can the Vision API do?

Detect categories of things in image, extract text, find topical entities (Celebrities, logos, news events, etc.), content moderation

What can the Cloud Video Intelligence API do?

Cloud Video Intelligence API makes videos searchable and discoverable by extracting metadata, identifying key nouns, and annotating the content of the video

Can you create a VM from a snapshot?

Yes

How do you share a snapshot across projects?

Convert to custom image first

GCP equivalent: Apache Kafka

Cloud Pub/Sub

GCP equivalent: Drill

BigQuery

GCP equivalent: Pig

Dataproc

GCP equivalent: Spark

Dataproc (or Dataflow with some rework)

GCP equivalent: Beam

Dataflow

GCP equivalent: Cassandra

Bigtable

GCP equivalent: HBase

Bigtable

GCP equivalent: Redis

Memorystore

What is spark?

Cluster and computing software, second generation after pig

What is hive?

Hadoop SQL-like database, runs mapreduce on the backend

What is pig?

Cluster computing software, runs mapreduce on the backend

Data Engineer Flashcards

(47 cards)