Data Engineer Flashcards
Pub/Sub delivery: what order and delivery guarantees?
Pub/Sub delivers at least once to each EXISTING subscription. No order is guaranteed. Held for 7 days before dropping.
What are the two accumulation modes for DataFlow?
accumulatingFiredPanes - Keeps entire set of results as long as it is in the window. This means some of them are output multiple times if there are multiple triggers in the window.
discardingFiredPanes - Keeps only new results for outputs triggered within the window.
Watermark
Tracks how far behind the system is. Can be guaranteed if coming from pub/sub. This averages the skew over time. Watermarks determine WHEN in processing time.
Windowing
Determines where in event time results are calculated. Windowing subdivides a PCollection according to the timestamps of its individual elements. Dataflow transforms that aggregate multiple elements, such as GroupByKey and Combine, work implicitly on a per-window basis—that is, they process each PCollection as a succession of multiple, finite windows, though the entire collection itself may be of unlimited or infinite size.
Triggering
Triggering determines when to “close” each finite window as unbounded data arrives. Using a trigger can help to refine the windowing strategy for your PCollection to deal with late-arriving data or to provide early results. Default is to trigger at watermark.
Side input
A side input is an additional input that your DoFn can access each time it processes an element in the input PCollection. When you specify a side input, you create a view of some other data that can be read from within the ParDo transform’s DoFn while processing each element. A side input can be a value computed by a separate branch of your pipeline.
Gradient descent
Iterative used to find the best parameters by reducing the error
Epoch
Traversal through the entire dataset
Softmax
Function that helps deal with multiple labels (suppresses lower inputs and increases max) and makes combination of all add to 1
Neuron
one unit of combining inputs (weighted input + activation function)
Hidden Layer
another layer of neuron(s) to combine outputs from previous layer
Inputs
data taken into the neuron
Features
transformations of inputs, such as x^2
Feature engineering
Determining the correct features to use the better use the machine learning model
Accuracy
correct / #total
Precision
TP / (TP + FP)
Recall
TP / (TP + FN)
How does TensorFlow evaluate?
Lazy evaluation - need to run the graph to get results