Data Engineer Flashcards

1
Q

Pub/Sub delivery: what order and delivery guarantees?

A

Pub/Sub delivers at least once to each EXISTING subscription. No order is guaranteed. Held for 7 days before dropping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two accumulation modes for DataFlow?

A

accumulatingFiredPanes - Keeps entire set of results as long as it is in the window. This means some of them are output multiple times if there are multiple triggers in the window.
discardingFiredPanes - Keeps only new results for outputs triggered within the window.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Watermark

A

Tracks how far behind the system is. Can be guaranteed if coming from pub/sub. This averages the skew over time. Watermarks determine WHEN in processing time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Windowing

A

Determines where in event time results are calculated. Windowing subdivides a PCollection according to the timestamps of its individual elements. Dataflow transforms that aggregate multiple elements, such as GroupByKey and Combine, work implicitly on a per-window basis—that is, they process each PCollection as a succession of multiple, finite windows, though the entire collection itself may be of unlimited or infinite size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Triggering

A

Triggering determines when to “close” each finite window as unbounded data arrives. Using a trigger can help to refine the windowing strategy for your PCollection to deal with late-arriving data or to provide early results. Default is to trigger at watermark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Side input

A

A side input is an additional input that your DoFn can access each time it processes an element in the input PCollection. When you specify a side input, you create a view of some other data that can be read from within the ParDo transform’s DoFn while processing each element. A side input can be a value computed by a separate branch of your pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Gradient descent

A

Iterative used to find the best parameters by reducing the error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Epoch

A

Traversal through the entire dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Softmax

A

Function that helps deal with multiple labels (suppresses lower inputs and increases max) and makes combination of all add to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Neuron

A

one unit of combining inputs (weighted input + activation function)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hidden Layer

A

another layer of neuron(s) to combine outputs from previous layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Inputs

A

data taken into the neuron

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Features

A

transformations of inputs, such as x^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Feature engineering

A

Determining the correct features to use the better use the machine learning model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Accuracy

A

correct / #total

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Precision

A

TP / (TP + FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Recall

A

TP / (TP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How does TensorFlow evaluate?

A

Lazy evaluation - need to run the graph to get results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Estimator API models

A

LinearRegressor - linear regression
LinearClassifier - linear classification
DNNRegressor - Deep Neural Network regression
DNNClassifier - Deep Neural Network classifier

20
Q

Logistic vs Linear Regression

A

Linear regression outputs a continuous value (how much will this house sell for) and Logistic regression outputs a binary value (will this house sell for more than 500k)

21
Q

How to encode categorical data

A

Use one-hot encoding if vocabulary is the same at prediction time as training time, cold start is when you can’t do predictions on new things in the vocabulary. If you don’t have the vocabulary of all possible values, use a hash bucket.

22
Q

When to bucketize

A

Bucketize floats so they aren’t all different

23
Q

What type of data are DNNs good at

A

Dense, highly correlated values

24
Q

What type of data are Linear models good at

A

Sparse independent features

25
Q

What model can do a combo of sparse and dense features

A

Wide-and-deep network in tf.estimator

26
Q

Hyperparameter tuning

A

optimizes a target variable that you specify

27
Q

TextLineDataset

A

Used to read data from CSV files/etc. when data can’t fit in memory

28
Q

What is BigQuery schema detection?

A

Schema auto-detection is available when you load data into BigQuery, and when you query an external data source.

When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.

29
Q

Supported external data sources for BigQuery

A

BigTable, Cloud Storage, Google Drive (CSV, JSON, Avro, or Google Sheets)

30
Q

Writing BigQuery query results

A

All BigQuery searches are recorded in either a permanent or temporary table. Temporary tables are used to cache query results for 24 hours. Permanent tables can be either new or existing tables.

31
Q

Supported data locations for Dataproc

A

HDFS (on cluster), GCS, Bigtable, or BigQuery

32
Q

AutoML offerings

A

Vision, Natural Language, Translation

33
Q

What can the Vision API do?

A

Detect categories of things in image, extract text, find topical entities (Celebrities, logos, news events, etc.), content moderation

34
Q

What can the Cloud Video Intelligence API do?

A

Cloud Video Intelligence API makes videos searchable and discoverable by extracting metadata, identifying key nouns, and annotating the content of the video

35
Q

Can you create a VM from a snapshot?

A

Yes

36
Q

How do you share a snapshot across projects?

A

Convert to custom image first

37
Q

GCP equivalent: Apache Kafka

A

Cloud Pub/Sub

38
Q

GCP equivalent: Drill

A

BigQuery

39
Q

GCP equivalent: Pig

A

Dataproc

40
Q

GCP equivalent: Spark

A

Dataproc (or Dataflow with some rework)

41
Q

GCP equivalent: Beam

A

Dataflow

42
Q

GCP equivalent: Cassandra

A

Bigtable

43
Q

GCP equivalent: HBase

A

Bigtable

44
Q

GCP equivalent: Redis

A

Memorystore

45
Q

What is spark?

A

Cluster and computing software, second generation after pig

46
Q

What is hive?

A

Hadoop SQL-like database, runs mapreduce on the backend

47
Q

What is pig?

A

Cluster computing software, runs mapreduce on the backend