Feature Engineering Flashcards

1
Q

What aspects are in the BigQuery preprocessing?

A
  • representation transformation
  • feature construction.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What Feature representation includes?

A
  • converting a numeric feature to a categorical feature ( bucketization)
  • converting categorical features to a numeric representation (one-hot encoding, learning with counts, sparse feature embeddings and so on)
  • Some models work only with numeric or categorical features,
  • others can handle mixed type features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What Feature construction includes?

A
  • creating new features, either by using typical techniques such as polynomial expansion by using univariate mathematical functions or feature crossing to capture feature interactions
  • by using business logic from the domain of the ML use case.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What types of two types of feature preprocessing BigQuery ML support?

A

automatic and manual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When BigQuery ML Automatic preprocessing occur?

A

Automatic preprocessing occurs during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

BigQuery ML provides the transform clause for you to define custom preprocessing using the manual preprocessing functions.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is BigQuery data processing realized?

A
  • Built-in SQL math and data processing functions
  • parsing common data formats,
  • SQL filtering operations to exclude bogus data from your training examples datasets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

List some of the advanced feature engineering preprocessing functions in BigQuery.

A
  • ML. ML.FEATURE_CROSS(STRUCT(features)) does a feature cross of all the combinations.
  • TRANSFORM clause, which allows you to specify all preprocessing during model creation. The preprocessing is automatically applied during the prediction and evaluation phases of machine learning
  • ML.BUCKETIZE(f, split_points) where split_points is an array.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When does Memorization work in ML?

A

Memorization works when you have so much data that for any single grid cell within your input space the distribution of data is statistically significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Feature crosses are about memorization.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the benefits of sparse models?

A
  • Sparse models contain fewer features and therefore are easier to train on limited data.
  • Fewer features also means less chance of overfitting.
  • Fewer features also mean it is easier to explain to users because only the most meaningful features remain.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is used for windowed functions in BigQuery ML?

A

For these types of time-windowed features, you will use Beam, batch and streaming data pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The tf.data API enables you to build complex input pipelines from simple, reusable pieces.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a complementary technology to Apache Beam?

A
  • Cloud Dataflow, that is a fully managed platform that allows you to run data processing and feature engineering pipelines at scale.
  • Cloud Dataflow is the API for data pipeline building in java or python and Apache Beam is the implementation and execution framework.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What do you need to do to implement a data processing pipeline?

A

To implement a data processing pipeline, you write your code using the Apache Beam APIs and then deploy the code to Cloud Dataflow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Apache Beam SDK comes with a variety of connectors that enable dataflow to read from many data sources, including text files in Google Cloud Storage or file systems, even from real-time streaming data sources, like Google Cloud Pub/Sub or Kafka.

A

True

17
Q

What are contention issues?

A

Contention issues are when multiple servers are trying to get a file locked to the same file concurrently

18
Q

What is one key advantage of preprocessing your features using Apache Beam?

A

The same code you use to preprocess features in training and evaluation can also be used in serving.

19
Q

What is the purpose of a Cloud Dataflow connector?

A

Connectors allow you to output the results of a pipeline to a specific data sink like Bigtable, Google Cloud Storage, flat file, BigQuery, and more.

20
Q

What are ways you could do feature engineering within TensorFlow?

A

Within TensorFlow itself using feature columns or by wrapping the feature dictionary and adding arbitrary TensorFlow code.

21
Q

Why do you need to use TensorFlow code when wrapping the feature dictionary?

A

Because this needs to be code executed as part of the model function that is to be a part of the TensorFlow graph.

22
Q

What is the limit of feature pre-processing in TensorFlow?

A

The limit here is that we can do pre-processing on a single input only.

23
Q

TensorFlow models - Sequence models are an exception - tend to be stateless.

A

True

24
Q

How does TensorFlow Transform work?

A
  • With tf.Transform, you are limited to TF methods, but then you get the efficiency of TensorFlow.
  • tf.Transform is a hybrid of Apache Beam and TensorFlow.
  • tf.Transform is the component used to analyze and transform training dat
25
Q

Dataflow preprocessing only works in the context of a pipeline.

A

True

26
Q

When do we use TensorFlow vs. DataFlow?

A
  • use Dataflow for preprocessing that needs to maintain state, such as time windows (back-end preprocessing for ML models in Dataflow.).
  • use TensorFlow for preprocessing that is based on the provided input only. (on the fly preprocessing for ML models)
27
Q

What are the problems with the typical ML pipeline?

A
  • you need to keep batch and live processing in sync,
  • all other tooling, such as evaluation, must also be kept in sync with batch processing.
28
Q

Which technology, Beam or TensorFlow, is better suited to doing on the flight transformation of the input data?

A
  • analysis in Beam (similar to sk-learn fir_transform),
  • transform in TensorFlow (similar to sk-transform)
29
Q

What PTransforms are available in tf.Transform?

A
  • AnalyzeAndTransformDataset, which is executed in Beam to create a preprocessed training data set, (similar to sk-learn fir_transform)
  • TransformDataset, which is executed in Beam to create the evaluation data set (similar to sk-transform)
30
Q

What does tf.Transform do during the training and serving phase?

A

Provides a TensorFlow graph for preprocessing

31
Q

Fill in the blank:
The ______________ _______________ is the most important concept of tf.Transform. The ______________ _______________ is a logical description of a transformation of the dataset. The ______________ _______________ accepts and returns a dictionary of tensors, where a tensor means Tensor or 2D SparseTensor.

A

Preprocessing function