Feature Engineering Flashcards by Wonk L

What aspects are in the BigQuery preprocessing?

representation transformation
feature construction.

How well did you know this?

Not at all

Perfectly

What Feature representation includes?

converting a numeric feature to a categorical feature ( bucketization)
converting categorical features to a numeric representation (one-hot encoding, learning with counts, sparse feature embeddings and so on)
Some models work only with numeric or categorical features,
others can handle mixed type features

How well did you know this?

Not at all

Perfectly

What Feature construction includes?

creating new features, either by using typical techniques such as polynomial expansion by using univariate mathematical functions or feature crossing to capture feature interactions
by using business logic from the domain of the ML use case.

How well did you know this?

Not at all

Perfectly

What types of two types of feature preprocessing BigQuery ML support?

automatic and manual

How well did you know this?

Not at all

Perfectly

When BigQuery ML Automatic preprocessing occur?

Automatic preprocessing occurs during training.

How well did you know this?

Not at all

Perfectly

BigQuery ML provides the transform clause for you to define custom preprocessing using the manual preprocessing functions.

True

How well did you know this?

Not at all

Perfectly

How is BigQuery data processing realized?

Built-in SQL math and data processing functions
parsing common data formats,
SQL filtering operations to exclude bogus data from your training examples datasets

How well did you know this?

Not at all

Perfectly

List some of the advanced feature engineering preprocessing functions in BigQuery.

ML. ML.FEATURE_CROSS(STRUCT(features)) does a feature cross of all the combinations.
TRANSFORM clause, which allows you to specify all preprocessing during model creation. The preprocessing is automatically applied during the prediction and evaluation phases of machine learning
ML.BUCKETIZE(f, split_points) where split_points is an array.

How well did you know this?

Not at all

Perfectly

When does Memorization work in ML?

Memorization works when you have so much data that for any single grid cell within your input space the distribution of data is statistically significant.

How well did you know this?

Not at all

Perfectly

Feature crosses are about memorization.

True

How well did you know this?

Not at all

Perfectly

What are the benefits of sparse models?

Sparse models contain fewer features and therefore are easier to train on limited data.
Fewer features also means less chance of overfitting.
Fewer features also mean it is easier to explain to users because only the most meaningful features remain.

How well did you know this?

Not at all

Perfectly

What is used for windowed functions in BigQuery ML?

For these types of time-windowed features, you will use Beam, batch and streaming data pipelines.

How well did you know this?

Not at all

Perfectly

The tf.data API enables you to build complex input pipelines from simple, reusable pieces.

True

How well did you know this?

Not at all

Perfectly

What is a complementary technology to Apache Beam?

Cloud Dataflow, that is a fully managed platform that allows you to run data processing and feature engineering pipelines at scale.
Cloud Dataflow is the API for data pipeline building in java or python and Apache Beam is the implementation and execution framework.

How well did you know this?

Not at all

Perfectly

What do you need to do to implement a data processing pipeline?

To implement a data processing pipeline, you write your code using the Apache Beam APIs and then deploy the code to Cloud Dataflow.

How well did you know this?

Not at all

Perfectly

Apache Beam SDK comes with a variety of connectors that enable dataflow to read from many data sources, including text files in Google Cloud Storage or file systems, even from real-time streaming data sources, like Google Cloud Pub/Sub or Kafka.

Study These Flashcards

True

What are contention issues?

Study These Flashcards

Contention issues are when multiple servers are trying to get a file locked to the same file concurrently

What is one key advantage of preprocessing your features using Apache Beam?

Study These Flashcards

The same code you use to preprocess features in training and evaluation can also be used in serving.

What is the purpose of a Cloud Dataflow connector?

Study These Flashcards

Connectors allow you to output the results of a pipeline to a specific data sink like Bigtable, Google Cloud Storage, flat file, BigQuery, and more.

What are ways you could do feature engineering within TensorFlow?

Study These Flashcards

Within TensorFlow itself using feature columns or by wrapping the feature dictionary and adding arbitrary TensorFlow code.

Why do you need to use TensorFlow code when wrapping the feature dictionary?

Study These Flashcards

Because this needs to be code executed as part of the model function that is to be a part of the TensorFlow graph.

What is the limit of feature pre-processing in TensorFlow?

Study These Flashcards

The limit here is that we can do pre-processing on a single input only.

TensorFlow models - Sequence models are an exception - tend to be stateless.

Study These Flashcards

True

How does TensorFlow Transform work?

Study These Flashcards

With tf.Transform, you are limited to TF methods, but then you get the efficiency of TensorFlow.
tf.Transform is a hybrid of Apache Beam and TensorFlow.
tf.Transform is the component used to analyze and transform training dat

Dataflow preprocessing only works in the context of a pipeline.

True

When do we use TensorFlow vs. DataFlow?

- use Dataflow for preprocessing that needs to maintain state, such as time windows (back-end preprocessing for ML models in Dataflow.). - use TensorFlow for preprocessing that is based on the provided input only. (on the fly preprocessing for ML models)

What are the problems with the typical ML pipeline?

- you need to keep batch and live processing in sync, - all other tooling, such as evaluation, must also be kept in sync with batch processing.

Which technology, Beam or TensorFlow, is better suited to doing on the flight transformation of the input data?

- analysis in Beam (similar to sk-learn fir_transform), - transform in TensorFlow (similar to sk-transform)

What PTransforms are available in tf.Transform?

- AnalyzeAndTransformDataset, which is executed in Beam to create a preprocessed training data set, (similar to sk-learn fir_transform) - TransformDataset, which is executed in Beam to create the evaluation data set (similar to sk-transform)

What does tf.Transform do during the training and serving phase?

Provides a TensorFlow graph for preprocessing

Fill in the blank: The ______________ _______________ is the most important concept of tf.Transform. The ______________ _______________ is a logical description of a transformation of the dataset. The ______________ _______________ accepts and returns a dictionary of tensors, where a tensor means Tensor or 2D SparseTensor.

Preprocessing function

Feature Engineering Flashcards

(31 cards)