Feature Engineering Flashcards
What aspects are in the BigQuery preprocessing?
- representation transformation
- feature construction.
What Feature representation includes?
- converting a numeric feature to a categorical feature ( bucketization)
- converting categorical features to a numeric representation (one-hot encoding, learning with counts, sparse feature embeddings and so on)
- Some models work only with numeric or categorical features,
- others can handle mixed type features
What Feature construction includes?
- creating new features, either by using typical techniques such as polynomial expansion by using univariate mathematical functions or feature crossing to capture feature interactions
- by using business logic from the domain of the ML use case.
What types of two types of feature preprocessing BigQuery ML support?
automatic and manual
When BigQuery ML Automatic preprocessing occur?
Automatic preprocessing occurs during training.
BigQuery ML provides the transform clause for you to define custom preprocessing using the manual preprocessing functions.
True
How is BigQuery data processing realized?
- Built-in SQL math and data processing functions
- parsing common data formats,
- SQL filtering operations to exclude bogus data from your training examples datasets
List some of the advanced feature engineering preprocessing functions in BigQuery.
- ML. ML.FEATURE_CROSS(STRUCT(features)) does a feature cross of all the combinations.
- TRANSFORM clause, which allows you to specify all preprocessing during model creation. The preprocessing is automatically applied during the prediction and evaluation phases of machine learning
- ML.BUCKETIZE(f, split_points) where split_points is an array.
When does Memorization work in ML?
Memorization works when you have so much data that for any single grid cell within your input space the distribution of data is statistically significant.
Feature crosses are about memorization.
True
What are the benefits of sparse models?
- Sparse models contain fewer features and therefore are easier to train on limited data.
- Fewer features also means less chance of overfitting.
- Fewer features also mean it is easier to explain to users because only the most meaningful features remain.
What is used for windowed functions in BigQuery ML?
For these types of time-windowed features, you will use Beam, batch and streaming data pipelines.
The tf.data API enables you to build complex input pipelines from simple, reusable pieces.
True
What is a complementary technology to Apache Beam?
- Cloud Dataflow, that is a fully managed platform that allows you to run data processing and feature engineering pipelines at scale.
- Cloud Dataflow is the API for data pipeline building in java or python and Apache Beam is the implementation and execution framework.
What do you need to do to implement a data processing pipeline?
To implement a data processing pipeline, you write your code using the Apache Beam APIs and then deploy the code to Cloud Dataflow.