2. Exploring Data and Building Data Pipelines Flashcards by KK Cheng

What is data visualization?

It is a data exploratory technique to find trends and outliers in data.
It helps data cleaning and feature engineering.

How well did you know this?

Not at all

Perfectly

What are the two ways to visualize data?

Univariate analysis (range & outliers)
Bivariate analysis (correlation between features)

How well did you know this?

Not at all

Perfectly

What does box plot consist of?

minimum, 25th, 50, 75 quartiles, maximum

How well did you know this?

Not at all

Perfectly

What is line plot for?

It is for showing the relationship between two variables and analyze trends

How well did you know this?

Not at all

Perfectly

What is bar plot for?

It is for analyzing trends and compare categorical data.

How well did you know this?

Not at all

Perfectly

What is scatterplot for?

Visualize clusters and show the relationship between two variables.

How well did you know this?

Not at all

Perfectly

What are the three measures of central tendency?

Mean
Median
Mode

How well did you know this?

Not at all

Perfectly

Which one of the three measures is affected by outliers?

Mean

How well did you know this?

Not at all

Perfectly

What is standard deviation?

It is the square root of the variance.
It is a good way to identify outliers.

How well did you know this?

Not at all

Perfectly

What is covariance?

It measures how much two variables vary from each other.

How well did you know this?

Not at all

Perfectly

What is correlation?

It is a normalized form of covariance ranging from -1 to +1.

How well did you know this?

Not at all

Perfectly

Can correlation be used to detect label leakage?

Yes, for example, hospital name shouldn’t be used as a feature as the name (cancer hospital) may give out the label.

How well did you know this?

Not at all

Perfectly

What are the two elements determining your model quality?

Data Quality
Reliability (missing values, duplicate values and bad features)

How well did you know this?

Not at all

Perfectly

How do you make sure a dataset is reliable?

Check for label errors
Check for noise in features
Check for outliers and data skew

How well did you know this?

Not at all

Perfectly

What is normalization?

It is to transform features to be on a similar scale.

How well did you know this?

Not at all

Perfectly

What does data skew mean?

Study These Flashcards

It means the normal distribution curve is not symmetric. There are outliers.
If skewness is in the target variable, you can use oversampling or undersampling.

What is scaling?

Study These Flashcards

Convert floating-point feature values from their natural range into a standard range.

What are the benefits of scaling?

Study These Flashcards

Help gradient descent converge better in DNN
Remove NaN traps
Won’t give too much importance to features with wider ranges

What is log scaling?

Study These Flashcards

When some data samples are in the power of law or very large. Log will bring them to the same range.

What is clipping?

Study These Flashcards

Cap all features above or below a certain value.
It can be used before or after normalization.

What is Z-score?

Study These Flashcards

scaled value = (value - mean) / stddev
The value is calculated as standard deviations away from the mean.

What visualization or statistical techniques can be used to detect outliers?

Study These Flashcards

Box plots
Z-score
Clipping
Interquartile range
You can remove outliers or impute them

What are the purposes for data analysis and exploration?

Study These Flashcards

Lead to key insights
Define a schema

What is TensorFlow Data Validation for?

Study These Flashcards

Understand, validate, monitor ML data at scale to detect data and schema anomalies.

What are the benefits of having a schema?

Enable metadata-driven preprocessing Validate new data and catch anomalies

What are the key TFX libraries?

TF Data Validation TF Transform: data processing and feature engineering TF Model Analysis: model evaluation and analysis TF Serving: Serving models

What are the uses of TFDV?

Produce a data schema Define the baseline to detect skew or drift during training and serving.

What are the characteristics of TFDV?

Built on Apache Beam for building batch and streaming pipelines. It can be run on Google Cloud Dataflow. Dataflow is a managed service for data processing Dataflow integrates with data warehousing serverless service, e.g., BigQuery, Cloud Storage and Vertex AI Pipelines.

What is imbalanced data?

Two classes in a dataset are not equal. You can perform oversampling or undersampling. Or, you can do downsample and upweight the majority class. It is faster to converge.

What is dataset splitting?

Training: train the model Validation: hyperparameter tuning Test: evaluate the performance

How do you split dataset for online systems?

Split the data by time as the training data is older than the serving data.

What are the ways to handle missing data?

Delete a row if it has more than one missing feature values Delete a column if it has more than 50% missing data Replace missing data with mean, median or mode Replace missing data with most frequent category Replace with last observation (last observation carried forward) Use interpolation in time-series Some ML algorithms can ignore missing values Use machine learning to predict

What is data leakage?

Expose test data during training Lead to overfitting

What are the reason for data leakage?

Add the target variable as your feature Include test data in the training data Expose information about the target variable after deployment Apply preprocessing techniques to the entire dataset.

What are the situation indicating data leakage?

The predicted output is as good as the actual output. Features are highly correlated with the target.

How do you prevent data leakage?

Select features not correlated with the target Split data into test, train and validation sets Preprocess training and test data separately Use a cutoff value on time for time series. Cross-validation when you have limited data

2. Exploring Data and Building Data Pipelines Flashcards

(36 cards)