2. Exploring Data and Building Data Pipelines Flashcards
What is data visualization?
It is a data exploratory technique to find trends and outliers in data.
It helps data cleaning and feature engineering.
What are the two ways to visualize data?
Univariate analysis (range & outliers)
Bivariate analysis (correlation between features)
What does box plot consist of?
minimum, 25th, 50, 75 quartiles, maximum
What is line plot for?
It is for showing the relationship between two variables and analyze trends
What is bar plot for?
It is for analyzing trends and compare categorical data.
What is scatterplot for?
Visualize clusters and show the relationship between two variables.
What are the three measures of central tendency?
Mean
Median
Mode
Which one of the three measures is affected by outliers?
Mean
What is standard deviation?
It is the square root of the variance.
It is a good way to identify outliers.
What is covariance?
It measures how much two variables vary from each other.
What is correlation?
It is a normalized form of covariance ranging from -1 to +1.
Can correlation be used to detect label leakage?
Yes, for example, hospital name shouldn’t be used as a feature as the name (cancer hospital) may give out the label.
What are the two elements determining your model quality?
Data Quality
Reliability (missing values, duplicate values and bad features)
How do you make sure a dataset is reliable?
Check for label errors
Check for noise in features
Check for outliers and data skew
What is normalization?
It is to transform features to be on a similar scale.
What does data skew mean?
It means the normal distribution curve is not symmetric. There are outliers.
If skewness is in the target variable, you can use oversampling or undersampling.
What is scaling?
Convert floating-point feature values from their natural range into a standard range.
What are the benefits of scaling?
Help gradient descent converge better in DNN
Remove NaN traps
Won’t give too much importance to features with wider ranges
What is log scaling?
When some data samples are in the power of law or very large. Log will bring them to the same range.
What is clipping?
Cap all features above or below a certain value.
It can be used before or after normalization.
What is Z-score?
scaled value = (value - mean) / stddev
The value is calculated as standard deviations away from the mean.
What visualization or statistical techniques can be used to detect outliers?
Box plots
Z-score
Clipping
Interquartile range
You can remove outliers or impute them
What are the purposes for data analysis and exploration?
Lead to key insights
Define a schema
What is TensorFlow Data Validation for?
Understand, validate, monitor ML data at scale to detect data and schema anomalies.
What are the benefits of having a schema?
Enable metadata-driven preprocessing
Validate new data and catch anomalies
What are the key TFX libraries?
TF Data Validation
TF Transform: data processing and feature engineering
TF Model Analysis: model evaluation and analysis
TF Serving: Serving models
What are the uses of TFDV?
Produce a data schema
Define the baseline to detect skew or drift during training and serving.
What are the characteristics of TFDV?
Built on Apache Beam for building batch and streaming pipelines. It can be run on Google Cloud Dataflow.
Dataflow is a managed service for data processing
Dataflow integrates with data warehousing serverless service, e.g., BigQuery, Cloud Storage and Vertex AI Pipelines.
What is imbalanced data?
Two classes in a dataset are not equal.
You can perform oversampling or undersampling.
Or, you can do downsample and upweight the majority class. It is faster to converge.
What is dataset splitting?
Training: train the model
Validation: hyperparameter tuning
Test: evaluate the performance
How do you split dataset for online systems?
Split the data by time as the training data is older than the serving data.
What are the ways to handle missing data?
Delete a row if it has more than one missing feature values
Delete a column if it has more than 50% missing data
Replace missing data with mean, median or mode
Replace missing data with most frequent category
Replace with last observation (last observation carried forward)
Use interpolation in time-series
Some ML algorithms can ignore missing values
Use machine learning to predict
What is data leakage?
Expose test data during training
Lead to overfitting
What are the reason for data leakage?
Add the target variable as your feature
Include test data in the training data
Expose information about the target variable after deployment
Apply preprocessing techniques to the entire dataset.
What are the situation indicating data leakage?
The predicted output is as good as the actual output.
Features are highly correlated with the target.
How do you prevent data leakage?
Select features not correlated with the target
Split data into test, train and validation sets
Preprocess training and test data separately
Use a cutoff value on time for time series.
Cross-validation when you have limited data