2. Exploring Data and Building Data Pipelines Flashcards

1
Q

What is data visualization?

A

It is a data exploratory technique to find trends and outliers in data.
It helps data cleaning and feature engineering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two ways to visualize data?

A

Univariate analysis (range & outliers)
Bivariate analysis (correlation between features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does box plot consist of?

A

minimum, 25th, 50, 75 quartiles, maximum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is line plot for?

A

It is for showing the relationship between two variables and analyze trends

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is bar plot for?

A

It is for analyzing trends and compare categorical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is scatterplot for?

A

Visualize clusters and show the relationship between two variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the three measures of central tendency?

A

Mean
Median
Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which one of the three measures is affected by outliers?

A

Mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is standard deviation?

A

It is the square root of the variance.
It is a good way to identify outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is covariance?

A

It measures how much two variables vary from each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is correlation?

A

It is a normalized form of covariance ranging from -1 to +1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Can correlation be used to detect label leakage?

A

Yes, for example, hospital name shouldn’t be used as a feature as the name (cancer hospital) may give out the label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two elements determining your model quality?

A

Data Quality
Reliability (missing values, duplicate values and bad features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you make sure a dataset is reliable?

A

Check for label errors
Check for noise in features
Check for outliers and data skew

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is normalization?

A

It is to transform features to be on a similar scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does data skew mean?

A

It means the normal distribution curve is not symmetric. There are outliers.
If skewness is in the target variable, you can use oversampling or undersampling.

15
Q

What is scaling?

A

Convert floating-point feature values from their natural range into a standard range.

16
Q

What are the benefits of scaling?

A

Help gradient descent converge better in DNN
Remove NaN traps
Won’t give too much importance to features with wider ranges

16
Q

What is log scaling?

A

When some data samples are in the power of law or very large. Log will bring them to the same range.

17
Q

What is clipping?

A

Cap all features above or below a certain value.
It can be used before or after normalization.

17
Q

What is Z-score?

A

scaled value = (value - mean) / stddev
The value is calculated as standard deviations away from the mean.

18
Q

What visualization or statistical techniques can be used to detect outliers?

A

Box plots
Z-score
Clipping
Interquartile range
You can remove outliers or impute them

19
Q

What are the purposes for data analysis and exploration?

A

Lead to key insights
Define a schema

19
Q

What is TensorFlow Data Validation for?

A

Understand, validate, monitor ML data at scale to detect data and schema anomalies.

20
What are the benefits of having a schema?
Enable metadata-driven preprocessing Validate new data and catch anomalies
21
What are the key TFX libraries?
TF Data Validation TF Transform: data processing and feature engineering TF Model Analysis: model evaluation and analysis TF Serving: Serving models
22
What are the uses of TFDV?
Produce a data schema Define the baseline to detect skew or drift during training and serving.
23
What are the characteristics of TFDV?
Built on Apache Beam for building batch and streaming pipelines. It can be run on Google Cloud Dataflow. Dataflow is a managed service for data processing Dataflow integrates with data warehousing serverless service, e.g., BigQuery, Cloud Storage and Vertex AI Pipelines.
24
What is imbalanced data?
Two classes in a dataset are not equal. You can perform oversampling or undersampling. Or, you can do downsample and upweight the majority class. It is faster to converge.
25
What is dataset splitting?
Training: train the model Validation: hyperparameter tuning Test: evaluate the performance
26
How do you split dataset for online systems?
Split the data by time as the training data is older than the serving data.
27
What are the ways to handle missing data?
Delete a row if it has more than one missing feature values Delete a column if it has more than 50% missing data Replace missing data with mean, median or mode Replace missing data with most frequent category Replace with last observation (last observation carried forward) Use interpolation in time-series Some ML algorithms can ignore missing values Use machine learning to predict
28
What is data leakage?
Expose test data during training Lead to overfitting
29
What are the reason for data leakage?
Add the target variable as your feature Include test data in the training data Expose information about the target variable after deployment Apply preprocessing techniques to the entire dataset.
30
What are the situation indicating data leakage?
The predicted output is as good as the actual output. Features are highly correlated with the target.
31
How do you prevent data leakage?
Select features not correlated with the target Split data into test, train and validation sets Preprocess training and test data separately Use a cutoff value on time for time series. Cross-validation when you have limited data