2. Exploring Data and Building Data Pipelines Flashcards
What is data visualization?
It is a data exploratory technique to find trends and outliers in data.
It helps data cleaning and feature engineering.
What are the two ways to visualize data?
Univariate analysis (range & outliers)
Bivariate analysis (correlation between features)
What does box plot consist of?
minimum, 25th, 50, 75 quartiles, maximum
What is line plot for?
It is for showing the relationship between two variables and analyze trends
What is bar plot for?
It is for analyzing trends and compare categorical data.
What is scatterplot for?
Visualize clusters and show the relationship between two variables.
What are the three measures of central tendency?
Mean
Median
Mode
Which one of the three measures is affected by outliers?
Mean
What is standard deviation?
It is the square root of the variance.
It is a good way to identify outliers.
What is covariance?
It measures how much two variables vary from each other.
What is correlation?
It is a normalized form of covariance ranging from -1 to +1.
Can correlation be used to detect label leakage?
Yes, for example, hospital name shouldn’t be used as a feature as the name (cancer hospital) may give out the label.
What are the two elements determining your model quality?
Data Quality
Reliability (missing values, duplicate values and bad features)
How do you make sure a dataset is reliable?
Check for label errors
Check for noise in features
Check for outliers and data skew
What is normalization?
It is to transform features to be on a similar scale.
What does data skew mean?
It means the normal distribution curve is not symmetric. There are outliers.
If skewness is in the target variable, you can use oversampling or undersampling.
What is scaling?
Convert floating-point feature values from their natural range into a standard range.
What are the benefits of scaling?
Help gradient descent converge better in DNN
Remove NaN traps
Won’t give too much importance to features with wider ranges
What is log scaling?
When some data samples are in the power of law or very large. Log will bring them to the same range.
What is clipping?
Cap all features above or below a certain value.
It can be used before or after normalization.
What is Z-score?
scaled value = (value - mean) / stddev
The value is calculated as standard deviations away from the mean.
What visualization or statistical techniques can be used to detect outliers?
Box plots
Z-score
Clipping
Interquartile range
You can remove outliers or impute them
What are the purposes for data analysis and exploration?
Lead to key insights
Define a schema
What is TensorFlow Data Validation for?
Understand, validate, monitor ML data at scale to detect data and schema anomalies.