2. Exploring Data and Building Data Pipelines Flashcards
What is data visualization?
It is a data exploratory technique to find trends and outliers in data.
It helps data cleaning and feature engineering.
What are the two ways to visualize data?
Univariate analysis (range & outliers)
Bivariate analysis (correlation between features)
What does box plot consist of?
minimum, 25th, 50, 75 quartiles, maximum
What is line plot for?
It is for showing the relationship between two variables and analyze trends
What is bar plot for?
It is for analyzing trends and compare categorical data.
What is scatterplot for?
Visualize clusters and show the relationship between two variables.
What are the three measures of central tendency?
Mean
Median
Mode
Which one of the three measures is affected by outliers?
Mean
What is standard deviation?
It is the square root of the variance.
It is a good way to identify outliers.
What is covariance?
It measures how much two variables vary from each other.
What is correlation?
It is a normalized form of covariance ranging from -1 to +1.
Can correlation be used to detect label leakage?
Yes, for example, hospital name shouldn’t be used as a feature as the name (cancer hospital) may give out the label.
What are the two elements determining your model quality?
Data Quality
Reliability (missing values, duplicate values and bad features)
How do you make sure a dataset is reliable?
Check for label errors
Check for noise in features
Check for outliers and data skew
What is normalization?
It is to transform features to be on a similar scale.