Chapter 3- Data Exploration Flashcards
What is in a data quality report?
Tabular reports that describe the characteristics of each feature in an ABT using standard measures of central tendency and variation
Which data visualizations accompany tabular reports?
- Histogram for continuous features we can apply Quantitative scales to
- Bar plot for each categorical feature we can apply Qualitative scales to
What are frequency histograms?
The consist of a set of rectangles/ bars that reflect the counts of frequencies of the classes present in the given data
How do you get to know categorical features?
- Examine the mode, 2nd mode, mode % and 2nd mode %
- These tell us the most common levels within the features and will identify if any levels dominate the dataset
How do you get to know continuous features?
- Examine the mean and standard deviation
- they will help us get a sense of the central tendency and variation of the values within the dataset for the feature
- Examine the minimum and maximum (and other quartiles)
- They will help us understand the range that is possible for each feature
What are the types of histogram characteristics?
- Uniform
- Normal (unimodal)
- Unimodal (skewed right)
- Unimodal (skewed left)
- Exponential
- Multimodal
What is a uniform distribution
- It indicated that a feature is equally likely to take a value in any of the ranges present
- Sometimes it shows that a descriptive feature contains an ID rather than a measure of something more interesting
What is Normal (unimodal) distribution?
- They have a strong tendency toward a central value and symmetrical variation on either side of the central tendency
- It is called unimodal because of the single peak around the central tendency
- Naturally occurring phenomena follow normal distribution
What is the skew when the data contains some very high values?
Skew right/ positive skew
What is the skew when the data contains some very low values?
Skew left/ Negative skew
What is the mode and median relationship during skews?
- Right skewed- Mode < Median
- Left skewed- Mode > Median
What is exponential distribution?
- The likelihood of certain values occurring is very high but diminishes rapidly for higher or lower values
- It is a clear warning sign that outliers are likely
What is multimodal distribution?
- It has two or more very commonly occurring ranges of values that are clearly separated
- Bi-modal distribution can be thought of as two normal distributions pushed together
- Occurs when a feature contains a measurement made across a number of distinct groups
What are the two things multimodal distribution is a cause for?
- Caution because measures of central tendency and variation tend to break down for multimodal data
- Optimism because if we are lucky, the separate peaks in the distribution will be associated with the different target levels we are trying to predict
What happens when a distribution has different means but identical standard deviations?
The distribution moves side to side
What happens when a distribution has identical means but different standard deviations?
Distribution moves up and down
What does the 68-95-99.7 rule state?
- 68% of observations will be between one standard deviation of mean (mean - sd)
- 95% of observations will be within two standard deviations of mean (mean - 2sd)
- 99.7% of observations will be within three standard deviations of mean (mean - 3sd)
What is a data quality issue?
It is loosely defined as anything unusual about the data in an ABT
What are the most common data quality issues?
- Missing values
- Irregular cardinality
- Outliers
What are the data quality issues we identify from a data quality report?
- Issues due to invalid data (syntax error)
- Issues due to valid data (human error)
How do you handle data issues?
- Approach 1: Drop any features that have missing values
- Approach 2: Apply complete case analysis (delete records)
- Approach 3: Derive a missing indicator feature from features with missing value
- Approach 4: Impute the missing values
What are the effects of approach 1?
- It can result in massive, and frequently needless loss of data
- Only features with 60% excess missing values should be considered for removal
- An alternative is to derive a missing indicator feature for them, could be categorical
What is approach 2?
- We delete instances that are missing one or more feature values
- This results in significant amounts of data loss and can introduce a bias to the dataset
- This should rarely be used and only when a data instance is missing values for multiple features
- It is recommended to remove instances that are missing the value in the target feature
Explain approach 3
- This could be a categorical feature that flags the missing data as a new label (unknown in marital status)
- Or a binary feature that flags whether the value was present or missing (T/F)
- When missing indicator features are used the original feature is usually discarded
Explain approach 4
- Imputation replaces missing feature values with a plausible estimated value based on the feature values that are present
- Most commonly you replace missing values with a measure of central tendency of that feature
- We should be reluctant to use imputation on features missing in excess of 30% of their values
- Strongly recommend against using imputation on features missing in excess of 50% of their values
What is the easiest way to handle outliers?
Use clamp transformation
What does clamp transformation do?
It clamps all values above an upper threshold and below a lower threshold to these threshold values thus removing the offending outliers
How do you identify outliers?
Three popular rules of thumb:
- Values located at least 1.5*IQR above Q3 or below Q1. Requires sorting so is expensive
- Values more than 2 standard deviations from the mean. Both mean and sd can usually be computed easily
- 2% from top and bottom of your ordered data. Trivial to implement but hard to scientifically defend heuristics
What is a scatter plot?
- It is based on 2 axes, horizontal and vertical axis
- Each instance is represented by a point on the plot determined by the values for that instance of the two features involved
What is a scatter plot matrix?
- Scatter plot matrix (SPLOM) shows scatter plots for a whole collection of features arranged into a matrix
- It is useful for exploring the relationship between groups of features
- It is a visualization of the correlation matrix
What else can we do in addition to visually inspecting scatter plots?
Calculate formal measures of the relationship between two continuous features using covariance and correlation
What range does covariance usually fall into?
- [Negative infinity, Positive infinity]
- Negative values indicate a negative relationship
- Positive values indicate a positive relationship
- Values near zero indicate that there is little to no relationship