Chapter 3- Data Exploration Flashcards
What is in a data quality report?
Tabular reports that describe the characteristics of each feature in an ABT using standard measures of central tendency and variation
Which data visualizations accompany tabular reports?
- Histogram for continuous features we can apply Quantitative scales to
- Bar plot for each categorical feature we can apply Qualitative scales to
What are frequency histograms?
The consist of a set of rectangles/ bars that reflect the counts of frequencies of the classes present in the given data
How do you get to know categorical features?
- Examine the mode, 2nd mode, mode % and 2nd mode %
- These tell us the most common levels within the features and will identify if any levels dominate the dataset
How do you get to know continuous features?
- Examine the mean and standard deviation
- they will help us get a sense of the central tendency and variation of the values within the dataset for the feature
- Examine the minimum and maximum (and other quartiles)
- They will help us understand the range that is possible for each feature
What are the types of histogram characteristics?
- Uniform
- Normal (unimodal)
- Unimodal (skewed right)
- Unimodal (skewed left)
- Exponential
- Multimodal
What is a uniform distribution
- It indicated that a feature is equally likely to take a value in any of the ranges present
- Sometimes it shows that a descriptive feature contains an ID rather than a measure of something more interesting
What is Normal (unimodal) distribution?
- They have a strong tendency toward a central value and symmetrical variation on either side of the central tendency
- It is called unimodal because of the single peak around the central tendency
- Naturally occurring phenomena follow normal distribution
What is the skew when the data contains some very high values?
Skew right/ positive skew
What is the skew when the data contains some very low values?
Skew left/ Negative skew
What is the mode and median relationship during skews?
- Right skewed- Mode < Median
- Left skewed- Mode > Median
What is exponential distribution?
- The likelihood of certain values occurring is very high but diminishes rapidly for higher or lower values
- It is a clear warning sign that outliers are likely
What is multimodal distribution?
- It has two or more very commonly occurring ranges of values that are clearly separated
- Bi-modal distribution can be thought of as two normal distributions pushed together
- Occurs when a feature contains a measurement made across a number of distinct groups
What are the two things multimodal distribution is a cause for?
- Caution because measures of central tendency and variation tend to break down for multimodal data
- Optimism because if we are lucky, the separate peaks in the distribution will be associated with the different target levels we are trying to predict
What happens when a distribution has different means but identical standard deviations?
The distribution moves side to side
What happens when a distribution has identical means but different standard deviations?
Distribution moves up and down
What does the 68-95-99.7 rule state?
- 68% of observations will be between one standard deviation of mean (mean - sd)
- 95% of observations will be within two standard deviations of mean (mean - 2sd)
- 99.7% of observations will be within three standard deviations of mean (mean - 3sd)
What is a data quality issue?
It is loosely defined as anything unusual about the data in an ABT
What are the most common data quality issues?
- Missing values
- Irregular cardinality
- Outliers
What are the data quality issues we identify from a data quality report?
- Issues due to invalid data (syntax error)
- Issues due to valid data (human error)
How do you handle data issues?
- Approach 1: Drop any features that have missing values
- Approach 2: Apply complete case analysis (delete records)
- Approach 3: Derive a missing indicator feature from features with missing value
- Approach 4: Impute the missing values
What are the effects of approach 1?
- It can result in massive, and frequently needless loss of data
- Only features with 60% excess missing values should be considered for removal
- An alternative is to derive a missing indicator feature for them, could be categorical
What is approach 2?
- We delete instances that are missing one or more feature values
- This results in significant amounts of data loss and can introduce a bias to the dataset
- This should rarely be used and only when a data instance is missing values for multiple features
- It is recommended to remove instances that are missing the value in the target feature
Explain approach 3
- This could be a categorical feature that flags the missing data as a new label (unknown in marital status)
- Or a binary feature that flags whether the value was present or missing (T/F)
- When missing indicator features are used the original feature is usually discarded