exam 3 - chapter 8 Flashcards
The Role of Data Understanding
•What it is NOT
[3]
The Role of Data Understanding •What it is NOT •Mindless calculation of statistics •Creating pretty graphs •Reporting the obvious
The Role of Data Understanding
•What it is
[3]
The Role of Data Understanding
•What it is
•Calculation and interpretations of statistics
•Creating graphs that help you understand your data set
•Reporting the anomalous
The EDA Process
- When the dataset is more or less ready for analysis, start applying the standard techniques to get a basic understanding of the features.
- You will begin to form a hypothesis about some aspects of the dataset (from the context of the problem).
- Apply EDA techniques to begin confirming/rejecting your hypothesis and preconceived ideas.
- You will start to understand the dataset. New questions will come to mind.
- Apply EDA techniques to try answering these new questions. You will gain more understanding, and further new questions will pop into your head.
- Repeat Step 4 and Step 5 a few times.
- Stop when you feel comfortable with the understanding you’ve got, and you think that you can move on to the modeling stage.
•What can we learn from EDA using
[2]
- What can we learn from EDA using
- Numerical Calculations
- Visualizations
Types of Analysis
[5]
Types of Analysis •Univariate Numerical •Univariate Categorical (Nominal) •Bivariate Numerical •Bivariate Categorical (Nominal) •Combinations
Never trust summary statistics alone; always ____ your data.
Never trust summary statistics alone; always visualize your data.
Distributions
- Symmetric
[2]
Distributions
- Symmetric
•Easiest to interpret.
•More likely to be a normal distribution
Distributions
•Skewed
[3]
Distributions •Skewed •Averages do not represent typical •Common with COUNT variables •Can be flattened
Distributions
•Multimodal
[2]
Distributions
•Multimodal
•Several common values
•Clusters might emerge
Distributions
Symmetric
[1]
Distributions
Symmetric
•Little variance means limited insights
Bivariate Feature Analysis • •Can be • • •
Bivariate Feature Analysis •Analyzing one feature in terms of another or against another. •Can be •2 numerical features •2 categorical features •One of each (Combination)
Combination analysis - box and Whisker
[4]
Combination analysis - box and Whisker •Based on medians, not means •Visualize the size of each quartile •Range from top to bottom is called Interquartile Range •Outliers (High and Low)
Don’t just calculate and draw
[2]
Don’t just calculate and draw
•Try to gain data understanding
•Think about how these data behaviors might impact models
Use the right tool for the right job
•
Use the right tool for the right job
•Different techniques depending on what we want to understand
If an EDA activity is not helping to better understand your data set, why do it?
-