Exploratory Data Anyalsis 6.1 + 6.2 Flashcards
What is Exploratory Data Analysis (EDA)?
- It helps us familiarise ourselves with the data and get
a better intuition for them - we do this by
- Visualising the data through graphs or charts.
- Producing summary statistics about the data, i.e.
summarising some key characteristics about them.
Definition provided by the National Institute of Standards and Technology (NIST)
What are the main goals of EDA?
- Maximize insight into a data set
- Uncover underlying structure
- Extract important variables
- Detect outliers and anomalies
- Test underlying assumptions
- Develop parsimonious models
- Determine optimal factor settings
These goals help in understanding the data better.
What tools are used in EDA?
Visualisations and summary statistics
These tools help familiarize ourselves with the data without answering a specific research question.
What are univariate approaches?
Approaches that look at only one variable
‘Uni-‘ indicates one, focusing on the nature of the data within that variable.
What are multivariate approaches?
Approaches that look at how variables relate to or interact with each other
Analysis of two variables is known as bivariate approaches.
What is the difference between univariate and multivariate approaches?
- Univariate approaches focus on one variable, describe the nature of the data within that varible
- multivariate approaches measure relationships between multiple variables.
Univariate describes the nature of data within a single variable; multivariate examines interactions among variables.
What is the purpose of visualizing data in EDA?
To help humans see patterns visually rather than through numbers
Humans are hardwired to recognize patterns visually.
What is a bivariate approach?
An analysis of two variables
This is a subset of multivariate approaches.
What does it mean to produce summary statistics in EDA?
Summarizing key characteristics about the data
This step aids in understanding the overall trends and features of the dataset.
True or False: EDA aims to answer specific research questions.
False
EDA is focused on understanding the data better, not answering specific queries.
What are some pitfalls of visualizations?
- Bad visualizations can hide patterns
- mislead the viewer.
- give wrong message
What are the key aspects to ensure legibility in visualizations?
- Mind your font size and style
- Label axes, title, and legend
- Choose scales appropriately
What is the main difference between histograms and bar charts?
- Histograms are for numeric variables;
- bar charts are for comparing categorical variables.
How do the bars in a histogram differ from those in a bar chart?
- Histogram bars touch to represent bins of data;
- bar chart bars have spaces between distinct categories.
What does a histogram tell us about the data?
- How values are distributed
- The range of values
- Presence of outliers
What does a bar chart illustrate regarding a variable?
- Levels of the variable
- Distribution of values
What is Bivariate Analysis?
Exploring pairs of variables.
What are the three combinations of variables in Bivariate Analysis?
- Two numeric variables
- Two categorical variables
- A numeric and a categorical variable
What is a scatter plot used for?
To show relationships between two numeric variables.
What is a contingency table?
A table that displays the frequency distribution of one categorical variable in rows and another in columns.
What can a contingency table help determine?
Which combinations of categories are common and which are rare.
What relationships can boxplots explore?
The relationship between a numeric variable and a categorical variable.
What is the purpose of a line chart?
To display trends over time and show multiple trends against each other.
What should be avoided when creating line charts?
Overcrowding with too many lines and joining unrelated datapoints.