Exploratory Data Anyalsis 6.1 + 6.2 Flashcards

1
Q

What is Exploratory Data Analysis (EDA)?

A
  • It helps us familiarise ourselves with the data and get
    a better intuition for them
  • we do this by
  • Visualising the data through graphs or charts.
  • Producing summary statistics about the data, i.e.
    summarising some key characteristics about them.

Definition provided by the National Institute of Standards and Technology (NIST)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the main goals of EDA?

A
  • Maximize insight into a data set
  • Uncover underlying structure
  • Extract important variables
  • Detect outliers and anomalies
  • Test underlying assumptions
  • Develop parsimonious models
  • Determine optimal factor settings

These goals help in understanding the data better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What tools are used in EDA?

A

Visualisations and summary statistics

These tools help familiarize ourselves with the data without answering a specific research question.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are univariate approaches?

A

Approaches that look at only one variable

‘Uni-‘ indicates one, focusing on the nature of the data within that variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are multivariate approaches?

A

Approaches that look at how variables relate to or interact with each other

Analysis of two variables is known as bivariate approaches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between univariate and multivariate approaches?

A
  • Univariate approaches focus on one variable, describe the nature of the data within that varible
  • multivariate approaches measure relationships between multiple variables.

Univariate describes the nature of data within a single variable; multivariate examines interactions among variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the purpose of visualizing data in EDA?

A

To help humans see patterns visually rather than through numbers

Humans are hardwired to recognize patterns visually.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a bivariate approach?

A

An analysis of two variables

This is a subset of multivariate approaches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does it mean to produce summary statistics in EDA?

A

Summarizing key characteristics about the data

This step aids in understanding the overall trends and features of the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

True or False: EDA aims to answer specific research questions.

A

False

EDA is focused on understanding the data better, not answering specific queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some pitfalls of visualizations?

A
  • Bad visualizations can hide patterns
  • mislead the viewer.
  • give wrong message
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the key aspects to ensure legibility in visualizations?

A
  • Mind your font size and style
  • Label axes, title, and legend
  • Choose scales appropriately
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the main difference between histograms and bar charts?

A
  • Histograms are for numeric variables;
  • bar charts are for comparing categorical variables.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do the bars in a histogram differ from those in a bar chart?

A
  • Histogram bars touch to represent bins of data;
  • bar chart bars have spaces between distinct categories.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does a histogram tell us about the data?

A
  • How values are distributed
  • The range of values
  • Presence of outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does a bar chart illustrate regarding a variable?

A
  • Levels of the variable
  • Distribution of values
17
Q

What is Bivariate Analysis?

A

Exploring pairs of variables.

18
Q

What are the three combinations of variables in Bivariate Analysis?

A
  • Two numeric variables
  • Two categorical variables
  • A numeric and a categorical variable
19
Q

What is a scatter plot used for?

A

To show relationships between two numeric variables.

20
Q

What is a contingency table?

A

A table that displays the frequency distribution of one categorical variable in rows and another in columns.

21
Q

What can a contingency table help determine?

A

Which combinations of categories are common and which are rare.

22
Q

What relationships can boxplots explore?

A

The relationship between a numeric variable and a categorical variable.

23
Q

What is the purpose of a line chart?

A

To display trends over time and show multiple trends against each other.

24
Q

What should be avoided when creating line charts?

A

Overcrowding with too many lines and joining unrelated datapoints.