Data Exploration Flashcards

Question 1

Q

What are the 2 things to achieve with exploratory data analysis

Answer

A

Learn from the data - what trends/relationships can we observe? What explanations and models can we hypothesise to explain them?
Quality assurance - check the data are not overly contaminated or in error, take steps to improve/clean the data

Question 2

Q

What are some of the types of bad data and what are the explanations for bad data?

Answer

A

Missing values
Outliers that are bad measurements
Duplicated data
Irrelevant data (to the problem you are solving)
Can be a result of unreliable data collection

Question 3

Q

What is the difference between univariate vs bivariate data analysis?

Answer

A

Univariate investigates how one variable changes on its own, whereas bivariate investigates how two variables behave together. Bivariate looks at whether there is a correlation between the variables.

Question 4

Q

What is the difference between Pandas Dataframe and Series?

Answer

A

A dataframe is the entire ‘table’ of data. It used a dictionary of data. A series is a column within this dataframe that contains data for one variable.

Question 5

Q

What is the difference between Pandas Index and Series?

Answer

A

The index is a row label that helps to identify rows in the series of the dataframe.

Question 6

Q

What are the different types of variable correlation and what can they imply?

Answer

A

Positive - one can cause the other to increase, or there is an underlying relationship between the two
Negative - one can cause the other to decrease, or there is an underlying relationship between the two
No correlation - changes in one variable does not predict changes in the other, there could still be a relationship but its just not linear.

Question 7

Q

What motivates our interest in data analysis?

Answer

A

Develop accurate insights, problem solving, decision making, curiosity

Question 8

Q

When is an outlier valid and not?

Answer

A

It is not valid when it is a result of bad data collection. It is valid when it is an interesting phenomenon

Question 9

Q

How can we identify different populations of data in either univariate or bivariate analysis?

Answer

A

For univariate data analysis, the distribution could be bimodal meaning that there are multiple populations. For bivariate data analysis, the data could contain different clusters of data, indicating more than one population.

Question 10

Q

Why should data cleaning be undertaking before data modelling?

Answer

A

Bad data input = bad model output. If the data is not accurate, the model can’t be either because it doesn’t accurately represent reality. Model may be trained on bad data.

Question 11

Q

Describe the similarities and differences between Pandas vs Excel

Answer

A

Both allow for data manipulation and have a tabular data representation. Pandas is more suitable for large datasets due to its computational efficiency/scalability. Pandas is also easily automated compared to Excel.

Question 12

Q

What is Anscombe’s Quartet?

Answer

A

All datasets have the same mean, variance, trend line, and correlation

Question 13

Q

What are three types of suspicious distributions and what causes them?

Answer

A

Regular peaking - data collection error
Bimodality - two populations, possibly interacting
Outliers - bad measurements or interesting phenomena

Question 14

Q

What causes noise in the data?

Answer

A

Underlying physical processes are complicated and the model only describes some of them.
Underlying system is heterogeneous, but the model uses average parameters.

Question 15

Q

What is machine learning?

Answer

A

Machine learning deals with the design of programs that can learn rules from data, adapt to changes, and improve performance with experience.

Data Exploration Flashcards

(15 cards)