Data Exploration Flashcards
What are the 2 things to achieve with exploratory data analysis
Learn from the data - what trends/relationships can we observe? What explanations and models can we hypothesise to explain them?
Quality assurance - check the data are not overly contaminated or in error, take steps to improve/clean the data
What are some of the types of bad data and what are the explanations for bad data?
- Missing values
- Outliers that are bad measurements
- Duplicated data
- Irrelevant data (to the problem you are solving)
Can be a result of unreliable data collection
What is the difference between univariate vs bivariate data analysis?
Univariate investigates how one variable changes on its own, whereas bivariate investigates how two variables behave together. Bivariate looks at whether there is a correlation between the variables.
What is the difference between Pandas Dataframe and Series?
A dataframe is the entire ‘table’ of data. It used a dictionary of data. A series is a column within this dataframe that contains data for one variable.
What is the difference between Pandas Index and Series?
The index is a row label that helps to identify rows in the series of the dataframe.
What are the different types of variable correlation and what can they imply?
- Positive - one can cause the other to increase, or there is an underlying relationship between the two
- Negative - one can cause the other to decrease, or there is an underlying relationship between the two
- No correlation - changes in one variable does not predict changes in the other, there could still be a relationship but its just not linear.
What motivates our interest in data analysis?
Develop accurate insights, problem solving, decision making, curiosity
When is an outlier valid and not?
It is not valid when it is a result of bad data collection. It is valid when it is an interesting phenomenon
How can we identify different populations of data in either univariate or bivariate analysis?
For univariate data analysis, the distribution could be bimodal meaning that there are multiple populations. For bivariate data analysis, the data could contain different clusters of data, indicating more than one population.
Why should data cleaning be undertaking before data modelling?
Bad data input = bad model output. If the data is not accurate, the model can’t be either because it doesn’t accurately represent reality. Model may be trained on bad data.
Describe the similarities and differences between Pandas vs Excel
Both allow for data manipulation and have a tabular data representation. Pandas is more suitable for large datasets due to its computational efficiency/scalability. Pandas is also easily automated compared to Excel.
What is Anscombe’s Quartet?
All datasets have the same mean, variance, trend line, and correlation
What are three types of suspicious distributions and what causes them?
- Regular peaking - data collection error
- Bimodality - two populations, possibly interacting
- Outliers - bad measurements or interesting phenomena
What causes noise in the data?
Underlying physical processes are complicated and the model only describes some of them.
Underlying system is heterogeneous, but the model uses average parameters.
What is machine learning?
Machine learning deals with the design of programs that can learn rules from data, adapt to changes, and improve performance with experience.