W6 Flashcards
What is the true scope of the course?
- Only use descriptive statistics and visualisation, do not go beyond exploratory data analysis (EDA)
* Do not introduce / use statistical models
* No confirmatory data analysis like hypothesis testing or confidence interval
* No prediction models - Understand the data through initial data analysis and EDA
* This includes checking the quality of the data - Aim to get meaningful insights from data through descriptive statistics and visualisation of the data available towards answering motivative questions
What are the objectives of EDA?
- Suggest hypotheses about the causes of observed phenomena
- Assess assumptions on which statistical inference will be based
- Support the selection of appropriate statistical tools and techniques
- Provide a basis for further data collection
What are some tips on effective visualisation?
- Keep it simple
- Avoid chart junk
- Maximise data-ink ratio
- If it can be visualised in 2d, do not visualise it in 3d. If it can be visualised in 1d, do not visualise it in 2d
What is chart junk?
Chart junks are visual embellishments that are not essential to understanding the data
- They are non-data and/or redundant data elements in a graph
- They can be artistic decoration, but more often in the form of conventional graphical elements that are unnecessary in that they add no value
What is the data-ink ratio?
Data-ink ratio = data ink / total ink used in graphic
You should maximise the data-ink ratio (i.e. remove unnecessary ink in graphs)
HOWEVER, This can be taken the the extreme (Tufte Box plot); Tufte’s version of the box plot has the box removed and has a high data-ink ratio. Or the violin plot (default vs. unfilled and no box)
What is the dot plot and how do you attain?
Dot plot is a simple visualisation of a single quantitative variable with labels. It helps to improve the data-ink ratio, and the scale
EXAMPLE:
_, ax = plt.subplots(ncols=2, figsize=(8, 3));
sns.barplot(dc.iloc[10:16], y=’name’, x=’APPEARANCES’, orient=’h’, ax=ax[0]);
plt.grid(axis=’y’);
sns.scatterplot(dc.iloc[10:16], y=’name’, x=’APPEARANCES’, ax=ax[1]);
plt.tight_layout();
What is a dumbbell plot vs. a side-by-side bar chart?
Similar to a drop plot, see NOTES
Why only show half of the correlation matrix?
With a full square correlation matrix, the same information is repeated.
- You can reduce ink by reducing the amount of correlations shown
- However, this may not always be useful (may be less accessible)
What are the 2 strategies on keeping it simple?
- Avoid chart junk
- Maximise data-to-ink ratio
Why should we be careful generalising from a sample to a population?
We are often interested in the properties of the population but what we have is a sample
* Sample: a part of the population being observed
* “Representative” sample (the sample does not differ from the population in an important way)
Possible issues
1. “Non-representative” sample
a). Selection bias
b). Non-response bias
c). Sample size
- Measuring error
What is measuring error and some possible causes?
Measuring Error is the difference between a measured value and the actual value
Some possible causes:
* Wording of questions and choice of answers
* People not answering honestly
* Priming and timing of the questions
What is important regarding data collection and data quality?
Quality of data affects whether valid and accurate conclusions can be drawn.
For the given data, it is important we understand:
* How the data was collected
* Whether the data is reliable and representative enough
* Limitations of the data
- Initial data analysis and EDA may also help us to discover issues with data quality
What are 3 C’s to be wary of in data analysis?
Correlation, causation and confounders
USE DOMAIN KNOWLEDGE AND COMMON SENSE AS WELL AS MATHS/STATS AND COMPUTER
Does correlation imply causation?
Correlation does not imply causation
- Reverse causation
- Factors not considered:
- Confounder: a variable which is associated with both predictor and response and may explain their association
- Lurking factor: not measured but may be a confounder
How can you explore the relations by considering 3 variables?
sns.relplot(dummy_df, x=’v2’, y=’v3’, hue=’v1’, height=3);
SEE NOTES
What is Simpson’s paradox?
Simpson’s reversal is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined
EXAMPLE:
* In all five subjects women have an equal or better success rate in applications than do men
* However, 24% of men are successful but only 23% of women are successful