W6 Flashcards

Question 1

Q

What is the true scope of the course?

Answer

A

Only use descriptive statistics and visualisation, do not go beyond exploratory data analysis (EDA)
* Do not introduce / use statistical models
* No confirmatory data analysis like hypothesis testing or confidence interval
* No prediction models
Understand the data through initial data analysis and EDA
* This includes checking the quality of the data
Aim to get meaningful insights from data through descriptive statistics and visualisation of the data available towards answering motivative questions

Question 2

Q

What are the objectives of EDA?

Answer

A

Suggest hypotheses about the causes of observed phenomena
Assess assumptions on which statistical inference will be based
Support the selection of appropriate statistical tools and techniques
Provide a basis for further data collection

Question 3

Q

What are some tips on effective visualisation?

Answer

A

Keep it simple
Avoid chart junk
Maximise data-ink ratio
If it can be visualised in 2d, do not visualise it in 3d. If it can be visualised in 1d, do not visualise it in 2d

Question 4

Q

What is chart junk?

Answer

A

Chart junks are visual embellishments that are not essential to understanding the data

They are non-data and/or redundant data elements in a graph
They can be artistic decoration, but more often in the form of conventional graphical elements that are unnecessary in that they add no value

Question 5

Q

What is the data-ink ratio?

Answer

A

Data-ink ratio = data ink / total ink used in graphic

You should maximise the data-ink ratio (i.e. remove unnecessary ink in graphs)

HOWEVER, This can be taken the the extreme (Tufte Box plot); Tufte’s version of the box plot has the box removed and has a high data-ink ratio. Or the violin plot (default vs. unfilled and no box)

Question 6

Q

What is the dot plot and how do you attain?

Answer

A

Dot plot is a simple visualisation of a single quantitative variable with labels. It helps to improve the data-ink ratio, and the scale

EXAMPLE:
_, ax = plt.subplots(ncols=2, figsize=(8, 3));
sns.barplot(dc.iloc[10:16], y=’name’, x=’APPEARANCES’, orient=’h’, ax=ax[0]);
plt.grid(axis=’y’);
sns.scatterplot(dc.iloc[10:16], y=’name’, x=’APPEARANCES’, ax=ax[1]);
plt.tight_layout();

Question 7

Q

What is a dumbbell plot vs. a side-by-side bar chart?

Answer

A

Similar to a drop plot, see NOTES

Question 8

Q

Why only show half of the correlation matrix?

Answer

A

With a full square correlation matrix, the same information is repeated.
- You can reduce ink by reducing the amount of correlations shown
- However, this may not always be useful (may be less accessible)

Question 9

Q

What are the 2 strategies on keeping it simple?

Answer

A

Avoid chart junk
Maximise data-to-ink ratio

Question 10

Q

Why should we be careful generalising from a sample to a population?

Answer

A

We are often interested in the properties of the population but what we have is a sample
* Sample: a part of the population being observed
* “Representative” sample (the sample does not differ from the population in an important way)

Possible issues
1. “Non-representative” sample
a). Selection bias
b). Non-response bias
c). Sample size

Measuring error

Question 11

Q

What is measuring error and some possible causes?

Answer

A

Measuring Error is the difference between a measured value and the actual value

Some possible causes:
* Wording of questions and choice of answers
* People not answering honestly
* Priming and timing of the questions

Question 12

Q

What is important regarding data collection and data quality?

Answer

A

Quality of data affects whether valid and accurate conclusions can be drawn.

For the given data, it is important we understand:
* How the data was collected
* Whether the data is reliable and representative enough
* Limitations of the data

Initial data analysis and EDA may also help us to discover issues with data quality

Question 13

Q

What are 3 C’s to be wary of in data analysis?

Answer

A

Correlation, causation and confounders

USE DOMAIN KNOWLEDGE AND COMMON SENSE AS WELL AS MATHS/STATS AND COMPUTER

Question 14

Q

Does correlation imply causation?

Answer

A

Correlation does not imply causation

Reverse causation
Factors not considered:
Confounder: a variable which is associated with both predictor and response and may explain their association
Lurking factor: not measured but may be a confounder

Question 15

Q

How can you explore the relations by considering 3 variables?

Answer

A

sns.relplot(dummy_df, x=’v2’, y=’v3’, hue=’v1’, height=3);

SEE NOTES

Question 16

Q

What is Simpson’s paradox?

Answer

Study These Flashcards

A

Simpson’s reversal is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined

EXAMPLE:
* In all five subjects women have an equal or better success rate in applications than do men
* However, 24% of men are successful but only 23% of women are successful

W6 Flashcards

(16 cards)