W6 Flashcards

1
Q

What is the true scope of the course?

A
  1. Only use descriptive statistics and visualisation, do not go beyond exploratory data analysis (EDA)
    * Do not introduce / use statistical models
    * No confirmatory data analysis like hypothesis testing or confidence interval
    * No prediction models
  2. Understand the data through initial data analysis and EDA
    * This includes checking the quality of the data
  3. Aim to get meaningful insights from data through descriptive statistics and visualisation of the data available towards answering motivative questions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the objectives of EDA?

A
  • Suggest hypotheses about the causes of observed phenomena
  • Assess assumptions on which statistical inference will be based
  • Support the selection of appropriate statistical tools and techniques
  • Provide a basis for further data collection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some tips on effective visualisation?

A
  1. Keep it simple
  2. Avoid chart junk
  3. Maximise data-ink ratio
  4. If it can be visualised in 2d, do not visualise it in 3d. If it can be visualised in 1d, do not visualise it in 2d
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is chart junk?

A

Chart junks are visual embellishments that are not essential to understanding the data

  • They are non-data and/or redundant data elements in a graph
  • They can be artistic decoration, but more often in the form of conventional graphical elements that are unnecessary in that they add no value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the data-ink ratio?

A

Data-ink ratio = data ink / total ink used in graphic

You should maximise the data-ink ratio (i.e. remove unnecessary ink in graphs)

HOWEVER, This can be taken the the extreme (Tufte Box plot); Tufte’s version of the box plot has the box removed and has a high data-ink ratio. Or the violin plot (default vs. unfilled and no box)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the dot plot and how do you attain?

A

Dot plot is a simple visualisation of a single quantitative variable with labels. It helps to improve the data-ink ratio, and the scale

EXAMPLE:
_, ax = plt.subplots(ncols=2, figsize=(8, 3));
sns.barplot(dc.iloc[10:16], y=’name’, x=’APPEARANCES’, orient=’h’, ax=ax[0]);
plt.grid(axis=’y’);
sns.scatterplot(dc.iloc[10:16], y=’name’, x=’APPEARANCES’, ax=ax[1]);
plt.tight_layout();

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a dumbbell plot vs. a side-by-side bar chart?

A

Similar to a drop plot, see NOTES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why only show half of the correlation matrix?

A

With a full square correlation matrix, the same information is repeated.
- You can reduce ink by reducing the amount of correlations shown
- However, this may not always be useful (may be less accessible)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 2 strategies on keeping it simple?

A
  1. Avoid chart junk
  2. Maximise data-to-ink ratio
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why should we be careful generalising from a sample to a population?

A

We are often interested in the properties of the population but what we have is a sample
* Sample: a part of the population being observed
* “Representative” sample (the sample does not differ from the population in an important way)

Possible issues
1. “Non-representative” sample
a). Selection bias
b). Non-response bias
c). Sample size

  1. Measuring error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is measuring error and some possible causes?

A

Measuring Error is the difference between a measured value and the actual value

Some possible causes:
* Wording of questions and choice of answers
* People not answering honestly
* Priming and timing of the questions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is important regarding data collection and data quality?

A

Quality of data affects whether valid and accurate conclusions can be drawn.

For the given data, it is important we understand:
* How the data was collected
* Whether the data is reliable and representative enough
* Limitations of the data

  • Initial data analysis and EDA may also help us to discover issues with data quality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are 3 C’s to be wary of in data analysis?

A

Correlation, causation and confounders

USE DOMAIN KNOWLEDGE AND COMMON SENSE AS WELL AS MATHS/STATS AND COMPUTER

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Does correlation imply causation?

A

Correlation does not imply causation

  • Reverse causation
  • Factors not considered:
  • Confounder: a variable which is associated with both predictor and response and may explain their association
  • Lurking factor: not measured but may be a confounder
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can you explore the relations by considering 3 variables?

A

sns.relplot(dummy_df, x=’v2’, y=’v3’, hue=’v1’, height=3);

SEE NOTES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Simpson’s paradox?

A

Simpson’s reversal is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined

EXAMPLE:
* In all five subjects women have an equal or better success rate in applications than do men
* However, 24% of men are successful but only 23% of women are successful