Data Analysis in Real Life Flashcards

1
Q

What are Some methods for helping reports have a more clear narrative include

A
  1. eliminating jargon and focusing on interpretability
  2. critiquing significant effects by coming up with potential alternate explanations
  3. focusing on simpler models and parsimony
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Version control software used for

A

Keeps track of checked in versions of code, data and reports

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Some easy things to double check reports include

A
  1. verifying the signs of effect are in the obvious direction
  2. checking magnitude of effects by comparison with other known effects
  3. putting units on graphs and coefficients and generally keeping track of units
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What do Reproducible report writing tools like knitr and ipython help with

A
  1. by automating the report writing process
  2. by organizing ones thinking by blending the code and the narrative into a single document
  3. by documenting the analysis code with the project narrative
  4. by advancing the goal of reproducibility
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are two components that make for good final data products that are ubiquitous across all settings

A

making the report reproducible and

making the report and code version controlled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The reason you get a null result it may be due to

A

low power

that the null hypothesis is actually correct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A study with a very low sample size will likely have

A

low power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Calculating power after the study has been done and analyzed is

A

problematic and should only be done by people well versed in the issues

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Ideally, a surrogate variable for variable of interest will

A
  1. have a known or estimable variance around the desired measurement variable
  2. be unbiased
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the idea of power

A

Power is the probability of rejecting the null hypothesis when it’s false.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can you do In the absence of any calibration data to evaluate your surrogate

A

either modeling via assumptions or sensitivity analyses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What to do if your surrogate variable is such an unreliable of an estimate of your actual outcome

A

one must come to the conclusion that it’s better to not conduct the study at all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Potential problems with testing lots of hypotheses until a significant one is found include

A

Declaring effects that are not significant as significant by chance

Misrepresenting the strength of the findings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Comparing your effects to familiar ones useful for

A

Mentally calibrating the size of an effect or its significance when a variable under study is not well understood

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Negative control analyses are useful for

A

for evaluating processes to see if spurious effects are obtained

as a validity check of an effect of interest by looking to see if similar effects occur with the same analysis on variables where an effect is known not to be present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A good negative control analysis will

A

have a negative control that is known not to have an effect but is otherwise similar to the variable under study

17
Q

What is hypothesis testing

A

In hypothesis testing, we use a statistic to decide between two hypotheses. We set one as the default hypothesis (null hypothesis) and the other as the alternative.

18
Q

The result of a hypothesis test is summarized with a p-value.What is p-value?

A

A small p-value (close to 0) supports the alternative while a large one (close to 1) supports the null. We reject the null if our p-value is less than 0.05 if we want to control the probability of incorrectly rejecting the null at 5%.

19
Q

What are potential problems with multiple comparisons

A

The probability that we see apparently significant findings simply by chance even though they’re not actually significant increases.

20
Q

In A/B testing randomization of a treatment is used for

A

To make groups as comparable as possible

Attempt to balance potential unobserved confounding variables

21
Q

Three strategies to combat sampling bias are

A

Random sampling

Modeling

Weighting

22
Q

It is generally a good idea to consider possible confounders when considering a significant effect

A

TRUE

23
Q

It’s possible for a regression effect to reverse itself after the inclusion of another variable into the model

A

TRUE

24
Q

What are Blocking and adjustment are tools used for

A

Account for variables potentially impacting the estimation of the effect of interest.

25
Q

If we see an association between two variables, it would be a good idea to

A

consider the possibility that the association is explained by a confounding third variable.

26
Q

You see an effect of ice cream sales on the number of heat exhaustion cases. The effect is likely due to:

A

The hot weather as a confounder.

27
Q

Associations can imply causality

A

under a set of strict assumptions often as a result of design choices.

28
Q

What is confounding

A

Confounding occurs when you want to compare two things and a third gets in the way.

29
Q

Define causal effects

A

We define causal effects as the difference between the outcome for a subject observed at a particular treatment minus the outcome observed as a a control

30
Q

What is Casual Inference

A

The study of how to estimate causal effects using data is called causal inference.

31
Q

What must The residuals consider:

A

the difference between the response and the fitted value

32
Q

Summary tables should include

A

Quantiles

Means

Standard deviations

33
Q

In leading digits of data that follow Benford’s, the digits (0-9) are all equally likely

A

FALSE

34
Q

Merging (linking two datasets via a common index) errors can have a strong impact on subsequent analyses

A

TRUE

35
Q

How do you keep on top of data quality without being in the trenches?

A

The construction of summary tables.
Regression diagnostics.
Residuals -