Data Analysis in Real Life Flashcards
What are Some methods for helping reports have a more clear narrative include
- eliminating jargon and focusing on interpretability
- critiquing significant effects by coming up with potential alternate explanations
- focusing on simpler models and parsimony
What is Version control software used for
Keeps track of checked in versions of code, data and reports
What are Some easy things to double check reports include
- verifying the signs of effect are in the obvious direction
- checking magnitude of effects by comparison with other known effects
- putting units on graphs and coefficients and generally keeping track of units
What do Reproducible report writing tools like knitr and ipython help with
- by automating the report writing process
- by organizing ones thinking by blending the code and the narrative into a single document
- by documenting the analysis code with the project narrative
- by advancing the goal of reproducibility
What are two components that make for good final data products that are ubiquitous across all settings
making the report reproducible and
making the report and code version controlled.
The reason you get a null result it may be due to
low power
that the null hypothesis is actually correct
A study with a very low sample size will likely have
low power
Calculating power after the study has been done and analyzed is
problematic and should only be done by people well versed in the issues
Ideally, a surrogate variable for variable of interest will
- have a known or estimable variance around the desired measurement variable
- be unbiased
What is the idea of power
Power is the probability of rejecting the null hypothesis when it’s false.
What can you do In the absence of any calibration data to evaluate your surrogate
either modeling via assumptions or sensitivity analyses.
What to do if your surrogate variable is such an unreliable of an estimate of your actual outcome
one must come to the conclusion that it’s better to not conduct the study at all.
Potential problems with testing lots of hypotheses until a significant one is found include
Declaring effects that are not significant as significant by chance
Misrepresenting the strength of the findings
What is Comparing your effects to familiar ones useful for
Mentally calibrating the size of an effect or its significance when a variable under study is not well understood
Negative control analyses are useful for
for evaluating processes to see if spurious effects are obtained
as a validity check of an effect of interest by looking to see if similar effects occur with the same analysis on variables where an effect is known not to be present
A good negative control analysis will
have a negative control that is known not to have an effect but is otherwise similar to the variable under study
What is hypothesis testing
In hypothesis testing, we use a statistic to decide between two hypotheses. We set one as the default hypothesis (null hypothesis) and the other as the alternative.
The result of a hypothesis test is summarized with a p-value.What is p-value?
A small p-value (close to 0) supports the alternative while a large one (close to 1) supports the null. We reject the null if our p-value is less than 0.05 if we want to control the probability of incorrectly rejecting the null at 5%.
What are potential problems with multiple comparisons
The probability that we see apparently significant findings simply by chance even though they’re not actually significant increases.
In A/B testing randomization of a treatment is used for
To make groups as comparable as possible
Attempt to balance potential unobserved confounding variables
Three strategies to combat sampling bias are
Random sampling
Modeling
Weighting
It is generally a good idea to consider possible confounders when considering a significant effect
TRUE
It’s possible for a regression effect to reverse itself after the inclusion of another variable into the model
TRUE
What are Blocking and adjustment are tools used for
Account for variables potentially impacting the estimation of the effect of interest.
If we see an association between two variables, it would be a good idea to
consider the possibility that the association is explained by a confounding third variable.
You see an effect of ice cream sales on the number of heat exhaustion cases. The effect is likely due to:
The hot weather as a confounder.
Associations can imply causality
under a set of strict assumptions often as a result of design choices.
What is confounding
Confounding occurs when you want to compare two things and a third gets in the way.
Define causal effects
We define causal effects as the difference between the outcome for a subject observed at a particular treatment minus the outcome observed as a a control
What is Casual Inference
The study of how to estimate causal effects using data is called causal inference.
What must The residuals consider:
the difference between the response and the fitted value
Summary tables should include
Quantiles
Means
Standard deviations
In leading digits of data that follow Benford’s, the digits (0-9) are all equally likely
FALSE
Merging (linking two datasets via a common index) errors can have a strong impact on subsequent analyses
TRUE
How do you keep on top of data quality without being in the trenches?
The construction of summary tables.
Regression diagnostics.
Residuals -