Research Methods & Communication Flashcards
Experimental design: what is a factor?
What you are testing (ie a drug)
What does the Anscombe Quartet show people?
That you should really plot your results before making assumptions.
Conditional probability
P(A|B) is the conditional probability that if A is true then B is also true.
What are the assumptions of a linear regression?
* Normal errors - check by looking at histogram of residuals or QQ plot.
* Variance is constant for all values of the independent variable. - check by looking at a plot of residuals vs fitted values
*** Assumes straight-line-relationship between variables **- check by looking at scatterplots & plots of residuals vs fitted values.
What is multiple regression?
Use more than one independent variable to predict the dependent variable. (eg plant growth is dependent on light AND rainfall)
What is a suggested alternative to the H index?
The M index which will be calculated the same way as the H index then be divided by the number of years since the first publication
Experimental design: what is a unit?
What you’re testing your factor on (number of people or plants or horses…)
Joint probability
P(AnB) is the joint probability that both A & B are true.
What is the t value equation?
t = X-μ OR there isanother t calculation
----- S/sqrtN
What symbol is used to represent significance level?
alpha
Why use R?
+ Free
+ Open Source
+ Widely used
- Command line
- Intimidating
When do you use the MULTIPLICATION RULE of probability?
To calculate the joint probability of two or more independent events. i.e flipping a head AND then flipping another head
When does confounding occur?
When it is impossible to separate the effects of experimental treatment from other factors that might affect the outcome.
What are the methods of randomisation?
simple, stratified, paired, pairwise, minimisation
What is pseudoreplication?
A special case of inadequate specification of random factors where both random and fixed factors are present.
What is a good M value?
Around 1 is a good M value.
With what data would you use a barplot?
With FREQUENCY data
When to use a Chi-Squared test?
* With nominal data
* “Goodness of fit” tests used to compare observed against theoretical frequencies
* Contingency test used to show whether data are associated or independent
How to calculate the value of cells in a contingency table?
column total X row total
grand total
Name measures of central tendency?
* Mean
* Median
* Mode
What is covariance?
Covariance is a measure of how much two random variables change together.
Experimental design: what is a level?
The level is the things you’re varying. So if your factor was a certain drug, you could have several levels within this: 10mg; 20mg; 30mg
Why are controls necessary?
Controls help avoid the treatment in question being confounded with experimental procedures associated with treatment. (eg without a placebo, drug effects are confounded with the act of taking the treatment)
What helps to reduce the risk of confounding?
Replication and randomisation
In the standard equation y = ax + b what variables are y and x?
y is the dependent variable
x is the independent variable
Standard deviation equation
S = (sqrt)S^2 ?????
Graphics for exploratory analysis : univariate data
* Stem-and-leaf plots
* Histograms
* Boxplots
What are the main points of the scientific method?
1) Logical guess based on other people’s results
2) Predictions tested
3) Results. Agree with hypothesis = win. If not, formulate new hypothesis.
What should you do if you cannot control for some confounding variables at the experimental design stage?
Attempt to control for variation statistically. - take measurements of variables that might influence the result, and hope we can quantify their influence. - this generally requires replication - we lose some degrees of freedom in estimating the effect of these variables.
What is the correlation coefficient and what does it show?
The correlation coefficient OR Pearson’s Product-Moment Correlation Coefficient OR r.
- falls between 1 and -1.
- 1 = complete positive correlation
- -1 = complete negative correlation
- 0 = no correlation
- Defined as the covariance divided by the product of their standard deviations.
Why are H, M and IF a bit shit?
All of them are strongly affected by discipline.
Common problems with experimental design and interpretation
* Non-independence of data points and pseudoreplication
* Sample size too small
* Confirmation bias & observer expectation
* Researcher degrees of freedom & ‘p-hacking’
* Interpreting non-significant result as meaning something true
* Interpreting a significant result as meaning that something is true
What are the pros and cons of Bayesian statistics?
+ Allows direct statements about probability (eg the probability that one drug is better than another)
+ Can be used to calculate the probability of future observations.
- It is subjective: because the posterior probability is affected by the prior probability, different people (with different priors) can reach different conclusions from the same data. - However, as more evidence is accumulated the posterior probabilities will converge on the same result, whatever the priors. Advocates of Bayesian statistics argue that since science is based on differences of opinion, methods of analysis should reflect this.
What do stripplots and boxplots show us?
* allows to identify outliers, errors and patterns in variance
* gives an impression of how the continuous variable is dependent on the categorical variable
* less useful when n is high
What do scatterplots show us?
* see relationships between two variables
* check for non-linearity
* check for outliers and errors
* check for change in variance
* check for structure in the data
How can you achieve a more stringent significance level?
Use lower significance levels (e.g, 0.01 or 0.001)
What does a two-factor ANOVA allow us to test for?
Main effect and interactions. A main effect is the effect of one factor in isolation. An interaction is the effect of one factor when the level of the other factors is taken into account.
Why do we randomise?
* to avoid selection bias
* control for temporal effects
* control for regression to the mean
* basis for statistical inference
Classical statistics
In classical statistics, we ask what the probability of seeing our data is, given a particular hypothesis (the null hypothesis)
What are the assumptions of the t test?
* Normal errors
* Independence of data points
* Equal variances - R uses a version that is fine with unequal variances
What does ANOVA rely on?
The partitioning of variance in the data into that unexplained by the factor(s) & that which is explained.
What is the equation for the correlation coefficient?
(x-x̄)(y-ȳ) / n-1
SxSy
What is an H index?
Used to assess the quality of an individual’s scientific output
What are the graphical rules according to that Tufte bloke?
- Data - ink ratio & graphical redesign
- Chartjunk
- Data - ink maximisation
- Multi-functioning graphical elements
- High resolution data graphics
- Aesthetics & technique in graphical design
In a study…
1 in 1000 people have a rare disease.
The test for this rare disease is 99.9% accurate.
You have tested positive. What are the chances that you have the disease?
Two people will test positive. - one will have the disease - one is a false positive There is a 50:50 chance that you have the disease.
Chi-squared equation
χ2 = Σ (obs - exp)^2
exp
Why is using an estimate of the standard deviation bad?
Causes problems because it will lead to systemic underestimation of σ
Should you swap in the Monty Hall Problem?
Always
When do you use the ADDITION RULE of probability?
When the outcomes of an event are mutually exclusive (cannot happen at the same time) i.e the probability of rolling a 2 OR a 5
What must you have for both correlation and regression?
* Normal errors
* Variances must be similar across the relationship
Yates’ correction equation
χ2 = Σ ((obs-exp)-0.5)^2
exp
What is calculated in ANOVA?
ANOVA calculates the between group variance, or the factor variance. - This is compared with the within group variance, or error variance by using an f test.
What is a type 2 error?
Failing to reject a null hypothesis which is actually incorrect. FALSE NEGATIVE
Bayesian Statistics
In Bayesian statistics, we ask what the probability of different hypotheses are, given our data: we then pick the most likely hypothesis.
What are the assumptions of a ANOVA?
* Normally distributed errors
* Homoscedasticity
* Observations are independent
Variance equation
S^2 = (x - x̄ )^2
------------- n-1
Does 95% PI exceed CI or does 95% CI exceed PI?
95% PI ALWAYS exceeds CI
What are the different types of experimental design?
Single-factor; two-factor; Higher level factorial design; incomplete design; Nested design
With what data would you use a scatterplot?
With two CONTINUOUS variables
What is Bayes’ rule?
Bayes’ rule = P(A|B) = P(B|A) x P(A) ——————————— P(B)
When do you use Yates’ correction?
When a contingency table is 2x2
Replication is no use (per se) as you need to replicate the right things. What do you need to replicate?
Replicate the treatment that it is applied to
Limitations of Chi-Squared test
* Each set of measurements must be independent * No sample must be s exact test instead )
How do we overcome the systematic underestimation of σ?
By comparing our value of t with Student’s t Distribution which takes account of this.
How do you report a t test statistic?
the difference between means was (or was not) statistically significant (t=X.XX, Ydf, P=Z.ZZ)
What are the cons of the H index?
It is strongly affected by the length of a person’s career
What is the coefficient of determination?
r^2 The coefficient of determination is an estimate of the % variability in one variable explained by the other variable.
What are measures of dispersion?
* Variance * Standard deviation * IQR
How do you calculate regression in R?
lm(dependent~independent) then use summary( ) to get more information.
When comparing more than one mean use pairwise comparisons, what is the formula for the number of pairwise comparisons?
(N-1)N/2 pairwise comparisons
What is the equation for covariance?
COVARIANCE = Σ (x-x̄)(y-ȳ) ————- n-1
What is the R function for calculating the correlation coefficient?
cor.test( )
How do you compare regression lines?
ANCOVA –> Analysis of covariance - uses the independent variable as the covariate.
What does correlation show?
The strength and significance of the relationship between two variables.
When do you calculate a t value?
* Compare two means * Compare a before and after
How do you partition variability?
Use sum of squares (SS), do not use the variance (S^2) SS = Σ(x-x̄ )^2 OR SS = S^2 X df
When performing pairwise comparisons, what is the formula that determines the number of errors?
1-(0.95^number of tests)
What do we use regression analysis for?
To fit a line to allow estimates of the dependent variable to be made from the independent.
What is extrapolation?
Estimating dependent variables from a regression equation outside the range of your data
What are the uses of graphical methods (histograms, stem-and-leaf plots) to demonstrate univariate data?
* tell us about the shape of the frequency distribution * helps to identify outliers * helps to identify possible errors
What is a type 1 error?
Reject a null hypothesis that is actually correct. FALSE POSITIVE
How to calculate Impact Factor
number of times articles published in 2010 & 2011 were cited in 2012
citable articles in 2010/11
What does P < 0.05 mean?
Your result is SIGNIFICANT, reject the NULL hypothesis
How do you calculate the H index?
A person’s H index is the highest number, h, for which they have h papers each with h citations.
What is the equation for the mean?
x̄ = 1/n Σ
What do you need to check when looking at regression?
* structure in the data * Error distribution * Variance structure * Linearity
How do you get an ANOVA table in R?
lm( ) or aov( )
How do you get a histogram of residuals in R?
hist(model$residuals)
How is a line fitted?
By the method of least squares
How do we test for statistically significant correlation?
Calculate a p value associated with r
How do we know if H1 is one-tailed or two-tailed?
If it is one-tailed then it will have one outcome, if it is two-tailed then it will have two outcomes.
With what data would you use a stripplot / boxplot?
With one CONTINUOUS variable and one CATEGORICAL variable
What is a prediction interval (PI)?
A prediction interval indicates a region we are 95% certain predictions of the dependent lie.
How do we know if t is significant? (or any other letters that aren’t p)
they are greater than 0.05
When is Pearson’ relationship used?
On 2 continuous variables