Week 2 Flashcards
Write one line of code that would simulate three dice rolls
np.random.choice(arange (1,7), 3)
Write a line of code that would decide whether this is a ‘poker’
if np.std(x) == 0:
What does the following lines of code denote?
stats. norm.rvs(0, 1, 100)
stats. binom.rvs(1, .5, 100)
stats. norm.rvs(0, 1, 100) generates a normal distribution with a mean of 0, std of 1 and n of 100
stats. binom.rvs(1, .5, 100) generates a binomial distribution with a probability of 0.5 and an n of 100
What do the following lines of code denote:
stats. norm.ppf(0.946)
stats. norm.rvs(0, 1, 50)
stats. norm.cdf(2)
stats. norm.pdf(2)
stats. norm.cdf(2) = the probability of being at the left side of the distribution at a z value of 2
stats. norm.pdf(2) = the height of the distribution at a z value of 2
stats. norm.ppf(0.946) = gets a z value for the probability area from the left of that z value
stats. norm.rvs(0, 1, 50) = generates data from the distribution
How to get a pvalue using a t test in python
alpha = 0.05 n = 100 x = stats.norm.rvs(loc=0.0, scale=1.0, size=n)
pg. ttest(x, 0)
pg. ttest(x, 0)[‘p-val’][0] pg.ttest(x, 0)[‘p-val’][0] < alpha
Show a code which computes the power of the simulated t test
Add effect to the values at the start:
n=100;effect=.3;std=1; alpha=.05;replications=500
rejections = np.zeros(replications)
for i in range(0, replications): x = stats.norm.rvs(loc=effect, scale=std, size=n) if pg.ttest(x, 0)[‘p-val’][0] < alpha: rejections[i] = 1
print(np.mean(rejections)) # power
essentially the mean of the rejections with the location as the effect size
How can you compute the power analytically?
df=99;ncp=effect/(std/np.sqrt(n))
print(1 - stats.t.cdf(stats.t.ppf(1-alpha/2, df), df, loc=ncp))
or:
pg.power_ttest(effect, n, alpha=alpha, contrast=’one-sample’)
What is the power to detect an effect = 0?
Alpha! (0.05)
The Z value for 95% confidence is Z=1.96.
Some students answered qnorm(.95) or stats.norm.ppf(.95). What goes wrong?
We need 2.5 % on both sides to calculate a two-sided confidence interval, thus use qnorm(.975) in R and stats.norm.ppf(.975) in Python.
What is the relationship between power and assumptions?
The more assumptions you have, the higher the power
Compare the wilcox test to the t test
It carries out the same task but is based on ranks (no normal distribution assumed, non parametric). It just requires that the data is symmetrical
wilcox.test(x,mu=0)
Name a similar technique to the t test and wilcox test and how it differs
proportion test - nominal with no assumptions at all!
prop.test(sum(x>0),n)
Checks whether each value is larger than 0 or not, if it is then it adds a 1
For asssignment 6 we compared the power of the parametric, non parametric and nominal test in a solution for n=100,effect=3,std=1.
What was concluded?
power Parametric (t test) and nonparametric (wilcox) comparable but power nominal test (proportion test) seriously lower
How is the cusp catastrophe related to Jaap’s research?
When people start smoking, relapse in depression, fall asleep etc
How does Jaap relate a catastrophe to perception?
visual illusions e.g the cube which jumps in how you perceive it. Made a mathematical formula for it.
What did we learn about power through carrying out regression analysis in python
You need to have a large amount of participants to really find the underlying parameters
What is meant by Occam’s razor?
When faced with two opposing explanations for the same set of evidence, our minds will natuirally prefer the explanation that makes the fewest assumptions.
How is occam’s razor related to stats?
The problem of overfitting data to a model. When applying a model, a more complex model with more parameters will always do better, but the question is how much better it should do before we accept the more complex model
Describe the three basic ways of evaluating this more complex model
Cross validation
Resampling (using simulation)
logistic regression
What does this line of code do when generating data?
y1 = 3 + 1*x + stats.norm.rvs(0, 1, N)
Generates data along the equation of 3 + 1x + noise from a normal distribution
What do these lines of code doing a regression analysis?
design_matrix1 = np.vstack((np.ones(N), x)).T design_matrix3 = np.vstack((np.ones(N), x, x**2, x**3)).T lm1 = sm.OLS(y1, design_matrix1) results1 = lm1.fit() pred1 = results1.predict() lm3 = sm.OLS(y1, design_matrix3) results3 = lm3.fit()
design_matrix1 = np.vstack((np.ones(N), x)).T
lm1 = sm.OLS(y1, design_matrix1)
results1 = lm1.fit()
Fits the linear model
design_matrix3 = np.vstack((np.ones(N), x, x2,x3)).T
lm3 = sm.OLS(y1, design_matrix3)
results3 = lm3.fit()
Fits the quadratic model
What did we learn from this overfitting assignment
Although we generated the data with a linear model. The quadratic model we generated fit the data better. We kept half the data generated aside to cross check these models. The linear model fit better than the quadratic demonstrating the problem of overfitting. This is like a replication in terms of statistics
How do you interprit AIC and BIC
The lower the score, the better the fit, as punished by the number of parameters. Used to compare the fit of two models based on the same data. Similar to ANOVA
Aic and Bic both penalize goodbess of fit with the number of parameters used in the model. What is their difference
BIC takes sample size into account and AIC does not, the punishment is larger for BIC (multiplied by logn).
Logistic regression can also be used using the following code:
x = stats.norm.rvs(0, 1, 1000)
logit = 1 + 1*x # make logit data
y = (stats.uniform.rvs(0, 1, 1000) < stats.logistic.cdf(logit))
g = sm.GLM(y, x, family=sm.families.Binomial()).fit()print(g.
summary()) # results
Why does Han not recommend this?
Because it ‘gives this GLM thing’ and you like to be sure that you do something sensible and if you are first able to generate data under that model, and feed it back and find your parameter values then you are in charge.
How can resampling (using simulation to do statistics) be useful? (3)
Resampling- using simulation to do statistics?
Validating: Validating models by using random subsets (bootstrapping, cross validation)
■Regression example (cross validation)
○Precision: Estimating the precision of sample statistics (medians, variances, percentiles) by drawing randomly with replacement from a set of data points (bootstrapping)
■E.g. confidence interval of the mean
○Significance tests: Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)
■T.test example (see next slides)
When doing statistical tests in r or python what assumptions do we rely on regarding the distribution of data?
instead of relying on assumption about the statistical distribution of data (e.g., normally distributed), we use simulation to generate distributions for comparison with the data. There are many different options! In the assignment we did a nonparametric bootstrap of test of correlation
In the significance test what does the following lines of code do?
for i in range(N):
rs[i] = np.corrcoef(x, np.random.choice(y, len(y), replace=False))[0, 1]
np.sum(rs >= r)/N
How often is my correlation exceptional against this distribution of simulated correlations.
How often are these simulated correlations higher than my correlation/ N