Week 2 Flashcards

1
Q

Write one line of code that would simulate three dice rolls

A

np.random.choice(arange (1,7), 3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Write a line of code that would decide whether this is a ‘poker’

A

if np.std(x) == 0:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does the following lines of code denote?

stats. norm.rvs(0, 1, 100)
stats. binom.rvs(1, .5, 100)

A

stats. norm.rvs(0, 1, 100) generates a normal distribution with a mean of 0, std of 1 and n of 100
stats. binom.rvs(1, .5, 100) generates a binomial distribution with a probability of 0.5 and an n of 100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What do the following lines of code denote:

stats. norm.ppf(0.946)
stats. norm.rvs(0, 1, 50)
stats. norm.cdf(2)
stats. norm.pdf(2)

A

stats. norm.cdf(2) = the probability of being at the left side of the distribution at a z value of 2
stats. norm.pdf(2) = the height of the distribution at a z value of 2
stats. norm.ppf(0.946) = gets a z value for the probability area from the left of that z value
stats. norm.rvs(0, 1, 50) = generates data from the distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to get a pvalue using a t test in python

A
alpha = 0.05
n = 100
x = stats.norm.rvs(loc=0.0, scale=1.0, size=n)

pg. ttest(x, 0)
pg. ttest(x, 0)[‘p-val’][0] pg.ttest(x, 0)[‘p-val’][0] < alpha

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Show a code which computes the power of the simulated t test

A

Add effect to the values at the start:
n=100;effect=.3;std=1; alpha=.05;replications=500

rejections = np.zeros(replications)

for i in range(0, replications): x = stats.norm.rvs(loc=effect, scale=std, size=n) if pg.ttest(x, 0)[‘p-val’][0] < alpha: rejections[i] = 1

print(np.mean(rejections)) # power

essentially the mean of the rejections with the location as the effect size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you compute the power analytically?

A

df=99;ncp=effect/(std/np.sqrt(n))

print(1 - stats.t.cdf(stats.t.ppf(1-alpha/2, df), df, loc=ncp))

or:

pg.power_ttest(effect, n, alpha=alpha, contrast=’one-sample’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the power to detect an effect = 0?

A

Alpha! (0.05)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The Z value for 95% confidence is Z=1.96.

Some students answered qnorm(.95) or stats.norm.ppf(.95). What goes wrong?

A

We need 2.5 % on both sides to calculate a two-sided confidence interval, thus use qnorm(.975) in R and stats.norm.ppf(.975) in Python.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the relationship between power and assumptions?

A

The more assumptions you have, the higher the power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Compare the wilcox test to the t test

A

It carries out the same task but is based on ranks (no normal distribution assumed, non parametric). It just requires that the data is symmetrical

wilcox.test(x,mu=0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name a similar technique to the t test and wilcox test and how it differs

A

proportion test - nominal with no assumptions at all!

prop.test(sum(x>0),n)

Checks whether each value is larger than 0 or not, if it is then it adds a 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

For asssignment 6 we compared the power of the parametric, non parametric and nominal test in a solution for n=100,effect=3,std=1.

What was concluded?

A

power Parametric (t test) and nonparametric (wilcox) comparable but power nominal test (proportion test) seriously lower

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is the cusp catastrophe related to Jaap’s research?

A

When people start smoking, relapse in depression, fall asleep etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does Jaap relate a catastrophe to perception?

A

visual illusions e.g the cube which jumps in how you perceive it. Made a mathematical formula for it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What did we learn about power through carrying out regression analysis in python

A

You need to have a large amount of participants to really find the underlying parameters

17
Q

What is meant by Occam’s razor?

A

When faced with two opposing explanations for the same set of evidence, our minds will natuirally prefer the explanation that makes the fewest assumptions.

18
Q

How is occam’s razor related to stats?

A

The problem of overfitting data to a model. When applying a model, a more complex model with more parameters will always do better, but the question is how much better it should do before we accept the more complex model

19
Q

Describe the three basic ways of evaluating this more complex model

A

Cross validation
Resampling (using simulation)
logistic regression

20
Q

What does this line of code do when generating data?

y1 = 3 + 1*x + stats.norm.rvs(0, 1, N)

A

Generates data along the equation of 3 + 1x + noise from a normal distribution

21
Q

What do these lines of code doing a regression analysis?

design_matrix1 = np.vstack((np.ones(N), x)).T
design_matrix3 = np.vstack((np.ones(N), x, x**2, x**3)).T
lm1 = sm.OLS(y1, design_matrix1)
results1 = lm1.fit()
pred1 = results1.predict()
lm3 = sm.OLS(y1, design_matrix3)
results3 = lm3.fit()
A

design_matrix1 = np.vstack((np.ones(N), x)).T
lm1 = sm.OLS(y1, design_matrix1)
results1 = lm1.fit()
Fits the linear model

design_matrix3 = np.vstack((np.ones(N), x, x2,x3)).T
lm3 = sm.OLS(y1, design_matrix3)
results3 = lm3.fit()
Fits the quadratic model

22
Q

What did we learn from this overfitting assignment

A

Although we generated the data with a linear model. The quadratic model we generated fit the data better. We kept half the data generated aside to cross check these models. The linear model fit better than the quadratic demonstrating the problem of overfitting. This is like a replication in terms of statistics

23
Q

How do you interprit AIC and BIC

A

The lower the score, the better the fit, as punished by the number of parameters. Used to compare the fit of two models based on the same data. Similar to ANOVA

24
Q

Aic and Bic both penalize goodbess of fit with the number of parameters used in the model. What is their difference

A

BIC takes sample size into account and AIC does not, the punishment is larger for BIC (multiplied by logn).

25
Q

Logistic regression can also be used using the following code:
x = stats.norm.rvs(0, 1, 1000)
logit = 1 + 1*x # make logit data
y = (stats.uniform.rvs(0, 1, 1000) < stats.logistic.cdf(logit))
g = sm.GLM(y, x, family=sm.families.Binomial()).fit()print(g.
summary()) # results

Why does Han not recommend this?

A

Because it ‘gives this GLM thing’ and you like to be sure that you do something sensible and if you are first able to generate data under that model, and feed it back and find your parameter values then you are in charge.

26
Q

How can resampling (using simulation to do statistics) be useful? (3)

A

Resampling- using simulation to do statistics?
Validating: Validating models by using random subsets (bootstrapping, cross validation)
■Regression example (cross validation)

○Precision: Estimating the precision of sample statistics (medians, variances, percentiles) by drawing randomly with replacement from a set of data points (bootstrapping)
■E.g. confidence interval of the mean

○Significance tests: Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)
■T.test example (see next slides)

27
Q

When doing statistical tests in r or python what assumptions do we rely on regarding the distribution of data?

A

instead of relying on assumption about the statistical distribution of data (e.g., normally distributed), we use simulation to generate distributions for comparison with the data. There are many different options! In the assignment we did a nonparametric bootstrap of test of correlation

28
Q

In the significance test what does the following lines of code do?

for i in range(N):
rs[i] = np.corrcoef(x, np.random.choice(y, len(y), replace=False))[0, 1]
np.sum(rs >= r)/N

A

How often is my correlation exceptional against this distribution of simulated correlations.

How often are these simulated correlations higher than my correlation/ N