Simulations Flashcards

1
Q

random module

A

used to generate random numbers in python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

random.seed()

A

If you initialize the random number generator with a specific seed using random.seed(), the sequence of numbers generated will always be the same (no matter how many times you run it)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the meaning of 1 in random.seed(1)?

A

1 doesn’t have anything to do with actual list content, it is just index used to generate random numbers - it enables you to access the same numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

np.random.choice()

A

used to generate a random sample from given array or list

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

np.arange()

A

generates array containing evenly spaced values within specified range -> so it is similar to range() but range() returns a list and np.arrange() returns array

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

x = np.random.choice(np.arange(1, 7), 3)

A

displays 3 random numbers from array 1 to 6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a difference between simulation probability and mathematical probability?

A

Simulation probability is an approximation, whereas mathematical probability is exact. However, mathematical probability is often calculated in ideal conditions, whereas simulations can be adjusted to real-life scenarios

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

one-sample t-test

A

parametric test
examines whether the mean of a population is statistically different from a known or hypothesized value (therefore it checks whether the difference between two variabels varies from 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are data requirements for one-sample t-test?

A

1) continuous test variable
2) scores on test variable are independent (there is no relationship between scores)
3) random sample of data from population
4) normal distribution of sample and population on test variable
5) homogeneity of variances (variances are approx equal in both sample and population)
6) no outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is linear regression?

A

estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values.
Y = β0+ β1X+ ε
Y = dependent variable
B0 = Y-intercept
B1 = slope
x = independent variable
e = error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to calculate the slope (B1)?

A

CHANGE IN Y/CHANGE IN X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is polynomial linear regression?

A

It is used when there is nonlinear relationship between the predictor and response variable (when data points form a curve)

Y = β0+ β1X + β2X2+ … + βhXh+ ε

In this equation,his referred to as the degreeof the polynomial.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are advantages of polynomial regression?

A
  • Polynomial provides the best approximation of the relationship between the dependent and independent variable.
  • A Broad range of function can be fit under it.
  • Polynomial basically fits a wide range of curvature.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are disadvantages of polynomial regression?

A
  • The presence of one or two outliers in the data can seriously affect the results of the nonlinear analysis.
  • These are too sensitive to the outliers.
  • In addition, there are unfortunately fewer model validation tools for the detection of outliers in nonlinear regression than there are for linear regression.
  • Also more complex models are prone to overfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is continous uniform distribution?

A

symmetric probability distribution where all outcomes have an equal likelihood of occurring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to draw samples from uniform distribution?

A

stats.uniform.rvs(0, 1, size=100):
generates 100 random samples from uniform distribution
loc = 0 (sets lower bound)
scale = 1 (sets range of distribution)
so samples are drawn from interval (0,1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is binomial distribution?

A

probability of outcome being either success or failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How to generate random variables from binomial distribution?

A

stats.binom.rvs(n=1, p=0.5, size=100)
n - number of trials
p - probability of success
size - number of random variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is normal distribution?

A

data are symmetrically distributed; bell-shaped - most values clustering around central region
mean = median = mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How to generate random samples from normal distribution in Pyton?

A

stats.norm.rvs(loc=0, scale=1, size=100)
loc = mean
scale = standard deviation
size = number of random samples to generate

OR
np.random.normal (loc=0, scale=1, size=100)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does function stats.norm.cdf?

A

Cumulative Distribution Function
area under the curve
For a normal distribution, the CDF at a point x gives the area under the probability density function (PDF) curve to the left of x. This area corresponds to the probability that a randomly selected value from the distribution is less than or equal to x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does function stats.norm.pdf?

A

Probability Density Function

The PDF of a continuous random variable gives the relative likelihood of the random variable taking on a specific value.

It is height of the curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is stats.norm.ppf?

A

Percent Point Function
inverse cdf!
given the probability of numbers smaller than x, what is x?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How to generate random samples from exponential distirbution?

A

stats.expon.rvs

23
Q

How to generate random samples from beta distribution?

A

stats.beta.rvs

23
Q

How to generate random samples from chi-squared distribution?

A

stats.chi2.rvs

24
Q

How to generate random samples from gamma distribution?

A

stats.gamma.rvs

25
Q

What is standard normal distribution?

A

special normal distribution where mean=0, sd = 1

26
Q

What are z-scores?

A

test statistic used in t-test
area under the curve to the right of azscore is thepvalue, and it’s the likelihood of your observation occurring if the null hypothesis is true.

27
Q

P-value

A

probability of obtaining test results at least as extreme as results actually observed, under assumption that null hypothesis is true

28
Q

Alpha

A

false positive -> Type 1 error rate
when we believe that there is genuine effect in population when in fact there isn’t

in real data, we don’t define alpha as part of the code -> instead look at p-value and compare it with alpha

28
Q

How to perform one-sample t-test in Python?

A

stats.ttest_1samp
(sample_data, population_mean)

pg.ttest(data, population_mean)

29
Q

What is power of test?

A

probability that the test correctly rejects a false null hypothesis (H0). It’s the complement of the Type II error rate (i.e., Power=1−β).

30
Q

What is effect size?

A

refers to the magnitude of the difference that you expect to detect

31
Q

How to calculate power of t-test?

A

pg.power_ttest
parameters:
n = sample size - if not provided calculated based on other parameters

d = effect size
alpha

power - if not provided calculated based on other parameters

32
Q

How to round numbers to specific number of decimal places?

33
Q

What are non-parametric tests?

A

tests which don’t require that your data follows normal distribution

34
Q

What is the parameter of the model?

A

parameter of a model is a variable that can take range of values that describe the data

35
Q

Why we choose parameters?

A

to simulate models to then generate simulated data

36
Q

How to report statistical results?

A

1) report overall significance of main model first and compare to other models (F, p-values, BIC, AIC)
2) report findings of individual parameters (effects in linear regression)
3) include post-oc and additional analyses
4) do not report all non-significant results
5) round to 3 decimal places when reporting p-values/Bayes factors

37
Q

What does sm.OLS?

A

OLS stands for Ordinary Least Squares
which is method for estimating parameters in linear regression model
goal is to find the line that minimizes the sum of squared differences between observed value and value predicted by the model

it requires 2 main inputs:
- dependent variable (observed data that you want to model -> goal is to predict them)
- independent vairable which can be coded in design matrix

38
Q

How to fit model onto the data?

A

use function .fit()

39
Q

How to simulate uncorrelated variables?

A

They need to be simulated from f.ex. random distribution seperately
SO you CANNOT use linspace!

40
Q

What does function np.linspace?

A

generates evenly spaced values over specified interval
for example:
np.linspace(0, 2, N) generates N evenly spaced points between 0 and 2, stored in the variable x

41
Q

What is overfitting?

A

when statistical model captures noise in the data -> it fits the data too well

42
Q

Occam’s Razor

A

when faced with 2 opposing explanations for the same set of evidence, preference is for the explanation making the fewest assumptions

43
Q

What fits data better - more complex or simple model?

A

More complex model
-> becuase it is overfitting the data

44
Q

How to generate predicted y values based on model?

A

Use predict() function and apply it to the results from model fit (calculating coefficients)

45
Q

What is R-squared?

A

statistical measure in regression analysis that represents the proportion of the variance for a dependent variable explained by an independent variable or variables, with values ranging from 0 to 1

46
Q

What is interpretation of high R-squared?

A

good model fit to the data
however, it doesn’t say anything about causation

47
Q

How to calculate R-squared for linear models?

A

r2 = np.round(1 - (np.var(prediction - y1, ddof=1)/np/var(y1, ddof=1)), 2)

essentially
1 - variance of residuals/variance of original data

48
Q

How to tackle problem of overfitting?

A

cross-validation!
we can split initial data into seperate training and test subset

then you train the model on training subset + test it on test subset

49
Q

What is AIC?
Akaike Information Criterion

AIC = 2k - 2ln(L)
k = number of estimated parameters
L = maximum likelihood value

you can calculate it when you fitted model with stats library

it is enough to append results.aic

lower AIC = better model

A

What is BIC?
Bayesian Information Criterion

you can calculate it when you fitted model with stats library

it is enough to append results.bic

lower BIC = better model

50
Q

Why AIC and BIC are useful?

A

they are metrics of model comparison that penalize model complexity (having more parameters)

51
Q

What is bootstrapping?

A

statistical procedure that resamples single dataset to create many simulated samples
each of these simulated samples has its own properties - such as mean

52
Q

How to calculate lower confidence interval (95%) from normal distribution?

A

x = stats.norm.rvs(0, 1, n)
lower = np.mean(x) - 1.96 * np.std(x, ddof=1) / n

53
Q

How to calculate upper confidence interval (95%) from normal distribution?

A

x = stats.norm.rvs(0, 1, n)
upper = np.mean(x) + 1.96 * np.std(x, ddof=1) / n

54
Q

How to derive bootstrap confidence interval?

A

1) draw random samples with replacement from original sample multiple times
2) for each re-sample, calculate mean and store it in means array
3) bootstrap confidence interval is derived by sorting the bootstrap sample means and selecting the 2.5th and 97.5th percentiles as the lower and upper bounds of the CI

55
Q

What is the difference between traditional confidence interval and bootstap confidence interval?

A

Bootstrapping is non-parametric method (you don’t draw from normal distribution) - it makes no assumptions about underlying distribution. It relies on resampling the data multiple times to approximate the sampling distribution to the mean.