Simulations Flashcards
random module
used to generate random numbers in python
random.seed()
If you initialize the random number generator with a specific seed using random.seed(), the sequence of numbers generated will always be the same (no matter how many times you run it)
What is the meaning of 1 in random.seed(1)?
1 doesn’t have anything to do with actual list content, it is just index used to generate random numbers - it enables you to access the same numbers
np.random.choice()
used to generate a random sample from given array or list
np.arange()
generates array containing evenly spaced values within specified range -> so it is similar to range() but range() returns a list and np.arrange() returns array
x = np.random.choice(np.arange(1, 7), 3)
displays 3 random numbers from array 1 to 6
What is a difference between simulation probability and mathematical probability?
Simulation probability is an approximation, whereas mathematical probability is exact. However, mathematical probability is often calculated in ideal conditions, whereas simulations can be adjusted to real-life scenarios
one-sample t-test
parametric test
examines whether the mean of a population is statistically different from a known or hypothesized value (therefore it checks whether the difference between two variabels varies from 0)
What are data requirements for one-sample t-test?
1) continuous test variable
2) scores on test variable are independent (there is no relationship between scores)
3) random sample of data from population
4) normal distribution of sample and population on test variable
5) homogeneity of variances (variances are approx equal in both sample and population)
6) no outliers
what is linear regression?
estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values.
Y = β0+ β1X+ ε
Y = dependent variable
B0 = Y-intercept
B1 = slope
x = independent variable
e = error
How to calculate the slope (B1)?
CHANGE IN Y/CHANGE IN X
What is polynomial linear regression?
It is used when there is nonlinear relationship between the predictor and response variable (when data points form a curve)
Y = β0+ β1X + β2X2+ … + βhXh+ ε
In this equation,his referred to as the degreeof the polynomial.
What are advantages of polynomial regression?
- Polynomial provides the best approximation of the relationship between the dependent and independent variable.
- A Broad range of function can be fit under it.
- Polynomial basically fits a wide range of curvature.
What are disadvantages of polynomial regression?
- The presence of one or two outliers in the data can seriously affect the results of the nonlinear analysis.
- These are too sensitive to the outliers.
- In addition, there are unfortunately fewer model validation tools for the detection of outliers in nonlinear regression than there are for linear regression.
- Also more complex models are prone to overfitting
What is continous uniform distribution?
symmetric probability distribution where all outcomes have an equal likelihood of occurring
How to draw samples from uniform distribution?
stats.uniform.rvs(0, 1, size=100):
generates 100 random samples from uniform distribution
loc = 0 (sets lower bound)
scale = 1 (sets range of distribution)
so samples are drawn from interval (0,1)
What is binomial distribution?
probability of outcome being either success or failure
How to generate random variables from binomial distribution?
stats.binom.rvs(n=1, p=0.5, size=100)
n - number of trials
p - probability of success
size - number of random variables
What is normal distribution?
data are symmetrically distributed; bell-shaped - most values clustering around central region
mean = median = mode
How to generate random samples from normal distribution in Pyton?
stats.norm.rvs(loc=0, scale=1, size=100)
loc = mean
scale = standard deviation
size = number of random samples to generate
OR
np.random.normal (loc=0, scale=1, size=100)
What does function stats.norm.cdf?
Cumulative Distribution Function
area under the curve
For a normal distribution, the CDF at a point x gives the area under the probability density function (PDF) curve to the left of x. This area corresponds to the probability that a randomly selected value from the distribution is less than or equal to x.
What does function stats.norm.pdf?
Probability Density Function
The PDF of a continuous random variable gives the relative likelihood of the random variable taking on a specific value.
It is height of the curve
What is stats.norm.ppf?
Percent Point Function
inverse cdf!
given the probability of numbers smaller than x, what is x?
How to generate random samples from exponential distirbution?
stats.expon.rvs
How to generate random samples from beta distribution?
stats.beta.rvs
How to generate random samples from chi-squared distribution?
stats.chi2.rvs
How to generate random samples from gamma distribution?
stats.gamma.rvs
What is standard normal distribution?
special normal distribution where mean=0, sd = 1
What are z-scores?
test statistic used in t-test
area under the curve to the right of azscore is thepvalue, and it’s the likelihood of your observation occurring if the null hypothesis is true.
P-value
probability of obtaining test results at least as extreme as results actually observed, under assumption that null hypothesis is true
Alpha
false positive -> Type 1 error rate
when we believe that there is genuine effect in population when in fact there isn’t
in real data, we don’t define alpha as part of the code -> instead look at p-value and compare it with alpha
How to perform one-sample t-test in Python?
stats.ttest_1samp
(sample_data, population_mean)
pg.ttest(data, population_mean)
What is power of test?
probability that the test correctly rejects a false null hypothesis (H0). It’s the complement of the Type II error rate (i.e., Power=1−β).
What is effect size?
refers to the magnitude of the difference that you expect to detect
How to calculate power of t-test?
pg.power_ttest
parameters:
n = sample size - if not provided calculated based on other parameters
d = effect size
alpha
power - if not provided calculated based on other parameters
How to round numbers to specific number of decimal places?
np.round
What are non-parametric tests?
tests which don’t require that your data follows normal distribution
What is the parameter of the model?
parameter of a model is a variable that can take range of values that describe the data
Why we choose parameters?
to simulate models to then generate simulated data
How to report statistical results?
1) report overall significance of main model first and compare to other models (F, p-values, BIC, AIC)
2) report findings of individual parameters (effects in linear regression)
3) include post-oc and additional analyses
4) do not report all non-significant results
5) round to 3 decimal places when reporting p-values/Bayes factors
What does sm.OLS?
OLS stands for Ordinary Least Squares
which is method for estimating parameters in linear regression model
goal is to find the line that minimizes the sum of squared differences between observed value and value predicted by the model
it requires 2 main inputs:
- dependent variable (observed data that you want to model -> goal is to predict them)
- independent vairable which can be coded in design matrix
How to fit model onto the data?
use function .fit()
How to simulate uncorrelated variables?
They need to be simulated from f.ex. random distribution seperately
SO you CANNOT use linspace!
What does function np.linspace?
generates evenly spaced values over specified interval
for example:
np.linspace(0, 2, N) generates N evenly spaced points between 0 and 2, stored in the variable x
What is overfitting?
when statistical model captures noise in the data -> it fits the data too well
Occam’s Razor
when faced with 2 opposing explanations for the same set of evidence, preference is for the explanation making the fewest assumptions
What fits data better - more complex or simple model?
More complex model
-> becuase it is overfitting the data
How to generate predicted y values based on model?
Use predict() function and apply it to the results from model fit (calculating coefficients)
What is R-squared?
statistical measure in regression analysis that represents the proportion of the variance for a dependent variable explained by an independent variable or variables, with values ranging from 0 to 1
What is interpretation of high R-squared?
good model fit to the data
however, it doesn’t say anything about causation
How to calculate R-squared for linear models?
r2 = np.round(1 - (np.var(prediction - y1, ddof=1)/np/var(y1, ddof=1)), 2)
essentially
1 - variance of residuals/variance of original data
How to tackle problem of overfitting?
cross-validation!
we can split initial data into seperate training and test subset
then you train the model on training subset + test it on test subset
What is AIC?
Akaike Information Criterion
AIC = 2k - 2ln(L)
k = number of estimated parameters
L = maximum likelihood value
you can calculate it when you fitted model with stats library
it is enough to append results.aic
lower AIC = better model
What is BIC?
Bayesian Information Criterion
you can calculate it when you fitted model with stats library
it is enough to append results.bic
lower BIC = better model
Why AIC and BIC are useful?
they are metrics of model comparison that penalize model complexity (having more parameters)
What is bootstrapping?
statistical procedure that resamples single dataset to create many simulated samples
each of these simulated samples has its own properties - such as mean
How to calculate lower confidence interval (95%) from normal distribution?
x = stats.norm.rvs(0, 1, n)
lower = np.mean(x) - 1.96 * np.std(x, ddof=1) / n
How to calculate upper confidence interval (95%) from normal distribution?
x = stats.norm.rvs(0, 1, n)
upper = np.mean(x) + 1.96 * np.std(x, ddof=1) / n
How to derive bootstrap confidence interval?
1) draw random samples with replacement from original sample multiple times
2) for each re-sample, calculate mean and store it in means array
3) bootstrap confidence interval is derived by sorting the bootstrap sample means and selecting the 2.5th and 97.5th percentiles as the lower and upper bounds of the CI
What is the difference between traditional confidence interval and bootstap confidence interval?
Bootstrapping is non-parametric method (you don’t draw from normal distribution) - it makes no assumptions about underlying distribution. It relies on resampling the data multiple times to approximate the sampling distribution to the mean.