Statistics/Probability Flashcards
sample space
the set of all possible sample points for an experiment, e.g. S={HH,TT,HT,TH} for two times head tails flip
dependent events regarding probability
e.g. picking marbles out a bag
Covariance
- When calculated between two variables, X and Y, it indicates how much the two variables change together.
- Cov(X,Y)=E[(X−EX)(Y−EY)] = E[XY]−(EX)(EY)
P–P plot
probability–probability plot or percent–percent plot or P value plot: probability plot for assessing how closely two data sets agree, or for assessing how closely a dataset fits a particular model.
It works by plotting the two cumulative distribution functions against each other; if they are similar, the data will appear to be nearly a straight line.
For input z the output is the pair of numbers giving what percentage of f and what percentage of g fall at or below z.
Q–Q plot
quantile–quantile plot: for comparing two probability distributions by plotting their quantiles against each other. A point (x, y) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate).
PMF (Probability Mass Function)
A probability mass function (PMF) is a mathematical function that calculates the probability that a discrete random variable will be a specific value. It assigns a particular probability to every possible value of the variable.
Table: With each row an outcome + probability
the conditional probability for a cancel given snow.
P(Cancel∣Snow), the ∣ is short for ‘given’
An event happens independently of a condition if
P(event∣condition)=P(event)
Kolmogorov-Smirnov (K-S) Test
non-parametric test that compares the empirical distribution of the data with a theoretical distribution.
It helps determine how well the theoretical distribution fits the data.
K-S Statistic
The K-S statistic measures the maximum distance between the empirical cumulative distribution function (ECDF) of your data and the cumulative distribution function (CDF) of the theoretical distribution.
In simpler terms, it quantifies the biggest difference between what you observed (your data) and what you would expect if the data followed the theoretical distribution.
The K-S statistic ranges from 0 to 1:
A smaller K-S statistic indicates that the empirical distribution is very close to the theoretical distribution.
outcome = model + error –> how are the parts called?
model = systematic part, error = unsystematic part
Descriptive Statistics
collect, organize, display, analyze, etc.
Inference Statistics
- Predict and forecast values of population
parameters - Test hypothesis and draw conclusions about values
of population parameters - Make decisions
Central Tedency
1st moment - mean, median, mode
Spread
2nd moment - MAD, Variance, SD, coefficient of variation (CV = SD/mean), range, IQR
Skweness
3rd moment - measure of asymmetry, positive skew (tail pointing to high values (body of the distribution is to the left), negative skew
Kurtosis
4th moment - Measure of heaviness of the tails, leptokurtic (heavy tails), platykurtic (light tails)
Which kurtosis has a normal distribution?
3 (mesokurtic)
statistical test on prices vs returns:
prices are not predictable, returns are predictable (they are “stationary”)
Standard Error calculation & meaning
- SE = SD / (n^1/2)
- Standard deviation measures the amount of variance or dispersion of the data spread around the mean. The standard error can be thought of as the dispersion of the sample mean estimations around the true population mean
Sample standard deviation
𝑠
Population standard deviation
𝜎 (sigma)
Central Limit Theorem
states that: the distribution of sample mean, 𝑋ത, will approach a Normal distribution as sample size 𝑛 increases (𝑛 ≥ 30)
Sample variance - do you use n or n-1?
n-1
Random variable:
𝑋
Cumulative Density Function of Standard Normal:
Φ (z)
Pivotal distribution
N(0,1)
Population mean - greek letter:
μ (mu)
sample standard deviation
s
Confidence interval
sample mean +/- z-value * (sigma or SE / root(n))
In the sample, you approximate mu and sigma with…
x (sample mean) and sample standard deviation
Population standard deviation
𝜎
Particular observation of a Standard Normal (also
known as ‘z-critical value’)
z
Parameter of 𝒕-distribution (also known as ‘degrees of
freedom’):
𝜐
t-critical value
t
Important: are you given sigma or s?
If n is < 30, but you are given sigma, you can use sigma
t-distribution
- Has thicker tails than Normal (i.e. larger chance of
extreme events). - Its shape depends on a single parameter “nu” 𝜈 =
𝑛 – 1, where n is the number of observations. - Assumption: t-distribution assumes that the data
originates from a Normal Distribution.
3 main types of distribution
Gaussian, Poisson, Chi-square
Statistical stationarity:
A stationary time series is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time.