Common - Question 5 (Statistics) Flashcards
Data types in statistics
- Categorical (qualitative)
- Numerical (quantitative)
Target population
the complete set of individuals that might
hypothetically be sampled for a particular study; ENTIRE group of individuals or objects to which researchers are interested in
generalizing the conclusions
Exploratory data analysis
summarizes the main characteristics of the data, often with visual methods.
Name measures of location and how to calculate them
- mean (aritmeticky prumer)
mean = 1/n * Σ x_i - median
median = x_((n + 1)/2) for odd n
median = (x_(n/2) + x_((n/2) + 1))/2 for even n - quantiles
- mode (modus) - value which appears most often in a set of data values
- trimmed mean (some defined amount of min + max values are removed)
- winsorized mean (uses percentage to find the element that will be repeated: elements 0 -10% have same repeating element which is at 10%, similarly from the other end)
Name measures of variability
- variance (rozptyl)
s^2 = 1/(n-1) * Σ(x_i - mean)^2 - standard deviation (směrodatná odchylka)
s = sqrt(s^2) - range
R = x_1 - x_n - Interquartile range
IQR = Q_3 - Q_1
Name measures of shape
- skewness (šikmost)
- kurtosis (špičatost)
Skewness formula
b = 1/n * ∑︁{ (𝑥_i - avg(𝑥)) / s}^3
s = standard deviation
b > 0 doprava ( avg < median < mode )
b < 0 doleva ( avg > median > mode )
Kurtosis formula with meaning
b = - 3 + 1/n * ∑︁{ (𝑥_i - avg(𝑥)) / s}^4
s = standard deviation
b > 0 heavily tailed (spicate)
b < 0 lightly tailed (splostele)
Define covariance
Let X and Y be random variables. Covariance of X and Y is defined as cov(X, Y ) = E[(X − EX)(Y − EY )] if the above expectation is finite.
Define correlation coefficient
Let X and Y be random variables with finite nonzero variances varX and varY. Correlation coefficient of X and Y is defined as cor(X, Y ) = cov(X, Y) / sqrt(varX varY)
If X and Y are independent and correlation coefficient exists, then cor(X, Y ) = 0.
Graph (charts) types and their description
- Boxplot
5 values: max{x_1, Q_1 - 1.5 IQR}, Q1, median, Q3, min{x_n, Q_3 + 1.5 IQR} - Scatterplot
is a plot that displays values for two variables of the dataset using Cartesian coordinates. - Pie chart
Categorical visualization of data - Histogram
Histogram is a piecewise constant estimate of the distribution of the data.
Random variable (náhodná veličina)
random variable X is a function from a sample space to a set of real numbers
What characterizes a random variable?
X is characterised by a model F from the set of models (probabilistic or statistical)
What is cumulative distribution function
Function F = R -> [0, 1] defined as F_X(x) = P(X <= x) for x in R is cdf of X.
F is non-decreasing function
What is probability mass function
p(x) = P(X = x)
Continuous random variable
Random variable X is (absolutely) continuous, if there exists an integrable and nonnegative function f such that F(x) = int{-inf}{x} f(t)dt for all x in R.
f is called probability density function
Discrete random variable
Random variable X is discrete, if there exists a finite or countable set {x1, x2, . . .} such that P(X = x_i) > 0 and sum of probabilities of all realisations of X is 1.
Random sampling (náhodný vyběr)
Náhodný výběr je uspořádaná n-tice náhodných veličin X_1,X_2,…,X_n, které jsou stochasticky nezávislé a mají stejné rozdělení.
x_1, x_2, …, x_n jsou realizace náhodného výběru.
List discrete univariate distributions
- Bernoulli
- Binomial
- Poisson
- Geometric
- Uniform
Bernoulli distribution
Random variable X describes events with two outcomes (success, failure) with p the probability of success.
EX = p and varX = p(1 − p).
Binomial distribution
A random variable X has binomial distribution with parameters n ∈ N and p ∈ (0, 1), if pmf is given by (n over x) p^x (1 - p)^(n-x). Random variable X represents the number of successes in the n repeated independent Bernoulli trials with p the probability of success at each individual trial.
EX = np and varX = np(1 − p).
Poisson distribution
A random variable X has Poisson distribution with parameter λ > 0, if the
pmf of X is given by
p(x) = P(X = x) = e^(−λ) * λ^x/x! if x = 0, 1, …
Random variable X represents the number of events occurring in a fixed interval of time if these events occur independently of the time since the last event.
Geometric distribution
Random variable X represents the number of failures before the first success when repeating independent Bernoulli trials with p the probability of success at each individual trial.
A random variable X has geometric distribution with parameter p ∈ (0, 1), if
the pmf of X is given by
p(x) = P(X = x) = p(1 - p)^x if x=0,1,…
p(x) = P(X = x) = 0 otherwise
EX = (1−p)/p
and varX = (1−p)/p^2
Uniform distribution
All the values of X are equally probable.
A random variable X has discrete uniform distribution on the finite set A, if the pmf of X is given by
p(x) = P(X = x) = 1/|A| if x in A
p(x) = P(X = x) = 0 otherwise
List examples of continuous distributions
- Continuous uniform
- Exponential
- Standard normal
- Normal
- Cauchy
- xi square
- student’s t-distribution
- Fisher-Snedecor F-distribution
Continuous uniform distribution
A random variable X has continuous uniform distribution on the interval (a, b), if the pdf of X is given by
f(x) = 1/(a-b) if x in (a, b)
f(x) = 0 otherwise
Exponential distribution
A random variable X has exponential distribution with parameter λ > 0, if
the pdf of X is given by
f(x) = λe^λx if x>0
Random variable X represents time between two events in a Poisson point process.
Normal distribution
A random variable X has normal distribution with parameters μ ∈ R and σ^2 > 0, if the pdf of X is given by
f(x) = (1/sqrt(2πσ^2))e^-((x−μ)^2)/2σ^2
xi square distribution
student’s t-distribution
Fisher-Snedecor F-distribution
Central Limit theorem (Centralni limitni veta - CLV)
Assumptions:
* X_n - are independent
* X_n - are from the same distribution
* X_n - have expected value (nema ho napr. Cauchy)
Means of samples have a approx normal distributation
no matter the distribution they are collected (e. g. if we collect 100 samples from any distribution, calculate mean, repeat it many times - to get many mean values, than those mean values will form normal distribution)
Rovnice (aproximace):
(∑X_i - n * μ ) / { 𝜎 * sqrt(n) } ≈ N(0, 1)
μ = prumerna hodnota
𝜎 = standard deviation
K cemu je Centralni limitni veta (Central Limit Theorem)
Pouziti je treba pro Confidence Interval, T-tests, Anova
- nemusime vedet, z jake distribuce vzorky pochazi, staci nam vedet, ze prumery utvori normalovou distribuci
PCA (principal component analysis) - mozna optional
Transformace dat - (jiny pohled na data)
PCA1 (primka) se tahne body tak, ze data by mela mit nejvetsi rozptyl (na primce byly nejkrajnejsi body co onejdal od sebe) a zaroven nejmensi square root error.
Dalsi PCA jsou kolme na predchozi
Kazda komponenta zachycuje urcite mnozstvi dat (proto chceme nejvetsi rozptyl) - nasledek muze zmensit mnozstvi dimenzi (atributy, ktere neovlivnuji vysledek) nebo zmensit mnozstvi dat (pro rychlejsi praci)
Odhady - co je cilem, jake typy odhadu
Cilem je urcit nezname parametry nahodne veliciny X.
Mame estimatory:
* parametricke (parametric) - estimator je funkce F a parametr theta
* neparametricke (nonparametric) - estimator je funkce F bez parametru
* semi-parametricke (semi-parametric)
Vlastnosti/typy parametrickeho odhadu
- Nestrannost (unbiased estimator)
- pokud stredni hodnota odhadu je rovna parametru
- asymptotically unbiased = lim n->inf ma nulove vychyleni
- Konzistence (consistent estimator)
- odhad je konzistentni, pokud se se zvetsujicim mnozstvim vzorku priblizujeme odhadem k prave hodnote parametru
Presnost parametru theta se pocita jako
MSE - mean squared error
MSE(T) = E(T - theta) ^ 2
unbiased estimate MSE(T) = var T
Bodovy odhad
Parametr odhadujem pomoci jedne hodnoty - hodnota parametru je aproximovana
Pouzivane bodove odhady
distribuce s prumerem µ a variance σ^2
* sample mean (arithmetic mean) je unbiased and consistent estimate of µ
* sample variances S^2
Intervalovy odhad
Parametr je odhadovan pomoci intervalu
Construction of point estimates
- Consider our data as realization of random sample X_1,…, X_n.
- Specify the function F (the distribution up to an unknown param.)
- Finally, estimate our unknown parameter (e.g. theta)
* We can use maximum likelihood for example
Maximum likelihood method
L(theta) = П_{i from 1 to n} f(x_i, θ) is the joint pdf (pmf) of random sample X = (X_1, …, X_n). П is multiplication.
L is called a likelihood function of parameter θ with fixed x_1, …, x_n.
Maximum likelihood is such θ that maximizes the likelihood function.
It is better to use log likelihood
Confidence interval
Assume that X_1, . . . , X_n is a random sample with cdf F(x, θ), where θ ∈ Θ is an unknown one-dimensional parameter. Interval [L,U] = [L(X1, . . . , Xn),U(X1, . . . , Xn)] is (1 − α)·100% confidence interval for parameter θ, if P(L(X_1, . . . , X_n) ≤ θ ≤ U(X_1, . . . , X_n)) = 1 − α, for all θ ∈ Θ.
Simply speaking, a confidence interval is the mean of your estimate plus and minus the variation in that estimate. This is the range of values you expect your estimate to fall between if you redo your test, within a certain level of confidence.
Factors affecting the width of the confidence interval
sample size, the confidence level, and the variability of the data.
Hypotheses testing
The goal is to decide between two competing hypotheses about parameter θ:
Null hypothesis
Alternative hypothesis
Our decision:
* We reject H_0
* We do not reject the null hypothesis H_0
Types of errors when testing hypotheses
Type 1 - we reject H_0 and H_0 is true
Type 2 - we do not reject H_0 and H_1 is true
Hypothesis test construction
- First, we have to find a pivotal statistic for a given model.
- Test statistic T is then the value of the pivotal statistic under H0.
- We have to find the distribution of the test statistic under the null hypothesis.
- Finally, we define the critical region W so that P(type I error) = α.
p-value
- p-value is the probability under the null hypothesis of obtaining test result at least as extreme as the result actually observed.
- p-value is the smallest significance level at which we can reject the null hypothesis.
- If p-value is smaller than α, we reject the null hypothesis.
Two-sample t-test
The two-sample t-test (also known as the independent samples t-test) is a method used to test whether the unknown population means of two groups are equal or not.
- The normality assumption is crucial.
- Both samples have to be mutually independent.
- Both samples must have the same variability - necessary to check.
- If this assumption is not met, we cannot use two-sample t-test.
- In that case, we may use e.g. Welch’s approximate t-test.
Paired t-test
Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a “repeated measures” t-test).
Model: Z_1 = X_1 − Y_1, . . . , Z_n = X_n − Y_n is a random sample from normal distribution N (μ, σ^2).
Null hypothesis: H_0 : μ = 0.
One-sample t-test
The one-sample t-test is a statistical hypothesis test used to determine whether an unknown population mean is different from a specific value.
t={\frac {{\bar {x}}- \mu_{0}}{s/{\sqrt {n}}}}
ANOVA test
ANOVA, which stands for Analysis of Variance, is a statistical test used to analyze the difference between the means of more than two groups.
Mathematical model:
* All the samples are from normal distribution
* All samples are mutually independent.
* n_1, . . . , n_k are the sample sizes.
* n = n_1 + . . . + n_k is the overall sample size.
* H_0: μ_1 = μ_2 = . . . = μ_k .
* H_1: μ_i != μ_j for at least one pair i != j.
ANOVA determines whether the groups created by the levels of the independent variable are statistically different by calculating whether the means of the treatment levels are different from the overall mean of the dependent variable.
If any of the group means is significantly different from the overall mean, then the null hypothesis is rejected.
ANOVA uses the F test for statistical significance. This allows for comparison of multiple means at once, because the error is calculated for the whole set of comparisons rather than for each individual two-way comparison (which would happen with a t test).
The F test compares the variance in each group mean from the overall group variance. If the variance within groups is smaller than the variance between groups, the F test will find a higher F value, and therefore a higher likelihood that the difference observed is real and not due to chance.