Common - Question 5 (Statistics) Flashcards

Question 1

Q

Data types in statistics

Answer

A

Categorical (qualitative)
Numerical (quantitative)

Question 2

Q

Target population

Answer

A

the complete set of individuals that might
hypothetically be sampled for a particular study; ENTIRE group of individuals or objects to which researchers are interested in
generalizing the conclusions

Question 3

Q

Exploratory data analysis

Answer

A

summarizes the main characteristics of the data, often with visual methods.

Question 4

Q

Name measures of location and how to calculate them

Answer

A

mean (aritmeticky prumer)
mean = 1/n * Σ x_i
median
median = x_((n + 1)/2) for odd n
median = (x_(n/2) + x_((n/2) + 1))/2 for even n
quantiles
mode (modus) - value which appears most often in a set of data values
trimmed mean (some defined amount of min + max values are removed)
winsorized mean (uses percentage to find the element that will be repeated: elements 0 -10% have same repeating element which is at 10%, similarly from the other end)

Question 5

Q

Name measures of variability

Answer

A

variance (rozptyl)
s^2 = 1/(n-1) * Σ(x_i - mean)^2
standard deviation (směrodatná odchylka)
s = sqrt(s^2)
range
R = x_1 - x_n
Interquartile range
IQR = Q_3 - Q_1

Question 6

Q

Name measures of shape

Answer

A

skewness (šikmost)
kurtosis (špičatost)

Question 7

Q

Skewness formula

Answer

A

b = 1/n * ∑︁{ (𝑥_i - avg(𝑥)) / s}^3

s = standard deviation

b > 0 doprava ( avg < median < mode )
b < 0 doleva ( avg > median > mode )

Question 8

Q

Kurtosis formula with meaning

Answer

A

b = - 3 + 1/n * ∑︁{ (𝑥_i - avg(𝑥)) / s}^4

s = standard deviation

b > 0 heavily tailed (spicate)
b < 0 lightly tailed (splostele)

Question 9

Q

Define covariance

Answer

A

Let X and Y be random variables. Covariance of X and Y is defined as cov(X, Y ) = E[(X − EX)(Y − EY )] if the above expectation is finite.

Question 10

Q

Define correlation coefficient

Answer

A

Let X and Y be random variables with finite nonzero variances varX and varY. Correlation coefficient of X and Y is defined as cor(X, Y ) = cov(X, Y) / sqrt(varX varY)

If X and Y are independent and correlation coefficient exists, then cor(X, Y ) = 0.

Question 11

Q

Graph (charts) types and their description

Answer

A

Boxplot
5 values: max{x_1, Q_1 - 1.5 IQR}, Q1, median, Q3, min{x_n, Q_3 + 1.5 IQR}
Scatterplot
is a plot that displays values for two variables of the dataset using Cartesian coordinates.
Pie chart
Categorical visualization of data
Histogram
Histogram is a piecewise constant estimate of the distribution of the data.

Question 12

Q

Random variable (náhodná veličina)

Answer

A

random variable X is a function from a sample space to a set of real numbers

Question 13

Q

What characterizes a random variable?

Answer

A

X is characterised by a model F from the set of models (probabilistic or statistical)

Question 14

Q

What is cumulative distribution function

Answer

A

Function F = R -> [0, 1] defined as F_X(x) = P(X <= x) for x in R is cdf of X.
F is non-decreasing function

Question 15

Q

What is probability mass function

Answer

A

p(x) = P(X = x)

Question 16

Q

Continuous random variable

Answer

A

Random variable X is (absolutely) continuous, if there exists an integrable and nonnegative function f such that F(x) = int{-inf}{x} f(t)dt for all x in R.

f is called probability density function

Question 17

Q

Discrete random variable

Answer

A

Random variable X is discrete, if there exists a finite or countable set {x1, x2, . . .} such that P(X = x_i) > 0 and sum of probabilities of all realisations of X is 1.

Question 18

Q

Random sampling (náhodný vyběr)

Answer

A

Náhodný výběr je uspořádaná n-tice náhodných veličin X_1,X_2,…,X_n, které jsou stochasticky nezávislé a mají stejné rozdělení.

x_1, x_2, …, x_n jsou realizace náhodného výběru.

Question 19

Q

List discrete univariate distributions

Answer

A

Bernoulli
Binomial
Poisson
Geometric
Uniform

Question 20

Q

Bernoulli distribution

Answer

A

Random variable X describes events with two outcomes (success, failure) with p the probability of success.

EX = p and varX = p(1 − p).

Question 21

Q

Binomial distribution

Answer

A

A random variable X has binomial distribution with parameters n ∈ N and p ∈ (0, 1), if pmf is given by (n over x) p^x (1 - p)^(n-x). Random variable X represents the number of successes in the n repeated independent Bernoulli trials with p the probability of success at each individual trial.

EX = np and varX = np(1 − p).

Question 22

Q

Poisson distribution

Answer

A

A random variable X has Poisson distribution with parameter λ > 0, if the
pmf of X is given by

p(x) = P(X = x) = e^(−λ) * λ^x/x! if x = 0, 1, …

Random variable X represents the number of events occurring in a fixed interval of time if these events occur independently of the time since the last event.

Question 23

Q

Geometric distribution

Answer

A

Random variable X represents the number of failures before the first success when repeating independent Bernoulli trials with p the probability of success at each individual trial.

A random variable X has geometric distribution with parameter p ∈ (0, 1), if
the pmf of X is given by
p(x) = P(X = x) = p(1 - p)^x if x=0,1,…
p(x) = P(X = x) = 0 otherwise

EX = (1−p)/p
and varX = (1−p)/p^2

Question 24

Q

Uniform distribution

Answer

A

All the values of X are equally probable.

A random variable X has discrete uniform distribution on the finite set A, if the pmf of X is given by
p(x) = P(X = x) = 1/|A| if x in A
p(x) = P(X = x) = 0 otherwise

Question 25

Q

List examples of continuous distributions

Answer

A

Continuous uniform
Exponential
Standard normal
Normal
Cauchy
xi square
student’s t-distribution
Fisher-Snedecor F-distribution

Question 26

Q

Continuous uniform distribution

Answer

A

A random variable X has continuous uniform distribution on the interval (a, b), if the pdf of X is given by
f(x) = 1/(a-b) if x in (a, b)
f(x) = 0 otherwise

Question 27

Q

Exponential distribution

Answer

A

A random variable X has exponential distribution with parameter λ > 0, if
the pdf of X is given by
f(x) = λe^λx if x>0

Random variable X represents time between two events in a Poisson point process.

Question 28

Q

Normal distribution

Answer

A

A random variable X has normal distribution with parameters μ ∈ R and σ^2 > 0, if the pdf of X is given by

f(x) = (1/sqrt(2πσ^2))e^-((x−μ)^2)/2σ^2

Question 29

Q

xi square distribution

Question 30

Q

student’s t-distribution

Question 31

Q

Fisher-Snedecor F-distribution

Question 32

Q

Central Limit theorem (Centralni limitni veta - CLV)

Answer

A

Assumptions:
* X_n - are independent
* X_n - are from the same distribution
* X_n - have expected value (nema ho napr. Cauchy)

Means of samples have a approx normal distributation
no matter the distribution they are collected (e. g. if we collect 100 samples from any distribution, calculate mean, repeat it many times - to get many mean values, than those mean values will form normal distribution)

Rovnice (aproximace):
(∑X_i - n * μ ) / { 𝜎 * sqrt(n) } ≈ N(0, 1)

μ = prumerna hodnota
𝜎 = standard deviation

Question 33

Q

K cemu je Centralni limitni veta (Central Limit Theorem)

Answer

A

Pouziti je treba pro Confidence Interval, T-tests, Anova

nemusime vedet, z jake distribuce vzorky pochazi, staci nam vedet, ze prumery utvori normalovou distribuci

Question 34

Q

PCA (principal component analysis) - mozna optional

Answer

A

Transformace dat - (jiny pohled na data)

PCA1 (primka) se tahne body tak, ze data by mela mit nejvetsi rozptyl (na primce byly nejkrajnejsi body co onejdal od sebe) a zaroven nejmensi square root error.

Dalsi PCA jsou kolme na predchozi

Kazda komponenta zachycuje urcite mnozstvi dat (proto chceme nejvetsi rozptyl) - nasledek muze zmensit mnozstvi dimenzi (atributy, ktere neovlivnuji vysledek) nebo zmensit mnozstvi dat (pro rychlejsi praci)

Question 35

Q

Odhady - co je cilem, jake typy odhadu

Answer

A

Cilem je urcit nezname parametry nahodne veliciny X.

Mame estimatory:
* parametricke (parametric) - estimator je funkce F a parametr theta
* neparametricke (nonparametric) - estimator je funkce F bez parametru
* semi-parametricke (semi-parametric)

Question 36

Q

Vlastnosti/typy parametrickeho odhadu

Answer

A

Nestrannost (unbiased estimator)
- pokud stredni hodnota odhadu je rovna parametru
- asymptotically unbiased = lim n->inf ma nulove vychyleni
Konzistence (consistent estimator)
- odhad je konzistentni, pokud se se zvetsujicim mnozstvim vzorku priblizujeme odhadem k prave hodnote parametru

Question 37

Q

Presnost parametru theta se pocita jako

Answer

A

MSE - mean squared error

MSE(T) = E(T - theta) ^ 2

unbiased estimate MSE(T) = var T

Question 38

Q

Bodovy odhad

Answer

A

Parametr odhadujem pomoci jedne hodnoty - hodnota parametru je aproximovana

Question 39

Q

Pouzivane bodove odhady

Answer

A

distribuce s prumerem µ a variance σ^2
* sample mean (arithmetic mean) je unbiased and consistent estimate of µ
* sample variances S^2

Question 40

Q

Intervalovy odhad

Answer

A

Parametr je odhadovan pomoci intervalu

Question 41

Q

Construction of point estimates

Answer

A

Consider our data as realization of random sample X_1,…, X_n.
Specify the function F (the distribution up to an unknown param.)
Finally, estimate our unknown parameter (e.g. theta)
* We can use maximum likelihood for example

Question 42

Q

Maximum likelihood method

Answer

A

L(theta) = П_{i from 1 to n} f(x_i, θ) is the joint pdf (pmf) of random sample X = (X_1, …, X_n). П is multiplication.
L is called a likelihood function of parameter θ with fixed x_1, …, x_n.
Maximum likelihood is such θ that maximizes the likelihood function.
It is better to use log likelihood

Question 43

Q

Confidence interval

Answer

A

Assume that X_1, . . . , X_n is a random sample with cdf F(x, θ), where θ ∈ Θ is an unknown one-dimensional parameter. Interval [L,U] = [L(X1, . . . , Xn),U(X1, . . . , Xn)] is (1 − α)·100% confidence interval for parameter θ, if P(L(X_1, . . . , X_n) ≤ θ ≤ U(X_1, . . . , X_n)) = 1 − α, for all θ ∈ Θ.

Simply speaking, a confidence interval is the mean of your estimate plus and minus the variation in that estimate. This is the range of values you expect your estimate to fall between if you redo your test, within a certain level of confidence.

Question 44

Q

Factors affecting the width of the confidence interval

Answer

A

sample size, the confidence level, and the variability of the data.

Question 45

Q

Hypotheses testing

Answer

A

The goal is to decide between two competing hypotheses about parameter θ:
Null hypothesis
Alternative hypothesis

Our decision:
* We reject H_0
* We do not reject the null hypothesis H_0

Question 46

Q

Types of errors when testing hypotheses

Answer

A

Type 1 - we reject H_0 and H_0 is true
Type 2 - we do not reject H_0 and H_1 is true

Question 47

Q

Hypothesis test construction

Answer

A

First, we have to find a pivotal statistic for a given model.
Test statistic T is then the value of the pivotal statistic under H0.
We have to find the distribution of the test statistic under the null hypothesis.
Finally, we define the critical region W so that P(type I error) = α.

Question 48

Q

p-value

Answer

A

p-value is the probability under the null hypothesis of obtaining test result at least as extreme as the result actually observed.
p-value is the smallest significance level at which we can reject the null hypothesis.
If p-value is smaller than α, we reject the null hypothesis.

Question 49

Q

Two-sample t-test

Answer

A

The two-sample t-test (also known as the independent samples t-test) is a method used to test whether the unknown population means of two groups are equal or not.

The normality assumption is crucial.
Both samples have to be mutually independent.
Both samples must have the same variability - necessary to check.
If this assumption is not met, we cannot use two-sample t-test.
In that case, we may use e.g. Welch’s approximate t-test.

Question 50

Q

Paired t-test

Answer

A

Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a “repeated measures” t-test).

Model: Z_1 = X_1 − Y_1, . . . , Z_n = X_n − Y_n is a random sample from normal distribution N (μ, σ^2).

Null hypothesis: H_0 : μ = 0.

Question 51

Q

One-sample t-test

Answer

A

The one-sample t-test is a statistical hypothesis test used to determine whether an unknown population mean is different from a specific value.

t={\frac {{\bar {x}}- \mu_{0}}{s/{\sqrt {n}}}}

Question 52

Q

ANOVA test

Answer

A

ANOVA, which stands for Analysis of Variance, is a statistical test used to analyze the difference between the means of more than two groups.

Mathematical model:
* All the samples are from normal distribution
* All samples are mutually independent.
* n_1, . . . , n_k are the sample sizes.
* n = n_1 + . . . + n_k is the overall sample size.
* H_0: μ_1 = μ_2 = . . . = μ_k .
* H_1: μ_i != μ_j for at least one pair i != j.

ANOVA determines whether the groups created by the levels of the independent variable are statistically different by calculating whether the means of the treatment levels are different from the overall mean of the dependent variable.

If any of the group means is significantly different from the overall mean, then the null hypothesis is rejected.

ANOVA uses the F test for statistical significance. This allows for comparison of multiple means at once, because the error is calculated for the whole set of comparisons rather than for each individual two-way comparison (which would happen with a t test).

The F test compares the variance in each group mean from the overall group variance. If the variance within groups is smaller than the variance between groups, the F test will find a higher F value, and therefore a higher likelihood that the difference observed is real and not due to chance.