Common - Question 5 (Statistics) Flashcards

1
Q

Data types in statistics

A
  • Categorical (qualitative)
  • Numerical (quantitative)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Target population

A

the complete set of individuals that might
hypothetically be sampled for a particular study; ENTIRE group of individuals or objects to which researchers are interested in
generalizing the conclusions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Exploratory data analysis

A

summarizes the main characteristics of the data, often with visual methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Name measures of location and how to calculate them

A
  • mean (aritmeticky prumer)
    mean = 1/n * Σ x_i
  • median
    median = x_((n + 1)/2) for odd n
    median = (x_(n/2) + x_((n/2) + 1))/2 for even n
  • quantiles
  • mode (modus) - value which appears most often in a set of data values
  • trimmed mean (some defined amount of min + max values are removed)
  • winsorized mean (uses percentage to find the element that will be repeated: elements 0 -10% have same repeating element which is at 10%, similarly from the other end)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name measures of variability

A
  • variance (rozptyl)
    s^2 = 1/(n-1) * Σ(x_i - mean)^2
  • standard deviation (směrodatná odchylka)
    s = sqrt(s^2)
  • range
    R = x_1 - x_n
  • Interquartile range
    IQR = Q_3 - Q_1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name measures of shape

A
  • skewness (šikmost)
  • kurtosis (špičatost)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Skewness formula

A

b = 1/n * ∑︁{ (𝑥_i - avg(𝑥)) / s}^3

s = standard deviation

b > 0 doprava ( avg < median < mode )
b < 0 doleva ( avg > median > mode )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Kurtosis formula with meaning

A

b = - 3 + 1/n * ∑︁{ (𝑥_i - avg(𝑥)) / s}^4

s = standard deviation

b > 0 heavily tailed (spicate)
b < 0 lightly tailed (splostele)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define covariance

A

Let X and Y be random variables. Covariance of X and Y is defined as cov(X, Y ) = E[(X − EX)(Y − EY )] if the above expectation is finite.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define correlation coefficient

A

Let X and Y be random variables with finite nonzero variances varX and varY. Correlation coefficient of X and Y is defined as cor(X, Y ) = cov(X, Y) / sqrt(varX varY)

If X and Y are independent and correlation coefficient exists, then cor(X, Y ) = 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Graph (charts) types and their description

A
  • Boxplot
    5 values: max{x_1, Q_1 - 1.5 IQR}, Q1, median, Q3, min{x_n, Q_3 + 1.5 IQR}
  • Scatterplot
    is a plot that displays values for two variables of the dataset using Cartesian coordinates.
  • Pie chart
    Categorical visualization of data
  • Histogram
    Histogram is a piecewise constant estimate of the distribution of the data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Random variable (náhodná veličina)

A

random variable X is a function from a sample space to a set of real numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What characterizes a random variable?

A

X is characterised by a model F from the set of models (probabilistic or statistical)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is cumulative distribution function

A

Function F = R -> [0, 1] defined as F_X(x) = P(X <= x) for x in R is cdf of X.
F is non-decreasing function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is probability mass function

A

p(x) = P(X = x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Continuous random variable

A

Random variable X is (absolutely) continuous, if there exists an integrable and nonnegative function f such that F(x) = int{-inf}{x} f(t)dt for all x in R.

f is called probability density function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Discrete random variable

A

Random variable X is discrete, if there exists a finite or countable set {x1, x2, . . .} such that P(X = x_i) > 0 and sum of probabilities of all realisations of X is 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Random sampling (náhodný vyběr)

A

Náhodný výběr je uspořádaná n-tice náhodných veličin X_1,X_2,…,X_n, které jsou stochasticky nezávislé a mají stejné rozdělení.

x_1, x_2, …, x_n jsou realizace náhodného výběru.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

List discrete univariate distributions

A
  • Bernoulli
  • Binomial
  • Poisson
  • Geometric
  • Uniform
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Bernoulli distribution

A

Random variable X describes events with two outcomes (success, failure) with p the probability of success.

EX = p and varX = p(1 − p).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Binomial distribution

A

A random variable X has binomial distribution with parameters n ∈ N and p ∈ (0, 1), if pmf is given by (n over x) p^x (1 - p)^(n-x). Random variable X represents the number of successes in the n repeated independent Bernoulli trials with p the probability of success at each individual trial.

EX = np and varX = np(1 − p).

22
Q

Poisson distribution

A

A random variable X has Poisson distribution with parameter λ > 0, if the
pmf of X is given by

p(x) = P(X = x) = e^(−λ) * λ^x/x! if x = 0, 1, …

Random variable X represents the number of events occurring in a fixed interval of time if these events occur independently of the time since the last event.

23
Q

Geometric distribution

A

Random variable X represents the number of failures before the first success when repeating independent Bernoulli trials with p the probability of success at each individual trial.

A random variable X has geometric distribution with parameter p ∈ (0, 1), if
the pmf of X is given by
p(x) = P(X = x) = p(1 - p)^x if x=0,1,…
p(x) = P(X = x) = 0 otherwise

EX = (1−p)/p
and varX = (1−p)/p^2

24
Q

Uniform distribution

A

All the values of X are equally probable.

A random variable X has discrete uniform distribution on the finite set A, if the pmf of X is given by
p(x) = P(X = x) = 1/|A| if x in A
p(x) = P(X = x) = 0 otherwise

25
Q

List examples of continuous distributions

A
  • Continuous uniform
  • Exponential
  • Standard normal
  • Normal
  • Cauchy
  • xi square
  • student’s t-distribution
  • Fisher-Snedecor F-distribution
26
Q

Continuous uniform distribution

A

A random variable X has continuous uniform distribution on the interval (a, b), if the pdf of X is given by
f(x) = 1/(a-b) if x in (a, b)
f(x) = 0 otherwise

27
Q

Exponential distribution

A

A random variable X has exponential distribution with parameter λ > 0, if
the pdf of X is given by
f(x) = λe^λx if x>0

Random variable X represents time between two events in a Poisson point process.

28
Q

Normal distribution

A

A random variable X has normal distribution with parameters μ ∈ R and σ^2 > 0, if the pdf of X is given by

f(x) = (1/sqrt(2πσ^2))e^-((x−μ)^2)/2σ^2

29
Q

xi square distribution

A
30
Q

student’s t-distribution

A
31
Q

Fisher-Snedecor F-distribution

A
32
Q

Central Limit theorem (Centralni limitni veta - CLV)

A

Assumptions:
* X_n - are independent
* X_n - are from the same distribution
* X_n - have expected value (nema ho napr. Cauchy)

Means of samples have a approx normal distributation
no matter the distribution they are collected (e. g. if we collect 100 samples from any distribution, calculate mean, repeat it many times - to get many mean values, than those mean values will form normal distribution)

Rovnice (aproximace):
(∑X_i - n * μ ) / { 𝜎 * sqrt(n) } ≈ N(0, 1)

μ = prumerna hodnota
𝜎 = standard deviation

33
Q

K cemu je Centralni limitni veta (Central Limit Theorem)

A

Pouziti je treba pro Confidence Interval, T-tests, Anova

  • nemusime vedet, z jake distribuce vzorky pochazi, staci nam vedet, ze prumery utvori normalovou distribuci
34
Q

PCA (principal component analysis) - mozna optional

A

Transformace dat - (jiny pohled na data)

PCA1 (primka) se tahne body tak, ze data by mela mit nejvetsi rozptyl (na primce byly nejkrajnejsi body co onejdal od sebe) a zaroven nejmensi square root error.

Dalsi PCA jsou kolme na predchozi

Kazda komponenta zachycuje urcite mnozstvi dat (proto chceme nejvetsi rozptyl) - nasledek muze zmensit mnozstvi dimenzi (atributy, ktere neovlivnuji vysledek) nebo zmensit mnozstvi dat (pro rychlejsi praci)

35
Q

Odhady - co je cilem, jake typy odhadu

A

Cilem je urcit nezname parametry nahodne veliciny X.

Mame estimatory:
* parametricke (parametric) - estimator je funkce F a parametr theta
* neparametricke (nonparametric) - estimator je funkce F bez parametru
* semi-parametricke (semi-parametric)

36
Q

Vlastnosti/typy parametrickeho odhadu

A
  • Nestrannost (unbiased estimator)
    • pokud stredni hodnota odhadu je rovna parametru
    • asymptotically unbiased = lim n->inf ma nulove vychyleni
  • Konzistence (consistent estimator)
    • odhad je konzistentni, pokud se se zvetsujicim mnozstvim vzorku priblizujeme odhadem k prave hodnote parametru
37
Q

Presnost parametru theta se pocita jako

A

MSE - mean squared error

MSE(T) = E(T - theta) ^ 2

unbiased estimate MSE(T) = var T

38
Q

Bodovy odhad

A

Parametr odhadujem pomoci jedne hodnoty - hodnota parametru je aproximovana

39
Q

Pouzivane bodove odhady

A

distribuce s prumerem µ a variance σ^2
* sample mean (arithmetic mean) je unbiased and consistent estimate of µ
* sample variances S^2

40
Q

Intervalovy odhad

A

Parametr je odhadovan pomoci intervalu

41
Q

Construction of point estimates

A
  1. Consider our data as realization of random sample X_1,…, X_n.
  2. Specify the function F (the distribution up to an unknown param.)
  3. Finally, estimate our unknown parameter (e.g. theta)
    * We can use maximum likelihood for example
42
Q

Maximum likelihood method

A

L(theta) = П_{i from 1 to n} f(x_i, θ) is the joint pdf (pmf) of random sample X = (X_1, …, X_n). П is multiplication.
L is called a likelihood function of parameter θ with fixed x_1, …, x_n.
Maximum likelihood is such θ that maximizes the likelihood function.
It is better to use log likelihood

43
Q

Confidence interval

A

Assume that X_1, . . . , X_n is a random sample with cdf F(x, θ), where θ ∈ Θ is an unknown one-dimensional parameter. Interval [L,U] = [L(X1, . . . , Xn),U(X1, . . . , Xn)] is (1 − α)·100% confidence interval for parameter θ, if P(L(X_1, . . . , X_n) ≤ θ ≤ U(X_1, . . . , X_n)) = 1 − α, for all θ ∈ Θ.

Simply speaking, a confidence interval is the mean of your estimate plus and minus the variation in that estimate. This is the range of values you expect your estimate to fall between if you redo your test, within a certain level of confidence.

44
Q

Factors affecting the width of the confidence interval

A

sample size, the confidence level, and the variability of the data.

45
Q

Hypotheses testing

A

The goal is to decide between two competing hypotheses about parameter θ:
Null hypothesis
Alternative hypothesis

Our decision:
* We reject H_0
* We do not reject the null hypothesis H_0

46
Q

Types of errors when testing hypotheses

A

Type 1 - we reject H_0 and H_0 is true
Type 2 - we do not reject H_0 and H_1 is true

47
Q

Hypothesis test construction

A
  • First, we have to find a pivotal statistic for a given model.
  • Test statistic T is then the value of the pivotal statistic under H0.
  • We have to find the distribution of the test statistic under the null hypothesis.
  • Finally, we define the critical region W so that P(type I error) = α.
48
Q

p-value

A
  • p-value is the probability under the null hypothesis of obtaining test result at least as extreme as the result actually observed.
  • p-value is the smallest significance level at which we can reject the null hypothesis.
  • If p-value is smaller than α, we reject the null hypothesis.
49
Q

Two-sample t-test

A

The two-sample t-test (also known as the independent samples t-test) is a method used to test whether the unknown population means of two groups are equal or not.

  • The normality assumption is crucial.
  • Both samples have to be mutually independent.
  • Both samples must have the same variability - necessary to check.
  • If this assumption is not met, we cannot use two-sample t-test.
  • In that case, we may use e.g. Welch’s approximate t-test.
50
Q

Paired t-test

A

Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a “repeated measures” t-test).

Model: Z_1 = X_1 − Y_1, . . . , Z_n = X_n − Y_n is a random sample from normal distribution N (μ, σ^2).

Null hypothesis: H_0 : μ = 0.

51
Q

One-sample t-test

A

The one-sample t-test is a statistical hypothesis test used to determine whether an unknown population mean is different from a specific value.

t={\frac {{\bar {x}}- \mu_{0}}{s/{\sqrt {n}}}}

52
Q

ANOVA test

A

ANOVA, which stands for Analysis of Variance, is a statistical test used to analyze the difference between the means of more than two groups.

Mathematical model:
* All the samples are from normal distribution
* All samples are mutually independent.
* n_1, . . . , n_k are the sample sizes.
* n = n_1 + . . . + n_k is the overall sample size.
* H_0: μ_1 = μ_2 = . . . = μ_k .
* H_1: μ_i != μ_j for at least one pair i != j.

ANOVA determines whether the groups created by the levels of the independent variable are statistically different by calculating whether the means of the treatment levels are different from the overall mean of the dependent variable.

If any of the group means is significantly different from the overall mean, then the null hypothesis is rejected.

ANOVA uses the F test for statistical significance. This allows for comparison of multiple means at once, because the error is calculated for the whole set of comparisons rather than for each individual two-way comparison (which would happen with a t test).

The F test compares the variance in each group mean from the overall group variance. If the variance within groups is smaller than the variance between groups, the F test will find a higher F value, and therefore a higher likelihood that the difference observed is real and not due to chance.