Common - Question 5 (Statistics) Flashcards
Data types in statistics
- Categorical (qualitative)
- Numerical (quantitative)
Target population
the complete set of individuals that might
hypothetically be sampled for a particular study; ENTIRE group of individuals or objects to which researchers are interested in
generalizing the conclusions
Exploratory data analysis
summarizes the main characteristics of the data, often with visual methods.
Name measures of location and how to calculate them
- mean (aritmeticky prumer)
mean = 1/n * Σ x_i - median
median = x_((n + 1)/2) for odd n
median = (x_(n/2) + x_((n/2) + 1))/2 for even n - quantiles
- mode (modus) - value which appears most often in a set of data values
- trimmed mean (some defined amount of min + max values are removed)
- winsorized mean (uses percentage to find the element that will be repeated: elements 0 -10% have same repeating element which is at 10%, similarly from the other end)
Name measures of variability
- variance (rozptyl)
s^2 = 1/(n-1) * Σ(x_i - mean)^2 - standard deviation (směrodatná odchylka)
s = sqrt(s^2) - range
R = x_1 - x_n - Interquartile range
IQR = Q_3 - Q_1
Name measures of shape
- skewness (šikmost)
- kurtosis (špičatost)
Skewness formula
b = 1/n * ∑︁{ (𝑥_i - avg(𝑥)) / s}^3
s = standard deviation
b > 0 doprava ( avg < median < mode )
b < 0 doleva ( avg > median > mode )
Kurtosis formula with meaning
b = - 3 + 1/n * ∑︁{ (𝑥_i - avg(𝑥)) / s}^4
s = standard deviation
b > 0 heavily tailed (spicate)
b < 0 lightly tailed (splostele)
Define covariance
Let X and Y be random variables. Covariance of X and Y is defined as cov(X, Y ) = E[(X − EX)(Y − EY )] if the above expectation is finite.
Define correlation coefficient
Let X and Y be random variables with finite nonzero variances varX and varY. Correlation coefficient of X and Y is defined as cor(X, Y ) = cov(X, Y) / sqrt(varX varY)
If X and Y are independent and correlation coefficient exists, then cor(X, Y ) = 0.
Graph (charts) types and their description
- Boxplot
5 values: max{x_1, Q_1 - 1.5 IQR}, Q1, median, Q3, min{x_n, Q_3 + 1.5 IQR} - Scatterplot
is a plot that displays values for two variables of the dataset using Cartesian coordinates. - Pie chart
Categorical visualization of data - Histogram
Histogram is a piecewise constant estimate of the distribution of the data.
Random variable (náhodná veličina)
random variable X is a function from a sample space to a set of real numbers
What characterizes a random variable?
X is characterised by a model F from the set of models (probabilistic or statistical)
What is cumulative distribution function
Function F = R -> [0, 1] defined as F_X(x) = P(X <= x) for x in R is cdf of X.
F is non-decreasing function
What is probability mass function
p(x) = P(X = x)
Continuous random variable
Random variable X is (absolutely) continuous, if there exists an integrable and nonnegative function f such that F(x) = int{-inf}{x} f(t)dt for all x in R.
f is called probability density function
Discrete random variable
Random variable X is discrete, if there exists a finite or countable set {x1, x2, . . .} such that P(X = x_i) > 0 and sum of probabilities of all realisations of X is 1.
Random sampling (náhodný vyběr)
Náhodný výběr je uspořádaná n-tice náhodných veličin X_1,X_2,…,X_n, které jsou stochasticky nezávislé a mají stejné rozdělení.
x_1, x_2, …, x_n jsou realizace náhodného výběru.
List discrete univariate distributions
- Bernoulli
- Binomial
- Poisson
- Geometric
- Uniform
Bernoulli distribution
Random variable X describes events with two outcomes (success, failure) with p the probability of success.
EX = p and varX = p(1 − p).