Statistics Flashcards
Population vs samples and parameters vs statistics
First step is to find out whether you are dealing with a population or a sample
Population:
All items of interest
Denoted with N
Numbers obtained are called parameters
Sample:
Subset of population
Denoted with n (lower case)
Numbers obtained are called statistics
Populations are hard to define and hard to observe in real life
Samples however are less time consuming, less costly
Randomness vs. representativeness
Randomness –> Random sample is collected when each member of the sample is chosen from the population strictly by chance
A group is not random when a large portion of the group did not have the chance to be chosen
Representative –> Sample is a subset of the population that accurately reflects the members
Which types of data can we define along with their subcategories?
Categorical
- Categories, groups
- Yes/No questions
Numerical –> Represents numbers
- Discrete nr’s –> Integer numbers Like amount of children you will have
- Continuous nr’s –> Infinite and impossible to count –> Weight count which is a rounded nr
What are the measurement levels of the data type categories?
Qualitavive data
- Nominal –> Like categorical data
- Ordinal –> Follow a strict order –> Rating your lunch for example from 1 to 5 stars
Quantitative data
- Interval –> Does not have a true zero like temperature (unlike Kelvin)
- Ratio –> Have a true zero like distance or time
What is the histogram relative frequency?
Percentage probability per interval –> relative frequency
When are scatter plots used?
Scatter plots
Used when we are representing two numerical variables
Example:
Horizontal axis –> Reading scores
Vertical axis –> Writing scores
Both axes are numerical
What is an outlier?
Data point that goes against the logic and of the whole dataset
Define mean
Simple average
Denoted with μ for a population
x̄ for sample
Downside: Easily disturbed by an outlier!
Define median
Middle number
(n+1) / 2
Define mode
Value that occurs most often
When each price appears only once –> We say there is NO mode
What is skewness and what does it indicate?
Skewness indicates whether the data is concentrated on one side
Right skew vs left skew
Right skew:
The mean is bigger than the median –> mean > median
The outliers are to the right
Mode –> Highest point in graph
Check video for graph
Left skew:
mean < median
Outliers are to the left
What does variance measure?
Variance measures the dispersion of a set of data points around their mean value
Why squaring the number for variance?
We always get non negative computations
Amplifies effect of large differences
Population variance vs sample variance
Population variance: √( ∑ ( (xi - μ)2 / N) )
Sample variance: √( ∑ ( (xi - x̅)2 / n - 1) )
Let op: x̅ en n-1 ipv n
Population variance standard deviation vs sample variance standard deviation
Population standard deviation –> σ = SQRT(σ²)
Sample standard deviation –> S = SQRT(S²)
What is the coefficient of variation?
Relative standard deviation: Standard deviation / mean
Population: Cv = σ / μ
Sample: Cv = s / x̄
Why use coefficients of variation?
Standard deviation is the most common measure of variability for a single dataset
Coefficient is much better measure for comparing two datasets
What is Covariance?
2-dimensionaal
In tegenstelling tot de formules voor variance en sample variance, komt er nu nog een y-component bij
Voor de rest dezelfde formule voor population en sample
Notice the sigma and s are NOT squared in the formula
Cov(x,y) = σ(xy)
Covariance formula?
Covariance meaning?
It gives a sense of direction in which the two variables are heading
> 0 means the two variables move together
<0 means the two variables move in opposite directions
=0 means the two variables are independent
What does correlation do?
Adjusts covariance, so that the relationship between the two variables becomes easy and intuitive to interpret
This is either sample of population dependent on the data you are working with
How to calculate correlation coefficient?
Cov(x,y) = σ(xy)
Population: σ(xy) / σ(x)σ(y)
Sample: S(xy) / SxSy
How to interpret correlation?
The correlation coefficient is always between -1 and 1
1 –> Entire variability of one variable is explained by the other
Almost 1 –> Strong relationship between the 2 values
0 –> Absolutely independent
Negative correlation –> They influence each other negatively
Is the correlation between X and Y the same as the correlation between Y and X?
Yes.
Hence: σ(xy) / σ(x)σ(y)
Where σ(xy) is the same as σ(yx)
What is causality?
Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events. This is also referred to as cause and effect.
It is important to understand the direction of causal relationships
Disregarding of correlations when
It is a common practise to disregard correlations below 0.2
How to calculate the Z-score
Z = (Y - μ) / σ
What is the central limit theorem?
In probability theory, the central limit theorem (CLT) establishes that, in many situations, for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.
When do we speak of a sampling distribution?
A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.
How to denote the sampling distribution?
Sampling distribution denoted:
~N(μ, σ²/n)
This leads to the insights:
The bigger the sample size the smaller the variance and the more accurate the results are
What allows the CLT us to do?
Make inferences using the normal distribution, even when the population is not normally distributed
Standard error: Definition and formula
Standard deviation of the distribution formed by the sample means, which is:
√(σ²/n) = σ/√n
Means that:
Error decreases when sample size increases
Why is the standard error important?
Important because it is used in most statistical tests –> It shows how well you approximated the true mean