Statistics Flashcards
Population vs samples and parameters vs statistics
First step is to find out whether you are dealing with a population or a sample
Population:
All items of interest
Denoted with N
Numbers obtained are called parameters
Sample:
Subset of population
Denoted with n (lower case)
Numbers obtained are called statistics
Populations are hard to define and hard to observe in real life
Samples however are less time consuming, less costly
Randomness vs. representativeness
Randomness –> Random sample is collected when each member of the sample is chosen from the population strictly by chance
A group is not random when a large portion of the group did not have the chance to be chosen
Representative –> Sample is a subset of the population that accurately reflects the members
Which types of data can we define along with their subcategories?
Categorical
- Categories, groups
- Yes/No questions
Numerical –> Represents numbers
- Discrete nr’s –> Integer numbers Like amount of children you will have
- Continuous nr’s –> Infinite and impossible to count –> Weight count which is a rounded nr
What are the measurement levels of the data type categories?
Qualitavive data
- Nominal –> Like categorical data
- Ordinal –> Follow a strict order –> Rating your lunch for example from 1 to 5 stars
Quantitative data
- Interval –> Does not have a true zero like temperature (unlike Kelvin)
- Ratio –> Have a true zero like distance or time
What is the histogram relative frequency?
Percentage probability per interval –> relative frequency
When are scatter plots used?
Scatter plots
Used when we are representing two numerical variables
Example:
Horizontal axis –> Reading scores
Vertical axis –> Writing scores
Both axes are numerical
What is an outlier?
Data point that goes against the logic and of the whole dataset
Define mean
Simple average
Denoted with μ for a population
x̄ for sample
Downside: Easily disturbed by an outlier!
Define median
Middle number
(n+1) / 2
Define mode
Value that occurs most often
When each price appears only once –> We say there is NO mode
What is skewness and what does it indicate?
Skewness indicates whether the data is concentrated on one side
Right skew vs left skew
Right skew:
The mean is bigger than the median –> mean > median
The outliers are to the right
Mode –> Highest point in graph
Check video for graph
Left skew:
mean < median
Outliers are to the left
What does variance measure?
Variance measures the dispersion of a set of data points around their mean value
Why squaring the number for variance?
We always get non negative computations
Amplifies effect of large differences
Population variance vs sample variance
Population variance: √( ∑ ( (xi - μ)2 / N) )
Sample variance: √( ∑ ( (xi - x̅)2 / n - 1) )
Let op: x̅ en n-1 ipv n
Population variance standard deviation vs sample variance standard deviation
Population standard deviation –> σ = SQRT(σ²)
Sample standard deviation –> S = SQRT(S²)
What is the coefficient of variation?
Relative standard deviation: Standard deviation / mean
Population: Cv = σ / μ
Sample: Cv = s / x̄
Why use coefficients of variation?
Standard deviation is the most common measure of variability for a single dataset
Coefficient is much better measure for comparing two datasets
What is Covariance?
2-dimensionaal
In tegenstelling tot de formules voor variance en sample variance, komt er nu nog een y-component bij
Voor de rest dezelfde formule voor population en sample
Notice the sigma and s are NOT squared in the formula
Cov(x,y) = σ(xy)
Covariance formula?
Covariance meaning?
It gives a sense of direction in which the two variables are heading
> 0 means the two variables move together
<0 means the two variables move in opposite directions
=0 means the two variables are independent
What does correlation do?
Adjusts covariance, so that the relationship between the two variables becomes easy and intuitive to interpret
This is either sample of population dependent on the data you are working with
How to calculate correlation coefficient?
Cov(x,y) = σ(xy)
Population: σ(xy) / σ(x)σ(y)
Sample: S(xy) / SxSy
How to interpret correlation?
The correlation coefficient is always between -1 and 1
1 –> Entire variability of one variable is explained by the other
Almost 1 –> Strong relationship between the 2 values
0 –> Absolutely independent
Negative correlation –> They influence each other negatively
Is the correlation between X and Y the same as the correlation between Y and X?
Yes.
Hence: σ(xy) / σ(x)σ(y)
Where σ(xy) is the same as σ(yx)
What is causality?
Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events. This is also referred to as cause and effect.
It is important to understand the direction of causal relationships
Disregarding of correlations when
It is a common practise to disregard correlations below 0.2
How to calculate the Z-score
Z = (Y - μ) / σ
What is the central limit theorem?
In probability theory, the central limit theorem (CLT) establishes that, in many situations, for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.
When do we speak of a sampling distribution?
A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.
How to denote the sampling distribution?
Sampling distribution denoted:
~N(μ, σ²/n)
This leads to the insights:
The bigger the sample size the smaller the variance and the more accurate the results are
What allows the CLT us to do?
Make inferences using the normal distribution, even when the population is not normally distributed
Standard error: Definition and formula
Standard deviation of the distribution formed by the sample means, which is:
√(σ²/n) = σ/√n
Means that:
Error decreases when sample size increases
Why is the standard error important?
Important because it is used in most statistical tests –> It shows how well you approximated the true mean
What is an estimate?
An approximation based on sample information
Which types of estimates can we distinguish?
Two types of estimates
Point estimates –> Single number
Confidence intervals –> Interval
Relation –> Point estimate is exactly in the middle of the confidence interval
Confidence intervals do provide much more information though
How are x̅ and S² defined as estimates?
The sample mean (̄x) is a point estimate of the population mean, μ.
The sample variance (s2) is a point estimate of the population variance (σ2).
Which two properties does an estimate have?
Bias
Efficiency
The goal is always to look for the most unbiased estimators
Characteristics of an unbiased estimator?
Expected value = population parameter
x̄ has an expected value of μ
Example: Someone says the average height of americans is taking a sample and add a foot to it.
x̄ plus 1 ft. = μ
What is the most efficient estimator?
The most efficient estimator is the unbiased estimator with the smallest variance
What is the confidence interval?
Range within which you expect the population parameter to be
How is the confidence level denoted?
Denoted as 1 - α
α is a value between 0 and 1
If the confidence level is 95% then α is 5%
How is the confidence interval denoted?
[ x̅ - Z(α/2) * (σ/√n), x̅ + Z(α/2) * (σ/√n) ]
Case: Calculate the confidence interval (95%) from:
With a x̅ (sample mean) of 100200
And σ = 15000
And n = 30
α is then 0,05 –> Divided by 2 is 0,025
Then you have to look up the Z-score of Z(0.025)
You would have to look up in the table the value of 1 - 0.025 = 0,975
This returns values of 1.9 and 0.06
Z(0.025) is therefore 1.9 + 0.06 = 1.96
Substitute the values in the formula:
[94833, 105568]
Interpretation:
We are 95% confident that the average data scientist salary will be in the interval [94833, 105568]
How usefull are confidence level ranges?
100% is useless –> Range is to big
99% –> Same story. Not insightful enough
5% –> Too small to be meaningful
95% is the accepted norm!
Characteristics Student’s T
Small sample size approximation of a Normal Distribution
You use this when there’s not sufficient data for the normal distribution
Graph is also bell shaped but with larger tails to accomodate occurence of values for away from the mean
Another key difference is that apart from mean and variance you must also define degrees of freedom for the distribution
What is the T-statistic
Just as the Z-statistic is related to the normal distribution
The T-statistic is related to the T distribution
How to calculate the T-statistic?
T(n-1),α = (x̅ - µ) / (s / √n)
–> Approximation of the normal distribution
How to find the T-statistic in a T-table?
Hence:
T(n-1), α = (x̅ - µ) / (s / √n)
With a sample of n-1 –> We have n-1 degrees of freedom. So for 20 observations, the degrees of freedom is 19
The T-table:
Vertical axis: degrees of freedom
Horizontal axis: α
Note that after 30th row the numbers don’t vary to much with the Z-statistic table
Finding confidence interval for Student’s T distribution for known population variance and unknown population variance?
Unknown variance:
[ x̅ - T(n-1,α/2) * (S/√n), x̅ + T(n-1,α/2) * (S/√n) ]
Known variance:
[ x̅ - Z(α/2) * (σ/√n), x̅ + Z(α/2) * (σ/√n) ]
All we have to do is finding the T-statistic in the table
Is T-statistic related to the Z-statistic
Just as the Z-statistic is related to the normal distribution
The T-statistic is related to the T distribution
How will the confidence interval change when we know the population variance?
When we know the population variance we get a narrower confidence interval. When do not know the population variance there is a higher uncertainty.
So: When we don’t know the population variance we can still make predictions though less accurate!
How is Margin of Error defined?
ME = Reliability Factor * (σ/√n)
Meaning:
Higher reliability factor or standard deviation –> Higher margin of error
Bigger margin of error –> Wider confidence interval
Smaller margin of error –> Narrower confidence interval
Higher sample size will decrease the margin of error and vice versa
Margin of Error for known and unknown population variance
Known population variance:
Margin of error –> Z(α/2) * (σ/√n)
Unknown population variance:
Margin of error –> T(n-1,α/2) * (S/√n)
How can you define the confidence intervals with the margin of error?
x̅ +-ME
What happens with a smaller margin of error?
Narrower confidence interval
What is an example of two datasets, with two means, that are dependent samples from each other
Studying a person’s weight loss –> Same person
Habits of husbands and wives –> Coincide with each other
Difference between dependent and independent samples
Dependent:
Instead of before and after situation we look at cause and effect
Testing with confidence intervals for dependent samples
Use statistical methods like regressions
Independent, can be applied for 3 cases:
When population variance is known
Population variance is unknown but assumed to be equal
Population variance unknown but assumed to be different
How to calculate confidence intervals for dependent samples?
We use đ instead of x̅
We calculate the đ by calculating the before and after difference of samples and taking the mean from that
You can use the T-statistic for applying it to the confidence interval:
[ đ - T(n-1,α/2) * (Sd/√n), đ + T(n-1,α/2) * (Sd/√n) ]
Example of application: 10 patients testing medication leading to before and after results. The differences of these results have a certain mean, which is defined as đ.
Considerations for using either the Z or T-statistic
Sample size –> Big / Small
Are the population variances known –> Yes / No
Distribution type? –> Normal?
In case of Big sample size, known population variance and normal distribution –> Use the Z statistic
How to calculate the variance between two INDEPENDENT data sets with variance KNOWN?
σ²(diff) = σ(1)² / n(1) + σ(2)² / n(2)
What is the confidence interval for two INDEPENDENT data sets with variance KNOWN?
( x̅ - ȳ) +- Z(α/2) * √(σ(1)² / n(1) + σ(2)² / n(2))
What is the confidence interval for two INDEPENDENT data sets with variance UNKNOWN but assumed to be equal? And what is an and example of a case like this?
In this case you use what is called the Pooled variance formula
S(p)² = (Nx - 1)Sx² + (Ny - 1)Sy² / Nx + Ny - 2
Calculate the interval by using the T-statistic, hence image
Example: You have 2 datasets but the sample size is not the same.
Explain the usage of the T-statistic for two INDEPENDENT data sets with variance UNKNOWN
The degrees of freedom are equal to the total sample size minus the number of variables
Normally this would be n-1 because you had 1 variable (sample size)
Because in this case you have 2 sample sizes, there’s 2 variables
Degrees of freedom is then Sample size 1 + sample size 2 - 2
What is the interpretation of calculating the confidence interval when comparing two datasets?
Interpretation:
We are 95% positive that the difference between set A and set B is between point (a,b)
What are the steps when comparing 2 different groups?
Find out whether sets are independent or not
Find out whether population variance is unknown or assumed to be equal
In this case calculate the pooled variance with according formula
You will get a confidence interval for every possible shoe size
Name the two hypotheses types
Null hypothesis –> Denoted with H0 (small 0)
Alternative hypothesis –> Denoted with H1 or Ha
Null hypothesis:
Is like innocent until proven guilty
H0 is true until rejected
The = sign always needs to be in the H0 hypothesis
How is α related to the null hypothesis?
Significance level. Defined as: The probability of rejecting the null hypothesis, if it’s true
Steps for testing a hypothesis?
- Calculate a statistic (like x̅)
- Scale it with Z = (x̅ - µ) / (s / √n)
- Check if Z is in the rejected region. Check whether it is one or two-sided –> Number for α depends on this.
The Z is the coordinate point. Check for α = 0.05 what the coordinates are for the safety margins (look up the α/2 value and then add the numbers on the left side and the top side for z). Then check whether Z falls within that region.
What is a Type 1 Error and what is a Type 2 Error?
Type I error:
When you reject a true null hypothesis
Also called a false positive
Probability: α
Type II error
Accept a false null hypothesis
False negative
Probability ß –> Depends mainly on sample size n and variance σ
Probability of rejecting a false null hypothesis: 1 - ß –> Also called the power of the test
What does the accept/reject quadrant look like
Example: You are in love with a girl, unsure if she looks you back
H0 –> She does not like you back
Fill in the blanks in the quadrants
H0 is true and accept(Do nothing) –> You do nothing and save yourself the embarrassment
H0 is false and accept(Do nothing) –> Missed opportunity
H0 is true and reject(Invite her) –> Embarrassment
H0 is false and reject(Invite her) –> Favourable for all
Describe the P-value
Smallest level of significance at which we can still reject the null hypothesis, given the observed sample statistic
Check of geteste waarde binnen het significance domein valt. Als p daarbuiten valt dan kun je hypothese afwijzen
What if you can’t find extreme values in Z-table?
Round up to the closest value available
When must the hypothesis be rejected?
When P-value < α
How to find p-value in Z-table?
One sided: 1 minus the number from the Z-table
Two sided: 1 minus the number from the Z-table times 2
What statistic to use when population variance unknown
T-statistic
What does D0 stand for?
Hypothesized value difference
Decision rule for accept/reject when using T-score
Accept if: The absolute value of the T-score < critical value t
Reject if: The absolute value of the T-score > critical value t
H0 : D0 >= 0 is the same as writing?
H0: µb - µa >=0
D0 = Hypothesized value difference
Steps for testing hypothesis - 11 steps
Formulating the hypothesis
Calculate sample mean
Standard deviation
Standard error
Determine which statistic to use
Small / Big sample
Assuming which distribution
Variance known / unknown
T score (in this case) is equal to T = (đ-µ0)/standard error
Determine whether you want to choose a level of significance, if not choose the p-value
In the T-table you can see in which significance range the number is (α between 0.025 & 0.01)
Use online formule to determine it exactly (p-value)
Decision rule
Accept if: p > α
Reject if: p < α
Then choose the level of significance for the study
Say these are your hypotheses:
Hypothesis: H0 : µe - µm = -4%
Hypothesis: H1 : µe - µm ≠ -4%
What if you wanna know it is higher or lower than -4%
–> The sign of the test statistic can give you that information
Negative sign of statistic means it’s smaller than hypothesized value –> In this case, Z=-2.44, thus the difference can be lower than -4%, like 5 or 6%
Positive sign of statistic means it’s higher than hypothesized value
Independent samples
Case example: On average, management outperforms engineering by 4%
Set up Hypothesis
Hypothesis: H0 : µe - µm = -4%
Hypothesis: H1 : µe - µm ≠ -4%
Look at sample sizes –> Whether they are equal
Determine difference between means
Determine standard error of the difference: √ ( σe1² / ne + σm² / nm )
Determine which statistic to use –> Z statistic
Big samples
Known variances
Find Z-score –> Z statistic formula: (x̅ - µ0) / standard error (from step 4)
Notice sometimes there’s no M0, because the H0 states that somethings smaller/bigger without giving the number –> In that case µ0 is null
P-value from online software –> 0.015
Interpretation:
At 5% significance we reject the null hypothesis –> 0.015 < 0.05
We say: There is enough statistical evidence that the mean difference is NOT 4%
What to do with independent samples, variance unknown but assumed to be equal
Use the pooled variance
What to do when the null hypothesis states that the difference between two means is 0, but you still want to know whether there’s a difference at all?
Checking if the T-score is positive or negative
Positive sign of statistic means it’s higher than hypothesized value