Stats Flashcards
What is Central Limit Theorem
No matter the distribution, if you take k subsets with each subset having n samples the mean of the sample set is approximately normal distribution. The distribution would have the same mean as the original data set and the variance is n time smaller than the variance of the original data set.
Important when -> Find average age of all people.
What is sampling?
- Obtain a part of the data population.
- Good sample is representative of entire population.
- Should not be a bias (convenient or voluntary response) sample
- Unbiased sample give each think an equal chance of being chosen.
What are some sampling techniques?
- Simple random sampling - unbiased
- Stratified Random Sample – Group similar featured data and get equal amount of random samples from each
- Multistage sampling
What is null hypothesis?
A null hypothesis is a type of conjecture used in statistics that proposes that there is no difference between certain characteristics of a population or data-generating process.
The alternative hypothesis proposes that there is a difference.
Hypothesis testing provides a method to reject a null hypothesis within a certain confidence level. (Null hypotheses cannot be proven, though.)
What is type 1 error?
Null hypothesis is true but rejected (false positive)
What is type 2 error?
Null hypothesis is false but failed to reject (false negative)
What is p-value?
The probability of making a type I error is represented by your alpha level (α), which is the p-value below which you reject the null hypothesis. A p-value of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis.
You can reduce your risk of committing a type I error by using a lower value for p. For example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.
However, using a lower value for alpha means that you will be less likely to detect a true difference if one really exists (thus risking a type II error).
What is /beta?
The probability of making a type II error is called Beta (β), and this is related to the power of the statistical test (power = 1- β). You can decrease your risk of committing a type II error by ensuring your test has enough power.
You can do this by ensuring your sample size is large enough to detect a practical difference when one truly exists.
Why type I and type II errors are important?
The consequences of making a type I error mean that changes or interventions are made which are unnecessary, and thus waste time, resources, etc.
Type II errors typically lead to the preservation of the status quo (i.e. interventions remain the same) when change is needed.
What is Bias?
In Data Science, Bias is a deviation from expectation in the data. In a general sense, bias in data science refers to an error in the data.
What are types of Bias? Explain each
- Confirmation bias
data scientists or analysts tend to lean towards data that is in alignment with their beliefs, views, and opinions. - Selection bias
selection bias develops when a subset of the data is systematically (i.e., non-randomly) excluded from the analysis - Availability bias
data scientists make inferences based on readily available data or recent information alone - Survivorship bias
tend to distort data sets by focusing on successful examples and ignoring failures - Recall bias
information bias where participants do not ‘recall’ previous events, memories, or details. This is also related to recency bias, where we tend to remember things better that have happened more recently
What is statistical significance?
- Determination that a relationship between two or more variables is caused by something other than chance.
- Measured by p-value –> which is probability of observing results as extreme as those in the data
- p-value of 5% or lower is often considered to be statistically significant
How do you assess the statistical significance?
- Do hypothesis testing, state null hypothesis and alternative hypothesis.
- Calculate p-value – probability of null hypothesis is true
- Calculate the significance – Alpha
- P-value < alpha we will reject the null hypothesis (the results is statistically significant)
What is statistical power?
Statistical power’ refers to the power of a binary hypothesis, which is the probability that the test rejects the null hypothesis given that the alternative hypothesis is true.
What is variation?
Gives an idea how spread the data is. But simple “Range” of numbers are susceptible to outliers.
Variance averaged squared deviation from the mean.
Standard deviation is just a square root of variance.