Statistics Flashcards
What is the Central Limit Theorem and why is it important?
States that if we sample from a population using a sufficiently large sample size, the mean of the samples (sample population) will be normally distributed (assuming true random sampling). The mean tending to the mean of the population and variance equal to the variance of the population divided by the size of sampling. This will be true regardless of the distribution of the population.
https://spin.atomicobject.com/2015/02/12/central-limit-theorem-intro/
What is sampling?
Data sampling is a statistical analysis technique used to select, manipulate, and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.
https://searchbusinessanalytics.techtarget.com/definition/data-sampling
Why is data sampling important?
It enables data scientists and other data professionals to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly while still producing accurate findings.
How is data sampling useful?
For data sets that are too large to efficiently analyze in full.
Identifying and analyzing a representative sample is more efficient and cost effective than surveying the entirety of the data or population.
Example: in big data analytics applications or surveys.
What should be considered when data sampling and why?
The size of the required data sample and the possibility of introducing a sampling error.
What are the different sampling methods?
- Simple random sampling
- Stratified sampling
- Cluster sampling
- Multistage sampling
- Systematic sampling
What is simple random sampling?
Randomly selecting subjects from the whole population.
What is stratified sampling?
Subsets of the data sets or population are created based on a common factor and samples are randomly collected from each subgroup. A sample is drawn from each strata using a random sampling method. *remember to sample proportionally.
What is cluster sampling?
A larger dataset is divided into subsets or clusters based on a defined factor, then a random sampling of clusters is analyzed–the sampling unit is the whole cluster–instead of sampling individuals form each group, a researcher will study whole clusters
What is multistage sampling?
More complicated form of cluster sampling
Dividing the larger population into a number of clusters
Second stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed
What is systematic sampling?
setting an interval at which to extract data from the larger population
Example - every 10th row in a dataset
What are the non-probability sampling methods?
- Convenience sampling
- Consecutive sampling
- Purposive/judgmental sampling
- Quota sampling
What is the difference between type I vs type II error?
Type I: null hypothesis is true but is rejected
Type II: the null hypothesis is false but erroneously fails to be rejected
What is linear regression?
the relationship between a single dependent variable Y and one or more predictors (X)
What are the assumptions required for linear regression?
- Linearity: The relationship between X and the mean of Y is linear.
- Independence: Observations are independent of each other (minimal collinearity between explanatory variables)
- The errors or residuals
(y-actual – y-hat(predicted)) are normally distributed - Homoscedasticity - The variance of residual is the same for any value of X