Statistics Flashcards

1
Q

What is the Central Limit Theorem and why is it important?

A

States that if we sample from a population using a sufficiently large sample size, the mean of the samples (sample population) will be normally distributed (assuming true random sampling). The mean tending to the mean of the population and variance equal to the variance of the population divided by the size of sampling. This will be true regardless of the distribution of the population.

https://spin.atomicobject.com/2015/02/12/central-limit-theorem-intro/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is sampling?

A

Data sampling is a statistical analysis technique used to select, manipulate, and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.

https://searchbusinessanalytics.techtarget.com/definition/data-sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is data sampling important?

A

It enables data scientists and other data professionals to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly while still producing accurate findings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is data sampling useful?

A

For data sets that are too large to efficiently analyze in full.
Identifying and analyzing a representative sample is more efficient and cost effective than surveying the entirety of the data or population.

Example: in big data analytics applications or surveys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What should be considered when data sampling and why?

A

The size of the required data sample and the possibility of introducing a sampling error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the different sampling methods?

A
  1. Simple random sampling
  2. Stratified sampling
  3. Cluster sampling
  4. Multistage sampling
  5. Systematic sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is simple random sampling?

A

Randomly selecting subjects from the whole population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is stratified sampling?

A

Subsets of the data sets or population are created based on a common factor and samples are randomly collected from each subgroup. A sample is drawn from each strata using a random sampling method. *remember to sample proportionally.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is cluster sampling?

A

A larger dataset is divided into subsets or clusters based on a defined factor, then a random sampling of clusters is analyzed–the sampling unit is the whole cluster–instead of sampling individuals form each group, a researcher will study whole clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is multistage sampling?

A

More complicated form of cluster sampling

Dividing the larger population into a number of clusters
Second stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is systematic sampling?

A

setting an interval at which to extract data from the larger population

Example - every 10th row in a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the non-probability sampling methods?

A
  1. Convenience sampling
  2. Consecutive sampling
  3. Purposive/judgmental sampling
  4. Quota sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between type I vs type II error?

A

Type I: null hypothesis is true but is rejected

Type II: the null hypothesis is false but erroneously fails to be rejected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is linear regression?

A

the relationship between a single dependent variable Y and one or more predictors (X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the assumptions required for linear regression?

A
  1. Linearity: The relationship between X and the mean of Y is linear.
  2. Independence: Observations are independent of each other (minimal collinearity between explanatory variables)
  3. The errors or residuals
    (y-actual – y-hat(predicted)) are normally distributed
  4. Homoscedasticity - The variance of residual is the same for any value of X
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define p-value

A

The minimum alpha (significance level) at which the coefficient is relevant
The lower the p-value, the more important the variable is in predicting the response/dependent variable (Y)

17
Q

Define coefficient

A

The coefficient value signifies how much the mean of the dependent variable changes given a 1-unit shift in the independent variable while holding other variables in the model constant.

18
Q

Define R-squared

A

Statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables

19
Q

What is statistical interaction?

A

the effect of one independent variable may depend on the level of the other independent
variable

the effect of one factor (input/independent variable) on the dependent variable (output
variable) differs among levels of another factor.

20
Q

What is selection bias?

A

‘sampling’ bias - data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see

active selection bias occurs when a subset of the data is systemically (non-randomly) excluded from analysis

21
Q

What is an example of a data set with a non-Gaussian distribution

A
  1. Weibull distribution, found with life data such as survival times of a product
  2. Log-normal distribution, found with length data such as heights
  3. Largest-extreme-value distribution, found with data such as the longest down-time each day
  4. Exponential distribution, found with growth data such as bacterial growth Poisson distribution, found with rare events such as number of accidents
  5. Binomial distribution, found with “proportion” data such as percent defectives or the possible numbers of successes on n trials for independent events that each have a probability of p occurring.
22
Q

What are some causes when data is not normally distributed?

A
  1. Extreme values/ Outliers - It is important that outliers are identified as truly special causes before they are eliminated. Extreme values should only be explained and removed from the data if there are more of them than expected under normal conditions.
  2. Overlap of Two or More Processes - If two or more data sets that would be normally distributed on their own are overlapped, data may look bimodal or multimodal – it will have two or more most-frequent values. The remedial action for these situations is to determine which X’s cause bimodal or multimodal distribution and then stratify the data.
  3. Insufficient Data Discrimination - Round-off errors or measurement devices with poor resolution can make truly continuous and normally distributed data look discrete and not normal. Insufficient data discrimination – and therefore an insufficient number of different values – can be overcome by using more accurate measurement systems or by collecting more data.
  4. Sorted Data - Collected data might not be normally distributed if it represents simply a subset of the total output a process produced. This can happen if data is collected and analyzed after sorting.
  5. Values Close to Zero or a Natural Limit -If a process has many values close to zero or a natural limit, the data distribution will skew to the right or left. In this case, a transformation, such as the Box-Cox power transformation, may help make data normal. In this method, all data is raised, or transformed, to a certain exponent, indicated by a Lambda value. When comparing transformed data, everything under comparison must be transformed in the same way.
  6. Data Follows a Different Distribution