Stats Flashcards

Question 1

Q

What is Central Limit Theorem

Answer

A

No matter the distribution, if you take k subsets with each subset having n samples the mean of the sample set is approximately normal distribution. The distribution would have the same mean as the original data set and the variance is n time smaller than the variance of the original data set.
Important when -> Find average age of all people.

Question 2

Q

What is sampling?

Answer

A

Obtain a part of the data population.
Good sample is representative of entire population.
Should not be a bias (convenient or voluntary response) sample
Unbiased sample give each think an equal chance of being chosen.

Question 3

Q

What are some sampling techniques?

Answer

A

Simple random sampling - unbiased
Stratified Random Sample – Group similar featured data and get equal amount of random samples from each
Multistage sampling

Question 4

Q

What is null hypothesis?

Answer

A

A null hypothesis is a type of conjecture used in statistics that proposes that there is no difference between certain characteristics of a population or data-generating process.

The alternative hypothesis proposes that there is a difference.

Hypothesis testing provides a method to reject a null hypothesis within a certain confidence level. (Null hypotheses cannot be proven, though.)

Question 5

Q

What is type 1 error?

Answer

A

Null hypothesis is true but rejected (false positive)

Question 6

Q

What is type 2 error?

Answer

A

Null hypothesis is false but failed to reject (false negative)

Question 7

Q

What is p-value?

Answer

A

The probability of making a type I error is represented by your alpha level (α), which is the p-value below which you reject the null hypothesis. A p-value of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis.

You can reduce your risk of committing a type I error by using a lower value for p. For example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.

However, using a lower value for alpha means that you will be less likely to detect a true difference if one really exists (thus risking a type II error).

Question 8

Q

What is /beta?

Answer

A

The probability of making a type II error is called Beta (β), and this is related to the power of the statistical test (power = 1- β). You can decrease your risk of committing a type II error by ensuring your test has enough power.

You can do this by ensuring your sample size is large enough to detect a practical difference when one truly exists.

Question 9

Q

Why type I and type II errors are important?

Answer

A

The consequences of making a type I error mean that changes or interventions are made which are unnecessary, and thus waste time, resources, etc.

Type II errors typically lead to the preservation of the status quo (i.e. interventions remain the same) when change is needed.

Question 10

Q

What is Bias?

Answer

A

In Data Science, Bias is a deviation from expectation in the data. In a general sense, bias in data science refers to an error in the data.

Question 11

Q

What are types of Bias? Explain each

Answer

A

Confirmation bias
data scientists or analysts tend to lean towards data that is in alignment with their beliefs, views, and opinions.
Selection bias
selection bias develops when a subset of the data is systematically (i.e., non-randomly) excluded from the analysis
Availability bias
data scientists make inferences based on readily available data or recent information alone
Survivorship bias
tend to distort data sets by focusing on successful examples and ignoring failures
Recall bias
information bias where participants do not ‘recall’ previous events, memories, or details. This is also related to recency bias, where we tend to remember things better that have happened more recently

Question 12

Q

What is statistical significance?

Answer

A

Determination that a relationship between two or more variables is caused by something other than chance.
Measured by p-value –> which is probability of observing results as extreme as those in the data
p-value of 5% or lower is often considered to be statistically significant

Question 13

Q

How do you assess the statistical significance?

Answer

A

Do hypothesis testing, state null hypothesis and alternative hypothesis.
Calculate p-value – probability of null hypothesis is true
Calculate the significance – Alpha
P-value < alpha we will reject the null hypothesis (the results is statistically significant)

Question 14

Q

What is statistical power?

Answer

A

Statistical power’ refers to the power of a binary hypothesis, which is the probability that the test rejects the null hypothesis given that the alternative hypothesis is true.

Question 15

Q

What is variation?

Answer

A

Gives an idea how spread the data is. But simple “Range” of numbers are susceptible to outliers.
Variance averaged squared deviation from the mean.

Standard deviation is just a square root of variance.

Question 16

Q

State the variance types and explain the differences.

Answer

Study These Flashcards

A

σ^2=(Σ(x_i-μ ̅ )^2)/N Population variation

s^2=(Σ(x_i-x ̅ )^2)/(n-1) Sample variation (N-1 to remove biase)

Stats Flashcards

(16 cards)