Class 6 & 7 - Diagnostic Analysis With Stat Review Flashcards
What is the difference between a Population vs. Sample
Population - a group of phenomena having something in common
Parameter is a characteristic of a population (μ)
Sample - a subset of members of a population selected to represent that population
Statistic is a characteristic of a sample (x̅)
Describe the Normal Distributions with Different Areas Under The Curve (Within SD of Mean)
68% of the observations are within 1 standard deviation of the mean.
95.4% of the observations are within 2 standard deviations of the mean.
99.7% of the observations are within 3 standard deviations of the mean.
Write the Hypothesis Testing Formats. What is the set up for One Tailed and Two Tailed for Sales?
Null Hypothesis (H0): there is no significant difference between two populations, or the hypothesized relationship does not exist.
Alternative Hypothesis (HA): a hypothesis that is opposite of the null hypothesis, or a potential result that is expected.
Two-tailed tests:
H0: Saturday sales are equal to Sunday sales. (The difference is equal to zero.)
HA: Saturday sales are not equal to Sunday sales. (The difference is not equal to zero.)
One-tailed tests:
H0: Saturday sales are less than or equal to Sunday sales. (The difference is less than or equal to zero.)
HA: Saturday sales are greater than Sunday sales. (The difference is greater than zero.)
How would you do hypothesis testing with a t -test?
Use the Student’s t-test to examine how “unusual” an outcome is.
- Compute the t-statistic, t = (M - µM ) / SM where the SM = S/ N is the standard error of M. M and S are the sample mean and standard deviation, respectively.
- Compare the t-statistic against the critical t-value (for right-tail target area):
If t-statistic < critical t-value, do not reject the null hypothesis
If t-statistic >= critical t-value, reject the null hypothesis
In hypothesis testing, what determines a critical t-value, and what are the common critical t-values? How do we find the critical values in excel?
The critical t-value depends on:
Sample size (N) (Degrees of freedom = N - 1)
Significance level (α) (e.g., 5% or 1%)
One-tailed test:
α = 5%: 1.697
α = 1%: 2.457
Two-tailed test:
α = 5%: 2.042
α =1%: 2.750
On Excel:
One-Tailed Test: T.INV(α, N)
Two-Tailed Test: =T.INV.2T(α, N)
α (alpha) = Significance level (e.g., 0.05 or 5%).
N = Degrees of freedom
Explain how to use the p-value to decide whether to reject the null hypothesis in a t-test. Provide an Example:
Consider:
Data Given: Sample size N=100, t-statistic = 1.290, p-value = 0.09.
Step 1: Compare p-value with α (the significance level)
If p-value > α, do not reject the null hypothesis H0
If p-value ≤ α, reject the null hypothesis H0
Example:
Consider H0 (Saturday sales are less than or equal to Sunday sales) vs. HA (Saturday sales are greater than Sunday sales).
Data Given: Sample size N=100, t-statistic = 1.290, p-value = 0.09.
Since p-value = 0.09 > α = 0.05: We would conclude that “the test cannot reject the null hypothesis at the 5% significance level.”
The confidence level is calculated as 1−α, If α=5%, the confidence level = 95%. We can now state
The test cannot reject the null hypothesis (Saturday sales are less than or equal to Sunday sales) at the 95% confidence level.
Explain Type I Error and Type 2 II Error. Draw the Chart
CHART: Decision Made
H0 is True
H0 is False
Reject H0
Type I Error (α)
Correct (1 - β)
Accept H0
Correct (1 - α)
Type II Error (β)
Type I Error (α): Rejecting a true null hypothesis (false positive)
Example: Saying a drug works when it does not
Type II Error (β): Accepting a false null hypothesis (false negative)
Example: Saying a drug does not work when it actually does
Correct Decisions:
1−α: Correctly accepting a true null hypothesis
1−β: Correctly rejecting a false null hypothesis
Power of the Test: The ability to detect a false null hypothesis (H0)
Power = 1−β1
When do you use a one-tailed test or two-tailed test?
One-Tailed Test
When to Use: If the hypothesis involves direction (greater than, less than).
Two-Tailed Test
When to Use: If the hypothesis does not specify a direction. You are testing for a difference in either direction (greater or less).
Keywords:”not equal to”, “different from”, “changed”, “effect” (without specifying direction)
What are the key measures for the regression model that make a good-fit? What about statistically significant?
Goodness-of-fit measures for the regression model:
R2
Adjusted R2
F statistics (and significance)
The estimated coefficient on an independent variable is statistically different from zero at the ∝ significance level, if:
t-statistic > critical t-value (given ∝ and N),
or p-value < ∝
Interpret the Regression Outputs: Interpret R-Square (R²):
R-Square (R²):
What it tells you: R² measures the proportion of variance in the dependent variable explained by the independent variables in the model.
Range: Between 0 and 1.
Closer to 1: A better fit (more variance explained).
Closer to 0: Poor fit (less variance explained).
Key Note: R² does not penalize for adding more variables, so it can be artificially high with many predictors.
Sentence Template:
“The R² value is [value], meaning that [value as a percentage]% of the variation in [dependent variable] is explained by the independent variables in the model.”
Example:
“The R² value is 0.6423, meaning 64.23% of the variation in college completion rates is explained by the predictors in the regression model.”
Interpret the Regression Outputs: Interpret Adjusted R-Square (R²):
Adjusted R-Square (Adjusted R²):
What it tells you: Adjusted R² modifies R² to account for the number of predictors. It penalizes for adding unnecessary variables (overfitting).
When to use: Use Adjusted R² instead of R² when comparing models with different numbers of predictors.
Key Note: Adjusted R² will always be lower than R² unless the added variables improve the model fit.
Sentence Template:
“The Adjusted R² is [value], which accounts for the number of predictors in the model. It indicates that [value as a percentage]% of the variance in [dependent variable] is explained after adjusting for the predictors.”
Example:
“The Adjusted R² is 0.6421, meaning 64.21% of the variation in college completion rates is explained by the predictors, adjusted for the number of variables in the model.”
Interpret the F-Statistic and Significance F (Overall Model Significance):
What it tells you: The F-statistic tests whether the regression model as a whole is significant.
Significance F (p-value): If p-value < α (e.g., 0.05), the model is statistically significant.
Key Note:
A large F-statistic (relative to its critical value) and small p-value suggest the model fits the data well.
Sentence Template:
“The F-statistic is [value] with a p-value of [value]. Since the p-value is [less than/greater than] the significance level of 0.05, we conclude that the overall regression model is [significant/not significant].”
Interpret the t-Statistic and p-Value (Variable Significance):
(Overall Model Significance):
What it tells you: The t-statistic and p-value determine whether individual predictors are significant.
Steps:
Compare the p-value of each predictor to the chosen significance level (α = 0.05).
If p-value < 0.05, the variable is statistically significant.
Sentence Template:
“The t-statistic for [variable name] is [value], with a p-value of [value]. Since the p-value is [less than/greater than] 0.05, [variable name] is [significant/not significant] in explaining the [dependent variable].”
Example:
“The t-statistic for SAT_AVG is 47.74 with a p-value of 1.26E-25. Since the p-value is less than 0.05, SAT_AVG is a significant predictor of college completion rates.”
What is the definition of Diagnostic Analysis?
Descriptive analytics answers the question, “What Happened?”
Diagnostic analytics takes it a step further by asking “Why it Happened?” “What are the Reasons for Past Results?”
Diagnostic analytics are performed to investigate the underlying reasons for past results that cannot be answered by simply looking at the descriptive evidence.
What are the two broad types of Diagnostic Analysis?
- Identifying Anomalies and Outliers
Look for unusual, unexpected results or transactions.
Find out what might have occurred and why they occurred.
Frauds, errors, or just extreme observations? - Finding Patterns or Relationships among Variables
Performing drill-down analytics
Look for patterns in the data by examining correlations and summarizing data at different levels to understand why something happened.
Performing statistical analyses
Uncover patterns in the data or how data moves together.