Statistics 2 Flashcards
Write out the conditions and assumptions for: A statistician found that in a random sample of 45 tubes of glue produced by one manufacturer, that the
mean drying time was π₯Μ
= 185 minutes, with a standard deviation of 20 minutes. In a random sample of40 tubes of glue produced by another manufacturer, the mean drying time was π₯Μ
2 = 201 minutes, with astandard deviation of 57 minutes. Do these data allow us to conclude, at the 10% significance level, that the
mean drying times of the two kinds of glue differ?
- independent random samples
- unknown variances as only sample variances are known
- quantitative data
- samples are large enough as n1>30 and n2>30, so normal approximation applies
Factors that identify the wilcoxon rank sum:
- the objective of the problem is to compare 2 populations
- data type : ordinal or interval but non-normal
- two random, independent samples
Why does the wilcoxon sign rank sum require the data to be non-normal?
Because if it were normally distributed we would be able to perform the equal variances t test of u1-u2
What distinguishes nonparametric techniques from parametric ones in this context?
Nonparametric techniques test differences in population locations rather than means. Theyβre used for ordinal data and as alternatives when populations arenβt normally distributed, unlike parametric tests requiring normal distribution.
When might nonparametric techniques be applied to interval data?
Nonparametric techniques can be used with interval data when the assumption of normality isnβt met. In such cases, even if the data are interval and the mean is appropriate, these techniques test population locations instead.
In the wilcoxon rank sum test how does the ranking procedure work?
We must remember to rank all the observations from both samples
What does a small value of T indicate when comparing observations in sample 1 and sample 2?
A small value of T suggests that most smaller observations belong to sample 1, while larger observations are predominantly in sample 2, implying that the location of population 1 is to the left of population 2.
In which situtations is the sign test employed?
- the problem objective is to compare two populations
- the data are ordinal
- the experimental design is matched pairs
In what scenario is the sign test applicable when dealing with interval data?
: While the sign test can be used with interval data, it results in a loss of potentially valuable information since differences hold significance in interval data.
Why might knowing the actual difference in values be more informative than just knowing the direction of difference in certain cases?
: In scenarios involving interval data, understanding the actual difference provides more meaningful information than solely knowing which value is greater or lesser.
What statistical method is recommended for interval data that arenβt normally distributed?
For interval data that donβt adhere to a normal distribution, the Wilcoxon Signed Rank Sum Test is recommended. This test considers both the sign and the magnitude of differences between values.
Which method remains valid when dealing with ordinal data?
Utilizing the sign of the differences remains the only valid method when handling ordinal data.
How is the difference between βexcellentβ and βgoodβ represented in a 4-3-2-1 rating system?
In a 4-3-2-1 rating system, the difference between βexcellentβ and βgoodβ is denoted as +1.
What adjustments are necessary in the Wilcoxon Rank Sum Test when switching the roles between Population 1 and Population 2, particularly concerning sample sizes (n1 and n2)?
When switching the roles between Population 1 and Population 2 in statistical analysis, itβs crucial to adjust the sample sizes accordingly. For instance, if initially n1 represented the sample size for Population 1 and n2 for Population 2, after the role switch, n1 becomes the sample size for the newly designated Population 2, and n2 becomes the sample size for the newly designated Population 1. All calculations involving these sample sizesβsuch as computing test statistics, determining critical values, and identifying rejection regionsβshould be adjusted to maintain accuracy in the analysis reflecting the comparison between the populations.
Why are the ratings provided for four-star and five-star hotels considered as independent samples rather than matched pairs?
The ratings for four-star and five-star hotels are treated as independent samples because they represent different entities (different hotels) and were provided by different individuals. Each rating corresponds to a different category of hotels, and there is no direct pairing between the ratings of a specific four-star hotel and a specific five-star hotel for the same respondents.
What adjustments are made when calculating probabilities related to a binomial distribution?
When aiming for a probability less than or equal to a certain number of successes, 0.5 is added. If the goal is a probability greater than or equal to that number, 0.5 is subtracted.
What are the key conditions necessary for conducting a Sign Test?
The key conditions for a Sign Test are:
Similarity in shape and spread of populations being compared.
Sample sizes for both groups must exceed 10.
Conditions and assumptions for a sign test:
The agencies rate the same 20 investments: random sample of matched pairs
* ESG scale is ordinal data, so we can ONLY perform a sign test
* Since π = 20 > 10, we may use the normal approximation here
Hypotheses for sign test (one used for matched pairs of ordinal data)
π»0 βΆ The two population locations are the same (no difference in preference) βΉ π+ = 0.50
π»1 βΆ Location of pop.1 (monitor 1) is to the right of location of pop.2 (monitor 2) βΉ π+ > 0.50
For the sign test, what does p+ refer to?
The probability of getting a positive result (favoring monitor 1) under the null hypothesis, assuming no difference.
Sample sizes_________ are often considered sufficient for the CLT to hold
equal to or greater than 30
For the wilxocon signed rank sum test you rank the ____
absolute differences and then when counting x+ only look at the ranks of the positive differences
When doing an exact sign test our rejection region is just the : _______
significance level; we compare the p-value to the significance level; if our p-value is less than significance level we reject H0
What does x+ represent in the context of a sign test?
x+ denotes the count of pairs where one observation in a pair exceeds the other in a given characteristic or condition.
How is the p-value calculated in statistical tests like the sign test?
The p-value signifies the probability of observing a result as extreme as, or more extreme than, the observed result, assuming that the null hypothesis is true.
Given x+ =15 and a right sided test, how is the p-value computed in a sign test?
The p-value is determined as 1-P(X+ < or = 14) where X+ follows a certain distribution (e.g., binomial or other relevant distributions).
What does the p-value represent in hypothesis testing?
The p-value represents the probability of obtaining results as extreme as the observed results under the assumption that the null hypothesis is true.
What is the significance level (πΌ) in hypothesis testing?
The significance level (πΌ) is the threshold used to determine the significance of the results. Itβs the predetermined acceptable probability of incorrectly rejecting the null hypothesis.
What is the decision rule based on the p-value and significance level?
If the p-value is less than or equal to the significance level (p β€ πΌ), we reject the null hypothesis. If the p-value is greater than the significance level (p > πΌ), we fail to reject the null hypothesis.
Why is the choice of significance level important?
The significance level determines the tolerance for Type I error (incorrectly rejecting the null hypothesis). A lower significance level requires stronger evidence to reject the null hypothesis but might increase the chances of Type II error (incorrectly failing to reject the null hypothesis).
Alternate hypothesis for Chi squared Goodness of Fit Test:
H1: At least one pi is not equal to its specified value
What is the goodness of fit test used to determine?
It is used to determine whether
sample data are consistent with a hypothesized distribution. Or simply
used for categorical data when you want to see if your observations fits
a theoretical expectation.
What are the essential assumptions for conducting non-parametric goodness of fit analysis, and what criteria must be met regarding sample randomness, chi-squared validity, expected values, and category thresholds in this context?
The assumptions crucial for non-parametric goodness of fit analysis encompass several criteria:
Random and Independent Samples: The analysis requires the samples to be both random and independent.
Chi-Squared Validity: It holds that the expected chi-squared value approximately equals the observed chi-squared value.
Expected Values: Expected values should not be less than 1 to maintain accuracy in the analysis.
Threshold for Expected Values in Categories: Itβs recommended to have no more than 20% of categories with expected values less than 5.
What are contingency tables used for?
Contingency tables, also known as two-way tables, are used when analyzing categorical data involving more than one variable.
How are expected values calculated in the Chi Square Test for Dependence:
The expected values would be calculated based on
the following:
* Find the sum of each row, and each column
* Find the total sum of all columns and rows
* For each cell, multiply the row sum with the column sum
and divide it by the total sum of all cells.
*
(πΉππ πππ π πͺπππππ πππ)/ divided by
πππππ ππm
Z-proportion test deals with _____ data
nominal
Nominal data is data that can only be ______
categorized
Conditions and assumptions for Z test (proportions)
- Random sample
- nominal data
- π β π = 200 β 0.40 = 80 β₯ 5 and π(1 β π) = 120 β₯ 5
Hypotheses for Chi Square for Dependence:
Hypotheses
π»0 βΆ the two classifications are independent
π»1 βΆ the two classifications are dependent
How do you compute conditional probabilities given π=π₯β±Ό?
The formula is π(π=π¦α΅’ | π=π₯β±Ό) = π(π=π¦α΅’, π=π₯β±Ό) / π(π=π₯β±Ό).
When you take time-series data can it be i.i.d.?
No, because Time-series data belong to consecutive time periods (or moments) in time, which are thus not drawn
independently, and as a result the consecutive pairs of random variables (ππ1, ππ1), β¦ , (ππππ, ππππ) are likely
to be dependent.
The correlation of G,H, is only equal to the correlation of X, Y if ______
G is a linear transformation of X and F is a linear transformation of Y
How does the residual calculation (least squares estimator) change for a linear regression model with only a constant term so for example Y= a + u (error term)? THIS IS VERY IMPORTANT
In a linear regression model with only a constant term, the residual, which is the difference between the observed value and the estimated value, simplifies to the observed value minus the constant term (Ξ±) since the estimated value is constant for all observations. So, it is just simply : YβΞ±.
df for TSS
n-1
df for SSR
n-2
df for ESS
1 as there is only one x (one regressor)
Give the interpretation of B1 kapeluszek
It is the slope of the sample regression line, for example for each unit increase of price, the estimated number of products sold decreases with 0.263
Give an interpretation of SER:
SER is an estimator of the standard deviation of the regression error π’ = π β π½0 β π½1ππ.
Give an interpretation of R squared:
The regression π 2 is the coefficient of determination, the fraction of the sample variance of π that is explained by (or predicted by) π
How can I explain that two variables are i.i.d. on the exam:
For each π = 1, β¦ , π, we know that (ππ, π’i) is drawn randomly from the same joint population
distribution, so it is i.i.d.. But this means that (ππ, ππ) = (ππ, π½0 + π½1ππ + π’π) is also i.i.d.. So valid.