Statistics 2 Flashcards by Marysia Sztukiewicz

Write out the conditions and assumptions for: A statistician found that in a random sample of 45 tubes of glue produced by one manufacturer, that the
mean drying time was 𝑥̅ = 185 minutes, with a standard deviation of 20 minutes. In a random sample of40 tubes of glue produced by another manufacturer, the mean drying time was 𝑥̅2 = 201 minutes, with astandard deviation of 57 minutes. Do these data allow us to conclude, at the 10% significance level, that the
mean drying times of the two kinds of glue differ?

independent random samples
unknown variances as only sample variances are known
quantitative data
samples are large enough as n1>30 and n2>30, so normal approximation applies

How well did you know this?

Not at all

Perfectly

Factors that identify the wilcoxon rank sum:

the objective of the problem is to compare 2 populations
data type : ordinal or interval but non-normal
two random, independent samples

How well did you know this?

Not at all

Perfectly

Why does the wilcoxon sign rank sum require the data to be non-normal?

Because if it were normally distributed we would be able to perform the equal variances t test of u1-u2

How well did you know this?

Not at all

Perfectly

What distinguishes nonparametric techniques from parametric ones in this context?

Nonparametric techniques test differences in population locations rather than means. They’re used for ordinal data and as alternatives when populations aren’t normally distributed, unlike parametric tests requiring normal distribution.

How well did you know this?

Not at all

Perfectly

When might nonparametric techniques be applied to interval data?

Nonparametric techniques can be used with interval data when the assumption of normality isn’t met. In such cases, even if the data are interval and the mean is appropriate, these techniques test population locations instead.

How well did you know this?

Not at all

Perfectly

In the wilcoxon rank sum test how does the ranking procedure work?

We must remember to rank all the observations from both samples

How well did you know this?

Not at all

Perfectly

What does a small value of T indicate when comparing observations in sample 1 and sample 2?

A small value of T suggests that most smaller observations belong to sample 1, while larger observations are predominantly in sample 2, implying that the location of population 1 is to the left of population 2.

How well did you know this?

Not at all

Perfectly

In which situtations is the sign test employed?

the problem objective is to compare two populations
the data are ordinal
the experimental design is matched pairs

How well did you know this?

Not at all

Perfectly

In what scenario is the sign test applicable when dealing with interval data?

: While the sign test can be used with interval data, it results in a loss of potentially valuable information since differences hold significance in interval data.

How well did you know this?

Not at all

Perfectly

Why might knowing the actual difference in values be more informative than just knowing the direction of difference in certain cases?

: In scenarios involving interval data, understanding the actual difference provides more meaningful information than solely knowing which value is greater or lesser.

How well did you know this?

Not at all

Perfectly

What statistical method is recommended for interval data that aren’t normally distributed?

For interval data that don’t adhere to a normal distribution, the Wilcoxon Signed Rank Sum Test is recommended. This test considers both the sign and the magnitude of differences between values.

How well did you know this?

Not at all

Perfectly

Which method remains valid when dealing with ordinal data?

Utilizing the sign of the differences remains the only valid method when handling ordinal data.

How well did you know this?

Not at all

Perfectly

How is the difference between “excellent” and “good” represented in a 4-3-2-1 rating system?

In a 4-3-2-1 rating system, the difference between “excellent” and “good” is denoted as +1.

How well did you know this?

Not at all

Perfectly

What adjustments are necessary in the Wilcoxon Rank Sum Test when switching the roles between Population 1 and Population 2, particularly concerning sample sizes (n1 and n2)?

When switching the roles between Population 1 and Population 2 in statistical analysis, it’s crucial to adjust the sample sizes accordingly. For instance, if initially n1 represented the sample size for Population 1 and n2 for Population 2, after the role switch, n1 becomes the sample size for the newly designated Population 2, and n2 becomes the sample size for the newly designated Population 1. All calculations involving these sample sizes—such as computing test statistics, determining critical values, and identifying rejection regions—should be adjusted to maintain accuracy in the analysis reflecting the comparison between the populations.

How well did you know this?

Not at all

Perfectly

Why are the ratings provided for four-star and five-star hotels considered as independent samples rather than matched pairs?

The ratings for four-star and five-star hotels are treated as independent samples because they represent different entities (different hotels) and were provided by different individuals. Each rating corresponds to a different category of hotels, and there is no direct pairing between the ratings of a specific four-star hotel and a specific five-star hotel for the same respondents.

How well did you know this?

Not at all

Perfectly

What adjustments are made when calculating probabilities related to a binomial distribution?

When aiming for a probability less than or equal to a certain number of successes, 0.5 is added. If the goal is a probability greater than or equal to that number, 0.5 is subtracted.

How well did you know this?

Not at all

Perfectly

What are the key conditions necessary for conducting a Sign Test?

The key conditions for a Sign Test are:

Similarity in shape and spread of populations being compared.
Sample sizes for both groups must exceed 10.

How well did you know this?

Not at all

Perfectly

Conditions and assumptions for a sign test:

The agencies rate the same 20 investments: random sample of matched pairs
* ESG scale is ordinal data, so we can ONLY perform a sign test
* Since 𝑛 = 20 > 10, we may use the normal approximation here

How well did you know this?

Not at all

Perfectly

Hypotheses for sign test (one used for matched pairs of ordinal data)

𝐻0 ∶ The two population locations are the same (no difference in preference) ⟹ 𝑝+ = 0.50
𝐻1 ∶ Location of pop.1 (monitor 1) is to the right of location of pop.2 (monitor 2) ⟹ 𝑝+ > 0.50

How well did you know this?

Not at all

Perfectly

For the sign test, what does p+ refer to?

The probability of getting a positive result (favoring monitor 1) under the null hypothesis, assuming no difference.

How well did you know this?

Not at all

Perfectly

Sample sizes_________ are often considered sufficient for the CLT to hold

Study These Flashcards

equal to or greater than 30

For the wilxocon signed rank sum test you rank the ____

Study These Flashcards

absolute differences and then when counting x+ only look at the ranks of the positive differences

When doing an exact sign test our rejection region is just the : _______

Study These Flashcards

significance level; we compare the p-value to the significance level; if our p-value is less than significance level we reject H0

What does x+ represent in the context of a sign test?

Study These Flashcards

x+ denotes the count of pairs where one observation in a pair exceeds the other in a given characteristic or condition.

How is the p-value calculated in statistical tests like the sign test?

The p-value signifies the probability of observing a result as extreme as, or more extreme than, the observed result, assuming that the null hypothesis is true.

Given x+ =15 and a right sided test, how is the p-value computed in a sign test?

The p-value is determined as 1-P(X+ < or = 14) where X+ follows a certain distribution (e.g., binomial or other relevant distributions).

What does the p-value represent in hypothesis testing?

The p-value represents the probability of obtaining results as extreme as the observed results under the assumption that the null hypothesis is true.

What is the significance level (𝛼) in hypothesis testing?

The significance level (𝛼) is the threshold used to determine the significance of the results. It's the predetermined acceptable probability of incorrectly rejecting the null hypothesis.

What is the decision rule based on the p-value and significance level?

If the p-value is less than or equal to the significance level (p ≤ 𝛼), we reject the null hypothesis. If the p-value is greater than the significance level (p > 𝛼), we fail to reject the null hypothesis.

Why is the choice of significance level important?

The significance level determines the tolerance for Type I error (incorrectly rejecting the null hypothesis). A lower significance level requires stronger evidence to reject the null hypothesis but might increase the chances of Type II error (incorrectly failing to reject the null hypothesis).

Alternate hypothesis for Chi squared Goodness of Fit Test:

H1: At least one pi is not equal to its specified value

What is the goodness of fit test used to determine?

It is used to determine whether sample data are consistent with a hypothesized distribution. Or simply used for categorical data when you want to see if your observations fits a theoretical expectation.

What are the essential assumptions for conducting non-parametric goodness of fit analysis, and what criteria must be met regarding sample randomness, chi-squared validity, expected values, and category thresholds in this context?

The assumptions crucial for non-parametric goodness of fit analysis encompass several criteria: Random and Independent Samples: The analysis requires the samples to be both random and independent. Chi-Squared Validity: It holds that the expected chi-squared value approximately equals the observed chi-squared value. Expected Values: Expected values should not be less than 1 to maintain accuracy in the analysis. Threshold for Expected Values in Categories: It's recommended to have no more than 20% of categories with expected values less than 5.

What are contingency tables used for?

Contingency tables, also known as two-way tables, are used when analyzing categorical data involving more than one variable.

How are expected values calculated in the Chi Square Test for Dependence:

The expected values would be calculated based on the following: * Find the sum of each row, and each column * Find the total sum of all columns and rows * For each cell, multiply the row sum with the column sum and divide it by the total sum of all cells. * (𝑹𝒐𝒘 𝒔𝒖𝒎 𝒙 𝑪𝒐𝒍𝒖𝒎𝒏 𝒔𝒖𝒎)/ divided by 𝒕𝒐𝒕𝒂𝒍 𝒔𝒖m

Z-proportion test deals with _____ data

nominal

Nominal data is data that can only be ______

categorized

Conditions and assumptions for Z test (proportions)

* Random sample * nominal data * 𝑛 ∙ 𝑝 = 200 ∙ 0.40 = 80 ≥ 5 and 𝑛(1 − 𝑝) = 120 ≥ 5

Hypotheses for Chi Square for Dependence:

Hypotheses 𝐻0 ∶ the two classifications are independent 𝐻1 ∶ the two classifications are dependent

How do you compute conditional probabilities given 𝑋=𝑥ⱼ?

The formula is 𝑃(𝑌=𝑦ᵢ | 𝑋=𝑥ⱼ) = 𝑃(𝑌=𝑦ᵢ, 𝑋=𝑥ⱼ) / 𝑃(𝑋=𝑥ⱼ).

When you take time-series data can it be i.i.d.?

No, because Time-series data belong to consecutive time periods (or moments) in time, which are thus not drawn independently, and as a result the consecutive pairs of random variables (𝑋𝑋1, 𝑌𝑌1), … , (𝑋𝑋𝑛𝑛, 𝑌𝑌𝑛𝑛) are likely to be dependent.

The correlation of G,H, is only equal to the correlation of X, Y if ______

G is a linear transformation of X and F is a linear transformation of Y

How does the residual calculation (least squares estimator) change for a linear regression model with only a constant term so for example Y= a + u (error term)? THIS IS VERY IMPORTANT

In a linear regression model with only a constant term, the residual, which is the difference between the observed value and the estimated value, simplifies to the observed value minus the constant term (α) since the estimated value is constant for all observations. So, it is just simply : Y−α.

df for TSS

n-1

df for SSR

n-2

df for ESS

1 as there is only one x (one regressor)

Give the interpretation of B1 kapeluszek

It is the slope of the sample regression line, for example for each unit increase of price, the estimated number of products sold decreases with 0.263

Give an interpretation of SER:

SER is an estimator of the standard deviation of the regression error 𝑢 = 𝑌 − 𝛽0 − 𝛽1𝑋𝑖.

Give an interpretation of R squared:

The regression 𝑅2 is the coefficient of determination, the fraction of the sample variance of 𝑌 that is explained by (or predicted by) 𝑋

How can I explain that two variables are i.i.d. on the exam:

For each 𝑖 = 1, … , 𝑛, we know that (𝑋𝑖, 𝑢i) is drawn randomly from the same joint population distribution, so it is i.i.d.. But this means that (𝑋𝑖, 𝑌𝑖) = (𝑋𝑖, 𝛽0 + 𝛽1𝑋𝑖 + 𝑢𝑖) is also i.i.d.. So valid.

Statistics 2 Flashcards

(51 cards)