Week 3 |COMPARING TWO POPULATION CENTRAL LOCATIONS WITH PARAMETRIC AND NONPARAMETRIC TECHNIQUES Flashcards

Question 1

Q

What questions are best studied with some matched pairs experiment based on paired sample design or repeated measures of design?

Answer

A

Does a certain professional development program improve performance of staff?
(fee free to add more)

Question 2

Q

What does the matched pairs experiment (paired sample design or repeated measures design involve?) How many expeirmental units, pairs of observations and variable of interest should there be?

Answer

A

It involves observation of a given set of randomly selected items twice (before and after the treatment) or based on pairs of items considered identical (at leas very similar) prior to the experiment and of which one does while the other does not recieve the treatment

1 population of experimental units
1 variable of interest and single random sample of pairs of observations

Question 3

Q

What are the requirements for the matched-pairs Z/t tests?

Answer

A

i. Data is a random sample of independent pairs of observations (before and after samples are not independent of each other)

ii. The variable of interest is quantitative and continuous
iii. The measurement scale is interval or ratio
iv. Either (Z test) the population standard deviation of the differences \delta_D, is known and the sample mean of the differences is at least approximately normally distributed

or \delta_D is unknown but the population of the differences is normally distributed

Question 4

Q

What is the fourth assumption of the Wilcoxon signed ranks test?

Answer

A

iv. The distribution of the difference is normally distributed

Question 5

Q

Ex 1:
In order to determine the effect of advertising in the Yellow Pages, a researcher took a sample of 40 retail stores that did not advertise in the Yellow Pages last year but did so this year. The annual sales (in thousands of dollars) for each store in both years were recorded.
Write the hypotheses down for improvement in sales between the 2 years
What is the decision rule to reject the T-value?

Answer

A

H_0 : mu_D = mu_1 - mu_2 = 0
H_A: mu_D=mu_1 - mu_2 >0
(let X1 equal this years sales and X2 equal last years sales )

If the T-value is greater than the observed t-value ( the manual t-critical value in t-table) we reject null

Question 6

Q

What is the t.test code for the t-test? What is the code for normality in R?
When skew.2SE and kurt.2SE are greater than 1, what does that imply?

Answer

A

T-test:
t.test(X1,X2, conf.level = 0.90, paired = TRUE)

normality:
stat.desc(X1-X2, basic = FALSE, desc = FALSE, norm = TRUE)

When skew.2SE and kurt.2SE are greater than 1, it implies non-normality

Question 7

Q

In order to determine the effect of advertising in the Yellow Pages, a researcher
took a sample of 40 retail stores that did not advertise in the Yellow Pages last
year but did so this year. The annual sales (in thousands of dollars) for each
store in both years were recorded.
Following off this question, what are some nonparametric tests we can perform to double check our parametric t.test? (list out the code)

Answer

A

Sign test:

SignTest(X1,X2, mu = 0, alternative = “greater”)

Question 8

Q

What questions does the matched pairs test study? How many experimental units, variables of interests and independent random samples are there?

Answer

A

Do men spend more on newspapers, magazines than women? (Add more)

There are 2 populations of experimental units (male-female), 1 variable of interest (amount spent on magazines and newspapers) and 2 independent samples

Question 9

Q

What properties does the sampling distribution between the two sample means have?
Depending on population variances (delta1^2 and delta2^2) what 3 possible cases do we distinguish?

Answer

A

i. E(X-bar1 - Xbar2) = mu1-mu2 (X1 bar - X2 bar is an unbiased estimator of mu1-mu2)

ii. Var(X1bar - X2bar) = delta1^2/n1 + delta2^2/n
iii) If both populations are normally distributed or the CLT applies:
Z= ((Xbar1 - Xbar2) - mu_(x1bar - x2bar)) / delta_(xbar1 - xbar2) ~ N(0;1)

1) delta1^2 and delta2^2 are known
2) delta1^2 and delta2^2 are unknown but equal
3) delta1^2 and delta2^2 are unknown and different

Question 10

Q

What happens when delta1^2 and delta2^2 are known?
What happens when delta1^2 and delta2^2 are unknown but equal?

What happens when delta1^2 and delta2^2 are unknown and different?
What is the test for the third situation called?

Answer

A

When delta1^2 and delta2^2 are known:
confidence interval is:
(x-bar1 - x-bar2) +- z_(alpha/2) delta_(xbar1 - xbar2)
Test statistic is: Z= ((X1bar - X2bar) - mu_(d,0))/delta_(x1bar - x2bar))

When delta1^2 and delta2^2 are unknown but equal:
common population variance is best estimated from the pooled sample i.e from all available n1+n2 observations or from the 2 sample variances (if they are already known)

confidence interval is: (x-bar1 - x-bar2) +- t_(alpha/2) delta_(xbar1 - xbar2)
Test statistic is: t = ((X1bar - X2bar) - mu_(d,0))/delta_(x1bar - x2bar)) ~N(0,1) ~ t_df

When delta1^2 and delta2^2 are unknown and different
The population variances must be estimated separately

Confidence interval: (x-bar1 - x-bar2) +- t_(alpha/2) delta_(xbar1 - xbar2)
Test statistic is: t = ((X1bar - X2bar) - mu_(d,0))/delta_(x1bar - x2bar)) ~N(0,1) ~ t_df
This version of the independent-samples t-test is called Welch t-test, or unequal variances t-test

Question 11

Q

What requirements are of the 2 independent sample Z/t test

Answer

A

i. Data consists of 2 independent random samples of independent observations
ii. The variable of interest is quantitative and continuous
iii. The measurement is interval or ratio
iv. Either (Z-test) the population standard deviations, delta1 and delta2 are known and the sample means are at least approximately normally distributed
or (t-test) delta1 and delta2 are unknown but the sampled populations are normally distributed (at least approximately)

Question 12

Q

Automobile insurance companies take many factors into consideration when
setting the rates. These factors include age, marital status, and kilometres
driven per year. In order to determine the effect of gender (1: male, 2: female),
100 male and 100 female drivers were surveyed. Each was asked how many
kilometres (KMSin thousands of kilometres) he or she drove in the past year.

What is the first step we need to take?

Given that the first step is taken and we find the variances they also appear to be similar, what variance case does this relate to?

What is the code for the confidence interval in R?

What are the hypotheses if we care testing to see if males drive more than females (let mu1 = males and mu2 = females)
How would we perform the above hypothesis test in R?

Answer

A

The first step we need to take is to find the variances. We do this through obtaining preliminary data

This relates to case (2) of variances

The code for confidence interval:

t. test(KMS ~ Gender,
var. equal = TRUE, conf.level = 0.90) gives two sample test

Hypotheses are:
H_0: mu1 = mu2 H_A: mu1>mu2
t-stat: (10.233 - 9.659)/0.407 = 1.410

R code:
t.test(KMS~Gender, var.equal = TRUE, conf.level = 0.95, alternative = “greater”) gives 2 sample t-test

Question 13

Q

Automobile insurance companies take many factors into consideration when
setting the rates. These factors include age, marital status, and kilometres
driven per year. In order to determine the effect of gender (1: male, 2: female),
100 male and 100 female drivers were surveyed. Each was asked how many
kilometres (KMSin thousands of kilometres) he or she drove in the past year.

Following on from this question, let’s say that the sample variances are equal, if we are not willing to make this assumption. What command would we then use in R

Answer

A

t.test(KMS~Gender, conf.level = 0.95, alternative = “greater”)
This gives us the Welch t-test

Question 14

Q

What is a nonparametric test tot eh two independent Z/t test? What are the requirements of this test?
What are the hypotheses for this test?
What calculation would we combine to find and what is the t-stat?

Answer

A

The Mann Whitney test:

i. The data consists of 2 independent random samples of independent observations
ii. The variable of interest is quantitative and continuous
iii. The measurement scale is at least ordinal

iv. The 2 sampled populations differ at most with respect to their central locations measured by the medians (i.e. they are identical in shape and spread)

Hypotheses:
H_0: n1=n2
H_A: n1 not equal to n2, n1 n2

The test is based on the ranks in the pooled sample of size: n= n1+n2

rank sum of 2 samples:
T1 +T2= n(n+1)/2
T-stat= t=t1

Question 15

Q

What is the difference between the Wilcoxon signed ranks and rank-sum tests?

Answer

A

Wilcoxon signed ranks test classification is only based on the position of each observation relative to the hypothesized median (smaller or larger)

Wilcoxon rank sum test- based on values of a grouping variable (like gender in our example)

Question 16

Q

Automobile insurance companies take many factors into consideration when
setting the rates. These factors include age, marital status, and kilometres
driven per year. In order to determine the effect of gender (1: male, 2: female),
100 male and 100 female drivers were surveyed. Each was asked how many
kilometres (KMSin thousands of kilometres) he or she drove in the past year.
Following on this question, lets perform the Wilcoxon ranked sum test as well for illustration, since samples are large (n1=n2=100) we use R.
What is the hypotheses for this?
List out the code

Answer

A

H0: n1 = n2 , HA: n1>n2
R-code:
wilcox.exact(KMS ~Gender, exact = TRUE, conf.level = 0.95, alternative = “greater”)
(exact wilcoxon rank sum test)

Note: n1 and n2 doesnt have to be equal in sample size

Question 17

Q

What is the code for the Wilcoxon signed ranks test /Exact Wilcoxon signed rank test

What is the code for Wilcoxon ranked ranked sum test?

Answer

A

wilcoxon.exact (Smaple-1, Sample_2, exact = TRUE, conf.level= 0.95, paired = TRUE)
Exact Wilcoxon signed rank test

wilcox.exact(Sample1, Sample_2, exact = TRUE, conf.level = 0.95)

Question 18

Q

Which is better, repeated measures design or the independent design?

Answer

A

In general the repeated measures design is potentially more efficient
because (everything else held constant) it reduces the standard error of the
estimator of the difference between the population means (or medians).
However, the repeated measures design also reduces the sample size and
the degrees of freedom, making less likely to reject a false null hypothesis.