Week 3 |COMPARING TWO POPULATION CENTRAL LOCATIONS WITH PARAMETRIC AND NONPARAMETRIC TECHNIQUES Flashcards
What questions are best studied with some matched pairs experiment based on paired sample design or repeated measures of design?
Does a certain professional development program improve performance of staff?
(fee free to add more)
What does the matched pairs experiment (paired sample design or repeated measures design involve?) How many expeirmental units, pairs of observations and variable of interest should there be?
It involves observation of a given set of randomly selected items twice (before and after the treatment) or based on pairs of items considered identical (at leas very similar) prior to the experiment and of which one does while the other does not recieve the treatment
1 population of experimental units
1 variable of interest and single random sample of pairs of observations
What are the requirements for the matched-pairs Z/t tests?
i. Data is a random sample of independent pairs of observations (before and after samples are not independent of each other)
ii. The variable of interest is quantitative and continuous
iii. The measurement scale is interval or ratio
iv. Either (Z test) the population standard deviation of the differences \delta_D, is known and the sample mean of the differences is at least approximately normally distributed
or \delta_D is unknown but the population of the differences is normally distributed
What is the fourth assumption of the Wilcoxon signed ranks test?
iv. The distribution of the difference is normally distributed
Ex 1:
In order to determine the effect of advertising in the Yellow Pages, a researcher took a sample of 40 retail stores that did not advertise in the Yellow Pages last year but did so this year. The annual sales (in thousands of dollars) for each store in both years were recorded.
Write the hypotheses down for improvement in sales between the 2 years
What is the decision rule to reject the T-value?
H_0 : mu_D = mu_1 - mu_2 = 0
H_A: mu_D=mu_1 - mu_2 >0
(let X1 equal this years sales and X2 equal last years sales )
If the T-value is greater than the observed t-value ( the manual t-critical value in t-table) we reject null
What is the t.test code for the t-test? What is the code for normality in R?
When skew.2SE and kurt.2SE are greater than 1, what does that imply?
T-test:
t.test(X1,X2, conf.level = 0.90, paired = TRUE)
normality:
stat.desc(X1-X2, basic = FALSE, desc = FALSE, norm = TRUE)
When skew.2SE and kurt.2SE are greater than 1, it implies non-normality
In order to determine the effect of advertising in the Yellow Pages, a researcher
took a sample of 40 retail stores that did not advertise in the Yellow Pages last
year but did so this year. The annual sales (in thousands of dollars) for each
store in both years were recorded.
Following off this question, what are some nonparametric tests we can perform to double check our parametric t.test? (list out the code)
Sign test:
SignTest(X1,X2, mu = 0, alternative = “greater”)
What questions does the matched pairs test study? How many experimental units, variables of interests and independent random samples are there?
Do men spend more on newspapers, magazines than women? (Add more)
There are 2 populations of experimental units (male-female), 1 variable of interest (amount spent on magazines and newspapers) and 2 independent samples
What properties does the sampling distribution between the two sample means have?
Depending on population variances (delta1^2 and delta2^2) what 3 possible cases do we distinguish?
i. E(X-bar1 - Xbar2) = mu1-mu2 (X1 bar - X2 bar is an unbiased estimator of mu1-mu2)
ii. Var(X1bar - X2bar) = delta1^2/n1 + delta2^2/n
iii) If both populations are normally distributed or the CLT applies:
Z= ((Xbar1 - Xbar2) - mu_(x1bar - x2bar)) / delta_(xbar1 - xbar2) ~ N(0;1)
1) delta1^2 and delta2^2 are known
2) delta1^2 and delta2^2 are unknown but equal
3) delta1^2 and delta2^2 are unknown and different
What happens when delta1^2 and delta2^2 are known?
What happens when delta1^2 and delta2^2 are unknown but equal?
What happens when delta1^2 and delta2^2 are unknown and different?
What is the test for the third situation called?
When delta1^2 and delta2^2 are known:
confidence interval is:
(x-bar1 - x-bar2) +- z_(alpha/2) delta_(xbar1 - xbar2)
Test statistic is: Z= ((X1bar - X2bar) - mu_(d,0))/delta_(x1bar - x2bar))
When delta1^2 and delta2^2 are unknown but equal:
common population variance is best estimated from the pooled sample i.e from all available n1+n2 observations or from the 2 sample variances (if they are already known)
confidence interval is: (x-bar1 - x-bar2) +- t_(alpha/2) delta_(xbar1 - xbar2)
Test statistic is: t = ((X1bar - X2bar) - mu_(d,0))/delta_(x1bar - x2bar)) ~N(0,1) ~ t_df
When delta1^2 and delta2^2 are unknown and different
The population variances must be estimated separately
Confidence interval: (x-bar1 - x-bar2) +- t_(alpha/2) delta_(xbar1 - xbar2)
Test statistic is: t = ((X1bar - X2bar) - mu_(d,0))/delta_(x1bar - x2bar)) ~N(0,1) ~ t_df
This version of the independent-samples t-test is called Welch t-test, or unequal variances t-test
What requirements are of the 2 independent sample Z/t test
i. Data consists of 2 independent random samples of independent observations
ii. The variable of interest is quantitative and continuous
iii. The measurement is interval or ratio
iv. Either (Z-test) the population standard deviations, delta1 and delta2 are known and the sample means are at least approximately normally distributed
or (t-test) delta1 and delta2 are unknown but the sampled populations are normally distributed (at least approximately)
Automobile insurance companies take many factors into consideration when
setting the rates. These factors include age, marital status, and kilometres
driven per year. In order to determine the effect of gender (1: male, 2: female),
100 male and 100 female drivers were surveyed. Each was asked how many
kilometres (KMSin thousands of kilometres) he or she drove in the past year.
What is the first step we need to take?
Given that the first step is taken and we find the variances they also appear to be similar, what variance case does this relate to?
What is the code for the confidence interval in R?
What are the hypotheses if we care testing to see if males drive more than females (let mu1 = males and mu2 = females) How would we perform the above hypothesis test in R?
The first step we need to take is to find the variances. We do this through obtaining preliminary data
This relates to case (2) of variances
The code for confidence interval:
t. test(KMS ~ Gender,
var. equal = TRUE, conf.level = 0.90) gives two sample test
Hypotheses are:
H_0: mu1 = mu2 H_A: mu1>mu2
t-stat: (10.233 - 9.659)/0.407 = 1.410
R code:
t.test(KMS~Gender, var.equal = TRUE, conf.level = 0.95, alternative = “greater”) gives 2 sample t-test
Automobile insurance companies take many factors into consideration when
setting the rates. These factors include age, marital status, and kilometres
driven per year. In order to determine the effect of gender (1: male, 2: female),
100 male and 100 female drivers were surveyed. Each was asked how many
kilometres (KMSin thousands of kilometres) he or she drove in the past year.
Following on from this question, let’s say that the sample variances are equal, if we are not willing to make this assumption. What command would we then use in R
t.test(KMS~Gender, conf.level = 0.95, alternative = “greater”)
This gives us the Welch t-test
What is a nonparametric test tot eh two independent Z/t test? What are the requirements of this test?
What are the hypotheses for this test?
What calculation would we combine to find and what is the t-stat?
The Mann Whitney test:
i. The data consists of 2 independent random samples of independent observations
ii. The variable of interest is quantitative and continuous
iii. The measurement scale is at least ordinal
iv. The 2 sampled populations differ at most with respect to their central locations measured by the medians (i.e. they are identical in shape and spread)
Hypotheses:
H_0: n1=n2
H_A: n1 not equal to n2, n1 n2
The test is based on the ranks in the pooled sample of size: n= n1+n2
rank sum of 2 samples:
T1 +T2= n(n+1)/2
T-stat= t=t1
What is the difference between the Wilcoxon signed ranks and rank-sum tests?
Wilcoxon signed ranks test classification is only based on the position of each observation relative to the hypothesized median (smaller or larger)
Wilcoxon rank sum test- based on values of a grouping variable (like gender in our example)