Week 5 Flashcards

Question 1

Q

Different cause relationships:
Direct causation
Reverse causation
Partial causation
Spurious causation
Time
Coincidence
Tautology

Answer

A

● Suppose you have evidence (after you have collected data in a research project) that clearly shows a pattern: variable A varies along with B.
● The correlation between A and B could be a result of a host of different reasons:
○ Direct causation: A and only A causes B
■ Note, even if we prove this ideal case, it does not mean that we understand how the casual relationship operates (mechanism)
○ Reverse causation: A and B vary together, but it’s actually B that causes A
■ The more firemen fighting a fire, the bigger the fire is observed to be - actually the larger the fire the more firemen is needed
○ Partial causation: A causes B, but only because of the presence of something else C. So it’s A + C that is causing B
■ A healthy diet decreases your chances of getting cancer - but is it only a healthy diet? Probably not, it’s healthy diet usually leads to more fit people and other variables that contribute to your chances of getting cancer.
○ Spurious causation (confounding variables are present): A and B are both caused by a third, unidentified factor of C
■ A high grade in this course (A) correlates with higher grades in later courses (B): how about a good hard-working student C. Which means that if you’re a hard-working student you’ll be able to get good grades regardless of what you get in the course. Therefore, C can cause both A and B.
○ Time: the passage of time causes both A and b to vary independently
■ Global warming and the increased number of earthquakes and other natural disasters are a direct effect of the shrinking number of pirates since the early 1800s.
■ The process of time causes both A and B
■ When designing research take into consideration the effect of time
○ Coincidence: apparent relationship just due to random variation
■ Lincoln and Kennedy assassinations
■ Cherry-picking data that makes it look like you have an argument but you don’t
■ When it seems that variables are connected but there is a chance of coincidence you need a theoretical framework that offers a coherent explanation
○ Tautology: A and B are actually the same variable (A/B measure the same concept)
■ Level of economic development and quality of judiciary institutions - which comes first - maybe your measuring the overall development in the society your looking at.
■ 100% of people who drink water die - if your alive you drink water - if your dead you can’t drink water
● These are all possibilities: would like to eliminate as many of them as possible, in order to give us more confidence in a hypothesis on the causal relationship
○ Ideally, we need a methodological solution (design) that allows us to isolate the impact of A on B, taking into account all of these other possible situations (the presence of C, time, etc.)
○ If for instance, taking C into account (“controlling for”) makes the original correlation disappear, then there wasn’t really a relationship.

Question 2

Q

Hypothesis testing and statistical significance

Answer

A

● When testing hypothesis about this reality, we begin with two questions:
○ Is there a relationship between two variables in the population?
■ For instance, can we claim that people with a higher level of education are generally more satisfied with the job they have?
■ Very typically we term the two variables as independent and dependent variable - notice these things are made up by me not the data, I choose to the that because I have a certain logic
○ Could we determine this only by looking at a sample?
● Given that we rarely have information on the entire population of cases, hypothesis testing often begins with the observation that there is a trend, or relationship, or something interestingly odd in our sample.
○ Is this proof that one variable is linked to another (correlation)?
○ Is this what we would expect (from theory, past experience, other cases, etc
● Tests of statistical significance are all intended to determine whether or not the data in the sample is indicative of a larger trend in a given population.
○ For example: is there a difference between the average income of adult men versus the rest of the population? Suppose we have data on:
■ The entire population - mean income for everyone in the country
■ A representative sample of 100 adult men, with a mean income that is different from the populations
● There are at least two possible explanations:
○ The sample mean is not statistically significant different from the population mean - i.e., the difference is trivial
○ The difference between the sample and population is statistically significant - income for adult men is different than the whole populations

Question 3

Q

Null and research hypotheses

Answer

A

● We have two competing hypothesis
○ Null hypothesis: the difference is caused by random sampling error, and is not due to real differences between the two groups
■ The null always states that there is no statistically significant difference
○ The alternative hypothesis: the difference is “real” (i.e., the trend in the sample is also true in the population)
○ We can never infer with 100% confidence level, there is always a chance that we are wrong: proving the relationship vs we disprove the null hypothesis - if we can determine that it is reasonable to reject null hypothesis, then we have support for alternative hypothesis
● Results: suppose the results show the mean for population is 4,000
○ The mean of sample men, is 4,500, standard deviation is 500 and sample size is 100
○ So there is an observable difference between the parameter (4000) and the statistic in the sample (4500) - it seems that adult men do have higher income on average - is this difference real or it is caused by other factors?
● If you can reject the null hypothesis then you can say its statistically significant - very specific term that doesn’t mean significant - i.e., the difference we were seeing in the sample was not due to random sampling error - therefore, we can conclude that adult mean have a different income on average than the whole population

Question 4

Q

5-step hypotheses testing model

Answer

A

Make assumptions and meet test requirements
State the null hypothesis and research hypothesis
Select the sampling distribution and establish the critical region (i.e.g, the criteria to pass/fail the test)
Compute the test statistic
Make a decision and interpret results

Step 1: Assumptions and requirements

For one sample hypothesis testing with t-tests:
Sample must be randomly selected from the defined population
The sample is selected so that it is representative of a subgroup of the whole population, one with a specific characteristic
Level of measurement of the dependent variable must be interval/ratio
Level of measurement for the independent variable must be dichotomous (only two possible values) - and nominal?? - one distinction between two categorical values
Sampling distribution of means must be normal in shape

Step 2: State hypotheses

Null: there is “no difference” - if we are measuring men income to population - between the mean of the population from which the sample comes is equal to the population mean we are comparing it with
Research hypothesis: there is a difference - perhaps focus on direction of difference, smaller or larger

Step 3: sampling distribution and critical region

Which distribution should we use?
The z-distribution if N is larger than 120 (normal curve appendix A)
T-distribution if N is smaller than 2 (appendix B)
Select the critical region:
Critical region is the area under the curve which includes all the unlikely sample outcomes if the null hypothesis were to be true - the region where you can reject the null hypothesis
Typically a 0.05, or 5% alpha level (proportion of area under the curve which falls in the critical region) - 95% confident leaving 5% of area under the critical region
If we want 99% critical region, alpha would be 0.01 and 1% of area falls in the critical region
Z or t score is the corresponding for the selected a level
I.e., it is the score that corresponds to the threshold outside of which we fall into the critical region
E.g., for an alpha of 0.05 and z score would be 1.96 for a two tailed test
If instead we had less than 75 samples, then t score would be 1.993 for an alpha of 0.05

Step 4: compute the test statistic

The test statistic is the z or t score of the sample outcome we are interested in
This score is referred to as z or t obtained

Step 5: discussion and interpretation

If z obtained falls in the critical region
Is the z obtained outside the 1.96 you can reject the null hypothesis
A larger standard error will result in a lower z figure
When you can reject the null hypothesis you are able to state that is it reasonable to argue that there is a real difference between gender and age, not necessarily we know why this correlation exists but that it does exist

Question 5

Q

Example of 5-step model in practice

Answer

A

A z critical to reject the null hypothesis is 1.96 for a two-tailed test but 1.65 for a one-tailed test
Example: a sample of 152 felonies tried in a local court has a mean sentence of 27.3 months, is this significantly different from the average term for all felons across the nation, which is 28.7 months?
Population mean = 28.7 months
Sample
27.3 mean
SD = 3.7
N = 152
Step 1: assumption and distribution
Random sample
Interval data for dependent
Dichotomous for the independent whether they were tried in local court or not
Step 2: state hypotheses
Null there is no relationship
Alternative there is a relationshit, but there is a specific type of relationship, the nation is larger than the mean for the local
We need a one-tailed test, because the critical region is only on one side of the mean, I’m not even going to consider the region on the other side
Step 3: distribution and critical region
Sample is large enough for z distribution
Alpha is 0.005 - confidence level of 95%, one-tailed because it’s one direction
1.65 z score, which is 0.45 because it’s one side of the mean 50+45
Step 4: compute the statistic
And z obtained 4.67 which is beyond the 1.65
Step 5:
95% confidence interval this is not a random sample we can be confidence that it will be replicated in 95% other samples
We reject the null hypothesis
The difference cannot be attributed to sampling error
The local court is handing down lower sentences for the same crime
Now what?
The 1.4 month difference between the two different means, does it really matter?
All the test does is tell us about statistical analysis but we need to consider theory, policy relevance and other data to make a conclusion
If it wasn’t in the critical region then you could have said there was no difference between the local court and the national court

Question 6

Q

p-value - type I and type II errors

Answer

A

Making the decision with the p-value:

Based on the obtained z or t score from one sample mean
P value is the probability of obtaining the difference between the means if the null hypothesis is true
P.value less than 0.05 lets you reject the null hypothesis
Therefore the p-value and the z or to score obtained by comparing the z or t score limit is the same practically

Type one error: we choose to reject the null hypothesis when it’s true
Type two error: we fail to reject the null hypothesis when it’s not true

Question 7

Q

Correlation versus causation

Answer

A

● One frequent problem we have to deal with when trying to answer a research question is causality: can we really demonstrate a causal link between concepts by simply looking at a set of cases and comparing them
● Worth noting that with only a few exceptions, all research methods and designs only help us establish a correlation: one independent variable (hypothesized cause) varies along with the dependent variable(outcome)
● To establish causality between two variables, we need both to observe correlation and to have a good theoretical explanation for the link
○ Example: age and political donations. We may find evidence illustrating correlation (i.e., as people get older, they tend to give more money to political parties). But can we explain it?
○ In other words, is the correlation causal? Does the variation in the independent variable cause the variation in the outcome?