Week 5 Flashcards
Different cause relationships: Direct causation Reverse causation Partial causation Spurious causation Time Coincidence Tautology
● Suppose you have evidence (after you have collected data in a research project) that clearly shows a pattern: variable A varies along with B.
● The correlation between A and B could be a result of a host of different reasons:
○ Direct causation: A and only A causes B
■ Note, even if we prove this ideal case, it does not mean that we understand how the casual relationship operates (mechanism)
○ Reverse causation: A and B vary together, but it’s actually B that causes A
■ The more firemen fighting a fire, the bigger the fire is observed to be - actually the larger the fire the more firemen is needed
○ Partial causation: A causes B, but only because of the presence of something else C. So it’s A + C that is causing B
■ A healthy diet decreases your chances of getting cancer - but is it only a healthy diet? Probably not, it’s healthy diet usually leads to more fit people and other variables that contribute to your chances of getting cancer.
○ Spurious causation (confounding variables are present): A and B are both caused by a third, unidentified factor of C
■ A high grade in this course (A) correlates with higher grades in later courses (B): how about a good hard-working student C. Which means that if you’re a hard-working student you’ll be able to get good grades regardless of what you get in the course. Therefore, C can cause both A and B.
○ Time: the passage of time causes both A and b to vary independently
■ Global warming and the increased number of earthquakes and other natural disasters are a direct effect of the shrinking number of pirates since the early 1800s.
■ The process of time causes both A and B
■ When designing research take into consideration the effect of time
○ Coincidence: apparent relationship just due to random variation
■ Lincoln and Kennedy assassinations
■ Cherry-picking data that makes it look like you have an argument but you don’t
■ When it seems that variables are connected but there is a chance of coincidence you need a theoretical framework that offers a coherent explanation
○ Tautology: A and B are actually the same variable (A/B measure the same concept)
■ Level of economic development and quality of judiciary institutions - which comes first - maybe your measuring the overall development in the society your looking at.
■ 100% of people who drink water die - if your alive you drink water - if your dead you can’t drink water
● These are all possibilities: would like to eliminate as many of them as possible, in order to give us more confidence in a hypothesis on the causal relationship
○ Ideally, we need a methodological solution (design) that allows us to isolate the impact of A on B, taking into account all of these other possible situations (the presence of C, time, etc.)
○ If for instance, taking C into account (“controlling for”) makes the original correlation disappear, then there wasn’t really a relationship.
Hypothesis testing and statistical significance
● When testing hypothesis about this reality, we begin with two questions:
○ Is there a relationship between two variables in the population?
■ For instance, can we claim that people with a higher level of education are generally more satisfied with the job they have?
■ Very typically we term the two variables as independent and dependent variable - notice these things are made up by me not the data, I choose to the that because I have a certain logic
○ Could we determine this only by looking at a sample?
● Given that we rarely have information on the entire population of cases, hypothesis testing often begins with the observation that there is a trend, or relationship, or something interestingly odd in our sample.
○ Is this proof that one variable is linked to another (correlation)?
○ Is this what we would expect (from theory, past experience, other cases, etc
● Tests of statistical significance are all intended to determine whether or not the data in the sample is indicative of a larger trend in a given population.
○ For example: is there a difference between the average income of adult men versus the rest of the population? Suppose we have data on:
■ The entire population - mean income for everyone in the country
■ A representative sample of 100 adult men, with a mean income that is different from the populations
● There are at least two possible explanations:
○ The sample mean is not statistically significant different from the population mean - i.e., the difference is trivial
○ The difference between the sample and population is statistically significant - income for adult men is different than the whole populations
Null and research hypotheses
● We have two competing hypothesis
○ Null hypothesis: the difference is caused by random sampling error, and is not due to real differences between the two groups
■ The null always states that there is no statistically significant difference
○ The alternative hypothesis: the difference is “real” (i.e., the trend in the sample is also true in the population)
○ We can never infer with 100% confidence level, there is always a chance that we are wrong: proving the relationship vs we disprove the null hypothesis - if we can determine that it is reasonable to reject null hypothesis, then we have support for alternative hypothesis
● Results: suppose the results show the mean for population is 4,000
○ The mean of sample men, is 4,500, standard deviation is 500 and sample size is 100
○ So there is an observable difference between the parameter (4000) and the statistic in the sample (4500) - it seems that adult men do have higher income on average - is this difference real or it is caused by other factors?
● If you can reject the null hypothesis then you can say its statistically significant - very specific term that doesn’t mean significant - i.e., the difference we were seeing in the sample was not due to random sampling error - therefore, we can conclude that adult mean have a different income on average than the whole population
5-step hypotheses testing model
- Make assumptions and meet test requirements
- State the null hypothesis and research hypothesis
- Select the sampling distribution and establish the critical region (i.e.g, the criteria to pass/fail the test)
- Compute the test statistic
- Make a decision and interpret results
Step 1: Assumptions and requirements
- For one sample hypothesis testing with t-tests:
- Sample must be randomly selected from the defined population
- The sample is selected so that it is representative of a subgroup of the whole population, one with a specific characteristic
- Level of measurement of the dependent variable must be interval/ratio
- Level of measurement for the independent variable must be dichotomous (only two possible values) - and nominal?? - one distinction between two categorical values
- Sampling distribution of means must be normal in shape
Step 2: State hypotheses
- Null: there is “no difference” - if we are measuring men income to population - between the mean of the population from which the sample comes is equal to the population mean we are comparing it with
- Research hypothesis: there is a difference - perhaps focus on direction of difference, smaller or larger
Step 3: sampling distribution and critical region
- Which distribution should we use?
- The z-distribution if N is larger than 120 (normal curve appendix A)
- T-distribution if N is smaller than 2 (appendix B)
- Select the critical region:
- Critical region is the area under the curve which includes all the unlikely sample outcomes if the null hypothesis were to be true - the region where you can reject the null hypothesis
- Typically a 0.05, or 5% alpha level (proportion of area under the curve which falls in the critical region) - 95% confident leaving 5% of area under the critical region
- If we want 99% critical region, alpha would be 0.01 and 1% of area falls in the critical region
- Z or t score is the corresponding for the selected a level
- I.e., it is the score that corresponds to the threshold outside of which we fall into the critical region
- E.g., for an alpha of 0.05 and z score would be 1.96 for a two tailed test
- If instead we had less than 75 samples, then t score would be 1.993 for an alpha of 0.05
Step 4: compute the test statistic
- The test statistic is the z or t score of the sample outcome we are interested in
- This score is referred to as z or t obtained
Step 5: discussion and interpretation
- If z obtained falls in the critical region
- Is the z obtained outside the 1.96 you can reject the null hypothesis
- A larger standard error will result in a lower z figure
- When you can reject the null hypothesis you are able to state that is it reasonable to argue that there is a real difference between gender and age, not necessarily we know why this correlation exists but that it does exist
Example of 5-step model in practice
- A z critical to reject the null hypothesis is 1.96 for a two-tailed test but 1.65 for a one-tailed test
- Example: a sample of 152 felonies tried in a local court has a mean sentence of 27.3 months, is this significantly different from the average term for all felons across the nation, which is 28.7 months?
- Population mean = 28.7 months
- Sample
- 27.3 mean
- SD = 3.7
- N = 152
- Step 1: assumption and distribution
- Random sample
- Interval data for dependent
- Dichotomous for the independent whether they were tried in local court or not
- Step 2: state hypotheses
- Null there is no relationship
- Alternative there is a relationshit, but there is a specific type of relationship, the nation is larger than the mean for the local
- We need a one-tailed test, because the critical region is only on one side of the mean, I’m not even going to consider the region on the other side
- Step 3: distribution and critical region
- Sample is large enough for z distribution
- Alpha is 0.005 - confidence level of 95%, one-tailed because it’s one direction
- 1.65 z score, which is 0.45 because it’s one side of the mean 50+45
- Step 4: compute the statistic
- And z obtained 4.67 which is beyond the 1.65
- Step 5:
- 95% confidence interval this is not a random sample we can be confidence that it will be replicated in 95% other samples
- We reject the null hypothesis
- The difference cannot be attributed to sampling error
- The local court is handing down lower sentences for the same crime
- Now what?
- The 1.4 month difference between the two different means, does it really matter?
- All the test does is tell us about statistical analysis but we need to consider theory, policy relevance and other data to make a conclusion
- If it wasn’t in the critical region then you could have said there was no difference between the local court and the national court
p-value - type I and type II errors
Making the decision with the p-value:
- Based on the obtained z or t score from one sample mean
- P value is the probability of obtaining the difference between the means if the null hypothesis is true
- P.value less than 0.05 lets you reject the null hypothesis
- Therefore the p-value and the z or to score obtained by comparing the z or t score limit is the same practically
Type one error: we choose to reject the null hypothesis when it’s true
Type two error: we fail to reject the null hypothesis when it’s not true
Correlation versus causation
● One frequent problem we have to deal with when trying to answer a research question is causality: can we really demonstrate a causal link between concepts by simply looking at a set of cases and comparing them
● Worth noting that with only a few exceptions, all research methods and designs only help us establish a correlation: one independent variable (hypothesized cause) varies along with the dependent variable(outcome)
● To establish causality between two variables, we need both to observe correlation and to have a good theoretical explanation for the link
○ Example: age and political donations. We may find evidence illustrating correlation (i.e., as people get older, they tend to give more money to political parties). But can we explain it?
○ In other words, is the correlation causal? Does the variation in the independent variable cause the variation in the outcome?