Final Flashcards
Residual
distance between the line and any given point
SSE
takes residuals, squares them and adds them up.
The regression shows us the best fitting line in terms of sum of squared errors
Homoskedasticity
When the random variable, X, has the same variance for all observations of X. This isn’t a problem.
Heteroskedasticity
When the random variable, X, does not have the same variance for all observations of X.
Multicollinearity
When an independent variable is highly related to other independent variables, the variance of the coefficient we estimate for that variable will be high.
Variables in a data set come in different forms:
- Dummy (AKA binary or dichotomous) variables • Discrete versus continuous variables
- Ordinal variables
- Nominal variables
Data sets themselves also come in different forms:
- Cross-sectional data
* Panel (time series) data
Logged variables
one version of a transformed variable
Binary (dummy) variables are very useful.
- They are regularly used to control for specific effects (Think about the South in Larry Bartels’ article)
- They are used in experiments to identify the treated (1) and control (0) units.
Binary variables make difference
in means across two groups, or ‘average treatment effects’, very easy to calculate.
Discrete data
Comes in ’bins’ or groups.
• Example: On a scale of 1 to 5, how much do you like this class? 1, 2, 3, 4, 5. (5! Obv.)
• Polity score is another example.
• Discrete data lacks precision, The bin’s may be clear but they may not. We may not know what going from 1-2 means.
Continuous data
Can take any value in a sequence. Examples:
• Annual income
• votes for each candidate • percentages.
Descriptive Data
it describes how the world is. Often, categorical data comes from qualitative research.
Categorical data can be ordinal or nominal.
• Ordinal can be ordered (low, medium high—Comparable to discrete data)
• nominal cannot be ordered (majors: political science, economics, sociology, states).
class(data$variable)
to find out if something is a factor
lev
to find out the levels of that factor and store it as an object
ifelse(data$factor == lev[1] | data$factor == lev[2], 1, 0)
if a variable is at these levels, code it as a 1 (treated) otherwise code it as a 0 (control)
Cross-sectional data:
- A sample of a population in a given period of time. You observe a bunch of units at one time period.
- Example: Representative public opinion poll before an election.
Repeated cross-sectional data:
• Taking different samples of a population over time. You observe different units over time. • Example: Multiple waves of a representative public opinion poll, where different people
respond (pick up the phone) in each wave (sample).
Panel (time-series) data:
- Seeing the same population repeatedly over time. You see the same units over given time periods.
- Example: countries’s GDP and wars by year from 1980-2015, turnout in each CA precinct between 2000-2014, Survey efforts that repeatedly target the same group of people annually.
Fixed effects models
control for unit specific effects – it nets out the average way that unit behaves over time.
• You could think of this as the ’culture’ of a unit.
Fixed effects models also
control for time period effects
• they control for the average behavior in a given year.
• Example: average income during recession years, average turnout in each presidential election year.
Fixed effects approach
allows us to control for any factor that is fixed within the entire panel, regardless of whether we observe the factor
Randomized experiments
deal with the endogeneity problem by creating exogeneity.
What factors do we need to control for, as covariates, in an experiment?
None! If randomization was done properly, and you have a large enough sample size, in expectation we should have balance across all the covariates.