Introduction. Visualizing Data Flashcards
What is spurious correlation?
A spurious correlation occurs when two variables are correlated but don’t have a causal relationship
Omitted variable bias
It occurs when we do not include an independent variable in the model which has a causal effect on dependent variable
Simpson’s Paradox
It is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined
Unit of analysis
The observation described by a set of data. For example, voters,
parties, bills, elections, voting decisions, legislative output. Very often our data have multiple levels of analysis (e.g., individuals, regions, countries), calling for different statistical techniques
Variables
Any characteristic related to the unit of analysis. A variable can take on different values for different observations
Types of variables
e.g., nominal (e.g., political party), ordinal (e.g., school grades),
interval (e.g., GDP), ratio (e.g., duration)
Data set
Set of variables for a given set of observations. Should come with a codebook
Hypothesis
Statement about the nature of the social and political world, often
expressed as statements about relationships between variables (e.g., “The lower X,the higher Y”)
Cross-section data
Sample of voters, governments, countries, or other units, taken at a given point in time. Observations are typically assumed to be independent
Time series data
Observations on units over time, e.g., number of conflicts in country X. Because past events can influence future events and lags in behavior are prevalent in social sciences, time is an
important dimension in such a data set. Observations are not independent across time (serial correlation)
Pooled time series cross-section data
Data consist of comparable time series data observed on
a variety of units. For instance, units are countries, and for each country we observe annual data on a variety of political and economic variables. Typically, we have few units, but long time series. Pooling the data increases the number of observations and makes it possible to control for exogenous shocks.
Observations are usually not independent.
Panel data
A large number of the same cross-sectional units, e.g., survey respondents, are observed
repeatedly over a number of “waves” (interviews). With panel data, the time series is usually very short.
Common in studies of political behavior. For example, German Socio-Economic Panel (SOEP) or the GIP
(German Internet Panel) in Mannheim
A histogram
It shows the distribution of the measurements of a variable, bar graph in which the height of the bar shows how many observations fall in particular subintervals (bins), plotted along the horizontal axis
Density plot
Address the deficiencies
of histograms by averaging and smoothing, probability density function from the random variable X
Measures of Central Tendency
Mode, Median, Mean