Introduction. Visualizing Data Flashcards
What is spurious correlation?
A spurious correlation occurs when two variables are correlated but don’t have a causal relationship
Omitted variable bias
It occurs when we do not include an independent variable in the model which has a causal effect on dependent variable
Simpson’s Paradox
It is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined
Unit of analysis
The observation described by a set of data. For example, voters,
parties, bills, elections, voting decisions, legislative output. Very often our data have multiple levels of analysis (e.g., individuals, regions, countries), calling for different statistical techniques
Variables
Any characteristic related to the unit of analysis. A variable can take on different values for different observations
Types of variables
e.g., nominal (e.g., political party), ordinal (e.g., school grades),
interval (e.g., GDP), ratio (e.g., duration)
Data set
Set of variables for a given set of observations. Should come with a codebook
Hypothesis
Statement about the nature of the social and political world, often
expressed as statements about relationships between variables (e.g., “The lower X,the higher Y”)
Cross-section data
Sample of voters, governments, countries, or other units, taken at a given point in time. Observations are typically assumed to be independent
Time series data
Observations on units over time, e.g., number of conflicts in country X. Because past events can influence future events and lags in behavior are prevalent in social sciences, time is an
important dimension in such a data set. Observations are not independent across time (serial correlation)
Pooled time series cross-section data
Data consist of comparable time series data observed on
a variety of units. For instance, units are countries, and for each country we observe annual data on a variety of political and economic variables. Typically, we have few units, but long time series. Pooling the data increases the number of observations and makes it possible to control for exogenous shocks.
Observations are usually not independent.
Panel data
A large number of the same cross-sectional units, e.g., survey respondents, are observed
repeatedly over a number of “waves” (interviews). With panel data, the time series is usually very short.
Common in studies of political behavior. For example, German Socio-Economic Panel (SOEP) or the GIP
(German Internet Panel) in Mannheim
A histogram
It shows the distribution of the measurements of a variable, bar graph in which the height of the bar shows how many observations fall in particular subintervals (bins), plotted along the horizontal axis
Density plot
Address the deficiencies
of histograms by averaging and smoothing, probability density function from the random variable X
Measures of Central Tendency
Mode, Median, Mean
Mode
Most frequently occurring value
of X
Median
Value of X that falls in the
middle position when the observations
are ordered from smallest to largest.
Median = 50th percentile = 2nd quartile
Mean
x =∑ni=1xi/n
When mean=mode=median
In a perfectly symmetric distribution, e.g., normal distribution
In right-skewed(positive skew) distribution what is greater: median or mean?
mean>median
In left-skewed(negative skew) distribution what is greater: median or mean?
median > mean
Who is sensitive to outliers: mean or median?
mean
Sample Variance: definition and formula
Average of the squared deviations from the mean
S^2=sum of all(xi-(x_hat)) / n-1
Sample Variance: definition and formula
Average of the squared deviations from the mean
S^2=∑i=1^n(xi-(x_hat))^2 / n-1
Standard Deviation:definition and formula
Square-root of sample variance
s=√s^2
Range: definition and formula
Difference between largest and smallest measurement:
RANGE = xMax − xMin
Interquartile Range (IQR): definition and formula
Difference between upper and lower quartiles (range of
the middle 50% of the distribution)
QR = xQ3 − xQ1
Q1 in boxplot
25 percentile
Q3 in boxplot
75 percentile
Q2 in boxplot
Median or 50 percentile
Q0 in boxplot
0th percentile, lowest datapoint excluding outliers
Q4 in boxplot
100th percentile, highest datapoint excluding outliers
Lower Wisker
Q1-1.5(IQR)
Upper Wisker
Q3+1.5(IQR)