Introduction. Visualizing Data Flashcards

Question 1

Q

What is spurious correlation?

Answer

A

A spurious correlation occurs when two variables are correlated but don’t have a causal relationship

Question 2

Q

Omitted variable bias

Answer

A

It occurs when we do not include an independent variable in the model which has a causal effect on dependent variable

Question 3

Q

Simpson’s Paradox

Answer

A

It is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined

Question 4

Q

Unit of analysis

Answer

A

The observation described by a set of data. For example, voters,
parties, bills, elections, voting decisions, legislative output. Very often our data have multiple levels of analysis (e.g., individuals, regions, countries), calling for different statistical techniques

Question 5

Q

Variables

Answer

A

Any characteristic related to the unit of analysis. A variable can take on different values for different observations

Question 6

Q

Types of variables

Answer

A

e.g., nominal (e.g., political party), ordinal (e.g., school grades),
interval (e.g., GDP), ratio (e.g., duration)

Question 7

Q

Data set

Answer

A

Set of variables for a given set of observations. Should come with a codebook

Question 8

Q

Hypothesis

Answer

A

Statement about the nature of the social and political world, often
expressed as statements about relationships between variables (e.g., “The lower X,the higher Y”)

Question 9

Q

Cross-section data

Answer

A

Sample of voters, governments, countries, or other units, taken at a given point in time. Observations are typically assumed to be independent

Question 10

Q

Time series data

Answer

A

Observations on units over time, e.g., number of conflicts in country X. Because past events can influence future events and lags in behavior are prevalent in social sciences, time is an
important dimension in such a data set. Observations are not independent across time (serial correlation)

Question 11

Q

Pooled time series cross-section data

Answer

A

Data consist of comparable time series data observed on
a variety of units. For instance, units are countries, and for each country we observe annual data on a variety of political and economic variables. Typically, we have few units, but long time series. Pooling the data increases the number of observations and makes it possible to control for exogenous shocks.
Observations are usually not independent.

Question 12

Q

Panel data

Answer

A

A large number of the same cross-sectional units, e.g., survey respondents, are observed
repeatedly over a number of “waves” (interviews). With panel data, the time series is usually very short.
Common in studies of political behavior. For example, German Socio-Economic Panel (SOEP) or the GIP
(German Internet Panel) in Mannheim

Question 13

Q

A histogram

Answer

A

It shows the distribution of the measurements of a variable, bar graph in which the height of the bar shows how many observations fall in particular subintervals (bins), plotted along the horizontal axis

Question 14

Q

Density plot

Answer

A

Address the deficiencies
of histograms by averaging and smoothing, probability density function from the random variable X

Question 15

Q

Measures of Central Tendency

Answer

A

Mode, Median, Mean

Question 16

Q

Mode

Answer

A

Most frequently occurring value
of X

Question 17

Q

Median

Answer

A

Value of X that falls in the
middle position when the observations
are ordered from smallest to largest.
Median = 50th percentile = 2nd quartile

Question 18

Q

Mean

Answer

A

x =∑ni=1xi/n

Question 19

Q

When mean=mode=median

Answer

A

In a perfectly symmetric distribution, e.g., normal distribution

Question 20

Q

In right-skewed(positive skew) distribution what is greater: median or mean?

Answer

A

mean>median

Question 21

Q

In left-skewed(negative skew) distribution what is greater: median or mean?

Answer

A

median > mean

Question 22

Q

Who is sensitive to outliers: mean or median?

Question 23

Q

Sample Variance: definition and formula

Answer

A

Average of the squared deviations from the mean
S^2=sum of all(xi-(x_hat)) / n-1

Question 24

Q

Sample Variance: definition and formula

Answer

A

Average of the squared deviations from the mean
S^2=∑i=1^n(xi-(x_hat))^2 / n-1

Question 25

Q

Standard Deviation:definition and formula

Answer

A

Square-root of sample variance
s=√s^2

Question 26

Q

Range: definition and formula

Answer

A

Difference between largest and smallest measurement:
RANGE = xMax − xMin

Question 27

Q

Interquartile Range (IQR): definition and formula

Answer

A

Difference between upper and lower quartiles (range of
the middle 50% of the distribution)
QR = xQ3 − xQ1

Question 28

Q

Q1 in boxplot

Answer

A

25 percentile

Question 29

Q

Q3 in boxplot

Answer

A

75 percentile

Question 30

Q

Q2 in boxplot

Answer

A

Median or 50 percentile

Question 31

Q

Q0 in boxplot

Answer

A

0th percentile, lowest datapoint excluding outliers

Question 32

Q

Q4 in boxplot

Answer

A

100th percentile, highest datapoint excluding outliers

Question 33

Q

Lower Wisker

Answer

A

Q1-1.5(IQR)

Question 34

Q

Upper Wisker

Answer

A

Q3+1.5(IQR)