Introduction. Visualizing Data Flashcards

1
Q

What is spurious correlation?

A

A spurious correlation occurs when two variables are correlated but don’t have a causal relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Omitted variable bias

A

It occurs when we do not include an independent variable in the model which has a causal effect on dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Simpson’s Paradox

A

It is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unit of analysis

A

The observation described by a set of data. For example, voters,
parties, bills, elections, voting decisions, legislative output. Very often our data have multiple levels of analysis (e.g., individuals, regions, countries), calling for different statistical techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Variables

A

Any characteristic related to the unit of analysis. A variable can take on different values for different observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Types of variables

A

e.g., nominal (e.g., political party), ordinal (e.g., school grades),
interval (e.g., GDP), ratio (e.g., duration)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data set

A

Set of variables for a given set of observations. Should come with a codebook

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hypothesis

A

Statement about the nature of the social and political world, often
expressed as statements about relationships between variables (e.g., “The lower X,the higher Y”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cross-section data

A

Sample of voters, governments, countries, or other units, taken at a given point in time. Observations are typically assumed to be independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Time series data

A

Observations on units over time, e.g., number of conflicts in country X. Because past events can influence future events and lags in behavior are prevalent in social sciences, time is an
important dimension in such a data set. Observations are not independent across time (serial correlation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pooled time series cross-section data

A

Data consist of comparable time series data observed on
a variety of units. For instance, units are countries, and for each country we observe annual data on a variety of political and economic variables. Typically, we have few units, but long time series. Pooling the data increases the number of observations and makes it possible to control for exogenous shocks.
Observations are usually not independent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Panel data

A

A large number of the same cross-sectional units, e.g., survey respondents, are observed
repeatedly over a number of “waves” (interviews). With panel data, the time series is usually very short.
Common in studies of political behavior. For example, German Socio-Economic Panel (SOEP) or the GIP
(German Internet Panel) in Mannheim

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A histogram

A

It shows the distribution of the measurements of a variable, bar graph in which the height of the bar shows how many observations fall in particular subintervals (bins), plotted along the horizontal axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Density plot

A

Address the deficiencies
of histograms by averaging and smoothing, probability density function from the random variable X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Measures of Central Tendency

A

Mode, Median, Mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Mode

A

Most frequently occurring value
of X

17
Q

Median

A

Value of X that falls in the
middle position when the observations
are ordered from smallest to largest.
Median = 50th percentile = 2nd quartile

18
Q

Mean

A

x =∑ni=1xi/n

19
Q

When mean=mode=median

A

In a perfectly symmetric distribution, e.g., normal distribution

20
Q

In right-skewed(positive skew) distribution what is greater: median or mean?

A

mean>median

21
Q

In left-skewed(negative skew) distribution what is greater: median or mean?

A

median > mean

22
Q

Who is sensitive to outliers: mean or median?

A

mean

23
Q

Sample Variance: definition and formula

A

Average of the squared deviations from the mean
S^2=sum of all(xi-(x_hat)) / n-1

23
Q

Sample Variance: definition and formula

A

Average of the squared deviations from the mean
S^2=∑i=1^n(xi-(x_hat))^2 / n-1

24
Q

Standard Deviation:definition and formula

A

Square-root of sample variance
s=√s^2

25
Q

Range: definition and formula

A

Difference between largest and smallest measurement:
RANGE = xMax − xMin

26
Q

Interquartile Range (IQR): definition and formula

A

Difference between upper and lower quartiles (range of
the middle 50% of the distribution)
QR = xQ3 − xQ1

27
Q

Q1 in boxplot

A

25 percentile

28
Q

Q3 in boxplot

A

75 percentile

29
Q

Q2 in boxplot

A

Median or 50 percentile

30
Q

Q0 in boxplot

A

0th percentile, lowest datapoint excluding outliers

31
Q

Q4 in boxplot

A

100th percentile, highest datapoint excluding outliers

32
Q

Lower Wisker

A

Q1-1.5(IQR)

33
Q

Upper Wisker

A

Q3+1.5(IQR)