Chapter 1 & 2 (Intro to Data and Visualizations) Flashcards

1
Q

Descriptive statistics

A

numbers used to summarize and describe data; do not involve generalizing beyond the data at hand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Sample

A

a finite set of observations or a small subset of data drawn from a population or a larger subset of data to make inferences about the latter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Population

A

larger set of data from which a sample is drawn; cannot be observed because it is theoretical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Inferential statistics

A

mathematical procedures whereby we convert information about a sample into intelligent guesses about the populations, assuming sampling is random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Simple random sampling

A

Every member of the population has an equal chance of being selected as part of the sample; the selection of members are independent of one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the importance of sample size?

A

Random samples, especially with a small sample size, is not always representative of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Random assignment

A

random division of the sample into two groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between failing to randomize assignment and having a non-random sample?

A

Failing to randomize invalidates the experimental findings while a non-random sample just restricts the generalizability of the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Stratified random sampling

A

(1) Identifying the members of your sample that belong to each strata or group in the population (2) Randomly sample from each subgroup so that the sizes of the subgroups in the sample are relative to those in the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Variables

A

properties or characteristics of some event, object, or person (observations) that can take on different values or amounts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the number of levels of an independent variable?

A

the number of experimental conditions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Qualitative vs. Quantitative variables

A

Qualitative/nominal/categorical variables have no numerical ordering but are often coded or represented with numbers, while quantitative variables are measured in numbers with some kind of unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Discrete vs. Continuous variables

A

Discrete are whole numbers on the scale while continuous can contain decimals and are not made of discrete steps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the value of the area under the normal distribution bell curve?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the probability of any exact value of x in a normal distribution?

A

0; the more precise the value of x, the closer the probability is to 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the area under the normal distribution curve and bounded between two given points on the x-axis represent?

A

`the probability that a randomly chosen number will fall between the two points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Positively skewed distribution

A

“skewed to the right,” the longer tail extends in the positive direction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Bimodal distribution

A

a distribution with two peaks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the different kinds of kurtosis in a distribution?

A

Leptokurtic (long tails; has more scores on its tails) and platykurtic (short tails)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Probability distribution

A

specifies the probability of different events (or combinations) in a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Event

A

a specific combination of attributes observed in a particular observation; a generalization of a specific attribute to a combination of attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Distrubution

A

specifies the likelihood of different events in the population using a probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Objective probability

A

the probability of an event is the relative frequency of that event occurring if the situation is observed frequently (Frequentist); can be measured and repeated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Subjective probability

A

the probability of an event is the belief about the likelihood that the given event will occur (Bayesian)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Probability density function (PDF)

A

describes the probability of a specific value occurring; “density” because the variable may not have well-defined values if it’s continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What do you call a PDF for a discrete variable?

A

probability mass function (PMF)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What notation (f) is used to express the value of the PDF for a variable x at value v?

A

fx(v)= P(X=v), the probability of x taking on the value v

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Frequency tables

A

shows the frequencies of various response categories and their relative frequencies (proportion of responses in each category)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Pie chart

A

each category is represented by a slice in the pie where the area of the slice is proportional to the percentage of responses in the category (relative frequency x 100)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

When is a pie chart effective?

A

when displaying the relative frequencies of a small number of categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Datum (plural data)

A

information which is relevant for a decision or for drawing a conclusion; can be seen as a single attribute of a single observation aggregated together to produce a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Dataset

A

a collection of data, usually of different types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

When does information become data?

A

when it used to answer a question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How is good and bad data determined?

A

how suitable it is to a question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Observations

A

a fundamental unit of analysis; like snapshots of a particular economic process or situation; have different attributes that are measured or described in different ways (the data)

35
Q

Quantitative vs Ordinal data

A

both order and distance (i.e. relative size) matter in quantitative data; only order matters in ordinal data, not magnitude

36
Q

Why is ordinal data a special case?

A

it can be thought of as an intermediate data type with both qualitative and quantitative properties

37
Q

Empirical model

A

a description of the processes which create the data we observe; how economists conceptualize the real-world forces around us

38
Q

What are the consequences of attempting to create narratives out of too little or too much data?

A

(1) A bias toward the measurable in our analysis even if it may not be appropriate e.g. streetlight effect (2) A lack of meaning or “junk” statistics, the creation of technically correct but realistically meaningless analysis

39
Q

What are the two ways to think about populations?

A

statistical population and real-world population

40
Q

Statistical population

A

a theoretical object representing all possible realizations of different observations and their likelihood of occurring

41
Q

Real-world population

A

the population exists but is incompletely or imperfectly observed

42
Q

Statistical inference

A

the use of statistics to overcome the limitation that we are working with a sample, not the population

43
Q

When is a sample considered representative of the population?

A

if a sample is large enough and drawn independently, meaning creating one observation does not affect other observations

44
Q

Representative sample

A

attributes occur in the sample in roughly similar proportion to how they occur in the population

45
Q

Weighting

A

targeting populations of particular interest more heavily then adjusting your results; everyone gets an “importance” value based on the group they represent

46
Q

Experimental data

A

collected in an experimental or controlled method wherein an experimenter intervenes or controls to manipulate some element of the setting

47
Q

Observational data

A

collected in a naturalistic setting without control or any explicit intervention; most common in economics

48
Q

How is experimental data powerful?

A

it allows us to focus on a single, well-controlled variable

49
Q

How is observational data powerful?

A

it can study many different questions at once

50
Q

Observations as combinations

A

observations are specific combinations of levels for the variables in the population

51
Q

Probability

A

used in a distribution to specify the likelihood of different events in the population

52
Q

What are 2 ways to interpret the probability of an event?

A

objective probability and subjective probability

53
Q

What are the 2 ways to express the distribution of a variable?

A

probability density function and cumulative density function

54
Q

Cumulative density function (CDF)

A

the probability of obtaining a result as large as the value; used when a variable is quantitative to express how extreme a value is

55
Q

What notation (f) is used for the value of the CDF for a variable x at value v?

A

Fx(v) = P(X<=v); this implies that you can calculate the CDF from the PDF by adding up the probability of all values less than v

56
Q

Joint distributions

A

the probability of combinations of variables occurring; let us assess the relationships between different variables

57
Q

What is the notation for the joint PDF of x and y?

A

F x,y (v,w) = P(X=v, Y=w)

58
Q

What is the notation for the joint CDF of x and y?

A

F x,y (v,w) = P(X<=v, Y<=w)

59
Q

What are some ways of representing a distribution?

A

tables of probabilities, figures or visualizations, equations for the PDF or CDF, etc.

60
Q

Empirical distributions

A

how likely observations in a sample take on a particular value or combination of values

61
Q

What is the notation for empirical distributions?

A

f^x(v) and F^x(v) for PDF and CDF respectively

62
Q

What is the relationship between a theoretical distribution and an empirical distribution?

A

If a sample is representative of the population, then if the sample is large enough, the empirical distribution (f^) will closely approximate the theoretical distribution (F)

63
Q

Statistic

A

any object or quantity computed from a sample which is used to understand the sample or the population from which it is drawn

64
Q

What are 3 ways to think about visualizations?

A

(1) As a way to describe or visualize a set of statistics in order to understand the relationships within the data itself (2) As a statistic itself that takes in data and outputs a visualization (3) A combination of different elements called aesthetics

65
Q

Charts

A

diagrams in which data is graphically represented using symbols (and their relationships); different from an illustration and an infographic; can be static or interactive

66
Q

What are the components of a bar chart?

A

it has an axis (usually x-axis) that represents categories of a variable and another axis (usually y-axis) that represents the levels or quantities of another variable

67
Q

How is a relationship illustrated in a bar chart?

A

the relationship between two axes is depicted via barsm where the height represents the value of the y variable for observations with the specific x value

68
Q

What are the types of bar charts?

A

histogram, frequency chart, stacked bar chart, 100% bar chart

69
Q

Histogram

A

a bar chart where the x variable is quantitative and is broken up into ordered “bins”; a common way to show a PDF

70
Q

Frequency chart

A

a bar chart where the y variable is the frequency of a category in the sample

71
Q

Stacked bar chart

A

it combines several observation types by category to give an idea of the composition

72
Q

100% bar chart

A

it normalizes the height of the bars, usually combined with stacking or something else

73
Q

What are bar charts used for?

A

to show changes in level or composition, to illustrate the distribution of variables in a dataset when y is the frequency (relative or absolute)

74
Q

When are bar charts most effective?

A

(1) the x variable is categorical or can be sensibly divided into relatively few categories and the y variable is summarizable by those categories (2) the objective of the visualization is to compare across categories of the x variable (3) there is a meaningful representation of the separation between the categories

75
Q

When do bar charts tend not to work?

A

when (1) data dimensionality is high and (2) variation across categories hard to compare

76
Q

Bivariate charts

A

summarize the relationship between 2 distinct variables, not derived or related to one another; do not rely on space or time for their relationships or visualization

77
Q

What are the types of bivariate charts?

A

scatterplots, line plots, and connected line plots

78
Q

Scatterplots

A

two variables are plotted using points against one another

79
Q

Line plots

A

points are ordered according to some variable then connected with a line; sometimes called a time series plot when the x-axis is time

80
Q

Connected line plots

A

a scatterplot is connected using a line

81
Q

Dependence

A

the ability of one variable to predict the values of another variable e.g. correlation; says nothing about whether such a relationship is causal or not

82
Q

Aesthetic

A

any visual property of a visualization and can be mapped into a variable or held constant e.g. the distance represented by the x and y axes, color, shape, size, transparency, thickness, etc.

83
Q

Information density

A

how many dimensions of data are being displayed relative to the number of aesthetic dimensions being displayed

84
Q

Low information density

A

when aesthetic dimensions are greater than dimensions of data or variables

85
Q

Tufte’s Principles

A

(1) Physical representation of numbers should be proportional to the numerical quantities represented (2) Labels clarify the data (3) Show data variation, not design variation (4) Number of dimensions in the figure should not exceed the number in the data (5) Graphics must not quote data out of context (6) Use deflated or standardized units, especially in a time series