Chapter 1 & 2 (Intro to Data and Visualizations) Flashcards
Descriptive statistics
numbers used to summarize and describe data; do not involve generalizing beyond the data at hand
Sample
a finite set of observations or a small subset of data drawn from a population or a larger subset of data to make inferences about the latter
Population
larger set of data from which a sample is drawn; cannot be observed because it is theoretical
Inferential statistics
mathematical procedures whereby we convert information about a sample into intelligent guesses about the populations, assuming sampling is random
Simple random sampling
Every member of the population has an equal chance of being selected as part of the sample; the selection of members are independent of one another
What is the importance of sample size?
Random samples, especially with a small sample size, is not always representative of the population
Random assignment
random division of the sample into two groups
What is the difference between failing to randomize assignment and having a non-random sample?
Failing to randomize invalidates the experimental findings while a non-random sample just restricts the generalizability of the results
Stratified random sampling
(1) Identifying the members of your sample that belong to each strata or group in the population (2) Randomly sample from each subgroup so that the sizes of the subgroups in the sample are relative to those in the population
Variables
properties or characteristics of some event, object, or person (observations) that can take on different values or amounts
What is the number of levels of an independent variable?
the number of experimental conditions
Qualitative vs. Quantitative variables
Qualitative/nominal/categorical variables have no numerical ordering but are often coded or represented with numbers, while quantitative variables are measured in numbers with some kind of unit
Discrete vs. Continuous variables
Discrete are whole numbers on the scale while continuous can contain decimals and are not made of discrete steps
What is the value of the area under the normal distribution bell curve?
1
What is the probability of any exact value of x in a normal distribution?
0; the more precise the value of x, the closer the probability is to 0
What does the area under the normal distribution curve and bounded between two given points on the x-axis represent?
`the probability that a randomly chosen number will fall between the two points
Positively skewed distribution
“skewed to the right,” the longer tail extends in the positive direction
Bimodal distribution
a distribution with two peaks
What are the different kinds of kurtosis in a distribution?
Leptokurtic (long tails; has more scores on its tails) and platykurtic (short tails)
Probability distribution
specifies the probability of different events (or combinations) in a population
Event
a specific combination of attributes observed in a particular observation; a generalization of a specific attribute to a combination of attributes
Distrubution
specifies the likelihood of different events in the population using a probability
Objective probability
the probability of an event is the relative frequency of that event occurring if the situation is observed frequently (Frequentist); can be measured and repeated
Subjective probability
the probability of an event is the belief about the likelihood that the given event will occur (Bayesian)
Probability density function (PDF)
describes the probability of a specific value occurring; “density” because the variable may not have well-defined values if it’s continuous
What do you call a PDF for a discrete variable?
probability mass function (PMF)
What notation (f) is used to express the value of the PDF for a variable x at value v?
fx(v)= P(X=v), the probability of x taking on the value v
Frequency tables
shows the frequencies of various response categories and their relative frequencies (proportion of responses in each category)
Pie chart
each category is represented by a slice in the pie where the area of the slice is proportional to the percentage of responses in the category (relative frequency x 100)
When is a pie chart effective?
when displaying the relative frequencies of a small number of categories
Datum (plural data)
information which is relevant for a decision or for drawing a conclusion; can be seen as a single attribute of a single observation aggregated together to produce a dataset
Dataset
a collection of data, usually of different types
When does information become data?
when it used to answer a question
How is good and bad data determined?
how suitable it is to a question
Observations
a fundamental unit of analysis; like snapshots of a particular economic process or situation; have different attributes that are measured or described in different ways (the data)
Quantitative vs Ordinal data
both order and distance (i.e. relative size) matter in quantitative data; only order matters in ordinal data, not magnitude
Why is ordinal data a special case?
it can be thought of as an intermediate data type with both qualitative and quantitative properties
Empirical model
a description of the processes which create the data we observe; how economists conceptualize the real-world forces around us
What are the consequences of attempting to create narratives out of too little or too much data?
(1) A bias toward the measurable in our analysis even if it may not be appropriate e.g. streetlight effect (2) A lack of meaning or “junk” statistics, the creation of technically correct but realistically meaningless analysis
What are the two ways to think about populations?
statistical population and real-world population
Statistical population
a theoretical object representing all possible realizations of different observations and their likelihood of occurring
Real-world population
the population exists but is incompletely or imperfectly observed
Statistical inference
the use of statistics to overcome the limitation that we are working with a sample, not the population
When is a sample considered representative of the population?
if a sample is large enough and drawn independently, meaning creating one observation does not affect other observations
Representative sample
attributes occur in the sample in roughly similar proportion to how they occur in the population
Weighting
targeting populations of particular interest more heavily then adjusting your results; everyone gets an “importance” value based on the group they represent
Experimental data
collected in an experimental or controlled method wherein an experimenter intervenes or controls to manipulate some element of the setting
Observational data
collected in a naturalistic setting without control or any explicit intervention; most common in economics
How is experimental data powerful?
it allows us to focus on a single, well-controlled variable
How is observational data powerful?
it can study many different questions at once
Observations as combinations
observations are specific combinations of levels for the variables in the population
Probability
used in a distribution to specify the likelihood of different events in the population
What are 2 ways to interpret the probability of an event?
objective probability and subjective probability
What are the 2 ways to express the distribution of a variable?
probability density function and cumulative density function
Cumulative density function (CDF)
the probability of obtaining a result as large as the value; used when a variable is quantitative to express how extreme a value is
What notation (f) is used for the value of the CDF for a variable x at value v?
Fx(v) = P(X<=v); this implies that you can calculate the CDF from the PDF by adding up the probability of all values less than v
Joint distributions
the probability of combinations of variables occurring; let us assess the relationships between different variables
What is the notation for the joint PDF of x and y?
F x,y (v,w) = P(X=v, Y=w)
What is the notation for the joint CDF of x and y?
F x,y (v,w) = P(X<=v, Y<=w)
What are some ways of representing a distribution?
tables of probabilities, figures or visualizations, equations for the PDF or CDF, etc.
Empirical distributions
how likely observations in a sample take on a particular value or combination of values
What is the notation for empirical distributions?
f^x(v) and F^x(v) for PDF and CDF respectively
What is the relationship between a theoretical distribution and an empirical distribution?
If a sample is representative of the population, then if the sample is large enough, the empirical distribution (f^) will closely approximate the theoretical distribution (F)
Statistic
any object or quantity computed from a sample which is used to understand the sample or the population from which it is drawn
What are 3 ways to think about visualizations?
(1) As a way to describe or visualize a set of statistics in order to understand the relationships within the data itself (2) As a statistic itself that takes in data and outputs a visualization (3) A combination of different elements called aesthetics
Charts
diagrams in which data is graphically represented using symbols (and their relationships); different from an illustration and an infographic; can be static or interactive
What are the components of a bar chart?
it has an axis (usually x-axis) that represents categories of a variable and another axis (usually y-axis) that represents the levels or quantities of another variable
How is a relationship illustrated in a bar chart?
the relationship between two axes is depicted via barsm where the height represents the value of the y variable for observations with the specific x value
What are the types of bar charts?
histogram, frequency chart, stacked bar chart, 100% bar chart
Histogram
a bar chart where the x variable is quantitative and is broken up into ordered “bins”; a common way to show a PDF
Frequency chart
a bar chart where the y variable is the frequency of a category in the sample
Stacked bar chart
it combines several observation types by category to give an idea of the composition
100% bar chart
it normalizes the height of the bars, usually combined with stacking or something else
What are bar charts used for?
to show changes in level or composition, to illustrate the distribution of variables in a dataset when y is the frequency (relative or absolute)
When are bar charts most effective?
(1) the x variable is categorical or can be sensibly divided into relatively few categories and the y variable is summarizable by those categories (2) the objective of the visualization is to compare across categories of the x variable (3) there is a meaningful representation of the separation between the categories
When do bar charts tend not to work?
when (1) data dimensionality is high and (2) variation across categories hard to compare
Bivariate charts
summarize the relationship between 2 distinct variables, not derived or related to one another; do not rely on space or time for their relationships or visualization
What are the types of bivariate charts?
scatterplots, line plots, and connected line plots
Scatterplots
two variables are plotted using points against one another
Line plots
points are ordered according to some variable then connected with a line; sometimes called a time series plot when the x-axis is time
Connected line plots
a scatterplot is connected using a line
Dependence
the ability of one variable to predict the values of another variable e.g. correlation; says nothing about whether such a relationship is causal or not
Aesthetic
any visual property of a visualization and can be mapped into a variable or held constant e.g. the distance represented by the x and y axes, color, shape, size, transparency, thickness, etc.
Information density
how many dimensions of data are being displayed relative to the number of aesthetic dimensions being displayed
Low information density
when aesthetic dimensions are greater than dimensions of data or variables
Tufte’s Principles
(1) Physical representation of numbers should be proportional to the numerical quantities represented (2) Labels clarify the data (3) Show data variation, not design variation (4) Number of dimensions in the figure should not exceed the number in the data (5) Graphics must not quote data out of context (6) Use deflated or standardized units, especially in a time series