Reading 2: Organizing, Visualizing, and Describing Data Flashcards
Identify and compare data types.
We may classify data types from three different perspectives: (a) numerical versus categorical, (b) time series versus cross sectional, and (c) structured versus unstructured.
(a) Numerical, or quantitative, data are values that can be counted or measured and may be discrete or continuous. Categorical, or qualitative, data are labels that can be used to classify a set of data into groups and may be nominal or ordinal.
(b) A time series is a set of observations taken at a sequence of points in time. Cross-sectional data are a set of comparable observations taken at one point in time. Time series and cross-sectional data may be combined to form panel data.
(c) Unstructured data refers to information that is presented in forms that are not regularly structured and may be generated by individuals, business processes, or sensors.
Describe how data are organized for quantitative analysis.
Data are typically organized into arrays for analysis. A time series is an example of a one-dimensional array. A data table is an example of a two-dimensional array.
Interpret frequency and related distributions.
- A frequency distribution groups observations into classes, or intervals. An interval is a range of values.
- Relative frequency is the percentage of total observations falling within an interval.
- Cumulative relative frequency for an interval is the sum of the relative frequencies for all values less than or equal to that interval’s maximum value.
Interpret a contingency table.
A contingency table is a two-dimensional array with which we can analyze two variables at the same time. The rows represent some attributes of one of the variables and the columns represent those attributes for the other variable. The data in each cell show the joint frequency with which we observe a pair of attributes simultaneously. The total of frequencies for a row or a column is the marginal frequency for that attribute.
Describe ways that data may be visualized and evaluate uses of specific visualization.
A histogram is a bar chart of data that has been grouped into a frequency distribution.
A frequency polygon plots the midpoint of each interval on the horizontal axis and the absolute frequency for that interval on the vertical axis, and it connects the midpoints with straight lines.
A cumulative frequency distribution chart is a line chart of the cumulative absolute frequency or the cumulative relative frequency.
Bar charts can be used to illustrate relative sizes, degrees, or magnitudes. A grouped or clustered bar chart can illustrate two categories at once.
- In a stacked bar chart, the height of each bar represents the cumulative frequency for a category, and the colors within each bar represent joint frequencies.
- A tree map is another method for visualizing the relative sizes of categories.
A word cloud is generated by counting the uses of specific words in a text file. It displays the words that appear most often, in type sizes that are scaled to the frequency of their use.
Line charts are particularly useful for exhibiting time series. Multiple time series can be displayed on a line chart if their scales are comparable. It is also possible to display two time series on a line chart if their scales are different by using left and right vertical axes. A technique for adding a dimension to a line chart is to create a bubble line chart.
A scatter plot is a way of displaying how two variables tend to change together. The vertical axis represents one variable and the horizontal axis represents a second variable. Each point in the scatter plot shows the values of both variables at one specific point in time.
A heat map uses color and shade to display data frequency.
Describe how to select among visualization types.
Which chart types tend to be most effective depends on what they are intended to visualize:
- Relationships. Scatter plots, scatter plot matrices, and heat maps.
- Comparisons. Bar charts, tree maps, and heat maps for comparisons among categories; line charts and bubble line charts for comparisons over time.
- Distributions. Histograms, frequency polygons, and cumulative distribution charts for numerical data; bar charts, tree maps, and heat maps for categorical data; and word clouds for unstructured data.
Calculate and interpret measures of central tendency.
Evaluate alternative definitions of mean to address an investment problem.
- Arithmetic mean is used to estimate expected value, value of a single outcome from a distribution.
- Geometric mean is used calculate or estimate periodic compound returns over multiple periods.
- Harmonic mean is used to calculate the average price paid with equal periodic investments.
- A trimmed mean omits outliers and a winsorized mean replaces outliers with given values, reducing the effect of outliers on the mean in both cases.
Calculate quantiles and interpret related visualizations.
Quantile is the general term for a value at or below which a stated proportion of the data in a distribution lies. Examples of quantiles include the following:
- Quartile. The distribution is divided into quarters.
- Quintile. The distribution is divided into fifths.
- Decile. The distribution is divided into tenths.
- Percentile. The distribution is divided into hundredths (percents).
Calculate and interpret measures of dispersion.
Calculate and interpret target downside deviation.
Interpret skewness.
Skewness describes the degree to which a distribution is not symmetric about its mean. A right-skewed distribution has positive skewness. A left-skewed distribution has negative skewness.
- For a positively skewed, unimodal distribution, the mean is greater than the median, which is greater than the mode.
- For a negatively skewed, unimodal distribution, the mean is less than the median, which is less than the mode.
Interpret kurtosis.
Kurtosis measures the peakedness of a distribution and the probability of extreme outcomes (thickness of tails):
- Excess kurtosis is measured relative to a normal distribution, which has a kurtosis of 3.
- Positive values of excess kurtosis indicate a distribution that is leptokurtic (fat tails, more peaked), so the probability of extreme outcomes is greater than for a normal distribution.
- Negative values of excess kurtosis indicate a platykurtic distribution (thin tails, less peaked).
Interpret correlation between two variables.
Correlation is a standardized measure of association between two random variables.
- It ranges in value from –1 to +1 and is equal to.
- Scatterplots are useful for revealing nonlinear relationships that are not measured by correlation.
- Correlation does not imply that changes in one variable cause changes in the other. Spurious correlation may result by chance or from the relationships of two variables to a third variable.