Organizing Visualizing and Describing Data Flashcards
Absolute dispersion
The amount of variability present without comparison to any reference point or benchmark.
Absolute frequency
The actual number of observations counted for each unique value of the variable (also called raw frequency).
Arithmetic mean
The sum of the observations divided by the number of observations.
Bar chart
A chart for plotting the frequency distribution of categorical data, where each bar represents a distinct category and each bar’s height is proportional to the frequency of the corresponding category. In technical analysis, a bar chart that plots four bits of data for each time interval—the high, low, opening, and closing prices. A vertical line connects the high and low prices. A cross-hatch left indicates the opening price and a cross-hatch right indicates the closing price.
Bimodal
A distribution that has two most frequently occurring values.
Box and whisker plot
A graphic for visualizing the dispersion of data across quartiles. It consists of a “box” with “whiskers” connected to the box.
Bubble line chart
A line chart that uses varying-sized bubbles to represent a third dimension of the data. The bubbles are sometimes color-coded to present additional information.
Categorical data
Values that describe a quality or characteristic of a group of observations and therefore can be used as labels to divide a dataset into groups to summarize and visualize (also called qualitative data).
Chi-square test of independence
A statistical test for detecting a potential association between categorical variables.
Clustered bar chart
A bar chart for showing joint frequencies for two categorical variables (also known as a clustered bar chart).
Coefficient of variation
The ratio of a set of observations’ standard deviation to the observations’ mean value.
Confusion matrix
A grid used for error analysis in classification problems, it presents values for four evaluation metrics including true positive (TP), false positive (FP), true negative (TN), and false negative (FN).
Contingency table
A table of the frequency distribution of observations classified on the basis of two discrete variables.
Continuous data
Data that can be measured and can take on any numerical value in a specified range of values.
Correlation
A measure of the linear relationship between two random variables.
Cost averaging
The periodic investment of a fixed amount of money.
Cross-sectional data
A list of the observations of a specific variable from multiple observational units at a given point in time. The observational units can be individuals, groups, companies, trading markets, regions, etc.
Cumulative absolute frequency
Cumulates (i.e., adds up) in a frequency distribution the absolute frequencies as one moves from the first bin to the last bin.
Cumulative frequency distribution chart
A chart that plots either the cumulative absolute frequency or the cumulative relative frequency on the y-axis against the upper limit of the interval and allows one to see the number or the percentage of the observations that lie below a certain value.
Cumulative relative frequency
A sequence of partial sums of the relative frequencies in a frequency distribution.
Data
A collection of numbers, characters, words, and text—as well as images, audio, and video—in a raw or organized format to represent facts or information.
Data table
see two-dimensional rectangular array.
Deciles
Quantiles that divide a distribution into 10 equal parts.
Descriptive statistics
The study of how data can be summarized effectively.
Discrete data
Numerical values that result from a counting process; therefore, practically speaking, the data are limited to a finite number of values.
Dispersion
The variability of a population or sample of observations around the central tendency.
Downside risk
Risk of incurring returns below a specified value.
Excess kurtosis
Degree of kurtosis (fatness of tails) relative to the kurtosis of the normal distribution.
Fat-Tailed
Describes a distribution that has fatter tails than a normal distribution (also called leptokurtic).
Fractile
A value at or below which a stated fraction of the data lies. Also called quantile.
Frequency distribution
A tabular display of data constructed either by counting the observations of a variable by distinct values or groups or by tallying the values of a numerical variable into a set of numerically ordered bins (also called a one-way table).
Frequency polygon
A graph of a frequency distribution obtained by drawing straight lines joining successive points representing the class frequencies.
Geometric mean
A measure of central tendency computed by taking the nth root of the product of n non-negative values.
Grouped bar chart
A bar chart for showing joint frequencies for two categorical variables (also known as a clustered bar chart).
Harmonic mean
A type of weighted mean computed as the reciprocal of the arithmetic average of the reciprocals.
Heat map
A type of graphic that organizes and summarizes data in a tabular format and represents it using a color spectrum.
Histogram
A chart that presents the distribution of numerical data by using the height of a bar or column to represent the absolute frequency of each bin or interval in the distribution.
Interquartile range
The difference between the third and first quartiles of a dataset.
Interval
With reference to grouped data, a set of values within which an observation falls.
Joint frequencies
The entry in the cells of the contingency table that represent the joining of one variable from a row and the other variable from a column to count observations.
Leptokurtic
Describes a distribution that has fatter tails than a normal distribution (also called fat-tailed).
Line chart
A type of graph used to visualize ordered observations. In technical analysis, a plot of price data, typically closing prices, with a line connecting the points.
Linear interpolation
The estimation of an unknown value on the basis of two known values that bracket it, using a straight line between the two known values.
Marginal frequencies
The sums determined by adding joint frequencies across rows or across columns in a contingency table.
Mean absolute deviation
With reference to a sample, the mean of the absolute values of deviations from the sample mean.
Measure of central tendency
A quantitative measure that specifies where data are centered.
Measures of location
Quantitative measures that describe the location or distribution of data. They include not only measures of central tendency but also other measures, such as percentiles.
Median
The value of the middle item of a set of items that has been sorted into ascending or descending order (i.e., the 50th percentile).
Mesokurtic
Describes a distribution with kurtosis equal to that of the normal distribution, namely, kurtosis equal to three.
Modal interval
With reference to grouped data, the interval containing the greatest number of observations (i.e., highest frequency).
Mode
The most frequently occurring value in a distribution.
Nominal data
Categorical values that are not amenable to being organized in a logical order. An example of nominal data is the classification of publicly listed stocks into sectors.
Numerical data
Values that represent measured or counted quantities as a number. Also called quantitative data.
Observation
The value of a specific variable collected at a point in time or over a specified period of time.
One-dimensional array
The simplest format for representing a collection of data of the same data type.
Ordinal data
Categorical values that can be logically ordered or ranked.
Panel data
A mix of time-series and cross-sectional data that contains observations through time on characteristics of across multiple observational units.
Percentiles
Quantiles that divide a distribution into 100 equal parts that sum to 100.
Platykurtic
Describes a distribution that has relatively less weight in the tails than the normal distribution (also called thin-tailed).
Population
All members of a specified group.
Qualitative data
Values that describe a quality or characteristic of a group of observations and therefore can be used as labels to divide a dataset into groups to summarize and visualize (also called Categorical data).
Quantile
A value at or below which a stated fraction of the data lies. Also referred to as a fractile.
Quantitative data
Values that represent measured or counted quantities as a number. Also called Numerical data.
Quartiles
Quantiles that divide a distribution into four equal parts. 25%iles
Quintiles
Quantiles that divide a distribution into five equal parts. 20%iles
Range
The difference between the maximum and minimum values in a dataset.
Raw data
Data available in their original form as collected.
Relative dispersion
The amount of dispersion relative to a reference value or benchmark.
Relative frequency
The absolute frequency of each unique value of the variable divided by the total number of observations of the variable.
Sample
A subset of a population.
Sample correlation coefficient
A standardized measure of how two variables in a sample move together. It is the ratio of the sample covariance to the product of the two variables’ standard deviations.
Sample covariance
A measure of how two variables in a sample move together.
Sample excess kurtosis
A sample measure of the degree of a distribution’s kurtosis in excess of the normal distribution’s kurtosis.
Sample mean
The sum of the sample observations divided by the sample size.
Sample skewness
A sample measure of the degree of asymmetry of a distribution.
Sample standard deviation
The positive square root of the sample variance.
Sample statistic
A quantity computed from or used to describe a sample.
Sample variance
The sum of squared deviations around the mean divided by the degrees of freedom.
Scatter plot matrix
A tool for organizing scatter plots between pairs of variables, making it easy to inspect all pairwise relationships in one combined visual.
Skewed
Not symmetrical.
Spurious correlation
Refers to: 1) correlation between two variables that reflects chance relationships in a particular dataset; 2) correlation induced by a calculation that mixes each of two variables with a third variable; and 3) correlation between two variables arising not from a direct relation between them but from their relation to a third variable.
Stacked bar chart
An alternative form for presenting the frequency distribution of two categorical variables, where bars representing the sub-groups are placed on top of each other to form a single bar. Each sub-section is shown in a different color to represent the contribution of each sub-group, and the overall height of the stacked bar represents the marginal frequency for the category.
Standard deviation
The positive square root of the variance; a measure of dispersion in the same units as the original data.
Statistic
A summary measure of a sample of observations.
Structured data
Data that are highly organized in a pre-defined manner, usually with repeating patterns.
Tag cloud / Word Cloud
A visual device for representing textual data, which consists of words extracted from a source of textual data. The size of each distinct word is proportional to the frequency with which it appears in the given text (also known as Word cloud).
Target semi-deviation/ Target downside deviation
A measure of downside risk, calculated as the square root of the average of the squared deviations of observations below the target (also called target downside deviation).
Thin-Tailed
Describes a distribution that has relatively less weight in the tails than the normal distribution (also called platykurtic)1
Time-series data
A sequence of observations for a single observational unit of a specific variable collected over time and at discrete and typically equally spaced intervals of time (such as daily, weekly, monthly, annually, or quarterly).
Tree-Map
Another graphical tool for displaying categorical data. It consists of a set of colored rectangles to represent distinct groups, and the area of each rectangle is proportional to the value of the corresponding group.
Trimmed mean
A mean computed after excluding a stated small percentage of the lowest and highest observations.
Trimodal
A distribution that has the three most frequently occurring values.
Two-dimensional rectangular array
A popular form for organizing data for processing by computers or for presenting data visually. It is comprised of columns and rows to hold multiple variables and multiple observations, respectively (also called a data table).
Unimodal
A distribution with a single value that is most frequently occurring.
Unstructured data
Data that do not follow any conventionally organized forms.
Variable
A characteristic or quantity that can be measured, counted, or categorized and that is subject to change (also called a field, an attribute, or a feature).
Variance
The expected value (the probability-weighted average) of squared deviations from a random variable’s expected value.
Visualization
The presentation of data in a pictorial or graphical format for the purpose of increasing understanding and for gaining insights into the data.
Weighted mean
An average in which each observation is weighted by an index of its relative importance.
Winsorized mean
A mean computed after assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value.
Word cloud
A visual device for representing textual data, which consists of words extracted from a source of textual data. The size of each distinct word is proportional to the frequency with which it appears in the given text (also known as tag cloud).