Organizing, Visualizing, and Describing Data Flashcards
Data that is measured or counted
Numerical
2 types of numerical data
Continuous and discrete
Data that can be measured and can take on any value in a range of values
Continuous numerical data - FV of an investment
Numerical data that result from a counting process
Discrete numerical data - the frequency of discrete compounding
2 types of data
Numerical and Categorical
Data that describe a characteristic or quality
categorical data
Other names for Numerical and Categorical data
Quantitative and Qualitative data
Categorical data not amenable to a logical order
nominal - stock sectors
Categorical data able to be logically ordered
ordinal data - ratings for investment funds
science of dealing with collection, analysis, interpretation, and presentation of numerical data
statistics
- the study of how large datasets can be effectively summarized
- studies of central tendency and variation of data
descriptive statistics
making extrapolations, estimates, forecasts about a large group from a smaller group
statistical inference
the complete group (objects, persons, items of interest) being studiued
population
a portion of the group being studied
sample
parameter vs statistic
a descriptive measure of a population vs a sample, respectively (p&s)
even distance between (consecutive) numbers
comment on zero
interval
zero is arbitrary
multiple data units at a given time
cross-sectional data
one unit of data across multiple time aliquots
time-series data
data that is patterned vs unpatterned
structured vs unstructured data
examples of structured data
market data - stock prices
fundamental data - financial statement data
analytics - cash flows
examples of unstructured data
produced by individuals - social media, posts, web searches
rank measures from more useful to less useful (interval, nominal, ordinal, ratio)
ratio, interval, ordinal, nominal
format for representing one variable
one-dimensional array
format for representing more than one variable via rows/columns
two-dimensional array
What is another name for a two-dimensional array?
data table
another name for a frequency distribution
one-way table
tool for summarizing data into groups or bins for display
frequency distribution
GICS stands for
Global Industry Classification Standard
real or actual frequency
absolute frequency
frequency as a percent number of observations
relative frequency
interval data where zero is an absolute number
ratio
raw data or non-summarized data
ungrouped data
data in a frequency distribution
grouped data
depiction of frequency distribution
histogram
also known as a 2 way table
contingency table
displays 2 or more categorical variables
contingency table
frequency at the intersection of a particular row and column
joint frequency
sums of joint frequencies
marginal frequency
2x2 contingency table in matrix form revealing actual and fake predictions within classes
confusion matrix
histogram with line graph showing relative frequencies
frequency polygon
frequency polygon with cumulative frequencies
ogive
circular depiction of data as a percent
pie chart
steps of creating a frequency distribution
- sort into ascending order
- range
- choose # of bins (k)
- bin width = range/k
- place the observations in the bins
- construct a table of bins from smallest to largest
test of association between 2 categorical variables
chi-square test
arranges data by left digit and right digit to present data concentrations
stem and leaf plot
used in quality control to tally qualitative issues
pareto chart
2-variable numeric chart used to show correlation
scatter plot
measures of where data tends to cluster
measures of central tendency
mathematical average influenced by outliers
mean
middle value in an array, not affected by the magnitude of extreme values
median
most frequent value (2 or more frequent values in data,
mode
central tendency measurement commonly used with ordinal data
median
central tendency measurement commonly used with interval/ratio data
mean
central tendency measurement commonly used with nominal data
mode
measures of how spread out data is
measures of dispersion
sum of the absolute values of differences between observation and sample mean
mean absolute deviation (MAD)
sum of the squared differences between the sample and the mean
variance
measures variability of the dataset
variance
percentage of variation with respect to the mean
coefficient of variation
if data is in a roughly normal distribution, than it will be deposited in certain areas
empirical rule
describes how much of a distribution is off center
skewness
describes relationship of tails of a distribution to its center
kurtosis
describe the mean and median in a normal distribution
they are even
tall and skinny distribution
leptokurtic
wide and flat distribution
platykurtic
range between quartiles (50% of the middle distribution)
interquartile range
what is coefficient of variation (CV) used for?
to compare datasets with different scales
what is population vs sample coefficient of variation
pop CV = sigma / mu
sample CV = s / x bar