1.2 Summarizing Data Using Frequency Distributions Flashcards
A frequency distribution
summarizes the values of a numerical variable into a few intervals
helpful when working with large data sets
the types of frequency
Absolute frequency
Relative frequency
Cumulative relative frequency
Absolute frequency
The actual number of observations in each interval
Relative frequency
The absolute frequency divided by the total number of observations
Cumulative relative frequency
The relative frequencies added up from the first interval to the current interval
the steps to construct a frequency distribution
- Sort the data in ascending order.
- Calculate the range of the data.
Range = Maximum value − Minimum value
- Determine the desired number of intervals, k
- Determine the interval width.
Interval width = Range/k
- Construct a table based on the minimum value, maximum value, the desired number of intervals (k), and the interval width.
- Assign the observations into the respective intervals. Each observation will only fall in one interval since the intervals will not overlap.
a contingency table
used to present the frequency distributions for multiple categorical variables simultaneously
the data in a contingency table can display absolute frequencies or relative frequencies
joint frequencies
the individual cells in a contingency table that results from the crossing point of two variables (one from column and one from the row)
marginal frequencies
last row and last column of the contingency table
shows the total per variable
confusion matrix
uses the contingency table to evaluate the performance of a classification model.
constructed to evaluate the performance of the prediction model
chi-square test of independence
test relationships between different variables in a contingency table
how to preform chi-square test of independence
- Use marginal frequencies in the contingency table to construct another table with expected values of the observations
- Compare the expected values to the actual values to derive the chi-square test statistic
- Compare the chi-square test statistic to the chi-square critical value for a given level of significance
chi-square test of independence
Test statistic > Critical value:
conclusion and implication
conclusion: Reject the claim of independence
implication: There is a significant association between the categorical variables
chi-square test of independence
Test statistic < Critical value:
conclusion and implication
conclusion: Do no reject the claim of independence
implication: There is no significant association between the categorical variables
in a frequency distribution, the absolute frequency measure most likely:
A: represents the percentages of each unique value of the variable.
B: represents the actual number of observations counted for each unique value of the variable.
C: allows for comparisons between datasets with different numbers of total observations.
B: represents the actual number of observations counted for each unique value of the variable.
A histogram
uses a chart to present the distribution of numerical data
non-overlapping intervals
from frequency distribution table
useful in presenting the frequency distribution of numerical data
A frequency polygon
similar to a histogram
However, rather than using bars, a frequency polygon plots each interval’s midpoint on the x-axis and the absolute frequency on the y-axis
points are connected through line segments
A cumulative frequency distribution chart
can also be used to illustrate the cumulative frequency distribution.
This shows how many observations lie below a certain value.
Most observations will lie on the steep slope
bar chart
more appropriate when handling the frequency distribution of categorical data
types of bar chart
pareto chart
A grouped bar chart (or clustered bar chart)
stacked bar chart
A Pareto chart
sorts the categories by frequency in descending order along with a cumulative relative frequency line
A grouped bar chart (or clustered bar chart)
may also be used to show joint frequencies when there are multiple categorical variables.
stacked bar chart
instead of grouped (cluster) bar chart, pile them up one over the other
tree-map
the frequency distribution of categorical data can be displayed using a tree-map
made up of different colored rectangles that have areas that represent the frequency of each category
provides a clear picture of the category that has the highest frequency.
–> However, it may become challenging to read when there are too many sub-categories
A word cloud (a.k.a. tag cloud)
used to illustrate the frequency of textual data, which is a type of unstructured data.
It allows analysts to quickly spot the most frequent terms in a report/article
Words which appear more frequently have bigger sizes in the word cloud, and different colors may indicate different sentiments
A line chart
useful at visualizing ordered observations
typically used to present the change in data over time
One of the most common applications of a line chart in the finance industry is showing stock price trend over time
A line chart may accommodate more than one set of data
A bubble line chart
can be used to add a third variable into a two-dimensional line chart
A scatter plot
describes the joint variation in two numerical variables
shows the correlation between the variables at a particular point in time, which can be none, linear, or non-linear
The degree of association is shown by the distance between the data points and the line of best fit.
A scatter plot matrix
can be used to visualize pairwise associations for more than two variables.
A heat map
used to visualize the frequency distribution of categorical data
It enhances the presentation of a contingency table by introducing a color spectrum based on the frequency distribution
which type of chart should we use to explore or present a relationship between variables?
Scatter Plot
Scatter Plot Matrix
Heat Map
which type of chart should we use to explore or present a comparison among categories?
Bar Chart
Tree-map
Heat map
which type of chart should we use to explore or present a comparison over time?
Line Chart (two variables)
Bubble Line chart (three variables)
which type of chart should we use to explore or present a distribution with numerical data?
Histogram
Frequency polygon
cumulative distribution chart
which type of chart should we use to explore or present a distribution with categorical data?
Bar Chart
tree-map
Heat Map
which type of chart should we use to explore or present a distribution with unstructured data?
Word cloud
A bar chart is similar to a histogram except it offers an alternative presentation of the same data
is this true or false?
false
the most common statistical measures
Measures of central tendency
Measures of central tendency
indicate where the data are centered.
The most common central tendency measures
arithmetic mean
median
mode
weighted mean
geometric mean
population
all possible observations
extremely difficult to collect data on an entire population
sample statistic
subset of the population
can then be used to draw inferences about the population statistic
the most common measure of where the data are centered
The arithmetic mean
The arithmetic mean
the sum of the observations divided by the number of observations
basically, the easiest type of average
The sample mean
the arithmetic mean for a sample
The value of the mean is extremely sensitive to extreme values or outliers
true or false
true
three ways to deal with outliers
No adjustments: This is appropriate if all values are equally important and meaningful.
Remove all outliers
A trimmed mean
Replace outliers with another value: A winsorized mean
a trimmed mean
removing all outliers
calculated by discarding a certain percentage of the highest and lowest values
For example, with a sample of 100 observations, a 2% trimmed mean would be the arithmetic mean without the highest value (top 1%) and the lowest value (bottom 1%).
a winsorized mean
adjusts any outliers’ values to either an upper or lower limit.
No observations are excluded from the calculation
Replaces outliers with another value
The median
the middle item of a sorted list
does not use all information about the observations because it only focuses on their relative position
more complicated to calculate (less mathematically tractable) than the mean
if n is odd, what is the median
(n+1) / 2 th term
if n is even, what is the median
the mean between (n/2) and ((n+2)/2)
The mode
the most frequently occurring value in a distribution
The mode
the most frequently occurring value in a distribution
Some distributions have more than one mode, while others have none
true or nah
true
A distribution with just one mode
unimodal
A distribution with two modes
bimodal
modal intervals
mode for data grouped in intervals
it would be the interval with the highest bar
The weighted mean formula
X = Wn*n
uses of weighted average in finance
used to calculate past portfolio or index returns
They can also be used to calculate future expected returns by weighting various scenarios
The geometric mean
used to average rates over time or compute growth rates
It is often used to average portfolio returns from different time periods
The geometric mean is always less than or equal to the arithmetic mean
formula for geometric mean
n root of the multiplication of (1 +return) for all the periods necessary -1 under the root
The harmonic mean
not as commonly used
The observation’s weight is inversely proportional to its magnitude
Smaller weights are assigned to larger observations
This property reduces the sensitivity of the harmonic mean to extremely large outliers
harmonic mean formula
1 / ((1/n)*(E * 1/Xi))
when is the harmonic mean useful
when the data consists of ratios (e.g., P/Es)
It would also be appropriate if the analyst wants the average price paid for a security when investing the same dollar amount for several time periods (also known as the cost averaging technique)
is the harmonic mean always more or less than the geometric mean?
what about the arithmetic mean?
always less
always less
which Mean to use if we have a sample and we want to include all values including outliers?
arithmetic mean
which Mean to use if we have a sample and we want to do compounding?
geometric mean
the quantile
a value at or below which a stated fraction of the data is found
If we arrange the observations in ascending order
most common quantiles
our quartiles
five quintiles
ten deciles
one hundred percentiles
The y th percentile
the value at or below which y percent of the observations lie
For example, the 90th percentile score (P90) on an exam is the number that separates the top 10% scores from the bottom 90%
The interquartile range (IQR)
the difference between the third quartile and the first quartile
IQR = Q3 - Q1
how is the location of the y th percentile (Ly) in a list of n observations ranked?
ranked in ascending order (lowest to highest) is calculated as follows:
Ly = (n + 1) * y/100
how do you find the percentile itself (Py)?
you do a weighted average depending on the decimals of the number equaling to Ly (location of percentile)
if Ly = 14.07
you do: n14 * (1 - 0.07) + n15 *0.07
if Ly = 16.63
you do: n16* (1 - 0.63) + n17 *0.63
The dispersion of data across quartiles can be visualized how?
using a box and whisker plot
box and whisker plot
The “box” has a height equal to the interquartile range and is connected by two “whiskers.”
The two whiskers are bounded by the “fences,”
the fences are the highest and the lowest values of the observations
dispersion (or variability) around the mean
dispersion addresses risk
The most common measures of absolute dispersion
range
mean absolute deviation
variance
standard deviation
The range
the difference between the maximum and minimum values
limited as a dispersion measure because it uses only the highest and lowest values
The mean absolute deviation (MAD)
uses all the observations in the sample, which makes it better than the range
MAD formula
Multiplication of all (Mean of sample i - general mean) / n