Reading 2: Organization, Visualizing, And Describing Data Flashcards
Identify and Compare data types
Data: defined as a collection of number panels data, character, words, and text; as well as images, audio, video, in a raw or organized format to represent facts or information.
Nominal/Quant Data: values that represent measured or counted quantities as a number, which can be split into the following two categories:
1. Continuous Data: data that can be measured and can take on any numerical value in a specified range of values; 2. Numerical values that result from a counting process. Are limited to finite number values.
Categorical/Qualitative Data: value that describe a quality of characteristic of a group of observations and can be utilized as labels to divide a dataset into groups to summarize and visualize. Please see the following categories:
1. Nominal Data: Categorical values that are not amenable to being organized in a logical order 2. Ordinal Data: Categorical values that can be logically ordered or ranked.
Cross sectional data: list of observations of a specific variable from multiple observational units at a given point in time.
Time-series data: sequence of observation for a single observational unit of a specific variable collected over time.
Panel data: mix of cross sectional and time series data that is frequently used in financing analysis and modeling
Structured data: highly organized in a pre-defined manner, usually with repeating patterns
Unstructured data: data t hat does not follow any conventionally organized form
Describe how data is organized for quantitative analysis
Organizing data into one-dimensional or a two-dimensional array is typically the first step in data analytics and modeling
Interpret frequency and related distributions
Frequency distributions: (one-way table) is a tabular display of data constructed either by counting the observations of a variable by distinct values or groups or by taking the values of a numerical variable into a set of numerically ordered bins.
Absolute Frequency: actual number of observation counted for each unique value of the variable
Relative Frequency: calculated as the absolute frequency divided by the total number of observations
Procedures for constructing a frequency distributions:
1. Sort the date in ascending order 2. Compute the range - Max - Min 3. Decide the number of bins (k) in the frequency distributions 4. Determine the bin width by the following: Range/K 5. Determine the first bin by adding bin width to the minimum value, and subsequently to the remaining values 6. Determine the number of observations falling into each bin 7. Construct a table of bins from smallest to largest
Interpret a Contingency Table
Contingency Table: Tabular format that displays the frequency distributions of two or more categorical variable simultaneously and is utilized for finding patterns between variables.
Note:
- Each variable in a contingency table must have a finite number of levels, which can either be ordered (ordinal data) or unordered (nominal data)
- joint frequencies - the joining of one variable from the row and the other variable from the column
- marginal frequencies - the cumulative of joint frequencies across rows or columns
Describe ways that data may be visualized and evaluate the use of specific visualizations
Visualization: the presentation of data in a pictorial or graphical form for the purpose of increasing understanding and gaining insights into the data.
Types of Visualizations:
Histogram: chart that presets the distribution of numerical data by using the height of the bar of column to represent the absolute frequency or relative frequency of each bin or interval in the distribution.
Frequency Polygon: another tool such as the histogram for displaying distributions, while the frequency polygon displays frequency as an area under the curve (x-axis: midpoint of interval / Y-axis: absolute frequency)
Cumulative frequency Distribution Chart: visualizing frequency distributions, such a chart can plot either the cumulative absolute frequency or the cumulative relative frequency on the y-axis against the upper limit of the interval
Bar Chart: Frequency distribution of categorical data can be plotted in a bar chart. Each bar represents a distinct category, with bar heights proportional to the frequency of the corresponding category.
Pareto Chart: in a specific case in a bar chart are ordered by frequency in descending order and the chart includes a line displaying cumulative relative frequency.
Grouped Bar Chart / Clustered Bar Chart: in the case of two categorical variables to show joint frequencies
Tree Map: another graphical tool for displaying categorical data. Consisting of a set of colored rectangles to represent distinct groups, and the area of each rectangle is proportional to the value of the corresponding group.
Word Cloud: is a visual device for representing textual data
Line Chart: Type of graph to visuals ordered observations. Often a line chart is used to display the change of data series over time. (X-axis: time / Y-axis: variable)
Bubble Line Chart: adds an additional dimension, with varying sized bubbles to represent a third digestion of the data.
Scatter Plot: a type of graph for visualizing the joint variation in two numerical variables. X-axis: 1 variable and Y-Axis: the other variable.
Scatter Plot Matrix: a useful tool for organizing scatter plots between pairs of variables, making it easy to inspect all pairwise relationships in one combined visual.
Heat Map: type of graphic that organizing and summarized data in a tabulate format and represents the utilizing a color spectrum.
Describe how to select among visualization types
Relationships Assessment: scatter plots, scatter plot matrix, and Heat map
Distributions based on the following data:
- Numerical data: histogram, frequency polygon, cumulative distribution chart
- Categorical Data: bar chart, Pareto chart, tree map, and heat map
- Unstructured Data: word cloud
Comparisons with categories: bar chart, tree map, and heat map
Comparisons overtime: line chart, bubble line chart
Calculate and interpret measured of central tendency
Measures of Central Tendency: specifies where the data is centered
Measured of Locations: include not only measures of central tendency, but other measured that illustrate the location or distribution of data
Arithmetic mean: sum of the value of the observation divided by the number of observations
Formula = x = Sum of x / n
Median: value of the middle item of a set of items that is ordered from ascending or descending.
Note: Odd = (n+1)/2 and Even = n/2
Mode: the most frequently occurring value in a distribution. Note, the highest frequency interval is called the modal interval.
Evaluate alternative definitions of mean to address an investment problem
Arithmetic mean: standard, which is your normal average
Weighed mean: allow for different (unequal) weights to be applied to different observations
Geometric mean: most frequently utilized to average rates of change over time or to compute the growth rate of a variable.
Harmonic mean: the value obtained from summing the reciprocal of the observations , then averaging that sum by dividing it by the number of observations, and finally taking the reciprocal of the average. The harmonic mean is useful in the presence of outliers, while its best utilized when data consist of rates and ratios. It is a relatively specialized concept of the mean that is appropriate for averaging ratios when the ratios are repeatedly applied to a fixed quantity to yield a variable number of units.
Note: The geometries mean is always less or equal to the arithmetic mean. More dispersion in cash flows or returns causes the arithmetic mean to be larger than the geometric mean.
Calculate quantiles and interpret related visualizations
Quantile (fracture): general term for a value at or below which a stated fraction of the data lie.
Quartile: divide into quarters
Quintiles: divide into fifths
Deciles: divided into tenths
Percentiles: divide into hundredths
Formula:
Ly = (n +1) y/100, if Ly is not a whole number, you must utilize linear interpolation which is the following: Xn + (Ly - whole number rounded down) x (Xn+1 - Xn)
Visualization of data across quantiles is best to utilize a diagram such as the box and whisker chart.
Calculate and interpret measures of dispersions
Dispersion: the variability around the central tendency, thus if mean addresses reward, than the dispersion represents risk
Absolute Dispersion: amount of variability present without comparison to any reference point or benchmark
Range: Max value - Min Value, one limitation of the range computation is that it cannot tell us how the data is distributed
Mean Absolute Deviation (“MAD”): ignores signs around the deviation of the mean. Formula: |Xi - X|/n
Variance Formula: (Xi - X)^2 / n-1
Standard Deviation Formula: Square root of the Variance
Relative Dispersion: amount of dispersion relative to a reference value or benchmark
Coefficient of Variation: represents the amount of risk per unit of reward (x cannot be negative).
- Formula: standard deviation / mean
Calculate and Interpret target downside Deviation
Downside Deviation (downside risk): returns that are below the mean or below some specified minimum target return. The target downside deviation also referred to as the target semi-deviation, is a measure of dispersion of the observation below the target.
Formula: Square root of the variance of all observations below the target in the numerator and n -1 in the denominator representing all observations.
Interpret skewness
Skewness: described the degree of symmetry in return distributions.
Positive skewed distribution: frequent small losses and few extreme gains. Mean > Mode > Median
Negative skewed distributions: frequent small gains and a few extreme losses. Mean < Median < Mode
Skewness Formula: Computed as the average cubed deviation from the men standardized by dividing by the standard deviation cubed to make the measure free of scale.
Interpret Kurtosis
Kurtosis: measure of the combined weight of the tails of a distribution relative to the rest of the distribution. See below for the different type of kurtosis characteristics:
- Leptokurtic (fat tailed): distribution that has fatter tails than the normal distributions. Generates more frequent extreme Ly large deviation from the mean than normal distributions.
- Platykurtic (thin tailed): distribution that has thinner tail than the normal distributions. Generates less frequent extremely large deviations from the mean than the normal distributions.
- Mesokurtic (normal): distribution akin to the normal distribution in relation to relative weight in the tails
Excess Kurtosis: average deviations fro the mean raised to the 4th power and then standardize that average by dividing y the standard deviation raised to the 4th power, and subtracting by 3 (normal distribution Kurtosis). Positive - fatter tails / negative - thinner tails.
Note: most equity return series have been found to be leptokurtic.
Interpret correlation between two variables
Correlation: is a measure of the linear relationship between two random variables.
See below in steps to configure the correlation:
1st step: Compute the covariance - which is the how the two variables vary together, which is the average value of the product of the deviations of observations on two random variables from their sample means. If the random variables are returns, the units would be returns squared. Also notes the use of the n-1 in the denominator, which ensures that the sample covariance is an unbiased estimate of population covariance. Stated simply the covariance is the measure of the joint variability of two random variables.
2nd step: Compute the correlation coefficient, which is the ratio of the sample covariance to the sample covariance to the product of the two variables standard deviation.
Properties of Correlations:
- Correlation ranges from -1 to +1
- A correlation of 0, indicates an absence of linear relationship
- A positive correlation is close to +1, indicates a strong positive linear relationship, while a correlation of 1 indicates a perfect linear relationship.
- A negative correlation is close to -1, indicates a strong negative linear relationship.
Limitation of Correlations Analysis:
- 2 variable can have a strong non-linear relationship and still have a low correlation
- Outliers can cause correlation to e an unreliable measure
- Spurious correlations: the term utilized to refer to the following - 1) correlation between 2 variable by chance relationships, 2) correlation induced by a calculation that mixes each of the two variables with a third variable, and 3) correlation between two variables arising not from a direct relationship between them, but from their relation to a third variable.