Part 2. Organising, Visualising and Describing Data Flashcards
Data
A collection of numberpanel datas, characters, words and text, as well as audio and video in a raw or organised format to represent facts or information.
Data Classifications
- Numerical vs Categorical Data
- Cross-sectional vs Time Series vs Panel Data
- Structured vs Unstructured Data
Numerical/Quantitative Data
Values that represent measured or counted quantities as a number.
This data can be split to two types:
- Continuous Data
- Discrete Data
Continuous Data
Data that can be measured and can take on any numerical value in a specified range of values.
Examples:
- The future value of a lump sum investment measures the amount of money to be received after a certain period of time bearing an interest rate.
- The price returns of a stock that measures price change over a given period in percentage terms.
Discrete Data
Numerical values that result from a counting process, in which the data is limited to finite number of values.
Example:
- The frequency of discrete compounding, m, counts the number of times that interest is accrued and paid out in a given year.
i. e. m = 12 means a monthly frequency
Categorical Data/Qualitative
Values that describe a quality or characteristic of a group of observation, used as labels to divide a data set into groups to summarise and visualise.
Example:
- Bankrupt vs. Not Bankrupt
- Dividends increased vs. No Dividend Action
Nominal Data
Categorical values that are not amendable to being organised in a logical order/rank.
Example:
- Classification of publicly listed stocks into 11 sectors, defined by the Global Industry Classification Standard (GICS).
- Text labels i.e. Sector
- Numerical Label i.e. GICS Code
Ordinal Data
Categorical values that can be logically ordered or ranked, or numbers to identify categories.
Example:
- S&P star ratings for investment funds, in which a star represents a group of funds judged to have worst performance/quality, and 2, 3, 4, 5 stars to have better.
- Ranking growth oriented investment funds based on the 5-year cumulative returns i.e. 1 to the top performing 10% of funds.
Data classification based on data collection:
- Cross-sectional
- Time series
- Panel
Variable (field, attribute, feature)
A characteristic/quantity that can be measured, counted or categorised, and is subject to change.
Example:
- Stock price
- Market capitalisation
- dividend and dividend yield
- Earnings per share (EPS)
- Price-to-earnings ratio (P/E)
Observation
The value of a specific variable collected at a point in time or over a specified period of time.
Example:
- DEF inc. recorded EPS of $7.50, this value represented a 15% annual increase.
Cross-sectional Data
A list of the observations of a specific variable from multiple observational units at a given point in time.
These observational units can be individuals, groups, companies, trading markets, regions, etc.
Example:
- January inflation rates (i.e. the variable) for each of the euro area countries (i.e. the observational units) in the EU for a given year constitute cross-sectional data.
Time-series Data
A sequence of observations for a single observational unit of a specific variable collected over time and at discrete/equally spaced intervals of time
i.e. daily, weekly, monthly, annually, or quarterly.
Example:
- The daily closing prices (i.e. the variable) of a particular stock recorded for a given month constitute time-series data.
Panel Data
A mix of time-series and cross-sectional data is frequently used in financial analysis and modeling.
Consists of observations through time on one or more variables for multiple observational units.
Example:
- Earnings per share in euros of three eurozone companies in a given year.
Data from Q1-4 is the time series.
Data from Co.A-C is the cross-sectional.
Structured Data
Highly organised in a pre-defined manner, usually with repeating patterns.
Typical form:
- One-dimensional arrays (time series of a single variable)
- Two-dimensional data tables (each column represents a variable or observation unit, and each row contains a set of values for the same columns).
Structured Data Types
- Market data - issued by stock exchanges such as intra-day, daily closing stock prices, trading volumes.
- Fundamental data - data contained in financial statements, such as earnings per share, price to earnings ratio, dividend yield and return on equity.
- Analytical data - data derived from analytics such as cash flow projections, or forecasted earnings growth.
Unstructured Data
Data that does not follow any conventionally organised forms.
Usually, financial models are able to take only structured data as inputs, so unstructured data must be transformed into structured that models can process.
Examples:
- Text - financial news, social media posts
- Audio/Video - management earning calls, presentation to analysts
Unstructured Data Types
- Produced by individuals (i.e. via social media posts, web searches, etc)
- Generated by business processes (i.e credit card transactions, corporate regulatory filings, etc)
- Generated by sensors (i.e. satellite imagery, foot traffic by mobile devices, etc)
Raw Data
Data available in their original form as collected such as data typically cannot be used by humans or computers to directly extract information and insights.
Data can be organized for quantitative analysis using:
- One-dimensional arrays
- Two-dimensional rectangular arrays
One-dimensional array
Simplest form for representing a collection of data of the same data type, suitable for representing a single variable.
Example:
- Daily closing price of ABC Inc. stock, after the company went public.
- Closing prices are time-series data collected at daily intervals.
- Plotting data against time means we can learn whether data demonstrates an increasing or decreasing trend and whether time series repeats certain patterns in a systematic way over time. (summary of central tendency and spread variation in data distribution.)
Two-dimensional rectangular arrays (data table)
Compromised of columns and rows to hold multiple variables and observations, respectively.
When a data table is used to organize data of a single observational unit, each column represents a different variable of the observational unit, and each row holds an observation for different variables; successive rows represent the observations for successive time periods.
Observations of each variable are a time-series sequence that is sorted in either ascending or descending time order.
Frequency Distribution
A tabular display of data constructed either by counting the observations of a variable by distinct values, or groups or by tallying the values of a numerical variable into a set of numerically ordered bins.
Helps the analysis of large amounts of numerical data, as it requires creating non-overlapping bins (intervals or buckets), and counts observations falling into each bin.
Constructing a frequency distribution:
- Count the number of observations for each unique value of the variable.
- Construct a table listing each unique value and the corresponding counts, and then sort the records by number of counts in descending or ascending order to facilitate the display.
Absolute Frequency
The actual number of observations counted for each unique value of the variable (i.e. each sector).
Relative Frequency
Calculated as the absolute frequency of each unique value of the variable divided by the total number of observations.
This provides a normalised measure of the distribution of the data, allowing comparisons between datasets with different numbers of total observations.
Frequency Distribution for Numerical Data
- Sort the data in ascending order.
- Calculate the range of data, defined as Range = Maximum Value - Minimum Value.
- Decide on the number of bins (k) in the frequency distribution.
- Determine bin width as Range/k.
- Determine the first bin by adding the bin width to the minimum value. Then, determine the remaining bins by successively adding the bin width to prior bins end point and then stop after reaching a bin that includes the max. value.
- Determine no. of observations falling into each bin by counting no. of observations whose values equal to or exceed the bin minimum value, yet are less than bins max. value. With exception in last bin where max. value is equal to last bin’s max, and therefore the observation with the max. value is included in bin’s count.
- Construct table of bins listed from smallest to largest that shows the no. of observations falling in each bin.
Cumulative absolute frequency
Adds up the absolute frequencies as we move from the first bin to the last bin.
Cumulative relative frequency
A sequence of partial sums of the relative frequencies.
For the last bin, the cumulative absolute frequency will equal the number of observations in the dataset (1,258), and the cumulative relative frequency will equal 100%.
Contingency table
A tabular format that displays the frequency distributions of two or more categorical variables simultaneously and is used for finding patterns between the variables.
This table having R levels of one variable in rows and C levels of the other variable in columns is referred to as R x C table.
Joint Frequencies
When you join one variable from the row (i.e. sector) and the other variable from the column (i.e. market cap) to count observations in a contingency table.
Marginal Frequencies
The corresponding sums of when joint frequencies are then added across rows and columns.
Applications of contingency tables
- Confusion Matrix
2. Chi-square test for independence
Confusion Matrix
Evaluates the performance of a classification model.
i.e. a model classifying companies into two groups: those that default on their bond payments and those that do not default.
The matrix for displaying model results will be 2 x 2 table, showing frequency of actual defaults vs models predicted frequency of defaults.
Chi-square test for independence
To test for a potential association between categorical variables.
The procedure involves using the marginal frequencies in contingency table to construct a table with expected values of observations.
Actual and expected values are used to derive chi square test statistic.
The test statistic is then compared to a value from the chi-square distribution for a given level of significance.
If test statistic is greater than chi-square distribution value, then there is evidence to reject claim of independence, implying significant association between the categorical variable.
Visualization
The presentation of data in a pictorial/graphical format for purpose of increasing understanding and gaining insights into the data.
Histogram
A chart that presents the distribution of numerical data by using the height of a bar, or column to represent the absolute frequency of each bin/interval in the distribution.
y axis - the absolute frequency/relative frequency in percentage terms.
x axis - represents the bin of a variable.
Absolute frequency histogram = answers the question of how many items are in each bin.
Relative frequency histogram = gives the proportion or percentage of the total observations in each bin.
Frequency Polygon
Plotting the mid point of each return bin on x-axis and the absolute frequency for that bin in the y-axis, connected with a straight line.
This can quickly convey a visual understanding of the distribution since it displays frequency as an area under the curve.
Cumulative Frequency Chart
A chart that can plot either the cumulative absolute frequency or the cumulative relative frequency on the y-axis against the upper limit of the interval.
This allows us to see the number or percentage of the observation that lie below a certain value.
Curve flattens = frequencies of observations in bins are small.
Curve steep = reflects most of the observations
Bar Chart
A frequency distribution of categorical data is plotted where each bar represents a distinct category, with the bar’s height proportional to the frequency of the corresponding category.
Vertical bar chart:
- the y axis represents the absolute frequency/relative frequency
- the x axis represents the mutually exclusive categories to be compared than bins that group numerical data.
Pareto Chart
The categories in a bar chart are ordered by frequency in descending order and includes a line displaying cumulative relative frequency.
The chart is used to highlight dominant categories or the most important groups.
Grouped Bar Chart (Clustered Bar Chart)
Presents frequency distribution of 2 categorical variables to show joint frequencies.
The bars within each cluster should be colored differently to distinguish between them, but color schemes for subgroups must be identical across sector clusters.
The bars in each sector cluster must always be placed in the same order throughout the chart.
Stacked Bar Chart
An alternative form for presenting the joint frequency distribution of two categorical variables.
Each subsection of the bar is shown in a different color to represent the contribution of each subgroup.
The overall height of the stacked bar represents the marginal frequency for the category.
Tree-Map
A graphical tool for presenting categorical data consists of a set of colored rectangles to represent distinct groups and the area of each rectangle is proportional to the value of the corresponding group.
This can represent data with additional dimensions by displaying a nest of rectangles. To display joint frequencies of sub-groups, we split the rectangle into sub-sections where the area of each nested rectangle would be proportional to the number of stocks in each market capitalization sub-group.