Part 2. Organising, Visualising and Describing Data Flashcards
Data
A collection of numberpanel datas, characters, words and text, as well as audio and video in a raw or organised format to represent facts or information.
Data Classifications
- Numerical vs Categorical Data
- Cross-sectional vs Time Series vs Panel Data
- Structured vs Unstructured Data
Numerical/Quantitative Data
Values that represent measured or counted quantities as a number.
This data can be split to two types:
- Continuous Data
- Discrete Data
Continuous Data
Data that can be measured and can take on any numerical value in a specified range of values.
Examples:
- The future value of a lump sum investment measures the amount of money to be received after a certain period of time bearing an interest rate.
- The price returns of a stock that measures price change over a given period in percentage terms.
Discrete Data
Numerical values that result from a counting process, in which the data is limited to finite number of values.
Example:
- The frequency of discrete compounding, m, counts the number of times that interest is accrued and paid out in a given year.
i. e. m = 12 means a monthly frequency
Categorical Data/Qualitative
Values that describe a quality or characteristic of a group of observation, used as labels to divide a data set into groups to summarise and visualise.
Example:
- Bankrupt vs. Not Bankrupt
- Dividends increased vs. No Dividend Action
Nominal Data
Categorical values that are not amendable to being organised in a logical order/rank.
Example:
- Classification of publicly listed stocks into 11 sectors, defined by the Global Industry Classification Standard (GICS).
- Text labels i.e. Sector
- Numerical Label i.e. GICS Code
Ordinal Data
Categorical values that can be logically ordered or ranked, or numbers to identify categories.
Example:
- S&P star ratings for investment funds, in which a star represents a group of funds judged to have worst performance/quality, and 2, 3, 4, 5 stars to have better.
- Ranking growth oriented investment funds based on the 5-year cumulative returns i.e. 1 to the top performing 10% of funds.
Data classification based on data collection:
- Cross-sectional
- Time series
- Panel
Variable (field, attribute, feature)
A characteristic/quantity that can be measured, counted or categorised, and is subject to change.
Example:
- Stock price
- Market capitalisation
- dividend and dividend yield
- Earnings per share (EPS)
- Price-to-earnings ratio (P/E)
Observation
The value of a specific variable collected at a point in time or over a specified period of time.
Example:
- DEF inc. recorded EPS of $7.50, this value represented a 15% annual increase.
Cross-sectional Data
A list of the observations of a specific variable from multiple observational units at a given point in time.
These observational units can be individuals, groups, companies, trading markets, regions, etc.
Example:
- January inflation rates (i.e. the variable) for each of the euro area countries (i.e. the observational units) in the EU for a given year constitute cross-sectional data.
Time-series Data
A sequence of observations for a single observational unit of a specific variable collected over time and at discrete/equally spaced intervals of time
i.e. daily, weekly, monthly, annually, or quarterly.
Example:
- The daily closing prices (i.e. the variable) of a particular stock recorded for a given month constitute time-series data.
Panel Data
A mix of time-series and cross-sectional data is frequently used in financial analysis and modeling.
Consists of observations through time on one or more variables for multiple observational units.
Example:
- Earnings per share in euros of three eurozone companies in a given year.
Data from Q1-4 is the time series.
Data from Co.A-C is the cross-sectional.
Structured Data
Highly organised in a pre-defined manner, usually with repeating patterns.
Typical form:
- One-dimensional arrays (time series of a single variable)
- Two-dimensional data tables (each column represents a variable or observation unit, and each row contains a set of values for the same columns).
Structured Data Types
- Market data - issued by stock exchanges such as intra-day, daily closing stock prices, trading volumes.
- Fundamental data - data contained in financial statements, such as earnings per share, price to earnings ratio, dividend yield and return on equity.
- Analytical data - data derived from analytics such as cash flow projections, or forecasted earnings growth.
Unstructured Data
Data that does not follow any conventionally organised forms.
Usually, financial models are able to take only structured data as inputs, so unstructured data must be transformed into structured that models can process.
Examples:
- Text - financial news, social media posts
- Audio/Video - management earning calls, presentation to analysts
Unstructured Data Types
- Produced by individuals (i.e. via social media posts, web searches, etc)
- Generated by business processes (i.e credit card transactions, corporate regulatory filings, etc)
- Generated by sensors (i.e. satellite imagery, foot traffic by mobile devices, etc)
Raw Data
Data available in their original form as collected such as data typically cannot be used by humans or computers to directly extract information and insights.
Data can be organized for quantitative analysis using:
- One-dimensional arrays
- Two-dimensional rectangular arrays
One-dimensional array
Simplest form for representing a collection of data of the same data type, suitable for representing a single variable.
Example:
- Daily closing price of ABC Inc. stock, after the company went public.
- Closing prices are time-series data collected at daily intervals.
- Plotting data against time means we can learn whether data demonstrates an increasing or decreasing trend and whether time series repeats certain patterns in a systematic way over time. (summary of central tendency and spread variation in data distribution.)
Two-dimensional rectangular arrays (data table)
Compromised of columns and rows to hold multiple variables and observations, respectively.
When a data table is used to organize data of a single observational unit, each column represents a different variable of the observational unit, and each row holds an observation for different variables; successive rows represent the observations for successive time periods.
Observations of each variable are a time-series sequence that is sorted in either ascending or descending time order.
Frequency Distribution
A tabular display of data constructed either by counting the observations of a variable by distinct values, or groups or by tallying the values of a numerical variable into a set of numerically ordered bins.
Helps the analysis of large amounts of numerical data, as it requires creating non-overlapping bins (intervals or buckets), and counts observations falling into each bin.
Constructing a frequency distribution:
- Count the number of observations for each unique value of the variable.
- Construct a table listing each unique value and the corresponding counts, and then sort the records by number of counts in descending or ascending order to facilitate the display.
Absolute Frequency
The actual number of observations counted for each unique value of the variable (i.e. each sector).