2 - Organizing, Visualizing, and Describing Data Flashcards
Data - definition
Data
A collection of numbers, characters, words, and text—as well as images, audio, and video—in a raw or organized format to represent facts or information.
Numerical data - definition
Numerical data
Values that represent measured or counted quantities as a number. Also called quantitative data.
Two types of numerical data?
continuous data and discrete data.
Continuous data - definition
Continuous data
are data that can be measured and can take on any numerical value in a specified range of values.
Discrete data - definition
Discrete data
are numerical values that result from a counting process. So, practically speaking, the data are limited to a finite number of values
Example of Discrete data
For example, the frequency of discrete compounding, m, counts the number of times that interest is accrued and paid out in a given year. The frequency could be monthly (m = 12), quarterly (m = 4), semi-yearly (m = 2), or yearly (m = 1).
Categorical data - definition
Categorical data (also called qualitative data) are values that describe a quality or characteristic of a group of observations and therefore can be used as labels to divide a dataset into groups to summarize and visualize. Usually they can take only a limited number of values that are mutually exclusive.
Two types of categorical data?
Nominal
Ordinal
Nominal data - definition
Nominal data
are categorical values that are not amenable to being organized in a logical order.
Ordinal data - definition
Ordinal data
are categorical values that can be logically ordered or ranked. Ordinal data may also involve numbers to identify categories.
e.g. dates
3 ways data can be classified
cross-sectional, time series, and panel
Variable - definition
A variable
is a characteristic or quantity that can be measured, counted, or categorized and is subject to change. A variable can also be called a field, an attribute, or a feature.
Example of a variable - think finance
For example, stock price, market capitalization, dividend and dividend yield, earnings per share (EPS), and price-to-earnings ratio (P/E) are basic data variables for the financial analysis of a public company.
Observation - definition
An observation
is the value of a specific variable collected at a point in time or over a specified period of time.
Example of an observation - think finance
For example, last year DEF, Inc. recorded EPS of $7.50. This value represented a 15% annual increase.
Cross-sectional data - definition
Cross-sectional data
are a list of the observations of a specific variable from multiple observational units at a given point in time.
The observational units can be individuals, groups, companies, trading markets, regions, etc.
For example, January inflation rates (i.e., the variable) for each of the euro-area countries (i.e., the observational units) in the European Union for a given year constitute cross-sectional data.
Time-series data - definition
Time-series data
are a sequence of observations for a single observational unit of a specific variable collected over time and at discrete and typically equally spaced intervals of time, such as daily, weekly, monthly, annually, or quarterly.
For example, the daily closing prices (i.e., the variable) of a particular stock recorded for a given month constitute time-series data.
Panel data - definition
Panel data
are a mix of time-series and cross-sectional data that are frequently used in financial analysis and modelling.
Panel data consist of observations through time on one or more variables for multiple observational units. The observations in panel data are usually organized in a matrix format called a data table.
Exhibit 2 is an example of panel data showing quarterly earnings per share (i.e., the variable) for three companies (i.e., the observational units) in a given year by quarter. Each column is a time series of data that represents the quarterly EPS observations from Q1 to Q4 of a specific company, and each row is cross-sectional data that represent the EPS of all three companies of a particular quarter.
Structured data - definition
Structured data are highly organized in a pre-defined manner, usually with repeating patterns
How many types of structured data and names
2
The typical forms of structured data are one-dimensional arrays, such as a time series of a single variable,
or two-dimensional data tables, where each column represents a variable or an observation unit and each row contains a set of values for the same columns.
Structured data are relatively easy to enter, store, query, and analyze without much manual processing.
Pros to structured data
Structured data are relatively easy to enter, store, query, and analyze without much manual processing.
3 typical examples of structured company financial data
o Market data: data issued by stock exchanges, such as intra-day and daily closing stock prices and trading volumes.
o Fundamental data: data contained in financial statements, such as earnings per share, price to earnings ratio, dividend yield, and return on equity.
o Analytical data: data derived from analytics, such as cash flow projections or forecasted earnings growth.
Unstructured data - definition
Unstructured data, in contrast, are data that do not follow any conventionally organized forms.
Examples of Unstructured data
Some common types of unstructured data are text—such as financial news, posts in social media, and company filings with regulators—and also audio/ video, such as managements’ earnings calls and presentations to analysts.
How is Unstructured data collected
Unstructured data are typically alternative data as they are usually collected from unconventional sources.
Pros of Unstructured data
Unstructured data may offer new market insights not normally contained in data from traditional sources and may provide potential sources of returns for investment processes.
Cons of Unstructured data
unstructured data in investment analysis is challenging. Typically, financial models are able to take only structured data as inputs; therefore, unstructured data must first be transformed into structured data that models can process.
3 categorisations of Unstructured data
By indicating the source from which the data are generated, such data can be classified into three groups:
o Produced by individuals (i.e., via social media posts, web searches, etc.);
o Generated by business processes (i.e., via credit card transactions, corporate regulatory filings, etc.);
o Generated by sensors (i.e., via satellite imagery, foot traffic by mobile devices, etc.).
Why is raw data not used in the quantitative analysis?
Raw data is not suitable for quantitative analysis – data needs to be clean and formatted.
Formatted into one-dimensional arrays or two-dimensional rectangular arrays
One-dimensional array - definition
One-dimensional array
The simplest format for representing a collection of data of the same data type. Represents a single variable.
Two-dimensional rectangular array - definition
Two-dimensional rectangular array
A popular form for organizing data for processing by computers or for presenting data visually. It is comprised of columns and rows to hold multiple variables and multiple observations, respectively (also called a data table).
Descriptive statistics - definition
Descriptive statistics
Measures that summarize central tendency and spread variation in the data’s distribution.
Frequency distribution - definition
Frequency distribution
A tabular display of data is constructed either by counting the observations of a variable by distinct values or groups or by tallying the values of a numerical variable into a set of numerically ordered bins (also called a one-way table).
How to construct a frequency distribution of a categorical variable
- Count the number of observations for each unique value of the variable.
- Construct a table listing each unique value and the corresponding counts, and then sort the records by number of counts in descending or ascending order to facilitate the display.
Absolute frequency - definition
Absolute frequency
The actual number of observations counted for each unique value of the variable (also called raw frequency).
Relative frequency - definition
Relative frequency
The absolute frequency of each unique value of the variable divided by the total number of observations of the variable.