2 - Organizing, Visualizing, and Describing Data Flashcards
Data - definition
Data
A collection of numbers, characters, words, and text—as well as images, audio, and video—in a raw or organized format to represent facts or information.
Numerical data - definition
Numerical data
Values that represent measured or counted quantities as a number. Also called quantitative data.
Two types of numerical data?
continuous data and discrete data.
Continuous data - definition
Continuous data
are data that can be measured and can take on any numerical value in a specified range of values.
Discrete data - definition
Discrete data
are numerical values that result from a counting process. So, practically speaking, the data are limited to a finite number of values
Example of Discrete data
For example, the frequency of discrete compounding, m, counts the number of times that interest is accrued and paid out in a given year. The frequency could be monthly (m = 12), quarterly (m = 4), semi-yearly (m = 2), or yearly (m = 1).
Categorical data - definition
Categorical data (also called qualitative data) are values that describe a quality or characteristic of a group of observations and therefore can be used as labels to divide a dataset into groups to summarize and visualize. Usually they can take only a limited number of values that are mutually exclusive.
Two types of categorical data?
Nominal
Ordinal
Nominal data - definition
Nominal data
are categorical values that are not amenable to being organized in a logical order.
Ordinal data - definition
Ordinal data
are categorical values that can be logically ordered or ranked. Ordinal data may also involve numbers to identify categories.
e.g. dates
3 ways data can be classified
cross-sectional, time series, and panel
Variable - definition
A variable
is a characteristic or quantity that can be measured, counted, or categorized and is subject to change. A variable can also be called a field, an attribute, or a feature.
Example of a variable - think finance
For example, stock price, market capitalization, dividend and dividend yield, earnings per share (EPS), and price-to-earnings ratio (P/E) are basic data variables for the financial analysis of a public company.
Observation - definition
An observation
is the value of a specific variable collected at a point in time or over a specified period of time.
Example of an observation - think finance
For example, last year DEF, Inc. recorded EPS of $7.50. This value represented a 15% annual increase.
Cross-sectional data - definition
Cross-sectional data
are a list of the observations of a specific variable from multiple observational units at a given point in time.
The observational units can be individuals, groups, companies, trading markets, regions, etc.
For example, January inflation rates (i.e., the variable) for each of the euro-area countries (i.e., the observational units) in the European Union for a given year constitute cross-sectional data.
Time-series data - definition
Time-series data
are a sequence of observations for a single observational unit of a specific variable collected over time and at discrete and typically equally spaced intervals of time, such as daily, weekly, monthly, annually, or quarterly.
For example, the daily closing prices (i.e., the variable) of a particular stock recorded for a given month constitute time-series data.
Panel data - definition
Panel data
are a mix of time-series and cross-sectional data that are frequently used in financial analysis and modelling.
Panel data consist of observations through time on one or more variables for multiple observational units. The observations in panel data are usually organized in a matrix format called a data table.
Exhibit 2 is an example of panel data showing quarterly earnings per share (i.e., the variable) for three companies (i.e., the observational units) in a given year by quarter. Each column is a time series of data that represents the quarterly EPS observations from Q1 to Q4 of a specific company, and each row is cross-sectional data that represent the EPS of all three companies of a particular quarter.
Structured data - definition
Structured data are highly organized in a pre-defined manner, usually with repeating patterns
How many types of structured data and names
2
The typical forms of structured data are one-dimensional arrays, such as a time series of a single variable,
or two-dimensional data tables, where each column represents a variable or an observation unit and each row contains a set of values for the same columns.
Structured data are relatively easy to enter, store, query, and analyze without much manual processing.
Pros to structured data
Structured data are relatively easy to enter, store, query, and analyze without much manual processing.
3 typical examples of structured company financial data
o Market data: data issued by stock exchanges, such as intra-day and daily closing stock prices and trading volumes.
o Fundamental data: data contained in financial statements, such as earnings per share, price to earnings ratio, dividend yield, and return on equity.
o Analytical data: data derived from analytics, such as cash flow projections or forecasted earnings growth.
Unstructured data - definition
Unstructured data, in contrast, are data that do not follow any conventionally organized forms.
Examples of Unstructured data
Some common types of unstructured data are text—such as financial news, posts in social media, and company filings with regulators—and also audio/ video, such as managements’ earnings calls and presentations to analysts.
How is Unstructured data collected
Unstructured data are typically alternative data as they are usually collected from unconventional sources.
Pros of Unstructured data
Unstructured data may offer new market insights not normally contained in data from traditional sources and may provide potential sources of returns for investment processes.
Cons of Unstructured data
unstructured data in investment analysis is challenging. Typically, financial models are able to take only structured data as inputs; therefore, unstructured data must first be transformed into structured data that models can process.
3 categorisations of Unstructured data
By indicating the source from which the data are generated, such data can be classified into three groups:
o Produced by individuals (i.e., via social media posts, web searches, etc.);
o Generated by business processes (i.e., via credit card transactions, corporate regulatory filings, etc.);
o Generated by sensors (i.e., via satellite imagery, foot traffic by mobile devices, etc.).
Why is raw data not used in the quantitative analysis?
Raw data is not suitable for quantitative analysis – data needs to be clean and formatted.
Formatted into one-dimensional arrays or two-dimensional rectangular arrays
One-dimensional array - definition
One-dimensional array
The simplest format for representing a collection of data of the same data type. Represents a single variable.
Two-dimensional rectangular array - definition
Two-dimensional rectangular array
A popular form for organizing data for processing by computers or for presenting data visually. It is comprised of columns and rows to hold multiple variables and multiple observations, respectively (also called a data table).
Descriptive statistics - definition
Descriptive statistics
Measures that summarize central tendency and spread variation in the data’s distribution.
Frequency distribution - definition
Frequency distribution
A tabular display of data is constructed either by counting the observations of a variable by distinct values or groups or by tallying the values of a numerical variable into a set of numerically ordered bins (also called a one-way table).
How to construct a frequency distribution of a categorical variable
- Count the number of observations for each unique value of the variable.
- Construct a table listing each unique value and the corresponding counts, and then sort the records by number of counts in descending or ascending order to facilitate the display.
Absolute frequency - definition
Absolute frequency
The actual number of observations counted for each unique value of the variable (also called raw frequency).
Relative frequency - definition
Relative frequency
The absolute frequency of each unique value of the variable divided by the total number of observations of the variable.
Interval - definition
Interval
With reference to grouped data, a set of values within which an observation falls.
How to construct a frequency distribution of a numerical variable
- Sort the data in ascending order.
- Calculate the range of the data, defined as Range = Maximum value − Minimum value.
- Decide on the number of bins (k) in the frequency distribution.
- Determine bin width as Range/k.
- Determine the first bin by adding the bin width to the minimum value. Then, determine the remaining bins by successively adding the bin width to the prior bin’s end point and stopping after reaching a bin that includes the maximum value.
- Determine the number of observations falling into each bin by counting the number of observations whose values are equal to or exceed the bin minimum value yet are less than the bin’s maximum value. The exception is in the last bin, where the maximum value is equal to the last bin’s maximum, and therefore, the observation with the maximum value is included in this bin’s count.
- Construct a table of the bins listed from smallest to largest that shows the num- ber of observations falling into each bin.
Cumulative absolute frequency - definition
Cumulative absolute frequency
Cumulates (i.e., adds up) in a frequency distribution the absolute frequencies as one moves from the first bin to the last bin.
Cumulative relative frequency - definition
Cumulative relative frequency
A sequence of partial sums of the relative frequencies in a frequency distribution.
Contingency table - definition
Contingency table
A tabular format that displays the frequency distributions of two or more categorical variables simultaneously and is used for finding patterns between the variables. A contingency table for two categorical variables is also known as a two-way table.
A contingency table having R levels of one variable in rows and C levels of the other variable in columns is referred to as an R × C table.
Joint frequencies - definition
Joint frequencies
The entry in the cells of the contingency table that represent the joining of one variable from a row
Marginal frequencies - definition
Marginal frequencies
The sums determined by adding joint frequencies across rows or across columns in a contingency table.
Confusion matrix - definition
Confusion matrix
A type of contingency table used for evaluating the performance of a classification model.
Chi-square test of independence - definition
Chi-square test of independence
A statistical test for detecting a potential association between categorical variables.
Two applications of the confusion matrix
o Evaluating the performance of a classification model (in this case, the contingency table is called a confusion matrix).
o Test for a potential association between categorical variables is to perform a chi-square test of independence.
Using a confusion matrix how can you test for a potential association between categorical variables?
Perform a chi-square test of independence
- use marginal frequencies in the contingency table to construct a table with expected values of the observations. The actual values and expected values are used to derive the chi-square test statistic. This test statistic is then compared to a value from the chi-square distribution for a given level of significance. If the test statistic is greater than the chi-square distribution value, then there is evidence to reject the claim of independence, implying a significant association exists between the categorical variables
What is Data Visualization - definition
Visualization is the presentation of data in a pictorial or graphical format for the purpose of increasing understanding and for gaining insights into the data.
What is a histogram - definition
A histogram is a chart that presents the distribution of numerical data by using the height of a bar or column to represent the absolute frequency of each bin or interval in the distribution.
How do we construct a histogram
To construct a histogram from a continuous variable, we first need to split the data into bins and summarize the data into a frequency distribution table.
In a histogram, the y-axis generally represents the absolute frequency or the relative frequency in percentage terms, while the x-axis usually represents the bins of the variable.
Bars have equal width.
The bars are usually drawn with no spaces in between, but small gaps can also be added between adjacent bars to increase readability
Pros of a histogram
Can present a large amount of numerical data that has been grouped into a frequency distribution and can allow a quick inspection of the shape, centre, and spread of the distribution to better understand it
What is a frequency polygon - definition
Frequency polygon
A graph of a frequency distribution is obtained by drawing straight lines joining successive points representing the class frequencies.
How do we construct a frequency polygon
To construct a frequency polygon, we plot the midpoint of each return bin on the x-axis and the absolute frequency for that bin on the y-axis. We then connect neighbouring points with a straight line.
Pro of a frequency polygon
The frequency polygon can quickly convey a visual understanding of the distribution since it displays frequency as an area under the curve.
What is a cumulative frequency distribution chart - definition
Cumulative frequency distribution chart
A chart that plots either the cumulative absolute frequency or the cumulative relative frequency on the y-axis against the upper limit of the interval and allows one to see the number or the percentage of the observations that lie below a certain value.
What is a bar chart - definition
Bar chart
A chart for plotting the frequency distribution of categorical data, where each bar represents a distinct cate- gory and each bar’s height is proportional to the frequency of the corresponding category. In technical analysis, a bar chart that plots four bits of data for each time interval— the high, low, opening, and closing prices. A vertical line connects the high and low prices. A crosshatch left indicates the opening price and a cross-hatch right indicates the closing price.
Axis on a bar chart
y-axis still represents the absolute frequency or the relative frequency.
x-axis in a bar chart represents the mutually exclusive categories to be compared
In the case of two categorical variables, we need an enhanced version of the bar chart, what are they called?
Grouped bar chart
Stacked bar chart
What is a Grouped bar chart - definition
Grouped bar chart
A bar chart for showing joint frequencies for two categorical variables (also known as a clustered bar chart).
What is a Stacked bar chart
- definition
Stacked bar chart
An alternative form for presenting the frequency distribution of two categorical variables, where bars representing the sub-groups are placed on top of each other to form a single bar. Each sub-section is shown in a different color to represent the contribution of each sub- group, and the overall height of the stacked bar represents the marginal frequency for the category.
What is a Tree-Map - definition
Tree-Map
Another graphical tool for displaying categorical data. It consists of a set of colored rectangles to represent distinct groups, and the area of each rectangle is proportional to the value of the corresponding group.
Con of a Tree-map
Tree-maps become difficult to read if the hierarchy involves more than three levels.
What is a Word cloud - definition
Word cloud
A visual device for representing textual data, which consists of words extracted from a source of textual data. The size of each distinct word is proportional to the frequency with which it appears in the given text (also known as tag cloud).
Con of a Word cloud
This format allows us to quickly perceive the most frequent terms among the given text to provide information about the nature of the text, including topic and whether or not the text conveys positive or negative news.
What is a line chart - definition
Line chart
A type of graph used to visualize ordered observations. In technical analysis, a plot of price data, typically closing prices, with a line connecting the points.
What is a Bubble line chart - definition
Bubble line chart
A line chart that uses varying-sized bubbles to represent a third dimension of the
What is a Scatter plot - definition
Scatter plot
A chart in which two variables are plotted along the axis and points on the chart represent pairs of the two variables. In regression, the dependent variable is plotted on the vertical axis and the independent variable is plotted along the horizontal axis. Also known as a scattergram
What is a Scatter plot matrix - definition
Scatter plot matrix
A tool for organizing scatter plots between pairs of variables, making it easy to inspect all pairwise relationships in one combined visual.
What is a Heat map - definition
Heat map
A type of graphic that organizes and summarizes data in a tabular format and represents it using a colour spectrum.
Guide to Selecting among Visualization Types
INSERT PIC
What are Four typical pitfalls here that analysts should avoid?
- First, an improper chart type is selected to present data, which would hinder the accurate interpretation of data.
- Second, data are selectively plotted in favour of the conclusion an analyst intends to draw.
For example, data - Third, data are improperly plotted in a truncated graph that has a y-axis that does not start at zero.
- Last, but not least, is the improper scaling of axes.
Measure of central tendency - definition
Measure of central tendency
A quantitative measure that specifies where data are centered.
Measure of value - definition
Measure of value
A standard for measuring value; a function of money.
Measures of location - definition
Quantitative measures that describe the location or distribution of data. They include not only measures of central tendency but also other measures, such as percentiles.
Population - definition
Population
All members of a specified group.
Parameter - definition
A parameter is any descriptive measure of a population. A sample statistic (statistic, for short) is a quantity computed from or used to describe a sample.
Sample statistic - definition
sample statistic (statistic, for short) is a quantity computed from or used to describe a sample.
The arithmetic mean - definition
The arithmetic mean is the sum of the values of the observations divided by the number of observations.
The sample mean - definition
The sample mean is the arithmetic mean or arithmetic average computed for a sample
Sample Mean Formula - formula
INSERT PICTURE
What are the 3 options for dealing with extreme values:
- Do nothing; use the data without any adjustment.
- Delete all the outliers.
- Replace the outliers with another value.
Dealing with extreme values - doing nothing
Is appropriate if the values are legitimate, correct observations, and it is important to reflect the whole of the sample distribution. Outliers may contain meaningful information, so excluding or altering these values may reduce valuable information. Further, because identifying a data point as extreme leaves it up to the judgment of the analyst, leaving in all observations eliminates that need to judge a value as extreme
Dealing with extreme values - Delete all the outliers
One measure of central tendency in this case is the trimmed mean, which is computed by excluding a stated small percentage of the lowest and highest values and then computing an arithmetic mean of the remaining values. For example, a 5% trimmed mean discards the lowest 2.5% and the highest 2.5% of values and computes the mean of the remaining 95% of values.
Trimmed mean - A mean computed after excluding a stated small percentage of the lowest and highest observations.
Dealing with extreme values - Replace the outliers with another value
A measure of central tendency in this case is the winsorized mean. It is calculated by assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value, and then it computes a mean from the restated data.
Winsorized mean A mean computed after assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value.
Trimmed mean - definition
Trimmed mean
A mean computed after excluding a stated small percentage of the lowest and highest observations.
Winsorized mean - definition
Winsorized mean
A mean computed after assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value.
Median - definition
Median
The value of the middle item of a set of items that has been sorted into ascending or descending order (i.e., the 50th percentile).
Pros of the median
A potential advantage of the median is that, unlike the mean, extreme values do not affect it.
Mode The mode is the most frequently occurring value in a distribution.
Mode
The mode is the most frequently occurring value in a distribution.
Pros of the mode
The mode is the only measure of central tendency that can be used with nominal data.
When a distribution has a a single value that is most frequently occurring what is it called?
Unimodal