2 - Organizing, Visualizing, and Describing Data Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Data - definition

A

Data
A collection of numbers, characters, words, and text—as well as images, audio, and video—in a raw or organized format to represent facts or information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Numerical data - definition

A

Numerical data

Values that represent measured or counted quantities as a number. Also called quantitative data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Two types of numerical data?

A

continuous data and discrete data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Continuous data - definition

A

Continuous data

are data that can be measured and can take on any numerical value in a specified range of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Discrete data - definition

A

Discrete data
are numerical values that result from a counting process. So, practically speaking, the data are limited to a finite number of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Example of Discrete data

A

For example, the frequency of discrete compounding, m, counts the number of times that interest is accrued and paid out in a given year. The frequency could be monthly (m = 12), quarterly (m = 4), semi-yearly (m = 2), or yearly (m = 1).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Categorical data - definition

A
Categorical data (also called qualitative data) 
are values that describe a quality or characteristic of a group of observations and therefore can be used as labels to divide a dataset into groups to summarize and visualize. Usually they can take only a limited number of values that are mutually exclusive.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Two types of categorical data?

A

Nominal

Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Nominal data - definition

A

Nominal data

are categorical values that are not amenable to being organized in a logical order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ordinal data - definition

A

Ordinal data
are categorical values that can be logically ordered or ranked. Ordinal data may also involve numbers to identify categories.

e.g. dates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

3 ways data can be classified

A

cross-sectional, time series, and panel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Variable - definition

A

A variable
is a characteristic or quantity that can be measured, counted, or categorized and is subject to change. A variable can also be called a field, an attribute, or a feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Example of a variable - think finance

A

For example, stock price, market capitalization, dividend and dividend yield, earnings per share (EPS), and price-to-earnings ratio (P/E) are basic data variables for the financial analysis of a public company.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Observation - definition

A

An observation

is the value of a specific variable collected at a point in time or over a specified period of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Example of an observation - think finance

A

For example, last year DEF, Inc. recorded EPS of $7.50. This value represented a 15% annual increase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cross-sectional data - definition

A

Cross-sectional data

are a list of the observations of a specific variable from multiple observational units at a given point in time.

The observational units can be individuals, groups, companies, trading markets, regions, etc.

For example, January inflation rates (i.e., the variable) for each of the euro-area countries (i.e., the observational units) in the European Union for a given year constitute cross-sectional data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Time-series data - definition

A

Time-series data

are a sequence of observations for a single observational unit of a specific variable collected over time and at discrete and typically equally spaced intervals of time, such as daily, weekly, monthly, annually, or quarterly.

For example, the daily closing prices (i.e., the variable) of a particular stock recorded for a given month constitute time-series data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Panel data - definition

A

Panel data

are a mix of time-series and cross-sectional data that are frequently used in financial analysis and modelling.

Panel data consist of observations through time on one or more variables for multiple observational units. The observations in panel data are usually organized in a matrix format called a data table.

Exhibit 2 is an example of panel data showing quarterly earnings per share (i.e., the variable) for three companies (i.e., the observational units) in a given year by quarter. Each column is a time series of data that represents the quarterly EPS observations from Q1 to Q4 of a specific company, and each row is cross-sectional data that represent the EPS of all three companies of a particular quarter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Structured data - definition

A

Structured data are highly organized in a pre-defined manner, usually with repeating patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How many types of structured data and names

A

2

The typical forms of structured data are one-dimensional arrays, such as a time series of a single variable,

or two-dimensional data tables, where each column represents a variable or an observation unit and each row contains a set of values for the same columns.

Structured data are relatively easy to enter, store, query, and analyze without much manual processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Pros to structured data

A

Structured data are relatively easy to enter, store, query, and analyze without much manual processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

3 typical examples of structured company financial data

A

o Market data: data issued by stock exchanges, such as intra-day and daily closing stock prices and trading volumes.
o Fundamental data: data contained in financial statements, such as earnings per share, price to earnings ratio, dividend yield, and return on equity.
o Analytical data: data derived from analytics, such as cash flow projections or forecasted earnings growth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Unstructured data - definition

A

Unstructured data, in contrast, are data that do not follow any conventionally organized forms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Examples of Unstructured data

A

Some common types of unstructured data are text—such as financial news, posts in social media, and company filings with regulators—and also audio/ video, such as managements’ earnings calls and presentations to analysts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How is Unstructured data collected

A

Unstructured data are typically alternative data as they are usually collected from unconventional sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Pros of Unstructured data

A

Unstructured data may offer new market insights not normally contained in data from traditional sources and may provide potential sources of returns for investment processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Cons of Unstructured data

A

unstructured data in investment analysis is challenging. Typically, financial models are able to take only structured data as inputs; therefore, unstructured data must first be transformed into structured data that models can process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

3 categorisations of Unstructured data

A

By indicating the source from which the data are generated, such data can be classified into three groups:
o Produced by individuals (i.e., via social media posts, web searches, etc.);
o Generated by business processes (i.e., via credit card transactions, corporate regulatory filings, etc.);
o Generated by sensors (i.e., via satellite imagery, foot traffic by mobile devices, etc.).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Why is raw data not used in the quantitative analysis?

A

Raw data is not suitable for quantitative analysis – data needs to be clean and formatted.

Formatted into one-dimensional arrays or two-dimensional rectangular arrays

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

One-dimensional array - definition

A

One-dimensional array

The simplest format for representing a collection of data of the same data type. Represents a single variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Two-dimensional rectangular array - definition

A

Two-dimensional rectangular array

A popular form for organizing data for processing by computers or for presenting data visually. It is comprised of columns and rows to hold multiple variables and multiple observations, respectively (also called a data table).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Descriptive statistics - definition

A

Descriptive statistics

Measures that summarize central tendency and spread variation in the data’s distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Frequency distribution - definition

A

Frequency distribution

A tabular display of data is constructed either by counting the observations of a variable by distinct values or groups or by tallying the values of a numerical variable into a set of numerically ordered bins (also called a one-way table).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How to construct a frequency distribution of a categorical variable

A
  1. Count the number of observations for each unique value of the variable.
  2. Construct a table listing each unique value and the corresponding counts, and then sort the records by number of counts in descending or ascending order to facilitate the display.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Absolute frequency - definition

A

Absolute frequency

The actual number of observations counted for each unique value of the variable (also called raw frequency).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Relative frequency - definition

A

Relative frequency

The absolute frequency of each unique value of the variable divided by the total number of observations of the variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Interval - definition

A

Interval

With reference to grouped data, a set of values within which an observation falls.

38
Q

How to construct a frequency distribution of a numerical variable

A
  1. Sort the data in ascending order.
  2. Calculate the range of the data, defined as Range = Maximum value − Minimum value.
  3. Decide on the number of bins (k) in the frequency distribution.
  4. Determine bin width as Range/k.
  5. Determine the first bin by adding the bin width to the minimum value. Then, determine the remaining bins by successively adding the bin width to the prior bin’s end point and stopping after reaching a bin that includes the maximum value.
  6. Determine the number of observations falling into each bin by counting the number of observations whose values are equal to or exceed the bin minimum value yet are less than the bin’s maximum value. The exception is in the last bin, where the maximum value is equal to the last bin’s maximum, and therefore, the observation with the maximum value is included in this bin’s count.
  7. Construct a table of the bins listed from smallest to largest that shows the num- ber of observations falling into each bin.
39
Q

Cumulative absolute frequency - definition

A

Cumulative absolute frequency

Cumulates (i.e., adds up) in a frequency distribution the absolute frequencies as one moves from the first bin to the last bin.

40
Q

Cumulative relative frequency - definition

A

Cumulative relative frequency

A sequence of partial sums of the relative frequencies in a frequency distribution.

41
Q

Contingency table - definition

A

Contingency table

A tabular format that displays the frequency distributions of two or more categorical variables simultaneously and is used for finding patterns between the variables. A contingency table for two categorical variables is also known as a two-way table.

A contingency table having R levels of one variable in rows and C levels of the other variable in columns is referred to as an R × C table.

42
Q

Joint frequencies - definition

A

Joint frequencies

The entry in the cells of the contingency table that represent the joining of one variable from a row

43
Q

Marginal frequencies - definition

A

Marginal frequencies

The sums determined by adding joint frequencies across rows or across columns in a contingency table.

44
Q

Confusion matrix - definition

A

Confusion matrix

A type of contingency table used for evaluating the performance of a classification model.

45
Q

Chi-square test of independence - definition

A

Chi-square test of independence

A statistical test for detecting a potential association between categorical variables.

46
Q

Two applications of the confusion matrix

A

o Evaluating the performance of a classification model (in this case, the contingency table is called a confusion matrix).
o Test for a potential association between categorical variables is to perform a chi-square test of independence.

47
Q

Using a confusion matrix how can you test for a potential association between categorical variables?

A

Perform a chi-square test of independence

  • use marginal frequencies in the contingency table to construct a table with expected values of the observations. The actual values and expected values are used to derive the chi-square test statistic. This test statistic is then compared to a value from the chi-square distribution for a given level of significance. If the test statistic is greater than the chi-square distribution value, then there is evidence to reject the claim of independence, implying a significant association exists between the categorical variables
48
Q

What is Data Visualization - definition

A

Visualization is the presentation of data in a pictorial or graphical format for the purpose of increasing understanding and for gaining insights into the data.

49
Q

What is a histogram - definition

A

A histogram is a chart that presents the distribution of numerical data by using the height of a bar or column to represent the absolute frequency of each bin or interval in the distribution.

50
Q

How do we construct a histogram

A

To construct a histogram from a continuous variable, we first need to split the data into bins and summarize the data into a frequency distribution table.

In a histogram, the y-axis generally represents the absolute frequency or the relative frequency in percentage terms, while the x-axis usually represents the bins of the variable.

Bars have equal width.

The bars are usually drawn with no spaces in between, but small gaps can also be added between adjacent bars to increase readability

51
Q

Pros of a histogram

A

Can present a large amount of numerical data that has been grouped into a frequency distribution and can allow a quick inspection of the shape, centre, and spread of the distribution to better understand it

52
Q

What is a frequency polygon - definition

A

Frequency polygon

A graph of a frequency distribution is obtained by drawing straight lines joining successive points representing the class frequencies.

53
Q

How do we construct a frequency polygon

A

To construct a frequency polygon, we plot the midpoint of each return bin on the x-axis and the absolute frequency for that bin on the y-axis. We then connect neighbouring points with a straight line.

54
Q

Pro of a frequency polygon

A

The frequency polygon can quickly convey a visual understanding of the distribution since it displays frequency as an area under the curve.

55
Q

What is a cumulative frequency distribution chart - definition

A

Cumulative frequency distribution chart

A chart that plots either the cumulative absolute frequency or the cumulative relative frequency on the y-axis against the upper limit of the interval and allows one to see the number or the percentage of the observations that lie below a certain value.

56
Q

What is a bar chart - definition

A

Bar chart

A chart for plotting the frequency distribution of categorical data, where each bar represents a distinct cate- gory and each bar’s height is proportional to the frequency of the corresponding category. In technical analysis, a bar chart that plots four bits of data for each time interval— the high, low, opening, and closing prices. A vertical line connects the high and low prices. A crosshatch left indicates the opening price and a cross-hatch right indicates the closing price.

57
Q

Axis on a bar chart

A

y-axis still represents the absolute frequency or the relative frequency.
x-axis in a bar chart represents the mutually exclusive categories to be compared

58
Q

In the case of two categorical variables, we need an enhanced version of the bar chart, what are they called?

A

Grouped bar chart

Stacked bar chart

59
Q

What is a Grouped bar chart - definition

A

Grouped bar chart

A bar chart for showing joint frequencies for two categorical variables (also known as a clustered bar chart).

60
Q

What is a Stacked bar chart

- definition

A

Stacked bar chart

An alternative form for presenting the frequency distribution of two categorical variables, where bars representing the sub-groups are placed on top of each other to form a single bar. Each sub-section is shown in a different color to represent the contribution of each sub- group, and the overall height of the stacked bar represents the marginal frequency for the category.

61
Q

What is a Tree-Map - definition

A

Tree-Map

Another graphical tool for displaying categorical data. It consists of a set of colored rectangles to represent distinct groups, and the area of each rectangle is proportional to the value of the corresponding group.

62
Q

Con of a Tree-map

A

Tree-maps become difficult to read if the hierarchy involves more than three levels.

63
Q

What is a Word cloud - definition

A

Word cloud

A visual device for representing textual data, which consists of words extracted from a source of textual data. The size of each distinct word is proportional to the frequency with which it appears in the given text (also known as tag cloud).

64
Q

Con of a Word cloud

A

This format allows us to quickly perceive the most frequent terms among the given text to provide information about the nature of the text, including topic and whether or not the text conveys positive or negative news.

65
Q

What is a line chart - definition

A

Line chart

A type of graph used to visualize ordered observations. In technical analysis, a plot of price data, typically closing prices, with a line connecting the points.

66
Q

What is a Bubble line chart - definition

A

Bubble line chart

A line chart that uses varying-sized bubbles to represent a third dimension of the

67
Q

What is a Scatter plot - definition

A

Scatter plot

A chart in which two variables are plotted along the axis and points on the chart represent pairs of the two variables. In regression, the dependent variable is plotted on the vertical axis and the independent variable is plotted along the horizontal axis. Also known as a scattergram

68
Q

What is a Scatter plot matrix - definition

A

Scatter plot matrix

A tool for organizing scatter plots between pairs of variables, making it easy to inspect all pairwise relationships in one combined visual.

69
Q

What is a Heat map - definition

A

Heat map

A type of graphic that organizes and summarizes data in a tabular format and represents it using a colour spectrum.

70
Q

Guide to Selecting among Visualization Types

A

INSERT PIC

71
Q

What are Four typical pitfalls here that analysts should avoid?

A
  1. First, an improper chart type is selected to present data, which would hinder the accurate interpretation of data.
  2. Second, data are selectively plotted in favour of the conclusion an analyst intends to draw.
    For example, data
  3. Third, data are improperly plotted in a truncated graph that has a y-axis that does not start at zero.
  4. Last, but not least, is the improper scaling of axes.
72
Q

Measure of central tendency - definition

A

Measure of central tendency

A quantitative measure that specifies where data are centered.

73
Q

Measure of value - definition

A

Measure of value

A standard for measuring value; a function of money.

74
Q

Measures of location - definition

A

Quantitative measures that describe the location or distribution of data. They include not only measures of central tendency but also other measures, such as percentiles.

75
Q

Population - definition

A

Population

All members of a specified group.

76
Q

Parameter - definition

A

A parameter is any descriptive measure of a population. A sample statistic (statistic, for short) is a quantity computed from or used to describe a sample.

77
Q

Sample statistic - definition

A

sample statistic (statistic, for short) is a quantity computed from or used to describe a sample.

78
Q

The arithmetic mean - definition

A

The arithmetic mean is the sum of the values of the observations divided by the number of observations.

79
Q

The sample mean - definition

A

The sample mean is the arithmetic mean or arithmetic average computed for a sample

80
Q

Sample Mean Formula - formula

A

INSERT PICTURE

81
Q

What are the 3 options for dealing with extreme values:

A
  1. Do nothing; use the data without any adjustment.
  2. Delete all the outliers.
  3. Replace the outliers with another value.
82
Q

Dealing with extreme values - doing nothing

A

Is appropriate if the values are legitimate, correct observations, and it is important to reflect the whole of the sample distribution. Outliers may contain meaningful information, so excluding or altering these values may reduce valuable information. Further, because identifying a data point as extreme leaves it up to the judgment of the analyst, leaving in all observations eliminates that need to judge a value as extreme

83
Q

Dealing with extreme values - Delete all the outliers

A

One measure of central tendency in this case is the trimmed mean, which is computed by excluding a stated small percentage of the lowest and highest values and then computing an arithmetic mean of the remaining values. For example, a 5% trimmed mean discards the lowest 2.5% and the highest 2.5% of values and computes the mean of the remaining 95% of values.

Trimmed mean - A mean computed after excluding a stated small percentage of the lowest and highest observations.

84
Q

Dealing with extreme values - Replace the outliers with another value

A

A measure of central tendency in this case is the winsorized mean. It is calculated by assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value, and then it computes a mean from the restated data.

Winsorized mean A mean computed after assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value.

85
Q

Trimmed mean - definition

A

Trimmed mean

A mean computed after excluding a stated small percentage of the lowest and highest observations.

86
Q

Winsorized mean - definition

A

Winsorized mean

A mean computed after assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value.

87
Q

Median - definition

A

Median

The value of the middle item of a set of items that has been sorted into ascending or descending order (i.e., the 50th percentile).

88
Q

Pros of the median

A

A potential advantage of the median is that, unlike the mean, extreme values do not affect it.

89
Q

Mode The mode is the most frequently occurring value in a distribution.

A

Mode

The mode is the most frequently occurring value in a distribution.

90
Q

Pros of the mode

A

The mode is the only measure of central tendency that can be used with nominal data.

91
Q

When a distribution has a a single value that is most frequently occurring what is it called?

A

Unimodal