2 - Organizing, Visualizing, and Describing Data Flashcards

Question

How is Unstructured data collected

Answer 1

Unstructured data are typically alternative data as they are usually collected from unconventional sources.

Answer 2

Unstructured data may offer new market insights not normally contained in data from traditional sources and may provide potential sources of returns for investment processes.

Answer 3

unstructured data in investment analysis is challenging. Typically, financial models are able to take only structured data as inputs; therefore, unstructured data must first be transformed into structured data that models can process.

Answer 4

By indicating the source from which the data are generated, such data can be classified into three groups: o Produced by individuals (i.e., via social media posts, web searches, etc.); o Generated by business processes (i.e., via credit card transactions, corporate regulatory filings, etc.); o Generated by sensors (i.e., via satellite imagery, foot traffic by mobile devices, etc.).

Answer 5

Raw data is not suitable for quantitative analysis – data needs to be clean and formatted. Formatted into one-dimensional arrays or two-dimensional rectangular arrays

Answer 6

One-dimensional array The simplest format for representing a collection of data of the same data type. Represents a single variable.

Answer 7

Two-dimensional rectangular array A popular form for organizing data for processing by computers or for presenting data visually. It is comprised of columns and rows to hold multiple variables and multiple observations, respectively (also called a data table).

Answer 8

Descriptive statistics | Measures that summarize central tendency and spread variation in the data’s distribution.

Answer 9

Frequency distribution A tabular display of data is constructed either by counting the observations of a variable by distinct values or groups or by tallying the values of a numerical variable into a set of numerically ordered bins (also called a one-way table).

Answer 10

1. Count the number of observations for each unique value of the variable. 2. Construct a table listing each unique value and the corresponding counts, and then sort the records by number of counts in descending or ascending order to facilitate the display.

Answer 11

Absolute frequency The actual number of observations counted for each unique value of the variable (also called raw frequency).

Answer 12

Relative frequency The absolute frequency of each unique value of the variable divided by the total number of observations of the variable.

Answer 13

Interval With reference to grouped data, a set of values within which an observation falls.

Answer 14

1. Sort the data in ascending order. 2. Calculate the range of the data, defined as Range = Maximum value − Minimum value. 3. Decide on the number of bins (k) in the frequency distribution. 4. Determine bin width as Range/k. 5. Determine the first bin by adding the bin width to the minimum value. Then, determine the remaining bins by successively adding the bin width to the prior bin’s end point and stopping after reaching a bin that includes the maximum value. 6. Determine the number of observations falling into each bin by counting the number of observations whose values are equal to or exceed the bin minimum value yet are less than the bin’s maximum value. The exception is in the last bin, where the maximum value is equal to the last bin’s maximum, and therefore, the observation with the maximum value is included in this bin’s count. 7. Construct a table of the bins listed from smallest to largest that shows the num- ber of observations falling into each bin.

Answer 15

Cumulative absolute frequency Cumulates (i.e., adds up) in a frequency distribution the absolute frequencies as one moves from the first bin to the last bin.

Answer 16

Cumulative relative frequency A sequence of partial sums of the relative frequencies in a frequency distribution.

Answer 17

Contingency table A tabular format that displays the frequency distributions of two or more categorical variables simultaneously and is used for finding patterns between the variables. A contingency table for two categorical variables is also known as a two-way table. A contingency table having R levels of one variable in rows and C levels of the other variable in columns is referred to as an R × C table.

Answer 18

Joint frequencies The entry in the cells of the contingency table that represent the joining of one variable from a row

Answer 19

Marginal frequencies The sums determined by adding joint frequencies across rows or across columns in a contingency table.

Answer 20

Confusion matrix A type of contingency table used for evaluating the performance of a classification model.

Answer 21

Chi-square test of independence A statistical test for detecting a potential association between categorical variables.

Answer 22

o Evaluating the performance of a classification model (in this case, the contingency table is called a confusion matrix). o Test for a potential association between categorical variables is to perform a chi-square test of independence.

Answer 23

Perform a chi-square test of independence - use marginal frequencies in the contingency table to construct a table with expected values of the observations. The actual values and expected values are used to derive the chi-square test statistic. This test statistic is then compared to a value from the chi-square distribution for a given level of significance. If the test statistic is greater than the chi-square distribution value, then there is evidence to reject the claim of independence, implying a significant association exists between the categorical variables

Answer 24

Visualization is the presentation of data in a pictorial or graphical format for the purpose of increasing understanding and for gaining insights into the data.

Answer 25

A histogram is a chart that presents the distribution of numerical data by using the height of a bar or column to represent the absolute frequency of each bin or interval in the distribution.

Answer 26

To construct a histogram from a continuous variable, we first need to split the data into bins and summarize the data into a frequency distribution table. In a histogram, the y-axis generally represents the absolute frequency or the relative frequency in percentage terms, while the x-axis usually represents the bins of the variable. Bars have equal width. The bars are usually drawn with no spaces in between, but small gaps can also be added between adjacent bars to increase readability

Answer 27

Can present a large amount of numerical data that has been grouped into a frequency distribution and can allow a quick inspection of the shape, centre, and spread of the distribution to better understand it

Answer 28

Frequency polygon A graph of a frequency distribution is obtained by drawing straight lines joining successive points representing the class frequencies.

Answer 29

To construct a frequency polygon, we plot the midpoint of each return bin on the x-axis and the absolute frequency for that bin on the y-axis. We then connect neighbouring points with a straight line.

Answer 30

The frequency polygon can quickly convey a visual understanding of the distribution since it displays frequency as an area under the curve.

Answer 31

Cumulative frequency distribution chart A chart that plots either the cumulative absolute frequency or the cumulative relative frequency on the y-axis against the upper limit of the interval and allows one to see the number or the percentage of the observations that lie below a certain value.

Answer 32

Bar chart A chart for plotting the frequency distribution of categorical data, where each bar represents a distinct cate- gory and each bar's height is proportional to the frequency of the corresponding category. In technical analysis, a bar chart that plots four bits of data for each time interval— the high, low, opening, and closing prices. A vertical line connects the high and low prices. A crosshatch left indicates the opening price and a cross-hatch right indicates the closing price.

Answer 33

y-axis still represents the absolute frequency or the relative frequency. x-axis in a bar chart represents the mutually exclusive categories to be compared

Answer 34

Grouped bar chart | Stacked bar chart

Answer 35

Grouped bar chart A bar chart for showing joint frequencies for two categorical variables (also known as a clustered bar chart).

Answer 36

Stacked bar chart An alternative form for presenting the frequency distribution of two categorical variables, where bars representing the sub-groups are placed on top of each other to form a single bar. Each sub-section is shown in a different color to represent the contribution of each sub- group, and the overall height of the stacked bar represents the marginal frequency for the category.

Answer 37

Tree-Map Another graphical tool for displaying categorical data. It consists of a set of colored rectangles to represent distinct groups, and the area of each rectangle is proportional to the value of the corresponding group.

Answer 38

Tree-maps become difficult to read if the hierarchy involves more than three levels.

Answer 39

Word cloud A visual device for representing textual data, which consists of words extracted from a source of textual data. The size of each distinct word is proportional to the frequency with which it appears in the given text (also known as tag cloud).

Answer 40

This format allows us to quickly perceive the most frequent terms among the given text to provide information about the nature of the text, including topic and whether or not the text conveys positive or negative news.

Answer 41

Line chart A type of graph used to visualize ordered observations. In technical analysis, a plot of price data, typically closing prices, with a line connecting the points.

Answer 42

Bubble line chart A line chart that uses varying-sized bubbles to represent a third dimension of the

Answer 43

Scatter plot A chart in which two variables are plotted along the axis and points on the chart represent pairs of the two variables. In regression, the dependent variable is plotted on the vertical axis and the independent variable is plotted along the horizontal axis. Also known as a scattergram

Answer 44

Scatter plot matrix A tool for organizing scatter plots between pairs of variables, making it easy to inspect all pairwise relationships in one combined visual.

Answer 45

Heat map A type of graphic that organizes and summarizes data in a tabular format and represents it using a colour spectrum.

Answer 46

INSERT PIC

Answer 47

1. First, an improper chart type is selected to present data, which would hinder the accurate interpretation of data. 2. Second, data are selectively plotted in favour of the conclusion an analyst intends to draw. For example, data 3. Third, data are improperly plotted in a truncated graph that has a y-axis that does not start at zero. 4. Last, but not least, is the improper scaling of axes.

Answer 48

Measure of central tendency A quantitative measure that specifies where data are centered.

Answer 49

Measure of value A standard for measuring value; a function of money.

Answer 50

Quantitative measures that describe the location or distribution of data. They include not only measures of central tendency but also other measures, such as percentiles.

Answer 51

Population All members of a specified group.

Answer 52

A parameter is any descriptive measure of a population. A sample statistic (statistic, for short) is a quantity computed from or used to describe a sample.

Answer 53

sample statistic (statistic, for short) is a quantity computed from or used to describe a sample.

Answer 54

The arithmetic mean is the sum of the values of the observations divided by the number of observations.

Answer 55

The sample mean is the arithmetic mean or arithmetic average computed for a sample

Answer 56

INSERT PICTURE

Answer 57

1. Do nothing; use the data without any adjustment. 2. Delete all the outliers. 3. Replace the outliers with another value.

Answer 58

Is appropriate if the values are legitimate, correct observations, and it is important to reflect the whole of the sample distribution. Outliers may contain meaningful information, so excluding or altering these values may reduce valuable information. Further, because identifying a data point as extreme leaves it up to the judgment of the analyst, leaving in all observations eliminates that need to judge a value as extreme

Answer 59

One measure of central tendency in this case is the trimmed mean, which is computed by excluding a stated small percentage of the lowest and highest values and then computing an arithmetic mean of the remaining values. For example, a 5% trimmed mean discards the lowest 2.5% and the highest 2.5% of values and computes the mean of the remaining 95% of values. Trimmed mean - A mean computed after excluding a stated small percentage of the lowest and highest observations.

Answer 60

A measure of central tendency in this case is the winsorized mean. It is calculated by assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value, and then it computes a mean from the restated data. Winsorized mean A mean computed after assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value.

Answer 61

Trimmed mean A mean computed after excluding a stated small percentage of the lowest and highest observations.

Answer 62

Winsorized mean A mean computed after assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value.

Answer 63

Median The value of the middle item of a set of items that has been sorted into ascending or descending order (i.e., the 50th percentile).

Answer 64

A potential advantage of the median is that, unlike the mean, extreme values do not affect it.

Answer 65

Mode The mode is the most frequently occurring value in a distribution.

Answer 66

The mode is the only measure of central tendency that can be used with nominal data.

2 - Organizing, Visualizing, and Describing Data Flashcards

(91 cards)