Part 2. Organising, Visualising and Describing Data Flashcards

Question

Relative Frequency

Answer 1

Calculated as the absolute frequency of each unique value of the variable divided by the total number of observations. This provides a normalised measure of the distribution of the data, allowing comparisons between datasets with different numbers of total observations.

Answer 2

1. Sort the data in ascending order. 2. Calculate the range of data, defined as Range = Maximum Value - Minimum Value. 3. Decide on the number of bins (k) in the frequency distribution. 4. Determine bin width as Range/k. 5. Determine the first bin by adding the bin width to the minimum value. Then, determine the remaining bins by successively adding the bin width to prior bins end point and then stop after reaching a bin that includes the max. value. 6. Determine no. of observations falling into each bin by counting no. of observations whose values equal to or exceed the bin minimum value, yet are less than bins max. value. With exception in last bin where max. value is equal to last bin's max, and therefore the observation with the max. value is included in bin's count. 7. Construct table of bins listed from smallest to largest that shows the no. of observations falling in each bin.

Answer 3

Adds up the absolute frequencies as we move from the first bin to the last bin.

Answer 4

A sequence of partial sums of the relative frequencies. For the last bin, the cumulative absolute frequency will equal the number of observations in the dataset (1,258), and the cumulative relative frequency will equal 100%.

Answer 5

A tabular format that displays the frequency distributions of two or more categorical variables simultaneously and is used for finding patterns between the variables. This table having R levels of one variable in rows and C levels of the other variable in columns is referred to as R x C table.

Answer 6

When you join one variable from the row (i.e. sector) and the other variable from the column (i.e. market cap) to count observations in a contingency table.

Answer 7

The corresponding sums of when joint frequencies are then added across rows and columns.

Answer 8

1. Confusion Matrix | 2. Chi-square test for independence

Answer 9

Evaluates the performance of a classification model. i.e. a model classifying companies into two groups: those that default on their bond payments and those that do not default. The matrix for displaying model results will be 2 x 2 table, showing frequency of actual defaults vs models predicted frequency of defaults.

Answer 10

To test for a potential association between categorical variables. The procedure involves using the marginal frequencies in contingency table to construct a table with expected values of observations. Actual and expected values are used to derive chi square test statistic. The test statistic is then compared to a value from the chi-square distribution for a given level of significance. If test statistic is greater than chi-square distribution value, then there is evidence to reject claim of independence, implying significant association between the categorical variable.

Answer 11

The presentation of data in a pictorial/graphical format for purpose of increasing understanding and gaining insights into the data.

Answer 12

A chart that presents the distribution of numerical data by using the height of a bar, or column to represent the absolute frequency of each bin/interval in the distribution. y axis - the absolute frequency/relative frequency in percentage terms. x axis - represents the bin of a variable. Absolute frequency histogram = answers the question of how many items are in each bin. Relative frequency histogram = gives the proportion or percentage of the total observations in each bin.

Answer 13

Plotting the mid point of each return bin on x-axis and the absolute frequency for that bin in the y-axis, connected with a straight line. This can quickly convey a visual understanding of the distribution since it displays frequency as an area under the curve.

Answer 14

A chart that can plot either the cumulative absolute frequency or the cumulative relative frequency on the y-axis against the upper limit of the interval. This allows us to see the number or percentage of the observation that lie below a certain value. Curve flattens = frequencies of observations in bins are small. Curve steep = reflects most of the observations

Answer 15

A frequency distribution of categorical data is plotted where each bar represents a distinct category, with the bar's height proportional to the frequency of the corresponding category. Vertical bar chart: - the y axis represents the absolute frequency/relative frequency - the x axis represents the mutually exclusive categories to be compared than bins that group numerical data.

Answer 16

The categories in a bar chart are ordered by frequency in descending order and includes a line displaying cumulative relative frequency. The chart is used to highlight dominant categories or the most important groups.

Answer 17

Presents frequency distribution of 2 categorical variables to show joint frequencies. The bars within each cluster should be colored differently to distinguish between them, but color schemes for subgroups must be identical across sector clusters. The bars in each sector cluster must always be placed in the same order throughout the chart.

Answer 18

An alternative form for presenting the joint frequency distribution of two categorical variables. Each subsection of the bar is shown in a different color to represent the contribution of each subgroup. The overall height of the stacked bar represents the marginal frequency for the category.

Answer 19

A graphical tool for presenting categorical data consists of a set of colored rectangles to represent distinct groups and the area of each rectangle is proportional to the value of the corresponding group. This can represent data with additional dimensions by displaying a nest of rectangles. To display joint frequencies of sub-groups, we split the rectangle into sub-sections where the area of each nested rectangle would be proportional to the number of stocks in each market capitalization sub-group.

Answer 20

A visual device for representing textual data, consisting of words extracted from a source of textual data, with the size of each distinct work being proportional to the frequency with which it appears in the given text. This format allows us to quickly perceive the most frequent terms among given text for information about the nature/sentiment of the text. Words conveying a different sentiment may be presented in different colors, i.e. positive words in green and negative words in red.

Answer 21

A type of chart used to visualise ordered observations, often used to display the change of data series over time. This facilitates showing changes in the data, and underlying trends in a clear and concise way, helping understand the current data and forecasting data series. It is especially helpful for making comparisons, with each distinct colour/pattern line representing each group of data.

Answer 22

This shows multidimensional data in one chart when an observational unit has more than 2 features of interest. Replacing data points with varying-sized bubbles to represent the 3rd dimension of data, even color-coded to represent more information. Each marker representing a revenue data point is replaced by circular bubbles with a size proportional to the magnitude of EPS in the corresponding quarter. The bubbles are colored in a binary scheme with green representing profits and red representing losses. 3 elements: - changes for revenue - changes for EPS - EPS represents profit or loss

Answer 23

A type of graph for visualising the joint variation in two numerical variables, useful for understanding potential relationships between variables. y axis = one variable x axis = other variable If data points seem to align along a straight line, then a significant relationship may exist among variables (positive or negative association). The strength of association is dependent on how closely the data points are clustered around the line, with a tight cluster signaling a stronger relationship. Assuming relationship among variables is apparent, the scatter plot can help spot extreme values (i.e. outliers).

Answer 24

A useful tool for organizing scatter plots between pairs of variables, making it easy to inspect all pairwise relationships in one combined visual. This contains each combination of bivariate scatter plots (i.e. S&P 500 vs each sector, IT vs utilities, IT vs financials, financial vs utilities), and univariate frequency distribution histograms for each variable plotted along the diagonal. Despite usefulness, these should not be considered as substitutes for robust statistical tests, but work alongside tests for best results.

Answer 25

A type of graphic that organises and summarizes data in a tabular format, and represents them using a colour spectrum. Cells in chart are colour coded to differentiate high values from low values defined by colour spectrum beside chart. Also used for visualising the degree of correlation among different variables.

Answer 26

1. Improper chart type is selected to present data which could hinder the accurate interpretation of data. 2. Data are selectively plotted in favor of the conclusions an analyst intends to draw, i.e. presenting data with a short time frame mistakenly points to a non-existing trend. 3. Data is improperly plotted in a truncated graph at a y-axis that does not start at zero i.e. creates a false impression of significant differences when actually small. 4. Improper scaling of axis i.e. a line chart setting a higher than necessary maximum on y-axis compresses graph in an area close to the x-axis, appearing less steep and less volatile if properly plotted.

Answer 27

Specifies where data is centered, more widely measured as can be computed and applied relatively easily. Most common: - the arithmetic mean - the median - the mode - the weighted mean - the geometric mean - the harmonic mean

Answer 28

Includes measures of central tendency and other measures that illustrate the location or distribution of data. Most common: - quartiles - quintiles - deciles - percentiles

Answer 29

A summary measure of a set of observations and descriptive statistics to summarise the central tendency and spread variation in the distribution of data.

Answer 30

The statistic summarises the set of all possible observations of a population.

Answer 31

If the statistic summarises a set of observations that is a subset of the population.

Answer 32

The sum of values of observations is divided by the number of observations.

Answer 33

The arithmetic mean/average is computed for a sample, as we cannot observe every member of the population, so instead observe a subset or sample of the population.

Answer 34

The distance from the mean and each outcome. This indicates risk, forming the foundation for complex concepts of variance, skewness, and kurtosis.

Answer 35

This represents a rare value/meaningful in the population, or may also reflect an error in recording the value of an observation, or an observation generated from a different population from that producing the other observations in the sample.

Answer 36

1. Examine the data either by inspecting the sample observations if the sample is not too large or by using visualisation approaches. 2. Once comfortable that we have identified and eliminated errors, then we can address what to do with extreme values in the sample. 3. Possibility of transforming the variable or of selecting another variable that achieves the same purpose. i. e. alternative model specs, variable transformation

Answer 37

1. Do nothing; use data without any adjustment - value could be legitimate and present meaningful information. 2. Delete all the outliers - trimmed mean 3. Replace the outliers with another value. - winsorised mean

Answer 38

A measure of central tendency computed by excluding a stated small percentage of the lowest and highest values, and then computing an arithmetic mean of remaining variables. e.g. 5% trimmed mean discards the lowest 2.5% and highest 2.5% of values and computes mean of the remaining 95% of values.

Answer 39

A measure of central tendency calculated by assigning a stated percentage of the lowest values equal to one specified low value and a stated percentage of the highest values equal to one specified high value, and then it computes a mean from the restated data. e.g. 95% windorized mean sets bottom 2.5% of values equal to the value at or below which 2.5% of all values lie (2.5th percentile), and top 2.5% of values equal to the value at or below which 97.5% of all values lie (97.5th percentile).

Answer 40

The value of the middle item of a set of items that has been sorted into ascending or descending order. For odd n sample: (n+1)/2 For even n sample: (n+2)/2

Answer 41

- It is affected less by outliers than the mean, so useful in describing data that follows a distribution and is not symmetric, e.g. revenue. - Does not use all the information about the size of the observations, it focuses on the relative position of the ranked observations. - Median is less mathematically tractable than mean, as ranking from smallest to largest determines if sample size is odd or even, then applies one of two calculations.

Answer 42

The most frequently occuring value in a distribution. There can be either more than 1 mode or no mode. The only measure of central tendency that can be used with nominal data. e.g. we categorise investment funds into different styles, and assign a number to each style, the mode of these categorised data is the most frequent investment fund style.

Answer 43

When a distribution has a single value that is most frequently occurring.

Answer 44

If a distribution has two most frequently occurring vaues, then it has 2 modes.

Answer 45

If the distribution has three most frequently occurring values, then it has 3 modes.

Answer 46

e.g. stock return data, and other data from continuous distributions.

Answer 47

When contiunous data are grouped into bins, we often find an interval (possibly more) with the highest frequency.

Answer 48

Investment Manager: $100m Allocation: $70m equities, $30m bonds Portfolio weight: 0.7 stocks and 0.3 bonds What is the calculated portfolio return? - This means averaging of the returns on the stock and bond investments, so multiply return on stock investment by 0.7 and bond investment 0.3, and then sum both results.

Answer 49

- Weighted average of the returns on the assets in the portfolio; the weight applied to each asset's return is the fraction of the portfolio invested in the asset.

Answer 50

i. e. S&P in the USA - each included stock receives a weight corresponding to its market value divided by the total market value of all stocks in the index. Expected value at year-end = (probability of expansion x forecast year-end level of S&P assuming expansion ) + (probability of contraction x forecast year-end level of S&P assuming contraction) Expected return - taking weighted average of possible future returns on S&P 500, where weights are probabilities = 1.

Answer 51

Used to average rates of change over time or compute growth rate of a variable. Use: - average time series of rates of return on an asset or portfolio. - compute growth rate of a financial variable such as earnings or sales.

Answer 52

- Represents the growth rate or compound rate of return on an investment. (geometric) - Geometric - focus on profitability of an investment over a multi-period horizon. - Arithmetic - focus on average single-period performance.

Answer 53

Another measure of central tendency, appropriate in cases in which the variable is a rate or ratio. The value obtained by summing the reciprocals of the observations 1/Xi, then averaging that sum by dividing it by the number of observations n, then taking the reciprocal of the average. Useful measure of central tendency in the presence of outliers. The concept of mean is appropriate for averaging ratios (amount per unit), when ratios are repeatedly applied to a fixed quantity to yield a variable number of units.

Answer 54

arithmetic > geometric > harmonic mean return

Answer 55

Involves the periodic investment of a fixed amount of money. arithmetic mean x harmonic mean = geometric mean^2

Answer 56

A value at or below which a stated fraction of the data lies. ``` Quartiles = 1/4 Quintiles = 1/5 Deciles = 1/10 Percentiles = 1/100 ``` the yth percentile is the value at or below which y% of observations lie.

Answer 57

The difference between the 3rd and 1st quartile, or IQR = Q3 - Q1.

Answer 58

Estimating an unknown value on the basis of two known values that surround it (i.e. lie above and below it), linear refers to straight line estimate.

Answer 59

Box - represents the lower bound of Q2 and upper bound of Q3 with median or arithmetic average as measure of central tendency. Whiskers - the lines that run from the box and are bounded by fences which represent the lowest and highest values of distribution.

Answer 60

1. Rank performance i.e. portfolios - Morningstar investment fund star rankings associate number of stars with percentiles of performance relative to similar style investment funds. 2. Investment research - set of companies with returns falling below 10th percentile cut off as bottom return decile, allowing analysts to divide data into quantiles based on characteristics allows evaluating the impact of characteristic on quantity of interest. i. e. ranking companies by decile to compare performance of small co. with larger ones.

Answer 61

The variability around the central tendency, addressing risk.

Answer 62

The amount of variability present without comparison to any reference point or benchmark.

Answer 63

1. Analyses data - range - mean absolute deviation 2. Measures risk - variance - standard deviation

Answer 64

Maximum value - minimum value Pros: - Ease of computation Cons: - only uses 2 pieces of information from the distribution, not representative.

Answer 65

A way to prevent the problem of negative deviations canceling out positive so that the means of deviations does not always equal zero. Pros: Uses all observations in sample, thus superior to range as measure of dispersion. Cons: Its is difficult to manipulate mathematically compared with the next measure sample variance.

Answer 66

The average of squared deviations around the mean.

Answer 67

The positive square root of variance. More easily interpreted than variance as expressed in the same unit of measurement as the observations, by taking square root.

Answer 68

When returns to an investor are below the mean or below some specified minimum target return.

Answer 69

Measure of dispersion of observations below the target. 1. Specify the target. 2. After identifying observation below target, we find sum of the squared negative deviations from the target. 3. Divide the sum by the total number of observations in the sample minus 1. 4. Take the square rool.

Answer 70

The amount of dispersion relative to a reference value or benchmark. i.e coefficient of variation

Answer 71

The ratio of the standard deviation of a set of observations to their mean value. uses: - when observations are returns, CV measures the amount of risk (standard deviation) per unit of reward (mean return). - Issue dealing with returns is that if X- is negative, the stat is meaningless. - CV may be stated as a multiple or percentage, expressing magnitude of variation among obs relative to average size, CV permits direct comparisons of dispersion across different datasets. - A scale free measure.

Answer 72

A symmetrical bell-shaped distribution that plays a central role in the mean-variance model of portfolio selection is extensively used in financial risk management. Characteristics: - mean, median, and mode are equal - completely described by two parameters - it's mean and variance (or standard dev).

Answer 73

The average cubed deviation from the mean standardised by dividing by the standard deviation cubed to make the measure free of scale. - Cubing preserves the sign of the deviations from the mean. Positive skew with mean greater than median: - Means more than half of the deviations from the mean are negative and less than half are positive. - For the sum of cubed deviations to be positive, the losses must be small and likely, and gains less likely but more extreme. - A positive skew means the average magnitude of positive deviations is larger than the average magnitude of negative deviations.

Answer 74

The measure of the combined weight of the tails of a distribution relative to the rest of the distribution. The proportion of the total probability that is outside of 2.5 standard deviations of the mean. Normal: 3 Fat tailed dist. of kurtosis: >3 Thin tailed dist. of kurtosis: <3

Answer 75

A distribution that has fatter tails than the normal distribution. This tends to generate more frequent extremely large deviations from mean than normal distribution.

Answer 76

A distribution that has thinner tails than the normal distribution.

Answer 77

A distribution similar to the normal distribution as concerns relative weight in the tails.

Answer 78

- The distribution is negatively skewed of -0.4260 and influence of observations below mean of 0.0347%. - Highest frequency of returns occurs within -0.5 to 0.0 standard deviations from the mean (negative skew). - The distribution is fat-tailed, indicated by positive excess kurtosis of 3.7962. With fat tails, a concentration of returns around mean and fewer observations in regions between central and two tail regions.

Answer 79

The measure of linear relationship between two random variables, with the first step being how 2 variable vary together, their covariance.

Answer 80

The measure of how two variables in a sample move together. The average value of the product of the deviations of observations on two random variables (Xi and Yi) from their sample means. i.e. if X tends to be above mean, when Y is above mean then there is a positive covariance.

Answer 81

A standardised measure of how two variables in a sample move together. The ratio of the sample covariance to the product f the two variables standard deviations.

Answer 82

1. Correlation ranges from -1 and +1 for 2 random variables X and Y: -1 rXY 1. 2. Correlation = 0 is no linear relationship between variables. 3. Correlation close to 1 is positive relationship, with =1 being perfect linear relationship. 4. Correlation close to -1 is negative relationship, with =-1 being perfectly inverse relationship.

Answer 83

- Not a reliable measure as two variables can have a strong non-linear relation, and still have very low correlation. - Unreliable measure when outliers are present in one or both variables. - Correlation does not imply causality. - Spurious correlation

Answer 84

Refers to: 1. correlation between two variables that reflects chance relationships in a particular dataset. 2. Correlation induced by calculation that mixes each of two variables with the third. 3. Correlation between 2 variables arising not from a direct relation between them, but relation from the third variable.

Answer 85

Knowing the means and standard deviations of two variable as well as the correlation between them does not tell the entire story.

Part 2. Organising, Visualising and Describing Data Flashcards

(109 cards)