Quantitative Methods Flashcards
Numerical Data (e.g. Discrete, Continuous)
Values that can be counted.
We can perform mathematical operations only on numerical data.
Categorical Data (e.g. Nominal, Ordinal)
consist of labels that can be used to classify a set of data into groups. Categorical data may be nominal or ordinal.
Discrete Data
Countable data , such as the months, days, or hours in a year
Continuous Data
Can take any fractional value (e.g., the annual percentage return on an investment).
Nominal Data
Data that cannot be placed in a logical order
Ordinal Data
Can be ranked in logical order
Structured Data
Data that can be organised in a defined way
Time series
A set of observations taken periodically e.g. at equal intervals over time
Cross-sectional data
Refers to a set of comparable observations all taken at one specific point in time.
Panel Data
Time series and cross-sectional data combined
Unstructured Data
A mix of data with no defined structure
One-dimensional array
represents a single variable (e.g. a time series)
Two-dimensional array
Represents two variables (e.g. panel data)
Contingency table
A two-dimensional array that displays the joint frequencies of two variables
Confusion matrix
A contingency table (two variables) that displays predicted and actual occurrences of an event
Relationship between geometric and arithmetic mean
The geometric mean is always less than or equal to the arithmetic mean, and the difference increases as the dispersion of the observations increases
Trimmed mean
Estimate the mean without the effects of a given percentage of outliers.
Winsorized mean
Decrease the effect of outliers on the mean.
Harmonic mean
Calculate the average share cost from periodic purchases in a fixed dollar amount.
Empirical probability
established by analysing past data (outcomes)
Priori probability
determined using reasoning and inspection process (not data) e.g. looking at a coin and deciding there is a 50/50 chance of each outcome.
Subjective probability
Established using personal judgement
Unconditional probability (marginal probability)
the probability of an event regardless of the past or future occurrence of other events.
Conditional probability
where the occurrence of one event affects the probability of the occurrence of another event. e.g. Prob (A I B)
Multiplication rule of probability
P (AB) = P (A I B) * P (B)
Addition rule of probability
P (A or B) = P (A) + P (B) - P (AB)
Probability distribution
The probabilities of all the possible outcomes for a random variable
A discrete random variable
when the number of possible outcomes in a probability can be counted and there is a measurable/positive probability. e.g. the number of days it may rain in a month.
A continuous random variable
When the number of possible outcomes is infinite, even if upper and lower bands exist. e.g. the amount of rainfall per month.
The probability function , p(x)
gives the probability that a discrete random variable will equal X
A cumulative probability function (cdf) , F(x)
gives the probability that a random variable will be less than or equal to a given value.
Binomial Random Variable - E(X) = np
Binomial Random Variable - Var(X) = np(1-p)
For a continuous random variable X, the probability of any single value of X is
0
The normal distribution has the following key properties:
- It is completely described by its mean, μ, and variance, σ2, stated as X ~ N(μ, σ2).
In words, this says that “X is normally distributed with mean μ and variance σ2.” - Skewness = 0 (symmetrical),
meaning that the normal distribution is symmetric about its mean, so that P(X ≤ μ) = P(μ ≤ X) = 0.5, and mean = median = mode. - Kurtosis = 3;
this is a measure of how flat the distribution is. Recall that excess kurtosis is measured relative to 3, the kurtosis of the normal distribution. - A linear combination of normally distributed random variables is also normally distributed.
- The probabilities of outcomes further above and below the mean get smaller and smaller but do not go to zero (the tails get very thin but extend infinitely).
univariate distribution
the distribution of a single random variable
A multivariate distribution
the distribution of two or more random variables (takes into account correlation coefficients)
- specifies the probabilities associated with a group of random variables and is meaningful only when the behavior of each random variable in the group is in some way dependent on the behavior of the others.
Number of correlations in a portfolio
0.5n*(n-1)
n = no. of assets in portfolio / variables
Normal distribution: +/-1 s.d. from the mean
68% confidence interval
Normal distribution: +/-1.65 s.d. from the mean
90% confidence interval
Normal distribution: +/-1.96 s.d. from the mean
95% confidence interval
Normal distribution: +/-2.58 s.d. from the mean
99% confidence interval
“standardizing a random variable” (finding z)
measuring how far it lies from the arithmetic mean
z = the no. of standard deviations the variable is from the mean
How to calculate z
how many standard deviations a variable is from the mean
z = ( x - pop. mean ) / s.d.
shortfall risk
probability that a portfolio return or value will be below a target return or value
Roy’s safety first ratio (SF ratio)
no. of standard deviations the target return is from the expected return/value
The larger the SF ratio, the lower the probability of falling below the minimum threshold.
For a standard normal distribution, F(0) is:
0.5
By the symmetry of the z-distribution and F(0) = 0.5. Half the distribution lies on each side of the mean. (LOS 4.j)
Holding period of return –> Continuously compounded rate
ln ( 1 + holding period of return)
ln = natural log
Continuously compounded rate –> Holding period of return
e^ continuously compounded rate -1
t-distribution
- Symmetrical.
- Defined by degrees of freedom (df), where the degrees of freedom = no. of sample observations - 1 (n – 1), for sample means.
- More probability in the tails (“fatter tails”) than the normal distribution.
- As the degrees of freedom (the sample size) gets larger, the shape of the t-distribution more closely approaches a standard normal distribution.
NOT PLATYKURTIC - is less peaked than normal dist. but has fatter tails.
For t-distribution, the lower the degrees of freedom, the fatter the tails and the greater the probability of extreme outcomes.
chi-square distribution (x^2)
used to test variance of a normally distributed population
- Distribution of the sum of squared values of n independent standard normal random variables (all positive values)
- Asymmetric
- Degrees of freedom = n-1
- As degrees of freedom increases, approaches normal distribution
Degrees of freedom, k (in context of distribution charts)
the number of values a random variable can vary from the mean
F-distribution
used to test variance of two population variances
- Quotient of two chi-square distributions with m and n degrees of freedom (all positive values)
- Asymmetric
- As degrees of freedom increase, approaches normal distribution
F-stat formula
F-distribution
F-stat = ( x^2 / m ) / ( x^2 / n )
= (chi-square for sample 1 / m ) / (chi-square for sample 1 / n )
Monte Carlo Simulation
used to estimate a distribution of asset prices
Generating 1000s of simulations of the asset using its variables, then calculate the mean/variance of the outcomes and price the asset accordingly
Use of Monte Carlo Simulation
- Value complex securities.
- Simulate the profits/losses from a trading strategy.
- Calculate estimates of value at risk (VaR) to determine the riskiness of a portfolio of assets and liabilities.
- Simulate pension fund assets and liabilities over time to examine the variability of the difference between the two.
- Value portfolios of assets that have abnormal returns distributions.
Binomial random variable
When there are only two possible outcomes of a given event.
Sampling error
The difference between a sample statistic and its corresponding population parameter
sampling error of the mean = sample mean – population mean = x – µ
The standard error of the sample mean (when population is known)
standard deviation of the distribution of the sample means
σx = σp / √n
Effect on the standard deviation of the sample, if the sample (n) increases?
n ↑ σs ↓
Desirable characteristics of an estimator (sample statistic)
Unbiased
Efficient
Consistent
Simple random sampling
Selecting a sample where each item in the population is has the same probability of being chosen
Stratified random sampling
randomly selecting samples proportionally from sub-groups. Sub-groups are formed based on one or more defining characteristics
Cluster sampling
similar to strat. random sampling - but subgroups are not necessarily based on the data.
1 stage: sample is chosen from random subgroups (clusters)
2 stage: sample is chosen from each subgroup (cluster)
Central limit theorem
For a population with a mean (µ) and a variance (σ^2) - ta sample distribution of 30+ will reflect the distribution of the population
Confidence interval
A range of values in which the population mean is expected to lie within a given probability
Reliability factor for 90% confidence interval
1.645
Reliability factor for 95% confidence interval
1.96
Reliability factor for 99% confidence interval
2.575
Confidence interval for a single item selected from the population
population mean (µ) +/- reliability factor * σ
Confidence interval for a point estimate (values used to estimate population parameters) selected from a sample
Mean of sample +/ reliability factor * standard error
Confidence interval for a sample mean
population mean (µ) +/- reliability factor * standard erro
Which test statistic should be used for a normal distribution with a known variance?
z-statistic
Which test statistic should be used for a normal distribution with an unknown variance?
t-statistic
Which test statistic should be used for a non-normal distribution with a known variance?
z-statistic
NB not available with a small sample n<30
Which test statistic should be used for a non-normal distribution with an unknown variance?
t-statistic
NB not available with a small sample n<30
Jackknife method of estimating standard error of the sample mean
Calculate the s.d. of multiple sample means (each sample with one observation removed from the sample).
- Computationally simple
- Used when population is small
- Removes bias from statistical estimates
Bootstrap method of estimating standard error of the sample mean
Calculate the s.d. of multiple sample means (each sample possible).
Two issues of the idea that larger samples increase accuracy of understanding the population
- May contain wrong observations (from other populations)
- Additional cost
Data snooping
Using a sample of observations to form an opinion - leads to ‘data snooping bias’
Sample selection bias
When certain observations are systematically excluded from the analysis (usually due to lack of available data)
Survivorship bias
Only including active/live data. e.g. only including active funds in an analysis of fund performance.
Time-period bias
Using data within a time period that is either too long or too short
Look-ahead bias
When a study tests a relationship with data that was not available on the test date.
Stratified random sampling is most often used to preserve the distribution of risk factors when creating a portfolio to track an index of:
Corporate bonds
risk factors e.g. ‘stratas’ can be more easily identified - which forms the basis of the sample
If random variable Y follows a lognormal distribution then the natural log of Y must be:
normally distributed.
Steps involved in hypothesis testing:
- Hypothesis
- Test statistic
- Level of significance
- Decision rule for hypothesis
- Collect sample and calculate statistics
- Make decision on hypothesis
- Make decision on test results
Null hypothesis (Ho)
- Always includes ‘=’ sign
- Two tailed test
- The test the researcher wants to reject
Alternative hypothesis (Ha)
- What is concluded if null hypothesis is wrong
General decision rule for a two-tailed test:
Reject Ho (null hypothesis) if:
test statistic > upper critical value, or
test statistic < lower critical value
(in one of the outer tails)
test statistic equation
(sample statistic - hypothesized value) / SE of sample statistic
Type I error
Rejecting the null hypothesis when it is true
Type II error
Failing to reject null hypothesis when it is false
determined by sample size and choice of significance level
Probability of making a Type I error
wrongly rejecting null hypothesis
The significance level (α)
Probability of correctly rejecting null hypothesis?
The power of the test
1 - the prob. of making a type 2 error
What is the decision rule for rejecting or failing to reject the null hypothesis based on?
the distribution of the test statistic
Statistical significance
refers to the use of a sample to carry out a statistical test meant to reveal any significant deviation from the stated null hypothesis.
Economic significance
the degree to which is the statistical significance is economically viable
p-value
Probability of obtaining a test statistic that would lead to a rejection of the null hypothesis (assuming the null hypothesis is true)
The smallest level of significance where the null can be rejected
When is it appropriate to use a z-test as the appropriate hypothesis test of the population mean?
Normal distribution and known variance
When is it appropriate to use a t-test as the appropriate hypothesis test of the population mean?
Unknown variance
Critical z-values for 10% level of significance
Two-tailed test: +/-1.65
One-tailed test: +1.28 or -1.28
Critical z-values for 5% level of significance
Two-tailed test: +/-1.96
One-tailed test: +1.65 or -1.65
Critical z-values for 1% level of significance
Two-tailed test: +/-2.58
One-tailed test: +2.33 or -2.33
Difference in means test
Two populations that are independent and normally distributed
Paired comparisons test
Two populations that are dependent of each other and normally distributed
How to test for the variance of a normally distributed population
The chi-squared test
How to test whether the variances of two normal populations are equal
The F -test
Parametric tests
based on assumptions about population distribution and parameters (e.g. mean = 3, variance = 100)
Non-parametric tests
based on minimal/no assumptions of population and test things other than parameter values (e.g. rank correlation tests, runs tests,)
How to test whether two characteristics in a sample of data are independent of each other?
The X^2 test
The appropriate test statistic for a test of the equality of variances for two normally distributed random variables, based on two independent random samples, is:
the F-test.
The appropriate test statistic to test the hypothesis that the variance of a normally distributed population is equal to 13 is:
the χ2 test.
A test of the population variance is a chi-square test.
The test statistic for a Spearman rank correlation test for a sample size greater than 30 follows:
a t-distribution.
The test statistic for the Spearman rank correlation test follows a t-distribution.
Assumptions of Linear Regression
- Linear relationship between the dependent and independent variables
- Variance of the residual term is constant (homoskedasticity)
- Residual terms independently and normally distributed
Coefficient of Variation (R^2)
= SSR / SST
measures the percentage of total variation in Y variable explained by the variation in X
For simple regression R^2 = correlation^2 XY
Factorial function
The factorial function, denoted n!, tells how many different ways n items can be arranged where all the items are included.
Coefficient of Variation
σ/µ
For a test of the equality of two variances
F-statistic.
unbiased estimator
the expected value equals the parameter it is intended to estimate.
A consistent estimator
the probability of estimates close to the value of the population parameter increases as sample size increases.