Quant Methods Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Yi

Dependent and independent variables // Graph function

A

Yi = b0 + b1Xi + ei

  1. Dependent variable is Yi
  2. Independent variable is Xi
  3. Error term is εi
  4. Coefficients are b0 (intercept) and b1 (slope coefficent)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Scatter Plots types

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Correlation coefficient (p or r) (Formula)

A

Correlation standardizes covariance by dividing it by the product of the standard deviations

Perfect postive correlation: +1
Perfect negative correlation: -1
No correlation: 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Covariance (Formula)

A

A statistical measure of the degree to which two variables move together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(Sample) Standard Deviation Formula

A

Sx = [E (xi - xmean)2 / n-1] 1/2

Easier with calculator!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Using calculator for Data Series to get Sx, Sy, r

A
  1. Add Data Series: [2nd] + [7]
  2. View Stats / Results: [2nd] + [8] > LIN [Down arrow]

Does not calculate Covariance!

BUT

Cov = rxy * Sx *Sy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Limitations of correlation analysis

A
  1. Correlation coefficient assumes linear relationship (no parabloic etc.)
  2. Presence of outliers can be distortive
  3. Spurious correlation (Fehlkorrelation)
    • Correlation does not imply causation (Rain in NYC has no effect on LON Bus routes altough there might be a statistical correlation)
    • Correlations without sound basis are suspect
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Assumptions underlying simple linear regression

A
  1. Linear relationship – might need transformation to make linear
  2. Independent variable is not random – assume expected values of independent variable are correct
  3. Expected value of error term is zero
  4. Variance of error term is same across all observations (homoskedasticity)
  5. Error terms uncorrelated (no serial/auto correlation) across observations
  6. Error terms normally distributed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Standard error of the estimate (SEE)

A

Standard error of the distribution of the errors about the regression line

The smaller the SEE, the better the fit of the estimated regression line. Tigther the points to the line

k = # of independent variables (single regression: 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sum of squared errors (SSE)

A

UNEXPLAINED: Actual (yi) - Prediction (^y)

The estimated regression equation will not predict the values y, it will only estimate them

A measure of this error is SSE (^y is predicted)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The coefficient of determination (R2)

A

Describes the percentage variation in the dependent variable explained by movements in the independent variable

Just r2 (loses + / -) add back when calculating r again

R<strong>2</strong> = 80% = 0.8 > r = 0.81/2 = 0.89 = -0.89 (see below)

y^ (predicted) = 0.4 - 0.3x > b1 = -0.3

Alternatively: R2 = RSS / TSS (if the same, R2 = 1 > perfect fit)
R2 = 1 - SSE/ TSS (if SSE = 0, R2 = 1 > perfect fit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Total sum of the squares (TSS)

A

ACTUAL (yi) - MEAN

Alternatively, TSS = RSS +SSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Regression sum of the squares (RSS)

A

EXPLAINED: PREDICTION (^y) - MEAN

Difference between the estimated values for y and the mean value of y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Graphic: Relationship between TSS, RSS and SSE

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Relationship between TSS, RSS and SSE

A
  • Using SSE, TSS and RSS to measure the goodness of fit of the estimated regression equation
  • The estimated regression equation would be a perfect fit if every value of the dependent variable yi happened to lie on the estimated regression line. This would result in SSE=0 and RSS=TSS
  • RSS/TSS is known as the coefficient of determination and is denoted by R2 :
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Hypothesis testing on regression parameters

A
  • Confidence Interval on b0 and b1
  • For a 90% confidence interval, 10% significance, 5% (a/2) in each tail
  • More HT in Multiple Regressions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

ANOVA tables

A
  • ANOVA stands for ANalysis Of VAriance
  • It is a summary table produced by statistical software such as Excel
  • Using the ANOVA table, calculate the coefficient of determination
  • The global test for the significance of the slope coefficient
  • Use of the F-statistic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Prediction intervals on the dependent variable

A
  • Range of dependent variable (Y) values for a given value of the independent variable (X) and a given level of probability
  • Two sources of error: Regression line and SEE

eg. 20 ——– 40

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Limitations of regression analysis

A
  1. Parameter instability - Regression r_elationships can change over time_
  2. Public knowledge of relationships - If a number of analysts identify a regression relationship that works, prices will change to reflect the inflow of funds, possibly removing the trading opportunity
  3. Assumption violation - If regression assumptions are violated then hypothesis test and predictions will be invalid
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Multiple Regression

A

Assumptions

  1. The relationship between the dependent variable and each independent variable is linear
  2. The independent variables are not random and there is no multicollinearity (x:x)
  3. The expected value of the error term is zero
  4. Error term is homoskedastic (E Variance constant; having the same scatter)
  5. No serial correlation
  6. Error term is normally distributed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

ANOVA

A

Work out:

  1. Degrees of freedom (DF) with k = # variables ; n = sample size
  2. Sum of squares: 2 will be given (TSS = RSS + SSE)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Using the regression equation to estimate the value

A

Becomes: Ŷ = 0.163 - (0.28 x 11) + (1.15 x 18) + (0.09 x 215) = 37.13

But this is only an estimate, we will want to apply confidence intervals to this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Individual test: T-test

A

Testing the significance of each of the individual regression coefficients and the
intercept

Tcalc: bi / S.E.

Tcrit: 2 (given in CFA)

TCalc > TCrit (in absolut) = REJECT NULL (H0: b1 = 0)

then b1 not equal to 0 = SIGNIFICANT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Global F-Test: Testing the validity of the whole regression

A

Testing to see whether or not all of the regression coefficients as a group are insignificant

FCalc > FCrit (in absolut) = REJECT NULL: at least one does not equal zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

T-Test: Specified Value

A

Determining whether a regression coefficient is significantly different from a specified value e.g. 1

Tcalc: bi - 1 / S.E.

Tcrit: 2 (given in CFA)

TCalc > TCrit (in absolut) = REJECT NULL (H0: b1 = 0)

then b1 not equal to 0 = SIGNIFICANT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

R<strong>2</strong> Recap

A

“The percentage of the total variation in the dependent variable (Y) that is explained by the regression equation”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Adjusted R2

A
  • The problem with R2 is that it will automatically increase if new independent variables are added, even if the new variable adds very little to the regression
  • Adjusted R2 takes into account the number of independent variables
  • It will only increase if the new independent variable pulls its weight

Example: Adding in a 4th variable and R2 increases (which is good). However, Adjusted R2 decreases and that is WORSE. Prefer option were R2 stays the same / gets worse and Adjusted is flat.

Interpret rather than use formula.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Dummy variables in regression analysis

A
  • Qualitative variables are important - E.g. investor confidence
  • Incorporate by dummy variables - Assigned either “1” or “0”
  • If you want to describe j circumstances with dummy variables you need j-1 dummy variables - E.g. month of year effect requires 11 dummy variables

Write a suitable regression equation and test significance (t-test: Tcalc with [b1 / S.E.] > Tcrit = REJECT = Significant]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Homoskedasticity

A

Variance of the error terms is constant across all of the observed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Heteroskedasticity

A

Variance of the error terms is not constant across all of the observed data

Testing for conditional heteroskedasticity: Breusch-Pagen test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Breusch-Pagen test

Testing for conditional heteroskedasticity

A
  • Regress the squared errors against each independent variable
  • Determine R2 of these regressions
  • If no conditional heteroskedasticity there will not be a strong relationship
  • If a high R2 there may be a strong relationship
  • But also need to consider the number of observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Correcting for heteroskedasticity

A

How we would correct for conditional heteroskedasticity:

  1. Compute robust standard errors
  2. Modify the regression equation by using generalized least squares method

Robust standard errors correct Tcalc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Autocorrelation / Serial correlation

E:E

A
  • The residuals of a regression are correlated across observations, so that a positive (or negative) error in one observation affects the probability that there will be a positive (or negative) error in the next observation (previous error predicts the next error; E:E)
  • Effect is that standard errors may be incorrect
  • Thus we may incorrectly reject/fail to reject null hypotheses about the population values
    • If one or more of the independent variables is a lagged value of the dependent variable, then serial correlation causes all regression parameters to be invalid – very serious problem as you may be performing the wrong type of regression
  • Detect with Durbin-Watson statistic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Durbin-Watson statistic

Detect autocorrelation

A

DW = 2 * (1 - r)

  • Obtain the critical value of the DW statistic (given in exam)
  • If positive correlation
  • H0 : No positive autocorrelation
    • IF DWcalc < dl reject Ho
    • IF DWcalc > du do not reject Ho

Example:

  • DW Statistic = 1.87
  • Assume the lower and upper critical values are 1.61 and 1.74
    => DWcalc (1.87) > du (1.74) => do not reject = No positive autocorrelation
    => if DWcalc was 1.65 => Inconclusive
    => if DWcalc was 1.00 => smaller than dl => REJECT => +ve Autocorrelation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Correcting for serial correlation

A
  1. Hansen method of adjusting the standard errors of the regression coefficients upwards
  2. Change the regression equation so that the autocorrelation is eliminated (do something different!!!!)

Hansen adjusts for both serial correlation and heteroskedasticity. It does not eliminate serial correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Multicollinearity

(X:X)

A

Definition

  • Multicollinearity occurs when two or more independent variables (or combinations of independent variables) in a regression model are highly (but not perfectly) correlated with each other (x:x
    • Estimates of regression coefficients will be unreliable
    • Cannot distinguish individual impacts of independent variables

Detection of multicollinearity

  • High R2 (this works - your equation is predicting is movement in y)
  • Significant F-stat (At least one bi is significant)
  • but low t-stats on each regression coefficient (due to overstated standard errors) - not significant: might be prooffor Multicol.
  • Can also be tested by pairwise correlation matrix but only when there are two independent variables (just look at correlation of each two if close to -/+ 1 = multicollinear)

Correcting for multicollinearity

  • Reformulate the regression model, leaving out variables that appear to be redundant
  • Rerun the regression model
  • In practice it can be difficult to determine which variables to exclude so experimentation may be necessary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Summary violation of assumptions

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Principles of model specification

A
  1. Model should be grounded in sensible economic reasoning - E.g. avoid data mining
  2. Functional form of variables should be appropriate - E.g. use logs of inputs if appropriate
  3. Model should be parsimonious i.e. achieving a lot with a little
  4. Model should be examined for violations of regression assumptions before being accepted
  5. Model should be tested ‘out of sample’, i.e. use new sample data before being accepted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

The model could fail because:

A
  1. One or more important variables are omitted (forget to put a variable in)
  2. One or more of the regression variables may need to be transformed - E.g. using natural logs for exponential data (or from millions in thousand)
  3. Data from different samples is pooled, e.g. using data from different stages of a company’s growth (mixing relationships)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Models with qualitative dependent variables

NOT dummy (independent)

A

Qualitative dependent variables are where dummy variables are used as dependent rather than independent variables

There are three main models:

  1. Probit
    • Estimate the probability of a discrete outcome (e.g. that a company will go bankrupt). Uses normal distribution
  2. Logit model
    • is based on the ‘logistic distribution’, a simplified version of the normal distribution that was useful before computers were developed
  3. Discriminant analysis
    • Yields a linear function that is similar to a regression equation that will create an overall ‘score’ for the dependent variable based on the values of the independent variables. If the score is above a certain number, the dependent variable is assigned a value of ‘1’; otherwise, it is assigned a value of ‘0’

Qualatative dependant output!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Time-series // Time-series analysis

A

A time series is a set of observations on a variable’s outcomes in different time periods

Models to use: Trend model (linear / log linear) & Auto Regressive (AR)

Key issues:

  1. How do we predict a future value based on past values?
  2. How do we model seasonality?
  3. How do we choose which models to use?
  4. How do we model changes in the variance of the time series over time?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Linear trend models

A

Probably serial correlation use DW to spot it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Limitations of trend models

A
  • Residuals are often serially correlated, tends to bias standard errors of regression coefficients downward (E:E; if overstated last period = overstate again)
  • This violates regression assumptions
  • Testing for serial correlation
    • Durbin Watson test (see overleaf for reminder of DW test)
    • Plot graph of Y against time and superimpose linear regression trend regression model estimates, judge it by eye
44
Q

Log-linear trend models

A

A trend model in which the logarithm of the dependent variable (lnYt ) is linearly related to time

45
Q

Autoregressive time series models

A

If a trend model has unacceptably high serial correlation in its residuals, an autoregressive time series model may solve the problem

An autoregressive time-series model is one in which the value of a time series in one period (xt) is related to its value in previous periods (xt-1, xt-2, and so forth).

Valid statistical inferences can be made from autoregressive time-series models only if the time series is covariance stationary

46
Q

Covariance stationary

A

In essence that its mean and variance do not change over time

To be covariance stationary, a time series must satisfy three requirements:

  1. The mean of the time series must be constant and finite in all periods
  2. The variance of the time series must be constant and finite in all periods
  3. The covariance of the time series with itself for a fixed number of periods in the past or future must be constant and finite in all periods.

If a time series is not covariance stationary we cannot model it using an AR model. If time series is not covariance stationary we may be able to transform to one

47
Q

Standard Error Formula

A
48
Q

Testing for serial correlation in an autoregressive time series model

A
  • CANNOT use Durbin-Watson statistic
  • T-test to test the regression coefficients and autocorelations of the standard errors (residuals)
  1. The regression coefficient is statistically significant at 5% because tstats are larger than 2.0 (IMPORTANT: Here Significant is good as Xt-1 is explaining Xt)
  2. Autocorrelations of standard errors all not significantly different to zero due to low tstats (good news). So no need to re-specify the model
49
Q

Chain rule

A

Process of forecasting where uncertainty is added at each forecast period so multi-period forecasts have more uncertainty than single period forecasts

50
Q

Mean reversion

A
  • A time series exhibits the property of mean reversion if it tends to fall when its level is above its mean-reversion level (MRL) and rise when its level is below its mean-reversion leve
  • Covariance stationary data will be mean reverting
  • b1 = 1 => No finite MRL => Not covariance stationary
  • b1 = 1 => Unit root => Random walk (not covariance stationary)
51
Q

Comparing Forecast Model Performance

A

Out of sample forecasts: we tend to look at out of sample results to compare forecast accuracy of two different models because the future is always out of sample

Typically compare performance using Root Mean Squared Error (RMSE):

√E2

The smaller the better

Also consider coefficient stability!!

52
Q

Coefficient stability

A

Regression coefficients are not stable over time

Don’t use data to construct an AR model that crosses periods with very different underlying conditions, need to apply subjective judgment

53
Q

Simple random walk

A

A simple random walk is a time series whose value in every period equals its value in the previous period

Special case of a first-order autoregressive time series model in which b0 is 0 and b1 is 1

  1. Means that the best forecast of xt is xt-1,because expected value of error term is zero
  2. Note that it is not xt that is random, but the variable xt - xt-1
  3. Random walks have an undefined mean-reverting level
  4.  Random walk is NOT covariance stationary
54
Q

Random walk with drift

A

autoregressive time series model in which b0 is not 0 and b1 is 1 then we have a random walk with a drift

Means that the best forecast of xt is b0 + xt-1 ,because expected value of error term is zero

The problem with all random walks is that because the data is not covariance stationary

Can convert data to a covariance-stationary time series by first differencing​

First-differenced series will have no predictive value but will help us conclude that the original series was a random walk.

55
Q

First Differencing

A

??

yt = b0 + εt

Even though this does not help us to make predictions it is nonetheless covariance stationary​

56
Q

Unit Root problem

A

Unit root = when a lag coefficient is not significantly different to one

Model not covariance stationary (need lag coefficient of less than 1)

If lag coefficient = 1 then we have a random walk. By definition all random walks have unit roots

If lag coefficient >1 then we have an explosive root

Need to transform into covariance stationary form with First Differencing

Test for Unit root using Dickey Fuller test

57
Q

Dickey-Fuller Test

A

Test for unit root: DFT to see if b1 - 1 (g) is significantly different to zero

b1 -1 = g

b1 = 1 then g = 0 thus H0 : g = 0

Calculate t stat as usual and compare to a Dickey-Fuller critical stat. - If reject null then do not have unit root problem

58
Q

Seasonality in time-series models

A

Test Lags with T-test

If TCalc > TCrit = REJECT = Significant

Add lag ie 4th lag significatn = re-specify the model to include a seasonal lag of one-year

To test if the new model is correct, retest for seasonality

Once specified correctly can be used for forecasting

59
Q

ARCH Models

Autoregressive conditional heteroskedasticity models

A

To test for such a relationship: ARCH Test

  1. (Regress the squared error terms on the previous period’s error terms)
  2. If the regression coefficient (a1 ) of this ARCH(1) model is statistically significant (T-TEST), the error terms in the model are ARCH(1)
60
Q

Using ARCH model to predict the variance of the error terms

A

Use the ARCH equation to predict the variance of error terms in the t+1 period

61
Q

Cointegration

Definition
Example
How to test for it

A

Two (or more) time series might not be stationary, e.g. have unit root problem, but if we regress the series against each other we might find we have a (covariance) stationary series – this is called cointegration. If the series are cointegrated then the error term above will not have a unit root

Example: Regressing the price of a stock market index and also the associated future contract. Each one individually might exhibit a random walk however we would intuitively expect a stable relationship between them.

Only reliable for modelling where there is a long run, stable relationship.

Testing for cointegration: The (Engle-Granger) Dickey-Fuller test

62
Q

(Engle-Granger) Dickey-Fuller test

A

Testing for cointegration

Check whether Error Term has a unit root. If the series are cointegrated then the error term above will not have a unit root.

If we reject (significant different) the null then we conclude the error term is covariance stationary = no unit root = cointegrated

63
Q

Time-Series Analysis: Determining which model to use

A
64
Q

Machine learning defined

and 3 different classes

A
  • Extracting knowledge from large amounts of data (big data)
  • Goal of automating the decision-making process by
  • ‘Learning’ from known examples to determine an underlying structure in the data

Find the pattern, apply the pattern.

Broadly categorized into three distinct classes of techniques:

  1. Supervised learning
  2. Unsupervised learning
  3. Deep Learning
65
Q

Supervised machine learning

A

Requires the use of a labelled data set i.e. matched set of observed inputs and the associated output

The ML algorithm is ‘trained’ using the labeled data set to infer the pattern-based prediction rule between the inputs and output

  • The ‘fit’ of the ML model is evaluated using labelled test data where the predicted targets (Y predicted) are compared to the actual targets (Y actual)
  • Two categories of problems: 1. Regression problems where the target variable is continuous (even if the ML technique used isn’t regression) 2. Classification problems where the target variable is categorical or ordinal e.g. fraudulent or non-fraudulent transactions
66
Q

Unsupervised machine learning

A

Does not make use of labeled data, the ML algorithm seeks to discover structure within the data set

Two types of problems suited towards unsupervised ML: 1. Dimension reduction aims at reducing the number of features used whilst retaining variation across observations e.g. identifying major factors underlying asset price movements 2. Clustering aims on sorting observations into groups (clusters) based on similarity that may or may not be pre-specified (for example, the number of groups) e.g. sorting companies into financial statement data groups

67
Q

Deep learning and reinforcement learning

A

complex and sophisticated algorithms tackle highly complex tasks such as image and speech recognition

In Reinforcement learning a computer learns from interacting with itself

Both are based on artificial neural networks (ANNs), and can be supervised or unsupervised

68
Q

Overfitting

A

Represents a model that fits its training data too well (i.e. the incorporation of noise or random fluctuations) and does not predict using out-of-sample data

Low or no in-sample error (Ein) but large out-of-sample error (Eout) represents poor generalization / overfit!

Main contributors to overfitting:

  1. High noise levels
  2. Too much complexity in the model i.e. features in the model, number of branches, linear or nonlinear relationship
69
Q

Sources of total out-of-sample error (Eout)

A
  1. Bias Error – the degree to which the model fits the training data (underfit?)
    • ML models with erroneous assumptions produce high bias and poor approximations which results in underfitting and high in-sample error
  2. Variance Error – a measure of how much the model’s results change in response to new data from the validation and test samples
    • An unstable model will pick up ‘noise’ and produce high variance causing overfitting and high outof-sample error
  3. Base Error – error due to randomness in the data
70
Q

ML: Learning curves

A

High Variance Error = Over-fitting

High Bias Error = Under-fitting

71
Q

ML: Fitting curve

A

A fitting curve show in- and out-of-sample error rates (Ein and Eout) against model complexity

Typically:

  • Linear functions are more susceptible to bias error and underfitting
  • Non-linear functions are more susceptible to variance error and overfitting

An optimal point (managing overfitting risk) of model complexity exists where the bias and variance error curves intersect and where Ein and Eout rates are minimized​

72
Q

ML: Preventing overfitting

A
  1. Estimation of an overfitting penalty that increases in size with the number of included features
    • Prevents the algorithm from getting too complex during the selection and training process
    • ​Only include parameters that reduce out-of-sample error
  2. Cross-validation
    • A process aimed at reducing sampling bias
    • The challenge is to have a large enough data set to partition the data into representative groups for training, validation and testing (holdout sample).
    • k-fold cross validation: data (excluding a test sample) is shuffled randomly and split into k equal size sub-groups (typically 5 or 10), with k-1 groups used as training samples and one sample (the kth) used as a validation sample. The process is repeated k times so each data point is used in the training data set k-1 times and in the validation data set once. The average of the k-validation errors (mean Eval) taken as an estimate of the model’s Eout
73
Q

ML: Penalized regression

A

Penalized regression is a process of regularisation that helps reduce the effect of ‘overfitting’ a model

-A penalty term is created that increases in size as the number of included variables in the model increases e.g. in Least Absolute Shrinkage and Selection Operator (LASSO):

74
Q

ML: Support Vector Machine (SVM)

A

A very popular ML algorithm used for classification, regression, and outlier detection

SVM is a linear classifier that determines a hyperplane (e.g. a line) that optimally separates the data into two sets of data points

75
Q

ML: K-Nearest Neighbour (KNN)

A
  • Supervised ML technique used commonly for classification and sometimes for regression
  • Aims to classify a new observation by identifying similarities between the new observation and the existing data
  • KNN is a straightforward, intuitive, non-parametric technique that can be used in a multiclassification situation
  • However, defining the term ‘similar’ can be difficult
  • The number of K (hyperparameter) in the model must be carefully chosen:
    • Too small: Results in a high error rate and sensitivity to local outliers
    • Too big: Dilutes the concept of nearest neighbor by averaging too many outcomes
    • Even: May result in ties and no clear classification
76
Q

ML: Classification and Regression Trees (CART)

A

ML technique used to predict either a:

  1. Categorical target variable, i.e. a classification problem, producing a classification tree, or
  2. Continuous outcome, i.e. a regression problem, producing a regression tree

Algorithm produces a visual decision tree with binary branching to classify observations

CART makes no assumptions about the characteristics of the training data - Therefore, if left unconstrained it can be subject to overfitting. This can be mitigated by the introduction of regularization parameters: • Maximum depth of tree • Minimum population at each node • Maximum number of decision nodes

77
Q

Ensemble learning

A

Combining the predictions from a collection of models to create an average predicted value

Heterogeneous learners: different types of algorithm combined together with a voting classifie

Homogenous learners: combination of the same algorithm using different training data

78
Q

Random Forest Classifier

A

??

79
Q

Dimension reduction – Principal Components Analysis (PCA)

A

Unsupervised ML Algorithms: Process used to summarize or reduce highly correlated features into a few main, uncorrelated composite variables

80
Q

Clustering algorithms

A

Clustering groups solely on the basis of information found in the data with no pre-determined labelling. A cluster is created on a sub-set of data that is deemed to be ‘similar

  • Cohesion – observations in each cluster are similar to each other
  • Separation – observations in two different clusters are as dissimilar as possible

Uncovers potentially interesting and novel relationships not previously identified using standard classifications to group companies such as industry and sector

Two popular approaches include: • K-means clustering • Hierarchical clustering

81
Q

K-means clustering

A

Iterative process of repeatedly partitioning data into a fixed number, k, of nonoverlapping clusters

82
Q

Hierarchical clustering

A

n iterative procedure that builds a hierarchy of clusters. The algorithm creates intermediate rounds of clusters that are of:

  1. Increasing size: Agglomerative – used in large datasets because of its fast computing speed. It makes decisions on local patterns without an initial global structure, therefore, it’s good at identifying smaller clusters
  2. Decreasing in size: Divisive – starts with an initial global structure and is better suited to identifying large clusters.
83
Q

Neural networks

A
84
Q

Deep Learning Nets (DLNs)

A
85
Q

Reinforcement Learning (RL)

A
86
Q

ML Summary

A
87
Q

Big Data

Definition

A

Characteristics of Big Data:

  1. Volume: Data collected in files, tables and datasets is large
  2. Velocity: The speed at which data is communicated is great! Real-time data is becoming the norm in many areas
  3. Variety: Data is collected from many different sources and in many different formats: - Structured data such as SQL tables and CSV files - Semi-structured data such as HTML code - Unstructured data such as video data

When using data for inference or prediction, there is a “Fourth V”:

  1. Veracity: Relates to the credibility and reliability of different data sources e.g. fake news and spam emails • Identifying quality from quantity!
88
Q

Data Analysis: ML Model Building Summary

A
89
Q
A
90
Q

Traditional (Structured) ML Model Building Steps

A
  1. Conceptualization • Determining what the inputs, and output of the model e.g. will the stock price rise or fall in a week’s time? • How will the model be used, and who will use it? • How will the model be incorporated into the business’ processes?
  2. Data Collection • Mostly data collected from internal and external sources in a structured form, e.g. cells with values • External data can be accessed through an application programming interface (API) which allows communication between different software components
  3. Data Preparation and Wrangling • Cleansing the data to resolve missing values or out-of-range values • Preprocessing the data: Extracting, aggregating, filtering, and selecting relevant data columns
  4. Data Exploration • Involves exploratory data analysis, feature selection, and feature engineering
  5. Model Training • Selecting the appropriate ML method(s) • Evaluating the performance of the trained model • Tuning the ML model
91
Q

Text Based (Unstructured) ML Model Building Steps

A
  1. Text Problem Formulation • Identify the inputs and outputs, e.g. identify a sentiment score that is structured output from an unstructured input, like text
  2. Data (Text) Curation • Gathering external text data via web services or web spidering (scraping or crawling) programs that extract raw content from a source, like web pages
  3. Text Preparation and Wrangling • Cleaning and preprocessing to convert the unstructured text into a format that can be interpreted by traditional modeling methods designed around structured inputs
  4. Text Exploration • The process of visualizing the text using techniques such as word clouds • Also, text feature selection and engineering
  5. Model Training

The output resulting from the process could be combined with other structured variables or used directly for forecasting and analysis. - The detail of steps 3 and 4 vary between structured data versus text based (unstructured) data. We will go on to look at these points in more detail.

92
Q

Introduction to Data Preparation and Wrangling

A

Data Preparation (Cleansing)

  • The process of examining, identifying, and mitigating errors in raw data
  • Common issues include missing, duplicated, erroneous or inaccurate values
  • Automated data can have similar issues due to software bugs and server failures

Data Wrangling (Preprocessing)

  • Involves the transformation and processing of the cleansed data so that it is ready to be used for ML model training
  • The data may be processed to deal with outliers, extraction of useful variables from the existing data, and also scaling the data

Different for Structured data // Unstructured (Text) data

93
Q
  • *Structured** Data: Data Preparation and Wrangling
    1. Data Preparation (Cleansing)
A

Possible errors in a raw dataset (e.g. a table) include:

  • Incompleteness error – data is not present i.e. missing value • Seek alternative sources • Missing values and NAs must be omitted or replaced with “NA” for deletion or substitution of an imputed value (e.g. the mean, median or mode or assume 0)
  • Invalidity error – data is outside of a meaningful range, creating invalid data - Inaccuracy error – data is not a measure of true value
  • Inconsistency error – data conflicts with corresponding data points or reality e.g. a title column shows ‘Mrs.’ when the sex column states ‘male’
  • Non-uniformity error – data is not present in a consistent format e.g. GBP and £ -
  • Duplication error – where duplicate observations are present
94
Q

Structured Data: Data Preparation and Wrangling

Data Wrangling (Preprocessing)

A
  • Predominantly the transformation and scaling of data on the cleansed data set
  • Common transformations used in practice include:
    • Extraction – new variable extracted from a current variable e.g. Age from observed DoB
    • Aggregation – consolidation of two or more similar variables into one variable e.g. capital gains/losses and income combined to give total return
    • Filtration – data rows not required must be identified and filtered
    • Selection – data columns not intuitively needed can be removed
    • Conversion – the data (nominal, ordinal, continuous, categorical) may need to be converted in order to be processed further, e.g. removal or prefixes such as currency symbols
  • Outliers need to be identified in order for them to be removed or replaced. Several techniques exist. Data values that are outside of:
      • 3 standard deviations from the mean, or - 1.5 times the inter-quartile range + 3rd Quartile upper bound
  • There are several methods to deal with outliers:
    • Trimming – removal of the outliers and extreme values
    • Winsorization – extreme values and outliers are replaced with the maximum (for large outliers) and minimum (for small outliers) values of data points that are not deemed to be outliers
  • Scaling
    • - The process of adjusting the range of a feature by shifting and changing the scale of the data
    • Required for ML techniques requiring scaled data e.g. an neural network
    • Two common methods: Normalization (sensitive to outliers) & Standardization (assumes normal distribution & less sensitive to outliers)
95
Q

Unstructured: Data Preparation (Cleansing)

A

Basic operations in the text cleansing process includes removing:

  • HTML tags - Required if the text is obtained from website
  • Punctuations and numbers - Generally, they are removed as words found in the sentence infers meaning, e.g. the presence of the word “boosted” in an earnings press release may indicate positive sentiment (rather than the number figure) - However, sometimes they can be useful e.g. % sign (would be replaced with the annotation /PercentSign/ to preserve grammatical meaning in the text)
  • White spaces - Removal of unnecessary white spaces that might have occurred because of the removal of punctuations and numbers
96
Q

Unstructured: Text Wrangling (Preprocessing)

A
  • Involves the process of tokenization: Process of splitting text into separate tokens (e.g. words) - Can be done at a character or word level (most common)
  • The normalization process involves the following:
    • Lowercasing • Removes the distinction among the same words e.g. “It” and “it”
    • Stop words • Such as “is”, “the” and “a” don’t always carry a semantic meaning so they are often removed at this stage (or maybe later in the data exploration stage because of high word frequency)
    • Stemming • Converting inflected forms of a word into a base word e.g. “fishing“, “fished“, and “fisher“ to the stem “fish”
    • Lemmatization • Converting inflected forms of a word into its morphological root known as a lemma • Requires an understanding of the relevant dictionary and is more expensive and advanced
97
Q

Creating Bag-of-Words (BOW)

Unstructured: Text Wrangling (Preprocessing)

A

Procedure used to analyze text and is a collection of a distinct set of tokens observed from all the texts in a sample data set

The final BOW created after normalization can be viewed in a document term matrix (DTM) which makes the text more structured

An N-grams technique can be used to attach words together to show representation of word sequences e.g. a bigram such as “not_present”. This ensures the term “not” isn’t considered a single token that may have been removed during normalization.

98
Q

Data Exploration Summary

A

Involves three vital tasks:

  1. Exploratory Data Analysis (EDA): This preliminary step of data exploration involves the creation of graphs, charts, heat maps and word clouds. EDA helps stakeholders connect and ensure the prepared data is sensible. EDA also allows for inspection of simple questions and hypotheses which enables planning for the next stage
  2. Feature Selection: Where only the key features from the dataset are selected for ML model training
  3. Feature Engineering: Process of creating new features by changing or transforming existing features

2 & 3 heavily influences model performance!

99
Q

Structured Data: Data Exploration

A
  1. Exploratory Data Analysis (EDA)
    1. Principal Components Analysis (PCA) can be used on high-dimension data
    2. Exploratory visualization for one-dimensional data (bar charts etc)
    3. Exploratory visualization for two-dimensional data includes scatterplots etc
  2. Feature Selection​
    1. Removal of unneeded, irrelevant, and redundant features to achieve model parsimony
    2. Basic diagnostic tests are carried out to identify: - Feature redundancy - Heteroskedasticity - Multicollinearity
    3. Dimension reduction is carried out which creates new combinations of features that are uncorrelated which helps to reduce cost and increase processing speeds
  3. Feature Engineering
    1. This process helps to further optimize and improve the features e.g. categorizing ages into either retirement and non-retirement age features
    2. For categorical data it may involve one hot encoding where a categorical feature is converted to a binary outcome of 0 or 1, e.g. is_RetirementAge assigned “0” for false, and “1” for true
100
Q

Unstructured Data: Exploratory Data Analysis (EDA)

A
  • Most common text analytical procedures are:
    • Text classification – supervised ML to classify texts into different classes
    • Topic modelling – unsupervised ML that groups texts into topic clusters
    • Fraud detection
    • Sentiment analysis – both supervised and unsupervised ML to predicting the sentient of texts
  • Statistical measures used as part of EDA on text data:
    • Term (or Collection) Frequency (TF) = No. of times a given token occurs in all texts/total number of tokens, and allows the analyst to identify (and potentially remove) noisy terms
    • Word associations
    • Average sentence and word length
    • Word and syllable count
  • Word clouds are a common visual technique used
101
Q

Unstructured Data: Feature Selection

A
  • For text data this involves selection of a subset of tokens occurring in the dataset, these represent features of the data set.
  • Noisy features represent the most infrequent and most frequent tokens in the dataset (e.g. stop words). Identification and removing this noise is an important task
  • General feature selection methods include:
  1. Frequency measures
  2. Chi-square test: • Used to test the independence of two events e.g. occurrence of the token vs. the occurrence of the class • Useful for ranking – tokens with the highest test statistic occur more frequently in texts associated with a particular class and may be selected as a feature
  3. Mutual information (MI): Measure how much information is contributed by a token to a class of texts • Value of “0” if the token appears equally in all classes or “1” if it occurs in only one class of tex
102
Q

Unstructured Data: Feature Engineering

A

This process is similar to techniques used for structured data

Techniques used include:

  • Numbers: Numbers of certain length could be identified as a particular token, e.g. 5-digit number representing a telephone area code in the UK. A feature labelled /number5/ could be created to represent a token
  • N-grams
  • Name Entity Recognition (NER) and Parts of Speech (POS) • Algorithms used to analyze individual tokens and their surrounding semantics whilst referencing to a dictionary in order to tag an object class to the token, e.g. taking a sentence and attaching labels such as verb, noun, percent, time, money etc.
103
Q

Model Training

Three vital tasks

A
  1. Method Selection: Deciding which ML method(s) to use (ML section)
  2. Performance Evaluation: Techniques and measures used to quantify and understand the model performance
  3. Tuning: Decisions and actions to improve the model performance

Iterative process: Repeated many times until the desired level of model performance is attained

104
Q

Model Training: Model Selection

A
  • Factors to consider when selecting the ML method or algorithm to be used include:
    • Supervised or Unsupervised
    • Type of Data
    • Size of Data
  • Once the method is selected, certain method-related decisions need to be made, i.e. hyper-parameters e.g. number of hidden layers in a neural network
  • Data needs to be split before training begins:
    • In-sample data: Training sample (60%)
    • Out-of-sample data: - Validation sample - Testing sample (40%)
105
Q

Model Training: Performance Evaluation

A

The process of measuring the ‘Goodness of Fit’ of the ML model - Several techniques are used and we will discuss methods suited to binary classification models

  1. Error Analysis - The computation of four basic evaluation metrics:
    A confusion matrix is used to summarize the above metrics
    • True positive (TP)
    • False positive (FP) - Type I error
    • True negative (TN)
    • False negative (FN) - Type II error
  2. Receiver Operating Characteristic (ROC) - Assesses model performance by plotting a curve that represents the trade-off between the false positive rate and the true positive rate for various cutoff points (for the observation to be classified as either “0” or “1”)
    • False Positive rate = FP / (TN + FP)
    • True Positive rate (Recall) = TP / (TP + FN
  3. Root Mean Squared Error (RMSE)
    • Appropriate for continuous data predictions and is commonly used in regression
    • A single metric capturing all the prediction errors in the data (n)
    • Square root of mean of the squared differences between actual values and the model’s predicted values
106
Q

Model Training: Tuning

A
  • Once the model has been evaluated, based on the findings, the performance of the model needs to be improved:
    • High prediction error on the training set = Underfit
    • Prediction error on the cross-validation (CV) set is much higher than on the training set = Overfit
  • Two types of error in model fitting:
    • Bias error: - Model is overly simplified and does not learn adequately from the training data - Associated with underfitting
    • Variance error: - Model is overly complicated and starts to memorize the training data and therefore performs poorly on new data - Associated with overfitting
    • It is not possible to remove both, however, it is possible to minimize the total aggregate error (bias and variance error)
  • Hyperparameters must be chosen in advance, e.g. regularization term (λ) in a supervised model, number of hidden layers in a NN
  • Grid search is a method of systematically training a ML model by using various combinations of hyperparameters and choosing the one with best model performance
  • Results can be analyzed using a fitting curve
107
Q

Tuning: Fitting Curve

A
  1. Very low regularization:
    • Prediction error on the training set is small (memorizing the data) but high on the cross validation data set
    • High variance error and low bias error
    • Model is overfitted as it does not perform well on new data
  2. Very high regularization:
    • Too few features included so the model is unable to learn
    • High prediction error on both the training (suggesting high bias) and CV datasets
    • Suggests model underfitting
  3. Optimum regularization
    • Minimizes total error in a balanced fashion, with prediction error in the training and CV datasets that are similar