Quant Methods Flashcards
Yi
Dependent and independent variables // Graph function
Yi = b0 + b1Xi + ei
- Dependent variable is Yi
- Independent variable is Xi
- Error term is εi
- Coefficients are b0 (intercept) and b1 (slope coefficent)
Scatter Plots types
Correlation coefficient (p or r) (Formula)
Correlation standardizes covariance by dividing it by the product of the standard deviations
Perfect postive correlation: +1
Perfect negative correlation: -1
No correlation: 0
Covariance (Formula)
A statistical measure of the degree to which two variables move together
(Sample) Standard Deviation Formula
Sx = [E (xi - xmean)2 / n-1] 1/2
Easier with calculator!!
Using calculator for Data Series to get Sx, Sy, r
- Add Data Series: [2nd] + [7]
- View Stats / Results: [2nd] + [8] > LIN [Down arrow]
Does not calculate Covariance!
BUT
Cov = rxy * Sx *Sy
Limitations of correlation analysis
- Correlation coefficient assumes linear relationship (no parabloic etc.)
- Presence of outliers can be distortive
- Spurious correlation (Fehlkorrelation)
- Correlation does not imply causation (Rain in NYC has no effect on LON Bus routes altough there might be a statistical correlation)
- Correlations without sound basis are suspect
Assumptions underlying simple linear regression
- Linear relationship – might need transformation to make linear
- Independent variable is not random – assume expected values of independent variable are correct
- Expected value of error term is zero
- Variance of error term is same across all observations (homoskedasticity)
- Error terms uncorrelated (no serial/auto correlation) across observations
- Error terms normally distributed
Standard error of the estimate (SEE)
Standard error of the distribution of the errors about the regression line
The smaller the SEE, the better the fit of the estimated regression line. Tigther the points to the line
k = # of independent variables (single regression: 1)
Sum of squared errors (SSE)
UNEXPLAINED: Actual (yi) - Prediction (^y)
The estimated regression equation will not predict the values y, it will only estimate them
A measure of this error is SSE (^y is predicted)
The coefficient of determination (R2)
Describes the percentage variation in the dependent variable explained by movements in the independent variable
Just r2 (loses + / -) add back when calculating r again
R<strong>2</strong> = 80% = 0.8 > r = 0.81/2 = 0.89 = -0.89 (see below)
y^ (predicted) = 0.4 - 0.3x > b1 = -0.3
Alternatively: R2 = RSS / TSS (if the same, R2 = 1 > perfect fit)
R2 = 1 - SSE/ TSS (if SSE = 0, R2 = 1 > perfect fit)
Total sum of the squares (TSS)
ACTUAL (yi) - MEAN
Alternatively, TSS = RSS +SSE
Regression sum of the squares (RSS)
EXPLAINED: PREDICTION (^y) - MEAN
Difference between the estimated values for y and the mean value of y
Graphic: Relationship between TSS, RSS and SSE
Relationship between TSS, RSS and SSE
- Using SSE, TSS and RSS to measure the goodness of fit of the estimated regression equation
- The estimated regression equation would be a perfect fit if every value of the dependent variable yi happened to lie on the estimated regression line. This would result in SSE=0 and RSS=TSS
- RSS/TSS is known as the coefficient of determination and is denoted by R2 :
Hypothesis testing on regression parameters
- Confidence Interval on b0 and b1
- For a 90% confidence interval, 10% significance, 5% (a/2) in each tail
- More HT in Multiple Regressions
ANOVA tables
- ANOVA stands for ANalysis Of VAriance
- It is a summary table produced by statistical software such as Excel
- Using the ANOVA table, calculate the coefficient of determination
- The global test for the significance of the slope coefficient
- Use of the F-statistic
Prediction intervals on the dependent variable
- Range of dependent variable (Y) values for a given value of the independent variable (X) and a given level of probability
- Two sources of error: Regression line and SEE
eg. 20 ——– 40
Limitations of regression analysis
- Parameter instability - Regression r_elationships can change over time_
- Public knowledge of relationships - If a number of analysts identify a regression relationship that works, prices will change to reflect the inflow of funds, possibly removing the trading opportunity
- Assumption violation - If regression assumptions are violated then hypothesis test and predictions will be invalid
Multiple Regression
Assumptions
- The relationship between the dependent variable and each independent variable is linear
- The independent variables are not random and there is no multicollinearity (x:x)
- The expected value of the error term is zero
- Error term is homoskedastic (E Variance constant; having the same scatter)
- No serial correlation
- Error term is normally distributed
ANOVA
Work out:
- Degrees of freedom (DF) with k = # variables ; n = sample size
- Sum of squares: 2 will be given (TSS = RSS + SSE)
Using the regression equation to estimate the value
Becomes: Ŷ = 0.163 - (0.28 x 11) + (1.15 x 18) + (0.09 x 215) = 37.13
But this is only an estimate, we will want to apply confidence intervals to this
Individual test: T-test
Testing the significance of each of the individual regression coefficients and the
intercept
Tcalc: bi / S.E.
Tcrit: 2 (given in CFA)
TCalc > TCrit (in absolut) = REJECT NULL (H0: b1 = 0)
then b1 not equal to 0 = SIGNIFICANT
Global F-Test: Testing the validity of the whole regression
Testing to see whether or not all of the regression coefficients as a group are insignificant
FCalc > FCrit (in absolut) = REJECT NULL: at least one does not equal zero
T-Test: Specified Value
Determining whether a regression coefficient is significantly different from a specified value e.g. 1
Tcalc: bi - 1 / S.E.
Tcrit: 2 (given in CFA)
TCalc > TCrit (in absolut) = REJECT NULL (H0: b1 = 0)
then b1 not equal to 0 = SIGNIFICANT
R<strong>2</strong> Recap
“The percentage of the total variation in the dependent variable (Y) that is explained by the regression equation”
Adjusted R2
- The problem with R2 is that it will automatically increase if new independent variables are added, even if the new variable adds very little to the regression
- Adjusted R2 takes into account the number of independent variables
- It will only increase if the new independent variable pulls its weight
Example: Adding in a 4th variable and R2 increases (which is good). However, Adjusted R2 decreases and that is WORSE. Prefer option were R2 stays the same / gets worse and Adjusted is flat.
Interpret rather than use formula.
Dummy variables in regression analysis
- Qualitative variables are important - E.g. investor confidence
- Incorporate by dummy variables - Assigned either “1” or “0”
- If you want to describe j circumstances with dummy variables you need j-1 dummy variables - E.g. month of year effect requires 11 dummy variables
Write a suitable regression equation and test significance (t-test: Tcalc with [b1 / S.E.] > Tcrit = REJECT = Significant]
Homoskedasticity
Variance of the error terms is constant across all of the observed data
Heteroskedasticity
Variance of the error terms is not constant across all of the observed data
Testing for conditional heteroskedasticity: Breusch-Pagen test
Breusch-Pagen test
Testing for conditional heteroskedasticity
- Regress the squared errors against each independent variable
- Determine R2 of these regressions
- If no conditional heteroskedasticity there will not be a strong relationship
- If a high R2 there may be a strong relationship
- But also need to consider the number of observations
Correcting for heteroskedasticity
How we would correct for conditional heteroskedasticity:
- Compute robust standard errors
- Modify the regression equation by using generalized least squares method
Robust standard errors correct Tcalc
Autocorrelation / Serial correlation
E:E
- The residuals of a regression are correlated across observations, so that a positive (or negative) error in one observation affects the probability that there will be a positive (or negative) error in the next observation (previous error predicts the next error; E:E)
- Effect is that standard errors may be incorrect
- Thus we may incorrectly reject/fail to reject null hypotheses about the population values
- If one or more of the independent variables is a lagged value of the dependent variable, then serial correlation causes all regression parameters to be invalid – very serious problem as you may be performing the wrong type of regression
- Detect with Durbin-Watson statistic
Durbin-Watson statistic
Detect autocorrelation
DW = 2 * (1 - r)
- Obtain the critical value of the DW statistic (given in exam)
- If positive correlation
-
H0 : No positive autocorrelation
- IF DWcalc < dl reject Ho
- IF DWcalc > du do not reject Ho
Example:
- DW Statistic = 1.87
- Assume the lower and upper critical values are 1.61 and 1.74
=> DWcalc (1.87) > du (1.74) => do not reject = No positive autocorrelation
=> if DWcalc was 1.65 => Inconclusive
=> if DWcalc was 1.00 => smaller than dl => REJECT => +ve Autocorrelation
Correcting for serial correlation
- Hansen method of adjusting the standard errors of the regression coefficients upwards
- Change the regression equation so that the autocorrelation is eliminated (do something different!!!!)
Hansen adjusts for both serial correlation and heteroskedasticity. It does not eliminate serial correlation.
Multicollinearity
(X:X)
Definition
- Multicollinearity occurs when two or more independent variables (or combinations of independent variables) in a regression model are highly (but not perfectly) correlated with each other (x:x
- Estimates of regression coefficients will be unreliable
- Cannot distinguish individual impacts of independent variables
Detection of multicollinearity
- High R2 (this works - your equation is predicting is movement in y)
- Significant F-stat (At least one bi is significant)
- but low t-stats on each regression coefficient (due to overstated standard errors) - not significant: might be prooffor Multicol.
- Can also be tested by pairwise correlation matrix but only when there are two independent variables (just look at correlation of each two if close to -/+ 1 = multicollinear)
Correcting for multicollinearity
- Reformulate the regression model, leaving out variables that appear to be redundant
- Rerun the regression model
- In practice it can be difficult to determine which variables to exclude so experimentation may be necessary
Summary violation of assumptions
Principles of model specification
- Model should be grounded in sensible economic reasoning - E.g. avoid data mining
- Functional form of variables should be appropriate - E.g. use logs of inputs if appropriate
- Model should be parsimonious i.e. achieving a lot with a little
- Model should be examined for violations of regression assumptions before being accepted
- Model should be tested ‘out of sample’, i.e. use new sample data before being accepted
The model could fail because:
- One or more important variables are omitted (forget to put a variable in)
- One or more of the regression variables may need to be transformed - E.g. using natural logs for exponential data (or from millions in thousand)
- Data from different samples is pooled, e.g. using data from different stages of a company’s growth (mixing relationships)
Models with qualitative dependent variables
NOT dummy (independent)
Qualitative dependent variables are where dummy variables are used as dependent rather than independent variables
There are three main models:
-
Probit
- Estimate the probability of a discrete outcome (e.g. that a company will go bankrupt). Uses normal distribution
-
Logit model
- is based on the ‘logistic distribution’, a simplified version of the normal distribution that was useful before computers were developed
- Discriminant analysis
- Yields a linear function that is similar to a regression equation that will create an overall ‘score’ for the dependent variable based on the values of the independent variables. If the score is above a certain number, the dependent variable is assigned a value of ‘1’; otherwise, it is assigned a value of ‘0’
Qualatative dependant output!!
Time-series // Time-series analysis
A time series is a set of observations on a variable’s outcomes in different time periods
Models to use: Trend model (linear / log linear) & Auto Regressive (AR)
Key issues:
- How do we predict a future value based on past values?
- How do we model seasonality?
- How do we choose which models to use?
- How do we model changes in the variance of the time series over time?
Linear trend models
Probably serial correlation use DW to spot it
Limitations of trend models
- Residuals are often serially correlated, tends to bias standard errors of regression coefficients downward (E:E; if overstated last period = overstate again)
- This violates regression assumptions
-
Testing for serial correlation
- Durbin Watson test (see overleaf for reminder of DW test)
- Plot graph of Y against time and superimpose linear regression trend regression model estimates, judge it by eye
Log-linear trend models
A trend model in which the logarithm of the dependent variable (lnYt ) is linearly related to time
Autoregressive time series models
If a trend model has unacceptably high serial correlation in its residuals, an autoregressive time series model may solve the problem
An autoregressive time-series model is one in which the value of a time series in one period (xt) is related to its value in previous periods (xt-1, xt-2, and so forth).
Valid statistical inferences can be made from autoregressive time-series models only if the time series is covariance stationary
Covariance stationary
In essence that its mean and variance do not change over time
To be covariance stationary, a time series must satisfy three requirements:
- The mean of the time series must be constant and finite in all periods
- The variance of the time series must be constant and finite in all periods
- The covariance of the time series with itself for a fixed number of periods in the past or future must be constant and finite in all periods.
If a time series is not covariance stationary we cannot model it using an AR model. If time series is not covariance stationary we may be able to transform to one
Standard Error Formula
Testing for serial correlation in an autoregressive time series model
- CANNOT use Durbin-Watson statistic
- T-test to test the regression coefficients and autocorelations of the standard errors (residuals)
- The regression coefficient is statistically significant at 5% because tstats are larger than 2.0 (IMPORTANT: Here Significant is good as Xt-1 is explaining Xt)
- Autocorrelations of standard errors all not significantly different to zero due to low tstats (good news). So no need to re-specify the model
Chain rule
Process of forecasting where uncertainty is added at each forecast period so multi-period forecasts have more uncertainty than single period forecasts
Mean reversion
- A time series exhibits the property of mean reversion if it tends to fall when its level is above its mean-reversion level (MRL) and rise when its level is below its mean-reversion leve
- Covariance stationary data will be mean reverting
- b1 = 1 => No finite MRL => Not covariance stationary
- b1 = 1 => Unit root => Random walk (not covariance stationary)
Comparing Forecast Model Performance
Out of sample forecasts: we tend to look at out of sample results to compare forecast accuracy of two different models because the future is always out of sample
Typically compare performance using Root Mean Squared Error (RMSE):
√E2
The smaller the better
Also consider coefficient stability!!
Coefficient stability
Regression coefficients are not stable over time
Don’t use data to construct an AR model that crosses periods with very different underlying conditions, need to apply subjective judgment
Simple random walk
A simple random walk is a time series whose value in every period equals its value in the previous period
Special case of a first-order autoregressive time series model in which b0 is 0 and b1 is 1
- Means that the best forecast of xt is xt-1,because expected value of error term is zero
- Note that it is not xt that is random, but the variable xt - xt-1
- Random walks have an undefined mean-reverting level
- Random walk is NOT covariance stationary
Random walk with drift
autoregressive time series model in which b0 is not 0 and b1 is 1 then we have a random walk with a drift
Means that the best forecast of xt is b0 + xt-1 ,because expected value of error term is zero
The problem with all random walks is that because the data is not covariance stationary
Can convert data to a covariance-stationary time series by first differencing
First-differenced series will have no predictive value but will help us conclude that the original series was a random walk.
First Differencing
??
yt = b0 + εt
Even though this does not help us to make predictions it is nonetheless covariance stationary
Unit Root problem
Unit root = when a lag coefficient is not significantly different to one
Model not covariance stationary (need lag coefficient of less than 1)
If lag coefficient = 1 then we have a random walk. By definition all random walks have unit roots
If lag coefficient >1 then we have an explosive root
Need to transform into covariance stationary form with First Differencing
Test for Unit root using Dickey Fuller test
Dickey-Fuller Test
Test for unit root: DFT to see if b1 - 1 (g) is significantly different to zero
b1 -1 = g
b1 = 1 then g = 0 thus H0 : g = 0
Calculate t stat as usual and compare to a Dickey-Fuller critical stat. - If reject null then do not have unit root problem
Seasonality in time-series models
Test Lags with T-test
If TCalc > TCrit = REJECT = Significant
Add lag ie 4th lag significatn = re-specify the model to include a seasonal lag of one-year
To test if the new model is correct, retest for seasonality
Once specified correctly can be used for forecasting
ARCH Models
Autoregressive conditional heteroskedasticity models
To test for such a relationship: ARCH Test
- (Regress the squared error terms on the previous period’s error terms)
- If the regression coefficient (a1 ) of this ARCH(1) model is statistically significant (T-TEST), the error terms in the model are ARCH(1)
Using ARCH model to predict the variance of the error terms
Use the ARCH equation to predict the variance of error terms in the t+1 period
Cointegration
Definition
Example
How to test for it
Two (or more) time series might not be stationary, e.g. have unit root problem, but if we regress the series against each other we might find we have a (covariance) stationary series – this is called cointegration. If the series are cointegrated then the error term above will not have a unit root
Example: Regressing the price of a stock market index and also the associated future contract. Each one individually might exhibit a random walk however we would intuitively expect a stable relationship between them.
Only reliable for modelling where there is a long run, stable relationship.
Testing for cointegration: The (Engle-Granger) Dickey-Fuller test
(Engle-Granger) Dickey-Fuller test
Testing for cointegration
Check whether Error Term has a unit root. If the series are cointegrated then the error term above will not have a unit root.
If we reject (significant different) the null then we conclude the error term is covariance stationary = no unit root = cointegrated
Time-Series Analysis: Determining which model to use
Machine learning defined
and 3 different classes
- Extracting knowledge from large amounts of data (big data)
- Goal of automating the decision-making process by
- ‘Learning’ from known examples to determine an underlying structure in the data
Find the pattern, apply the pattern.
Broadly categorized into three distinct classes of techniques:
- Supervised learning
- Unsupervised learning
- Deep Learning
Supervised machine learning
Requires the use of a labelled data set i.e. matched set of observed inputs and the associated output
The ML algorithm is ‘trained’ using the labeled data set to infer the pattern-based prediction rule between the inputs and output
- The ‘fit’ of the ML model is evaluated using labelled test data where the predicted targets (Y predicted) are compared to the actual targets (Y actual)
- Two categories of problems: 1. Regression problems where the target variable is continuous (even if the ML technique used isn’t regression) 2. Classification problems where the target variable is categorical or ordinal e.g. fraudulent or non-fraudulent transactions
Unsupervised machine learning
Does not make use of labeled data, the ML algorithm seeks to discover structure within the data set
Two types of problems suited towards unsupervised ML: 1. Dimension reduction aims at reducing the number of features used whilst retaining variation across observations e.g. identifying major factors underlying asset price movements 2. Clustering aims on sorting observations into groups (clusters) based on similarity that may or may not be pre-specified (for example, the number of groups) e.g. sorting companies into financial statement data groups
Deep learning and reinforcement learning
complex and sophisticated algorithms tackle highly complex tasks such as image and speech recognition
In Reinforcement learning a computer learns from interacting with itself
Both are based on artificial neural networks (ANNs), and can be supervised or unsupervised
Overfitting
Represents a model that fits its training data too well (i.e. the incorporation of noise or random fluctuations) and does not predict using out-of-sample data
Low or no in-sample error (Ein) but large out-of-sample error (Eout) represents poor generalization / overfit!
Main contributors to overfitting:
- High noise levels
- Too much complexity in the model i.e. features in the model, number of branches, linear or nonlinear relationship
Sources of total out-of-sample error (Eout)
-
Bias Error – the degree to which the model fits the training data (underfit?)
- ML models with erroneous assumptions produce high bias and poor approximations which results in underfitting and high in-sample error
-
Variance Error – a measure of how much the model’s results change in response to new data from the validation and test samples
- An unstable model will pick up ‘noise’ and produce high variance causing overfitting and high outof-sample error
- Base Error – error due to randomness in the data
ML: Learning curves
High Variance Error = Over-fitting
High Bias Error = Under-fitting
ML: Fitting curve
A fitting curve show in- and out-of-sample error rates (Ein and Eout) against model complexity
Typically:
- Linear functions are more susceptible to bias error and underfitting
- Non-linear functions are more susceptible to variance error and overfitting
An optimal point (managing overfitting risk) of model complexity exists where the bias and variance error curves intersect and where Ein and Eout rates are minimized
ML: Preventing overfitting
- Estimation of an overfitting penalty that increases in size with the number of included features
- Prevents the algorithm from getting too complex during the selection and training process
- Only include parameters that reduce out-of-sample error
-
Cross-validation
- A process aimed at reducing sampling bias
- The challenge is to have a large enough data set to partition the data into representative groups for training, validation and testing (holdout sample).
- k-fold cross validation: data (excluding a test sample) is shuffled randomly and split into k equal size sub-groups (typically 5 or 10), with k-1 groups used as training samples and one sample (the kth) used as a validation sample. The process is repeated k times so each data point is used in the training data set k-1 times and in the validation data set once. The average of the k-validation errors (mean Eval) taken as an estimate of the model’s Eout
ML: Penalized regression
Penalized regression is a process of regularisation that helps reduce the effect of ‘overfitting’ a model
-A penalty term is created that increases in size as the number of included variables in the model increases e.g. in Least Absolute Shrinkage and Selection Operator (LASSO):
ML: Support Vector Machine (SVM)
A very popular ML algorithm used for classification, regression, and outlier detection
SVM is a linear classifier that determines a hyperplane (e.g. a line) that optimally separates the data into two sets of data points
ML: K-Nearest Neighbour (KNN)
- Supervised ML technique used commonly for classification and sometimes for regression
- Aims to classify a new observation by identifying similarities between the new observation and the existing data
- KNN is a straightforward, intuitive, non-parametric technique that can be used in a multiclassification situation
- However, defining the term ‘similar’ can be difficult
- The number of K (hyperparameter) in the model must be carefully chosen:
- Too small: Results in a high error rate and sensitivity to local outliers
- Too big: Dilutes the concept of nearest neighbor by averaging too many outcomes
- Even: May result in ties and no clear classification
ML: Classification and Regression Trees (CART)
ML technique used to predict either a:
- Categorical target variable, i.e. a classification problem, producing a classification tree, or
- Continuous outcome, i.e. a regression problem, producing a regression tree
Algorithm produces a visual decision tree with binary branching to classify observations
CART makes no assumptions about the characteristics of the training data - Therefore, if left unconstrained it can be subject to overfitting. This can be mitigated by the introduction of regularization parameters: • Maximum depth of tree • Minimum population at each node • Maximum number of decision nodes
Ensemble learning
Combining the predictions from a collection of models to create an average predicted value
Heterogeneous learners: different types of algorithm combined together with a voting classifie
Homogenous learners: combination of the same algorithm using different training data
Random Forest Classifier
??
Dimension reduction – Principal Components Analysis (PCA)
Unsupervised ML Algorithms: Process used to summarize or reduce highly correlated features into a few main, uncorrelated composite variables
Clustering algorithms
Clustering groups solely on the basis of information found in the data with no pre-determined labelling. A cluster is created on a sub-set of data that is deemed to be ‘similar
- Cohesion – observations in each cluster are similar to each other
- Separation – observations in two different clusters are as dissimilar as possible
Uncovers potentially interesting and novel relationships not previously identified using standard classifications to group companies such as industry and sector
Two popular approaches include: • K-means clustering • Hierarchical clustering
K-means clustering
Iterative process of repeatedly partitioning data into a fixed number, k, of nonoverlapping clusters
Hierarchical clustering
n iterative procedure that builds a hierarchy of clusters. The algorithm creates intermediate rounds of clusters that are of:
- Increasing size: Agglomerative – used in large datasets because of its fast computing speed. It makes decisions on local patterns without an initial global structure, therefore, it’s good at identifying smaller clusters
- Decreasing in size: Divisive – starts with an initial global structure and is better suited to identifying large clusters.
Neural networks
Deep Learning Nets (DLNs)
Reinforcement Learning (RL)
ML Summary
Big Data
Definition
Characteristics of Big Data:
- Volume: Data collected in files, tables and datasets is large
- Velocity: The speed at which data is communicated is great! Real-time data is becoming the norm in many areas
- Variety: Data is collected from many different sources and in many different formats: - Structured data such as SQL tables and CSV files - Semi-structured data such as HTML code - Unstructured data such as video data
When using data for inference or prediction, there is a “Fourth V”:
- Veracity: Relates to the credibility and reliability of different data sources e.g. fake news and spam emails • Identifying quality from quantity!
Data Analysis: ML Model Building Summary
Traditional (Structured) ML Model Building Steps
- Conceptualization • Determining what the inputs, and output of the model e.g. will the stock price rise or fall in a week’s time? • How will the model be used, and who will use it? • How will the model be incorporated into the business’ processes?
- Data Collection • Mostly data collected from internal and external sources in a structured form, e.g. cells with values • External data can be accessed through an application programming interface (API) which allows communication between different software components
- Data Preparation and Wrangling • Cleansing the data to resolve missing values or out-of-range values • Preprocessing the data: Extracting, aggregating, filtering, and selecting relevant data columns
- Data Exploration • Involves exploratory data analysis, feature selection, and feature engineering
- Model Training • Selecting the appropriate ML method(s) • Evaluating the performance of the trained model • Tuning the ML model
Text Based (Unstructured) ML Model Building Steps
- Text Problem Formulation • Identify the inputs and outputs, e.g. identify a sentiment score that is structured output from an unstructured input, like text
- Data (Text) Curation • Gathering external text data via web services or web spidering (scraping or crawling) programs that extract raw content from a source, like web pages
- Text Preparation and Wrangling • Cleaning and preprocessing to convert the unstructured text into a format that can be interpreted by traditional modeling methods designed around structured inputs
- Text Exploration • The process of visualizing the text using techniques such as word clouds • Also, text feature selection and engineering
- Model Training
The output resulting from the process could be combined with other structured variables or used directly for forecasting and analysis. - The detail of steps 3 and 4 vary between structured data versus text based (unstructured) data. We will go on to look at these points in more detail.
Introduction to Data Preparation and Wrangling
Data Preparation (Cleansing)
- The process of examining, identifying, and mitigating errors in raw data
- Common issues include missing, duplicated, erroneous or inaccurate values
- Automated data can have similar issues due to software bugs and server failures
Data Wrangling (Preprocessing)
- Involves the transformation and processing of the cleansed data so that it is ready to be used for ML model training
- The data may be processed to deal with outliers, extraction of useful variables from the existing data, and also scaling the data
Different for Structured data // Unstructured (Text) data
- *Structured** Data: Data Preparation and Wrangling
1. Data Preparation (Cleansing)
Possible errors in a raw dataset (e.g. a table) include:
- Incompleteness error – data is not present i.e. missing value • Seek alternative sources • Missing values and NAs must be omitted or replaced with “NA” for deletion or substitution of an imputed value (e.g. the mean, median or mode or assume 0)
- Invalidity error – data is outside of a meaningful range, creating invalid data - Inaccuracy error – data is not a measure of true value
- Inconsistency error – data conflicts with corresponding data points or reality e.g. a title column shows ‘Mrs.’ when the sex column states ‘male’
- Non-uniformity error – data is not present in a consistent format e.g. GBP and £ -
- Duplication error – where duplicate observations are present
Structured Data: Data Preparation and Wrangling
Data Wrangling (Preprocessing)
- Predominantly the transformation and scaling of data on the cleansed data set
- Common transformations used in practice include:
- Extraction – new variable extracted from a current variable e.g. Age from observed DoB
- Aggregation – consolidation of two or more similar variables into one variable e.g. capital gains/losses and income combined to give total return
- Filtration – data rows not required must be identified and filtered
- Selection – data columns not intuitively needed can be removed
- Conversion – the data (nominal, ordinal, continuous, categorical) may need to be converted in order to be processed further, e.g. removal or prefixes such as currency symbols
- Outliers need to be identified in order for them to be removed or replaced. Several techniques exist. Data values that are outside of:
- 3 standard deviations from the mean, or - 1.5 times the inter-quartile range + 3rd Quartile upper bound
- There are several methods to deal with outliers:
- Trimming – removal of the outliers and extreme values
- Winsorization – extreme values and outliers are replaced with the maximum (for large outliers) and minimum (for small outliers) values of data points that are not deemed to be outliers
-
Scaling
- - The process of adjusting the range of a feature by shifting and changing the scale of the data
- Required for ML techniques requiring scaled data e.g. an neural network
- Two common methods: Normalization (sensitive to outliers) & Standardization (assumes normal distribution & less sensitive to outliers)
Unstructured: Data Preparation (Cleansing)
Basic operations in the text cleansing process includes removing:
- HTML tags - Required if the text is obtained from website
- Punctuations and numbers - Generally, they are removed as words found in the sentence infers meaning, e.g. the presence of the word “boosted” in an earnings press release may indicate positive sentiment (rather than the number figure) - However, sometimes they can be useful e.g. % sign (would be replaced with the annotation /PercentSign/ to preserve grammatical meaning in the text)
- White spaces - Removal of unnecessary white spaces that might have occurred because of the removal of punctuations and numbers
Unstructured: Text Wrangling (Preprocessing)
- Involves the process of tokenization: Process of splitting text into separate tokens (e.g. words) - Can be done at a character or word level (most common)
- The normalization process involves the following:
- Lowercasing • Removes the distinction among the same words e.g. “It” and “it”
- Stop words • Such as “is”, “the” and “a” don’t always carry a semantic meaning so they are often removed at this stage (or maybe later in the data exploration stage because of high word frequency)
- Stemming • Converting inflected forms of a word into a base word e.g. “fishing“, “fished“, and “fisher“ to the stem “fish”
- Lemmatization • Converting inflected forms of a word into its morphological root known as a lemma • Requires an understanding of the relevant dictionary and is more expensive and advanced
Creating Bag-of-Words (BOW)
Unstructured: Text Wrangling (Preprocessing)
Procedure used to analyze text and is a collection of a distinct set of tokens observed from all the texts in a sample data set
The final BOW created after normalization can be viewed in a document term matrix (DTM) which makes the text more structured
An N-grams technique can be used to attach words together to show representation of word sequences e.g. a bigram such as “not_present”. This ensures the term “not” isn’t considered a single token that may have been removed during normalization.
Data Exploration Summary
Involves three vital tasks:
- Exploratory Data Analysis (EDA): This preliminary step of data exploration involves the creation of graphs, charts, heat maps and word clouds. EDA helps stakeholders connect and ensure the prepared data is sensible. EDA also allows for inspection of simple questions and hypotheses which enables planning for the next stage
- Feature Selection: Where only the key features from the dataset are selected for ML model training
- Feature Engineering: Process of creating new features by changing or transforming existing features
2 & 3 heavily influences model performance!
Structured Data: Data Exploration
-
Exploratory Data Analysis (EDA)
- Principal Components Analysis (PCA) can be used on high-dimension data
- Exploratory visualization for one-dimensional data (bar charts etc)
- Exploratory visualization for two-dimensional data includes scatterplots etc
-
Feature Selection
- Removal of unneeded, irrelevant, and redundant features to achieve model parsimony
- Basic diagnostic tests are carried out to identify: - Feature redundancy - Heteroskedasticity - Multicollinearity
- Dimension reduction is carried out which creates new combinations of features that are uncorrelated which helps to reduce cost and increase processing speeds
-
Feature Engineering
- This process helps to further optimize and improve the features e.g. categorizing ages into either retirement and non-retirement age features
- For categorical data it may involve one hot encoding where a categorical feature is converted to a binary outcome of 0 or 1, e.g. is_RetirementAge assigned “0” for false, and “1” for true
Unstructured Data: Exploratory Data Analysis (EDA)
- Most common text analytical procedures are:
- Text classification – supervised ML to classify texts into different classes
- Topic modelling – unsupervised ML that groups texts into topic clusters
- Fraud detection
- Sentiment analysis – both supervised and unsupervised ML to predicting the sentient of texts
- Statistical measures used as part of EDA on text data:
- Term (or Collection) Frequency (TF) = No. of times a given token occurs in all texts/total number of tokens, and allows the analyst to identify (and potentially remove) noisy terms
- Word associations
- Average sentence and word length
- Word and syllable count
- Word clouds are a common visual technique used
Unstructured Data: Feature Selection
- For text data this involves selection of a subset of tokens occurring in the dataset, these represent features of the data set.
- Noisy features represent the most infrequent and most frequent tokens in the dataset (e.g. stop words). Identification and removing this noise is an important task
- General feature selection methods include:
- Frequency measures
- Chi-square test: • Used to test the independence of two events e.g. occurrence of the token vs. the occurrence of the class • Useful for ranking – tokens with the highest test statistic occur more frequently in texts associated with a particular class and may be selected as a feature
- Mutual information (MI): Measure how much information is contributed by a token to a class of texts • Value of “0” if the token appears equally in all classes or “1” if it occurs in only one class of tex
Unstructured Data: Feature Engineering
This process is similar to techniques used for structured data
Techniques used include:
- Numbers: Numbers of certain length could be identified as a particular token, e.g. 5-digit number representing a telephone area code in the UK. A feature labelled /number5/ could be created to represent a token
- N-grams
- Name Entity Recognition (NER) and Parts of Speech (POS) • Algorithms used to analyze individual tokens and their surrounding semantics whilst referencing to a dictionary in order to tag an object class to the token, e.g. taking a sentence and attaching labels such as verb, noun, percent, time, money etc.
Model Training
Three vital tasks
- Method Selection: Deciding which ML method(s) to use (ML section)
- Performance Evaluation: Techniques and measures used to quantify and understand the model performance
- Tuning: Decisions and actions to improve the model performance
Iterative process: Repeated many times until the desired level of model performance is attained
Model Training: Model Selection
- Factors to consider when selecting the ML method or algorithm to be used include:
- Supervised or Unsupervised
- Type of Data
- Size of Data
- Once the method is selected, certain method-related decisions need to be made, i.e. hyper-parameters e.g. number of hidden layers in a neural network
- Data needs to be split before training begins:
- In-sample data: Training sample (60%)
- Out-of-sample data: - Validation sample - Testing sample (40%)
Model Training: Performance Evaluation
The process of measuring the ‘Goodness of Fit’ of the ML model - Several techniques are used and we will discuss methods suited to binary classification models
-
Error Analysis - The computation of four basic evaluation metrics:
A confusion matrix is used to summarize the above metrics- True positive (TP)
- False positive (FP) - Type I error
- True negative (TN)
- False negative (FN) - Type II error
-
Receiver Operating Characteristic (ROC) - Assesses model performance by plotting a curve that represents the trade-off between the false positive rate and the true positive rate for various cutoff points (for the observation to be classified as either “0” or “1”)
- False Positive rate = FP / (TN + FP)
- True Positive rate (Recall) = TP / (TP + FN
-
Root Mean Squared Error (RMSE)
- Appropriate for continuous data predictions and is commonly used in regression
- A single metric capturing all the prediction errors in the data (n)
- Square root of mean of the squared differences between actual values and the model’s predicted values
Model Training: Tuning
- Once the model has been evaluated, based on the findings, the performance of the model needs to be improved:
- High prediction error on the training set = Underfit
- Prediction error on the cross-validation (CV) set is much higher than on the training set = Overfit
-
Two types of error in model fitting:
- Bias error: - Model is overly simplified and does not learn adequately from the training data - Associated with underfitting
- Variance error: - Model is overly complicated and starts to memorize the training data and therefore performs poorly on new data - Associated with overfitting
- It is not possible to remove both, however, it is possible to minimize the total aggregate error (bias and variance error)
- Hyperparameters must be chosen in advance, e.g. regularization term (λ) in a supervised model, number of hidden layers in a NN
- Grid search is a method of systematically training a ML model by using various combinations of hyperparameters and choosing the one with best model performance
- Results can be analyzed using a fitting curve
Tuning: Fitting Curve
-
Very low regularization:
- Prediction error on the training set is small (memorizing the data) but high on the cross validation data set
- High variance error and low bias error
- Model is overfitted as it does not perform well on new data
-
Very high regularization:
- Too few features included so the model is unable to learn
- High prediction error on both the training (suggesting high bias) and CV datasets
- Suggests model underfitting
-
Optimum regularization
- Minimizes total error in a balanced fashion, with prediction error in the training and CV datasets that are similar